Capstone: Building a Knowledge Graph Platform
Project Duration: 15-20 hours
Learning Style: Full-Stack Implementation + Graph Algorithms + Production Deployment
Outcome: Complete knowledge graph application demonstrating Neo4j mastery
Project: Academic Research Knowledge Graph
Build ScholarGraph - a knowledge graph connecting:
- Papers (research publications)
- Authors (researchers)
- Institutions (universities, labs)
- Topics (research areas)
- Citations (paper references)
Features:
- Import data from academic APIs
- Entity relationship extraction
- Semantic search (find similar papers)
- Recommendation engine (relevant papers for researchers)
- Network analysis (influential authors, emerging topics)
- Visualization dashboard
Part 1: Data Model Design
Schema
// Node labels
(:Paper {id, title, abstract, year, citation_count})
(:Author {id, name, h_index, institution})
(:Institution {id, name, country, rank})
(:Topic {name})
// Relationships
(author)-[:AUTHORED]->(paper)
(paper)-[:CITES]->(paper)
(paper)-[:ABOUT]->(topic)
(author)-[:AFFILIATED_WITH]->(institution)
(author)-[:COLLABORATES_WITH {papers_together}]->(author)
Indexes and Constraints
// Unique constraints
CREATE CONSTRAINT paper_id ON (p:Paper) ASSERT p.id IS UNIQUE;
CREATE CONSTRAINT author_id ON (a:Author) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT institution_id ON (i:Institution) ASSERT i.id IS UNIQUE;
// Indexes for search
CREATE INDEX paper_title FOR (p:Paper) ON (p.title);
CREATE INDEX author_name FOR (a:Author) ON (a.name);
CREATE FULLTEXT INDEX paper_search FOR (p:Paper) ON EACH [p.title, p.abstract];
// Vector index for embeddings
CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding)
OPTIONS {dimension: 768, similarity: 'cosine'};
Part 2: Data Import
Sample Data Source: arXiv API
Python Script (import_arxiv.py):
import arxiv
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
# Neo4j connection
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
# Embedding model (for semantic search)
model = SentenceTransformer('all-MiniLM-L6-v2')
def import_paper(paper_data):
with driver.session() as session:
session.execute_write(_create_paper, paper_data)
def _create_paper(tx, data):
# Generate embedding
embedding = model.encode(data['abstract']).tolist()
# Create paper
query = """
MERGE (p:Paper {id: $id})
SET p.title = $title,
p.abstract = $abstract,
p.year = $year,
p.embedding = $embedding
// Create authors
FOREACH (author IN $authors |
MERGE (a:Author {name: author.name})
MERGE (a)-[:AUTHORED]->(p)
)
// Create topics
FOREACH (topic IN $topics |
MERGE (t:Topic {name: topic})
MERGE (p)-[:ABOUT]->(t)
)
RETURN p
"""
tx.run(query,
id=data['id'],
title=data['title'],
abstract=data['abstract'],
year=data['year'],
embedding=embedding,
authors=data['authors'],
topics=data['topics']
)
# Fetch papers from arXiv
search = arxiv.Search(
query="cat:cs.AI", # AI papers
max_results=1000,
sort_by=arxiv.SortCriterion.SubmittedDate
)
for result in search.results():
paper_data = {
'id': result.entry_id,
'title': result.title,
'abstract': result.summary,
'year': result.published.year,
'authors': [{'name': author.name} for author in result.authors],
'topics': [cat.term for cat in result.categories]
}
import_paper(paper_data)
print(f"Imported: {result.title}")
driver.close()
Import Citations
def import_citations(paper_id, cited_ids):
with driver.session() as session:
session.run("""
MATCH (citing:Paper {id: $citing_id})
UNWIND $cited_ids AS cited_id
MATCH (cited:Paper {id: cited_id})
MERGE (citing)-[:CITES]->(cited)
""", citing_id=paper_id, cited_ids=cited_ids)
Part 3: Core Features
Feature 1: Semantic Search
Find similar papers by embedding:
// User searches for "graph neural networks"
WITH "graph neural networks" AS query_text
// Generate embedding (done in application layer)
// Assume $query_embedding is passed from app
MATCH (p:Paper)
WHERE vector.similarity.cosine(p.embedding, $query_embedding) > 0.7
RETURN p.title, p.abstract,
vector.similarity.cosine(p.embedding, $query_embedding) AS similarity
ORDER BY similarity DESC
LIMIT 10
Python API:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
def search_papers(query_text):
query_embedding = model.encode(query_text).tolist()
with driver.session() as session:
result = session.run("""
MATCH (p:Paper)
WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.7
RETURN p.title AS title,
p.abstract AS abstract,
vector.similarity.cosine(p.embedding, $embedding) AS score
ORDER BY score DESC
LIMIT 10
""", embedding=query_embedding)
return [dict(record) for record in result]
# Usage
results = search_papers("transformers for natural language processing")
for r in results:
print(f"{r['title']} (score: {r['score']:.3f})")
Feature 2: Author Recommendations
“Authors working on similar topics to you”:
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic:Topic)
WITH me, collect(DISTINCT topic) AS my_topics
MATCH (other:Author)-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic)
WHERE other <> me AND topic IN my_topics
WITH other, count(topic) AS common_topics
ORDER BY common_topics DESC
LIMIT 10
OPTIONAL MATCH (other)-[:AFFILIATED_WITH]->(inst:Institution)
RETURN other.name AS author,
common_topics,
inst.name AS institution
Feature 3: Paper Recommendations
“Papers you might be interested in” (collaborative filtering):
// Based on what similar authors read
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
MATCH (my_paper)<-[:CITES]-(citing:Paper)<-[:AUTHORED]-(other:Author)
MATCH (other)-[:AUTHORED]->(rec:Paper)
WHERE NOT (me)-[:AUTHORED]->(rec)
WITH rec, count(DISTINCT other) AS score
ORDER BY score DESC
LIMIT 10
RETURN rec.title, rec.year, score
Find most influential researchers:
// Create citation network projection
CALL gds.graph.project(
'citation-network',
'Author',
{
CITES: {
orientation: 'NATURAL',
aggregation: 'SINGLE'
}
}
)
// Run PageRank
CALL gds.pageRank.write('citation-network', {
writeProperty: 'influence_score',
dampingFactor: 0.85,
maxIterations: 20
})
// Query top authors
MATCH (a:Author)
RETURN a.name, a.h_index, a.influence_score
ORDER BY a.influence_score DESC
LIMIT 20
Feature 5: Research Communities
Detect research groups (Louvain):
// Project collaboration network
CALL gds.graph.project(
'collaboration-network',
'Author',
'COLLABORATES_WITH'
)
// Detect communities
CALL gds.louvain.write('collaboration-network', {
writeProperty: 'community'
})
// View communities
MATCH (a:Author)
WITH a.community AS community, collect(a.name) AS members
WHERE size(members) > 5
RETURN community, size(members) AS size, members[0..10] AS sample
ORDER BY size DESC
Part 4: REST API (FastAPI)
File: api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
from typing import List, Optional
app = FastAPI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
model = SentenceTransformer('all-MiniLM-L6-v2')
class SearchQuery(BaseModel):
query: str
limit: int = 10
class Paper(BaseModel):
id: str
title: str
abstract: str
year: int
authors: List[str]
topics: List[str]
@app.get("/papers/{paper_id}", response_model=Paper)
def get_paper(paper_id: str):
with driver.session() as session:
result = session.run("""
MATCH (p:Paper {id: $id})
OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
OPTIONAL MATCH (p)-[:ABOUT]->(t:Topic)
RETURN p.id AS id,
p.title AS title,
p.abstract AS abstract,
p.year AS year,
collect(DISTINCT a.name) AS authors,
collect(DISTINCT t.name) AS topics
""", id=paper_id).single()
if not result:
raise HTTPException(status_code=404, detail="Paper not found")
return Paper(**result)
@app.post("/search")
def search_papers(query: SearchQuery):
embedding = model.encode(query.query).tolist()
with driver.session() as session:
result = session.run("""
MATCH (p:Paper)
WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.6
OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
WITH p, vector.similarity.cosine(p.embedding, $embedding) AS score,
collect(a.name) AS authors
ORDER BY score DESC
LIMIT $limit
RETURN p.id AS id, p.title AS title, p.year AS year,
authors, score
""", embedding=embedding, limit=query.limit)
return [dict(record) for record in result]
@app.get("/authors/{author_id}/recommendations")
def recommend_papers(author_id: str, limit: int = 10):
with driver.session() as session:
result = session.run("""
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
MATCH (my_paper)-[:ABOUT]->(topic:Topic)<-[:ABOUT]-(rec:Paper)
WHERE NOT (me)-[:AUTHORED]->(rec)
WITH rec, count(DISTINCT topic) AS relevance
ORDER BY relevance DESC, rec.citation_count DESC
LIMIT $limit
OPTIONAL MATCH (a:Author)-[:AUTHORED]->(rec)
RETURN rec.id AS id, rec.title AS title, rec.year AS year,
relevance, collect(a.name) AS authors
""", author_id=author_id, limit=limit)
return [dict(record) for record in result]
@app.get("/stats")
def get_stats():
with driver.session() as session:
result = session.run("""
MATCH (p:Paper) WITH count(p) AS papers
MATCH (a:Author) WITH papers, count(a) AS authors
MATCH ()-[c:CITES]->() WITH papers, authors, count(c) AS citations
MATCH (t:Topic) WITH papers, authors, citations, count(t) AS topics
RETURN papers, authors, citations, topics
""").single()
return dict(result)
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Run API:
Test:
curl http://localhost:8000/stats
curl -X POST http://localhost:8000/search \
-H "Content-Type: application/json" \
-d '{"query": "graph neural networks", "limit": 5}'
Part 5: Visualization Dashboard
Frontend: React + vis.js
File: App.js
import React, { useState, useEffect } from 'react';
import Graph from 'react-graph-vis';
function App() {
const [graph, setGraph] = useState({ nodes: [], edges: [] });
const [search, setSearch] = useState('');
const fetchGraph = async (authorId) => {
const response = await fetch(`http://localhost:8000/authors/${authorId}/network`);
const data = await response.json();
const nodes = data.authors.map(a => ({
id: a.id,
label: a.name,
title: `${a.name}\nPapers: ${a.paper_count}`,
value: a.paper_count
}));
const edges = data.collaborations.map(c => ({
from: c.author1,
to: c.author2,
value: c.papers_together
}));
setGraph({ nodes, edges });
};
const options = {
nodes: {
shape: 'dot',
scaling: { min: 10, max: 30 }
},
edges: {
width: 0.5,
color: { inherit: 'from' },
smooth: { type: 'continuous' }
},
physics: {
stabilization: false,
barnesHut: {
gravitationalConstant: -8000,
springConstant: 0.04,
springLength: 95
}
}
};
return (
<div>
<h1>ScholarGraph</h1>
<input
type="text"
placeholder="Search authors..."
value={search}
onChange={(e) => setSearch(e.target.value)}
/>
<button onClick={() => fetchGraph(search)}>Visualize Network</button>
<Graph
graph={graph}
options={options}
style={{ height: '600px' }}
/>
</div>
);
}
export default App;
Batch Imports
Use UNWIND for bulk inserts:
// Instead of 1000 individual queries
UNWIND $papers AS paper
MERGE (p:Paper {id: paper.id})
SET p.title = paper.title,
p.abstract = paper.abstract,
p.year = paper.year
Query Optimization
Before (slow):
MATCH (a:Author)
MATCH (a)-[:AUTHORED]->(p:Paper)
WHERE p.year = 2023
RETURN a.name, count(p)
After (fast):
MATCH (a:Author)-[:AUTHORED]->(p:Paper {year: 2023})
RETURN a.name, count(p)
Caching
Add Redis for hot queries:
import redis
import json
cache = redis.Redis(host='localhost', port=6379, decode_responses=True)
def get_paper(paper_id):
# Check cache
cached = cache.get(f"paper:{paper_id}")
if cached:
return json.loads(cached)
# Query Neo4j
with driver.session() as session:
result = session.run("MATCH (p:Paper {id: $id}) RETURN p", id=paper_id).single()
paper = dict(result['p'])
# Cache for 1 hour
cache.setex(f"paper:{paper_id}", 3600, json.dumps(paper))
return paper
Part 7: Deployment
Docker Compose
docker-compose.yml:
version: '3.8'
services:
neo4j:
image: neo4j:5.13
ports:
- "7474:7474" # Browser
- "7687:7687" # Bolt
environment:
NEO4J_AUTH: neo4j/password
NEO4J_dbms_memory_heap_max__size: 8G
NEO4J_dbms_memory_pagecache_size: 4G
volumes:
- neo4j_data:/data
- neo4j_logs:/logs
api:
build: .
ports:
- "8000:8000"
depends_on:
- neo4j
environment:
NEO4J_URI: bolt://neo4j:7687
NEO4J_USER: neo4j
NEO4J_PASSWORD: password
redis:
image: redis:7
ports:
- "6379:6379"
volumes:
neo4j_data:
neo4j_logs:
Dockerfile (API):
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
Deploy:
Part 8: Deliverables
- Data Model Diagram: Nodes, relationships, properties
- Import Scripts: Python code to load arXiv data
- API Implementation: FastAPI with all endpoints
- Frontend: React dashboard with graph visualization
- Query Collection: 20+ Cypher queries for features
- Performance Report: Query times, optimization results
- Deployment Guide: Docker Compose setup
- Presentation: Demo video + slides
Summary
ScholarGraph demonstrates:
- ✅ Graph data modeling (papers, authors, topics)
- ✅ Semantic search (vector embeddings)
- ✅ Recommendations (collaborative filtering)
- ✅ Graph algorithms (PageRank, Louvain)
- ✅ REST API (FastAPI)
- ✅ Visualization (React + vis.js)
- ✅ Production deployment (Docker)
Congratulations on completing the Neo4j mastery course! 🎉
What’s Next?