Skip to main content

Capstone: Building a Knowledge Graph Platform

Project Duration: 15-20 hours Learning Style: Full-Stack Implementation + Graph Algorithms + Production Deployment Outcome: Complete knowledge graph application demonstrating Neo4j mastery

Project: Academic Research Knowledge Graph

Build ScholarGraph - a knowledge graph connecting:
  • Papers (research publications)
  • Authors (researchers)
  • Institutions (universities, labs)
  • Topics (research areas)
  • Citations (paper references)
Features:
  1. Import data from academic APIs
  2. Entity relationship extraction
  3. Semantic search (find similar papers)
  4. Recommendation engine (relevant papers for researchers)
  5. Network analysis (influential authors, emerging topics)
  6. Visualization dashboard

Part 1: Data Model Design

Schema

// Node labels
(:Paper {id, title, abstract, year, citation_count})
(:Author {id, name, h_index, institution})
(:Institution {id, name, country, rank})
(:Topic {name})

// Relationships
(author)-[:AUTHORED]->(paper)
(paper)-[:CITES]->(paper)
(paper)-[:ABOUT]->(topic)
(author)-[:AFFILIATED_WITH]->(institution)
(author)-[:COLLABORATES_WITH {papers_together}]->(author)

Indexes and Constraints

// Unique constraints
CREATE CONSTRAINT paper_id ON (p:Paper) ASSERT p.id IS UNIQUE;
CREATE CONSTRAINT author_id ON (a:Author) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT institution_id ON (i:Institution) ASSERT i.id IS UNIQUE;

// Indexes for search
CREATE INDEX paper_title FOR (p:Paper) ON (p.title);
CREATE INDEX author_name FOR (a:Author) ON (a.name);
CREATE FULLTEXT INDEX paper_search FOR (p:Paper) ON EACH [p.title, p.abstract];

// Vector index for embeddings
CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding)
OPTIONS {dimension: 768, similarity: 'cosine'};

Part 2: Data Import

Sample Data Source: arXiv API

Python Script (import_arxiv.py):
import arxiv
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer

# Neo4j connection
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Embedding model (for semantic search)
model = SentenceTransformer('all-MiniLM-L6-v2')

def import_paper(paper_data):
    with driver.session() as session:
        session.execute_write(_create_paper, paper_data)

def _create_paper(tx, data):
    # Generate embedding
    embedding = model.encode(data['abstract']).tolist()

    # Create paper
    query = """
    MERGE (p:Paper {id: $id})
    SET p.title = $title,
        p.abstract = $abstract,
        p.year = $year,
        p.embedding = $embedding

    // Create authors
    FOREACH (author IN $authors |
        MERGE (a:Author {name: author.name})
        MERGE (a)-[:AUTHORED]->(p)
    )

    // Create topics
    FOREACH (topic IN $topics |
        MERGE (t:Topic {name: topic})
        MERGE (p)-[:ABOUT]->(t)
    )

    RETURN p
    """

    tx.run(query,
        id=data['id'],
        title=data['title'],
        abstract=data['abstract'],
        year=data['year'],
        embedding=embedding,
        authors=data['authors'],
        topics=data['topics']
    )

# Fetch papers from arXiv
search = arxiv.Search(
    query="cat:cs.AI",  # AI papers
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

for result in search.results():
    paper_data = {
        'id': result.entry_id,
        'title': result.title,
        'abstract': result.summary,
        'year': result.published.year,
        'authors': [{'name': author.name} for author in result.authors],
        'topics': [cat.term for cat in result.categories]
    }
    import_paper(paper_data)
    print(f"Imported: {result.title}")

driver.close()

Import Citations

def import_citations(paper_id, cited_ids):
    with driver.session() as session:
        session.run("""
            MATCH (citing:Paper {id: $citing_id})
            UNWIND $cited_ids AS cited_id
            MATCH (cited:Paper {id: cited_id})
            MERGE (citing)-[:CITES]->(cited)
        """, citing_id=paper_id, cited_ids=cited_ids)

Part 3: Core Features

Find similar papers by embedding:
// User searches for "graph neural networks"
WITH "graph neural networks" AS query_text

// Generate embedding (done in application layer)
// Assume $query_embedding is passed from app

MATCH (p:Paper)
WHERE vector.similarity.cosine(p.embedding, $query_embedding) > 0.7
RETURN p.title, p.abstract,
       vector.similarity.cosine(p.embedding, $query_embedding) AS similarity
ORDER BY similarity DESC
LIMIT 10
Python API:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def search_papers(query_text):
    query_embedding = model.encode(query_text).tolist()

    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper)
            WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.7
            RETURN p.title AS title,
                   p.abstract AS abstract,
                   vector.similarity.cosine(p.embedding, $embedding) AS score
            ORDER BY score DESC
            LIMIT 10
        """, embedding=query_embedding)

        return [dict(record) for record in result]

# Usage
results = search_papers("transformers for natural language processing")
for r in results:
    print(f"{r['title']} (score: {r['score']:.3f})")

Feature 2: Author Recommendations

“Authors working on similar topics to you”:
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic:Topic)
WITH me, collect(DISTINCT topic) AS my_topics

MATCH (other:Author)-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic)
WHERE other <> me AND topic IN my_topics
WITH other, count(topic) AS common_topics
ORDER BY common_topics DESC
LIMIT 10

OPTIONAL MATCH (other)-[:AFFILIATED_WITH]->(inst:Institution)
RETURN other.name AS author,
       common_topics,
       inst.name AS institution

Feature 3: Paper Recommendations

“Papers you might be interested in” (collaborative filtering):
// Based on what similar authors read
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
MATCH (my_paper)<-[:CITES]-(citing:Paper)<-[:AUTHORED]-(other:Author)
MATCH (other)-[:AUTHORED]->(rec:Paper)
WHERE NOT (me)-[:AUTHORED]->(rec)
WITH rec, count(DISTINCT other) AS score
ORDER BY score DESC
LIMIT 10
RETURN rec.title, rec.year, score

Feature 4: Influential Authors (PageRank)

Find most influential researchers:
// Create citation network projection
CALL gds.graph.project(
  'citation-network',
  'Author',
  {
    CITES: {
      orientation: 'NATURAL',
      aggregation: 'SINGLE'
    }
  }
)

// Run PageRank
CALL gds.pageRank.write('citation-network', {
  writeProperty: 'influence_score',
  dampingFactor: 0.85,
  maxIterations: 20
})

// Query top authors
MATCH (a:Author)
RETURN a.name, a.h_index, a.influence_score
ORDER BY a.influence_score DESC
LIMIT 20

Feature 5: Research Communities

Detect research groups (Louvain):
// Project collaboration network
CALL gds.graph.project(
  'collaboration-network',
  'Author',
  'COLLABORATES_WITH'
)

// Detect communities
CALL gds.louvain.write('collaboration-network', {
  writeProperty: 'community'
})

// View communities
MATCH (a:Author)
WITH a.community AS community, collect(a.name) AS members
WHERE size(members) > 5
RETURN community, size(members) AS size, members[0..10] AS sample
ORDER BY size DESC

Part 4: REST API (FastAPI)

File: api.py
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
from typing import List, Optional

app = FastAPI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
model = SentenceTransformer('all-MiniLM-L6-v2')

class SearchQuery(BaseModel):
    query: str
    limit: int = 10

class Paper(BaseModel):
    id: str
    title: str
    abstract: str
    year: int
    authors: List[str]
    topics: List[str]

@app.get("/papers/{paper_id}", response_model=Paper)
def get_paper(paper_id: str):
    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper {id: $id})
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
            OPTIONAL MATCH (p)-[:ABOUT]->(t:Topic)
            RETURN p.id AS id,
                   p.title AS title,
                   p.abstract AS abstract,
                   p.year AS year,
                   collect(DISTINCT a.name) AS authors,
                   collect(DISTINCT t.name) AS topics
        """, id=paper_id).single()

        if not result:
            raise HTTPException(status_code=404, detail="Paper not found")

        return Paper(**result)

@app.post("/search")
def search_papers(query: SearchQuery):
    embedding = model.encode(query.query).tolist()

    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper)
            WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.6
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
            WITH p, vector.similarity.cosine(p.embedding, $embedding) AS score,
                 collect(a.name) AS authors
            ORDER BY score DESC
            LIMIT $limit
            RETURN p.id AS id, p.title AS title, p.year AS year,
                   authors, score
        """, embedding=embedding, limit=query.limit)

        return [dict(record) for record in result]

@app.get("/authors/{author_id}/recommendations")
def recommend_papers(author_id: str, limit: int = 10):
    with driver.session() as session:
        result = session.run("""
            MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
            MATCH (my_paper)-[:ABOUT]->(topic:Topic)<-[:ABOUT]-(rec:Paper)
            WHERE NOT (me)-[:AUTHORED]->(rec)
            WITH rec, count(DISTINCT topic) AS relevance
            ORDER BY relevance DESC, rec.citation_count DESC
            LIMIT $limit
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(rec)
            RETURN rec.id AS id, rec.title AS title, rec.year AS year,
                   relevance, collect(a.name) AS authors
        """, author_id=author_id, limit=limit)

        return [dict(record) for record in result]

@app.get("/stats")
def get_stats():
    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper) WITH count(p) AS papers
            MATCH (a:Author) WITH papers, count(a) AS authors
            MATCH ()-[c:CITES]->() WITH papers, authors, count(c) AS citations
            MATCH (t:Topic) WITH papers, authors, citations, count(t) AS topics
            RETURN papers, authors, citations, topics
        """).single()

        return dict(result)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
Run API:
python api.py
Test:
curl http://localhost:8000/stats
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "graph neural networks", "limit": 5}'

Part 5: Visualization Dashboard

Frontend: React + vis.js File: App.js
import React, { useState, useEffect } from 'react';
import Graph from 'react-graph-vis';

function App() {
  const [graph, setGraph] = useState({ nodes: [], edges: [] });
  const [search, setSearch] = useState('');

  const fetchGraph = async (authorId) => {
    const response = await fetch(`http://localhost:8000/authors/${authorId}/network`);
    const data = await response.json();

    const nodes = data.authors.map(a => ({
      id: a.id,
      label: a.name,
      title: `${a.name}\nPapers: ${a.paper_count}`,
      value: a.paper_count
    }));

    const edges = data.collaborations.map(c => ({
      from: c.author1,
      to: c.author2,
      value: c.papers_together
    }));

    setGraph({ nodes, edges });
  };

  const options = {
    nodes: {
      shape: 'dot',
      scaling: { min: 10, max: 30 }
    },
    edges: {
      width: 0.5,
      color: { inherit: 'from' },
      smooth: { type: 'continuous' }
    },
    physics: {
      stabilization: false,
      barnesHut: {
        gravitationalConstant: -8000,
        springConstant: 0.04,
        springLength: 95
      }
    }
  };

  return (
    <div>
      <h1>ScholarGraph</h1>
      <input
        type="text"
        placeholder="Search authors..."
        value={search}
        onChange={(e) => setSearch(e.target.value)}
      />
      <button onClick={() => fetchGraph(search)}>Visualize Network</button>

      <Graph
        graph={graph}
        options={options}
        style={{ height: '600px' }}
      />
    </div>
  );
}

export default App;

Part 6: Performance Optimization

Batch Imports

Use UNWIND for bulk inserts:
// Instead of 1000 individual queries
UNWIND $papers AS paper
MERGE (p:Paper {id: paper.id})
SET p.title = paper.title,
    p.abstract = paper.abstract,
    p.year = paper.year

Query Optimization

Before (slow):
MATCH (a:Author)
MATCH (a)-[:AUTHORED]->(p:Paper)
WHERE p.year = 2023
RETURN a.name, count(p)
After (fast):
MATCH (a:Author)-[:AUTHORED]->(p:Paper {year: 2023})
RETURN a.name, count(p)

Caching

Add Redis for hot queries:
import redis
import json

cache = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_paper(paper_id):
    # Check cache
    cached = cache.get(f"paper:{paper_id}")
    if cached:
        return json.loads(cached)

    # Query Neo4j
    with driver.session() as session:
        result = session.run("MATCH (p:Paper {id: $id}) RETURN p", id=paper_id).single()
        paper = dict(result['p'])

    # Cache for 1 hour
    cache.setex(f"paper:{paper_id}", 3600, json.dumps(paper))

    return paper

Part 7: Deployment

Docker Compose

docker-compose.yml:
version: '3.8'

services:
  neo4j:
    image: neo4j:5.13
    ports:
      - "7474:7474"  # Browser
      - "7687:7687"  # Bolt
    environment:
      NEO4J_AUTH: neo4j/password
      NEO4J_dbms_memory_heap_max__size: 8G
      NEO4J_dbms_memory_pagecache_size: 4G
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs

  api:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      - neo4j
    environment:
      NEO4J_URI: bolt://neo4j:7687
      NEO4J_USER: neo4j
      NEO4J_PASSWORD: password

  redis:
    image: redis:7
    ports:
      - "6379:6379"

volumes:
  neo4j_data:
  neo4j_logs:
Dockerfile (API):
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
Deploy:
docker-compose up -d

Part 8: Deliverables

  1. Data Model Diagram: Nodes, relationships, properties
  2. Import Scripts: Python code to load arXiv data
  3. API Implementation: FastAPI with all endpoints
  4. Frontend: React dashboard with graph visualization
  5. Query Collection: 20+ Cypher queries for features
  6. Performance Report: Query times, optimization results
  7. Deployment Guide: Docker Compose setup
  8. Presentation: Demo video + slides

Summary

ScholarGraph demonstrates:
  • ✅ Graph data modeling (papers, authors, topics)
  • ✅ Semantic search (vector embeddings)
  • ✅ Recommendations (collaborative filtering)
  • ✅ Graph algorithms (PageRank, Louvain)
  • ✅ REST API (FastAPI)
  • ✅ Visualization (React + vis.js)
  • ✅ Production deployment (Docker)
Congratulations on completing the Neo4j mastery course! 🎉

What’s Next?