Capstone: Building a Knowledge Graph Platform
Project: Academic Research Knowledge Graph
Part 1: Data Model Design
Schema
Indexes and Constraints
Part 2: Data Import
Sample Data Source: arXiv API
Import Citations
Part 3: Core Features
Feature 1: Semantic Search
Feature 2: Author Recommendations
Feature 3: Paper Recommendations
Feature 4: Influential Authors (PageRank)
Feature 5: Research Communities
Part 4: REST API (FastAPI)
Part 5: Visualization Dashboard
Part 6: Performance Optimization
Batch Imports
Query Optimization
Caching
Part 7: Deployment
Docker Compose
Part 8: Deliverables
Summary
What’s Next?

Capstone: Building a Knowledge Graph Platform

Project Duration: 15-20 hours Learning Style: Full-Stack Implementation + Graph Algorithms + Production Deployment Outcome: Complete knowledge graph application demonstrating Neo4j mastery

Project: Academic Research Knowledge Graph

Build ScholarGraph - a knowledge graph connecting:

Papers (research publications)
Authors (researchers)
Institutions (universities, labs)
Topics (research areas)
Citations (paper references)

Features:

Import data from academic APIs
Entity relationship extraction
Semantic search (find similar papers)
Recommendation engine (relevant papers for researchers)
Network analysis (influential authors, emerging topics)
Visualization dashboard

Part 1: Data Model Design

Schema

// Node labels
(:Paper {id, title, abstract, year, citation_count})
(:Author {id, name, h_index, institution})
(:Institution {id, name, country, rank})
(:Topic {name})

// Relationships
(author)-[:AUTHORED]->(paper)
(paper)-[:CITES]->(paper)
(paper)-[:ABOUT]->(topic)
(author)-[:AFFILIATED_WITH]->(institution)
(author)-[:COLLABORATES_WITH {papers_together}]->(author)

Indexes and Constraints

// Unique constraints
CREATE CONSTRAINT paper_id ON (p:Paper) ASSERT p.id IS UNIQUE;
CREATE CONSTRAINT author_id ON (a:Author) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT institution_id ON (i:Institution) ASSERT i.id IS UNIQUE;

// Indexes for search
CREATE INDEX paper_title FOR (p:Paper) ON (p.title);
CREATE INDEX author_name FOR (a:Author) ON (a.name);
CREATE FULLTEXT INDEX paper_search FOR (p:Paper) ON EACH [p.title, p.abstract];

// Vector index for embeddings
CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding)
OPTIONS {dimension: 768, similarity: 'cosine'};

Part 2: Data Import

Sample Data Source: arXiv API

Python Script (import_arxiv.py):

import arxiv
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer

# Neo4j connection
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Embedding model (for semantic search)
model = SentenceTransformer('all-MiniLM-L6-v2')

def import_paper(paper_data):
    with driver.session() as session:
        session.execute_write(_create_paper, paper_data)

def _create_paper(tx, data):
    # Generate embedding
    embedding = model.encode(data['abstract']).tolist()

    # Create paper
    query = """
    MERGE (p:Paper {id: $id})
    SET p.title = $title,
        p.abstract = $abstract,
        p.year = $year,
        p.embedding = $embedding

    // Create authors
    FOREACH (author IN $authors |
        MERGE (a:Author {name: author.name})
        MERGE (a)-[:AUTHORED]->(p)
    )

    // Create topics
    FOREACH (topic IN $topics |
        MERGE (t:Topic {name: topic})
        MERGE (p)-[:ABOUT]->(t)
    )

    RETURN p
    """

    tx.run(query,
        id=data['id'],
        title=data['title'],
        abstract=data['abstract'],
        year=data['year'],
        embedding=embedding,
        authors=data['authors'],
        topics=data['topics']
    )

# Fetch papers from arXiv
search = arxiv.Search(
    query="cat:cs.AI",  # AI papers
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

for result in search.results():
    paper_data = {
        'id': result.entry_id,
        'title': result.title,
        'abstract': result.summary,
        'year': result.published.year,
        'authors': [{'name': author.name} for author in result.authors],
        'topics': [cat.term for cat in result.categories]
    }
    import_paper(paper_data)
    print(f"Imported: {result.title}")

driver.close()

Import Citations

def import_citations(paper_id, cited_ids):
    with driver.session() as session:
        session.run("""
            MATCH (citing:Paper {id: $citing_id})
            UNWIND $cited_ids AS cited_id
            MATCH (cited:Paper {id: cited_id})
            MERGE (citing)-[:CITES]->(cited)
        """, citing_id=paper_id, cited_ids=cited_ids)

Part 3: Core Features

Feature 1: Semantic Search

Find similar papers by embedding:

// User searches for "graph neural networks"
WITH "graph neural networks" AS query_text

// Generate embedding (done in application layer)
// Assume $query_embedding is passed from app

MATCH (p:Paper)
WHERE vector.similarity.cosine(p.embedding, $query_embedding) > 0.7
RETURN p.title, p.abstract,
       vector.similarity.cosine(p.embedding, $query_embedding) AS similarity
ORDER BY similarity DESC
LIMIT 10

Python API:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def search_papers(query_text):
    query_embedding = model.encode(query_text).tolist()

    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper)
            WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.7
            RETURN p.title AS title,
                   p.abstract AS abstract,
                   vector.similarity.cosine(p.embedding, $embedding) AS score
            ORDER BY score DESC
            LIMIT 10
        """, embedding=query_embedding)

        return [dict(record) for record in result]

# Usage
results = search_papers("transformers for natural language processing")
for r in results:
    print(f"{r['title']} (score: {r['score']:.3f})")

Feature 2: Author Recommendations

“Authors working on similar topics to you”:

MATCH (me:Author {id: $author_id})-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic:Topic)
WITH me, collect(DISTINCT topic) AS my_topics

MATCH (other:Author)-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic)
WHERE other <> me AND topic IN my_topics
WITH other, count(topic) AS common_topics
ORDER BY common_topics DESC
LIMIT 10

OPTIONAL MATCH (other)-[:AFFILIATED_WITH]->(inst:Institution)
RETURN other.name AS author,
       common_topics,
       inst.name AS institution

Feature 3: Paper Recommendations

“Papers you might be interested in” (collaborative filtering):

// Based on what similar authors read
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
MATCH (my_paper)<-[:CITES]-(citing:Paper)<-[:AUTHORED]-(other:Author)
MATCH (other)-[:AUTHORED]->(rec:Paper)
WHERE NOT (me)-[:AUTHORED]->(rec)
WITH rec, count(DISTINCT other) AS score
ORDER BY score DESC
LIMIT 10
RETURN rec.title, rec.year, score

Feature 4: Influential Authors (PageRank)

Find most influential researchers:

// Create citation network projection
CALL gds.graph.project(
  'citation-network',
  'Author',
  {
    CITES: {
      orientation: 'NATURAL',
      aggregation: 'SINGLE'
    }
  }
)

// Run PageRank
CALL gds.pageRank.write('citation-network', {
  writeProperty: 'influence_score',
  dampingFactor: 0.85,
  maxIterations: 20
})

// Query top authors
MATCH (a:Author)
RETURN a.name, a.h_index, a.influence_score
ORDER BY a.influence_score DESC
LIMIT 20

Feature 5: Research Communities

Detect research groups (Louvain):

// Project collaboration network
CALL gds.graph.project(
  'collaboration-network',
  'Author',
  'COLLABORATES_WITH'
)

// Detect communities
CALL gds.louvain.write('collaboration-network', {
  writeProperty: 'community'
})

// View communities
MATCH (a:Author)
WITH a.community AS community, collect(a.name) AS members
WHERE size(members) > 5
RETURN community, size(members) AS size, members[0..10] AS sample
ORDER BY size DESC

Part 4: REST API (FastAPI)

File: api.py

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
from typing import List, Optional

app = FastAPI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
model = SentenceTransformer('all-MiniLM-L6-v2')

class SearchQuery(BaseModel):
    query: str
    limit: int = 10

class Paper(BaseModel):
    id: str
    title: str
    abstract: str
    year: int
    authors: List[str]
    topics: List[str]

@app.get("/papers/{paper_id}", response_model=Paper)
def get_paper(paper_id: str):
    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper {id: $id})
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
            OPTIONAL MATCH (p)-[:ABOUT]->(t:Topic)
            RETURN p.id AS id,
                   p.title AS title,
                   p.abstract AS abstract,
                   p.year AS year,
                   collect(DISTINCT a.name) AS authors,
                   collect(DISTINCT t.name) AS topics
        """, id=paper_id).single()

        if not result:
            raise HTTPException(status_code=404, detail="Paper not found")

        return Paper(**result)

@app.post("/search")
def search_papers(query: SearchQuery):
    embedding = model.encode(query.query).tolist()

    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper)
            WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.6
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
            WITH p, vector.similarity.cosine(p.embedding, $embedding) AS score,
                 collect(a.name) AS authors
            ORDER BY score DESC
            LIMIT $limit
            RETURN p.id AS id, p.title AS title, p.year AS year,
                   authors, score
        """, embedding=embedding, limit=query.limit)

        return [dict(record) for record in result]

@app.get("/authors/{author_id}/recommendations")
def recommend_papers(author_id: str, limit: int = 10):
    with driver.session() as session:
        result = session.run("""
            MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
            MATCH (my_paper)-[:ABOUT]->(topic:Topic)<-[:ABOUT]-(rec:Paper)
            WHERE NOT (me)-[:AUTHORED]->(rec)
            WITH rec, count(DISTINCT topic) AS relevance
            ORDER BY relevance DESC, rec.citation_count DESC
            LIMIT $limit
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(rec)
            RETURN rec.id AS id, rec.title AS title, rec.year AS year,
                   relevance, collect(a.name) AS authors
        """, author_id=author_id, limit=limit)

        return [dict(record) for record in result]

@app.get("/stats")
def get_stats():
    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper) WITH count(p) AS papers
            MATCH (a:Author) WITH papers, count(a) AS authors
            MATCH ()-[c:CITES]->() WITH papers, authors, count(c) AS citations
            MATCH (t:Topic) WITH papers, authors, citations, count(t) AS topics
            RETURN papers, authors, citations, topics
        """).single()

        return dict(result)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Run API:

python api.py

Test:

curl http://localhost:8000/stats
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "graph neural networks", "limit": 5}'

Part 5: Visualization Dashboard

Frontend: React + vis.js File: App.js

import React, { useState, useEffect } from 'react';
import Graph from 'react-graph-vis';

function App() {
  const [graph, setGraph] = useState({ nodes: [], edges: [] });
  const [search, setSearch] = useState('');

  const fetchGraph = async (authorId) => {
    const response = await fetch(`http://localhost:8000/authors/${authorId}/network`);
    const data = await response.json();

    const nodes = data.authors.map(a => ({
      id: a.id,
      label: a.name,
      title: `${a.name}\nPapers: ${a.paper_count}`,
      value: a.paper_count
    }));

    const edges = data.collaborations.map(c => ({
      from: c.author1,
      to: c.author2,
      value: c.papers_together
    }));

    setGraph({ nodes, edges });
  };

  const options = {
    nodes: {
      shape: 'dot',
      scaling: { min: 10, max: 30 }
    },
    edges: {
      width: 0.5,
      color: { inherit: 'from' },
      smooth: { type: 'continuous' }
    },
    physics: {
      stabilization: false,
      barnesHut: {
        gravitationalConstant: -8000,
        springConstant: 0.04,
        springLength: 95
      }
    }
  };

  return (
    <div>
      <h1>ScholarGraph</h1>
      <input
        type="text"
        placeholder="Search authors..."
        value={search}
        onChange={(e) => setSearch(e.target.value)}
      />
      <button onClick={() => fetchGraph(search)}>Visualize Network</button>

      <Graph
        graph={graph}
        options={options}
        style={{ height: '600px' }}
      />
    </div>
  );
}

export default App;

Part 6: Performance Optimization

Batch Imports

Use UNWIND for bulk inserts:

// Instead of 1000 individual queries
UNWIND $papers AS paper
MERGE (p:Paper {id: paper.id})
SET p.title = paper.title,
    p.abstract = paper.abstract,
    p.year = paper.year

Query Optimization

Before (slow):

MATCH (a:Author)
MATCH (a)-[:AUTHORED]->(p:Paper)
WHERE p.year = 2023
RETURN a.name, count(p)

After (fast):

MATCH (a:Author)-[:AUTHORED]->(p:Paper {year: 2023})
RETURN a.name, count(p)

Caching

Add Redis for hot queries:

import redis
import json

cache = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_paper(paper_id):
    # Check cache
    cached = cache.get(f"paper:{paper_id}")
    if cached:
        return json.loads(cached)

    # Query Neo4j
    with driver.session() as session:
        result = session.run("MATCH (p:Paper {id: $id}) RETURN p", id=paper_id).single()
        paper = dict(result['p'])

    # Cache for 1 hour
    cache.setex(f"paper:{paper_id}", 3600, json.dumps(paper))

    return paper

Part 7: Deployment

Docker Compose

docker-compose.yml:

version: '3.8'

services:
  neo4j:
    image: neo4j:5.13
    ports:
      - "7474:7474"  # Browser
      - "7687:7687"  # Bolt
    environment:
      NEO4J_AUTH: neo4j/password
      NEO4J_dbms_memory_heap_max__size: 8G
      NEO4J_dbms_memory_pagecache_size: 4G
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs

  api:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      - neo4j
    environment:
      NEO4J_URI: bolt://neo4j:7687
      NEO4J_USER: neo4j
      NEO4J_PASSWORD: password

  redis:
    image: redis:7
    ports:
      - "6379:6379"

volumes:
  neo4j_data:
  neo4j_logs:

Dockerfile (API):

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Deploy:

docker-compose up -d

Part 8: Deliverables

Data Model Diagram: Nodes, relationships, properties
Import Scripts: Python code to load arXiv data
API Implementation: FastAPI with all endpoints
Frontend: React dashboard with graph visualization
Query Collection: 20+ Cypher queries for features
Performance Report: Query times, optimization results
Deployment Guide: Docker Compose setup
Presentation: Demo video + slides

Summary

ScholarGraph demonstrates:

✅ Graph data modeling (papers, authors, topics)
✅ Semantic search (vector embeddings)
✅ Recommendations (collaborative filtering)
✅ Graph algorithms (PageRank, Louvain)
✅ REST API (FastAPI)
✅ Visualization (React + vis.js)
✅ Production deployment (Docker)

Congratulations on completing the Neo4j mastery course! 🎉

What’s Next?

Apache Cassandra Course

Master distributed NoSQL databases with Cassandra

Apache Spark Course

Learn distributed data processing at scale

Apache Flink Course

Real-time stream processing and analytics

Apache Kafka Course

Build event-driven architectures with Kafka

7. Production & Ops

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Capstone: Building a Knowledge Graph Platform

​Project: Academic Research Knowledge Graph

​Part 1: Data Model Design

​Schema

​Indexes and Constraints

​Part 2: Data Import

​Sample Data Source: arXiv API

​Import Citations

​Part 3: Core Features

​Feature 1: Semantic Search

​Feature 2: Author Recommendations

​Feature 3: Paper Recommendations

​Feature 4: Influential Authors (PageRank)

​Feature 5: Research Communities

​Part 4: REST API (FastAPI)

​Part 5: Visualization Dashboard

​Part 6: Performance Optimization

​Batch Imports

​Query Optimization

​Caching

​Part 7: Deployment

​Docker Compose

​Part 8: Deliverables

​Summary

​What’s Next?

Apache Cassandra Course

Apache Spark Course

Apache Flink Course

Apache Kafka Course

Capstone: Building a Knowledge Graph Platform

Project: Academic Research Knowledge Graph

Part 1: Data Model Design

Schema

Indexes and Constraints

Part 2: Data Import

Sample Data Source: arXiv API

Import Citations

Part 3: Core Features

Feature 1: Semantic Search

Feature 2: Author Recommendations

Feature 3: Paper Recommendations

Feature 4: Influential Authors (PageRank)

Feature 5: Research Communities

Part 4: REST API (FastAPI)

Part 5: Visualization Dashboard

Part 6: Performance Optimization

Batch Imports

Query Optimization

Caching

Part 7: Deployment

Docker Compose

Part 8: Deliverables

Summary

What’s Next?