ScholarGraph

> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Capstone: Building a Knowledge Graph Platform

> Build a production-ready knowledge graph with entity extraction, semantic search, recommendations, and graph analytics

# Capstone: Building a Knowledge Graph Platform

<Info>
  **Project Duration**: 15-20 hours
  **Learning Style**: Full-Stack Implementation + Graph Algorithms + Production Deployment
  **Outcome**: Complete knowledge graph application demonstrating Neo4j mastery
</Info>

## Project: Academic Research Knowledge Graph

Build **ScholarGraph** - a knowledge graph connecting:

* **Papers** (research publications)
* **Authors** (researchers)
* **Institutions** (universities, labs)
* **Topics** (research areas)
* **Citations** (paper references)

**Features**:

1. Import data from academic APIs
2. Entity relationship extraction
3. Semantic search (find similar papers)
4. Recommendation engine (relevant papers for researchers)
5. Network analysis (influential authors, emerging topics)
6. Visualization dashboard

***

## Part 1: Data Model Design

### Schema

```cypher theme={null}
// Node labels
(:Paper {id, title, abstract, year, citation_count})
(:Author {id, name, h_index, institution})
(:Institution {id, name, country, rank})
(:Topic {name})

// Relationships
(author)-[:AUTHORED]->(paper)
(paper)-[:CITES]->(paper)
(paper)-[:ABOUT]->(topic)
(author)-[:AFFILIATED_WITH]->(institution)
(author)-[:COLLABORATES_WITH {papers_together}]->(author)
```

### Indexes and Constraints

```cypher theme={null}
// Unique constraints
CREATE CONSTRAINT paper_id ON (p:Paper) ASSERT p.id IS UNIQUE;
CREATE CONSTRAINT author_id ON (a:Author) ASSERT a.id IS UNIQUE;
CREATE CONSTRAINT institution_id ON (i:Institution) ASSERT i.id IS UNIQUE;

// Indexes for search
CREATE INDEX paper_title FOR (p:Paper) ON (p.title);
CREATE INDEX author_name FOR (a:Author) ON (a.name);
CREATE FULLTEXT INDEX paper_search FOR (p:Paper) ON EACH [p.title, p.abstract];

// Vector index for embeddings
CREATE VECTOR INDEX paper_embedding FOR (p:Paper) ON (p.embedding)
OPTIONS {dimension: 768, similarity: 'cosine'};
```

***

## Part 2: Data Import

### Sample Data Source: arXiv API

**Python Script** (import\_arxiv.py):

```python theme={null}
import arxiv
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer

# Neo4j connection
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))

# Embedding model (for semantic search)
model = SentenceTransformer('all-MiniLM-L6-v2')

def import_paper(paper_data):
    with driver.session() as session:
        session.execute_write(_create_paper, paper_data)

def _create_paper(tx, data):
    # Generate embedding
    embedding = model.encode(data['abstract']).tolist()

    # Create paper
    query = """
    MERGE (p:Paper {id: $id})
    SET p.title = $title,
        p.abstract = $abstract,
        p.year = $year,
        p.embedding = $embedding

    // Create authors
    FOREACH (author IN $authors |
        MERGE (a:Author {name: author.name})
        MERGE (a)-[:AUTHORED]->(p)
    )

    // Create topics
    FOREACH (topic IN $topics |
        MERGE (t:Topic {name: topic})
        MERGE (p)-[:ABOUT]->(t)
    )

    RETURN p
    """

    tx.run(query,
        id=data['id'],
        title=data['title'],
        abstract=data['abstract'],
        year=data['year'],
        embedding=embedding,
        authors=data['authors'],
        topics=data['topics']
    )

# Fetch papers from arXiv
search = arxiv.Search(
    query="cat:cs.AI",  # AI papers
    max_results=1000,
    sort_by=arxiv.SortCriterion.SubmittedDate
)

for result in search.results():
    paper_data = {
        'id': result.entry_id,
        'title': result.title,
        'abstract': result.summary,
        'year': result.published.year,
        'authors': [{'name': author.name} for author in result.authors],
        'topics': [cat.term for cat in result.categories]
    }
    import_paper(paper_data)
    print(f"Imported: {result.title}")

driver.close()
```

### Import Citations

```python theme={null}
def import_citations(paper_id, cited_ids):
    with driver.session() as session:
        session.run("""
            MATCH (citing:Paper {id: $citing_id})
            UNWIND $cited_ids AS cited_id
            MATCH (cited:Paper {id: cited_id})
            MERGE (citing)-[:CITES]->(cited)
        """, citing_id=paper_id, cited_ids=cited_ids)
```

***

## Part 3: Core Features

### Feature 1: Semantic Search

**Find similar papers by embedding**:

```cypher theme={null}
// User searches for "graph neural networks"
WITH "graph neural networks" AS query_text

// Generate embedding (done in application layer)
// Assume $query_embedding is passed from app

MATCH (p:Paper)
WHERE vector.similarity.cosine(p.embedding, $query_embedding) > 0.7
RETURN p.title, p.abstract,
       vector.similarity.cosine(p.embedding, $query_embedding) AS similarity
ORDER BY similarity DESC
LIMIT 10
```

**Python API**:

```python theme={null}
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('all-MiniLM-L6-v2')

def search_papers(query_text):
    query_embedding = model.encode(query_text).tolist()

    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper)
            WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.7
            RETURN p.title AS title,
                   p.abstract AS abstract,
                   vector.similarity.cosine(p.embedding, $embedding) AS score
            ORDER BY score DESC
            LIMIT 10
        """, embedding=query_embedding)

        return [dict(record) for record in result]

# Usage
results = search_papers("transformers for natural language processing")
for r in results:
    print(f"{r['title']} (score: {r['score']:.3f})")
```

### Feature 2: Author Recommendations

**"Authors working on similar topics to you"**:

```cypher theme={null}
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic:Topic)
WITH me, collect(DISTINCT topic) AS my_topics

MATCH (other:Author)-[:AUTHORED]->(:Paper)-[:ABOUT]->(topic)
WHERE other <> me AND topic IN my_topics
WITH other, count(topic) AS common_topics
ORDER BY common_topics DESC
LIMIT 10

OPTIONAL MATCH (other)-[:AFFILIATED_WITH]->(inst:Institution)
RETURN other.name AS author,
       common_topics,
       inst.name AS institution
```

### Feature 3: Paper Recommendations

**"Papers you might be interested in"** (collaborative filtering):

```cypher theme={null}
// Based on what similar authors read
MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
MATCH (my_paper)<-[:CITES]-(citing:Paper)<-[:AUTHORED]-(other:Author)
MATCH (other)-[:AUTHORED]->(rec:Paper)
WHERE NOT (me)-[:AUTHORED]->(rec)
WITH rec, count(DISTINCT other) AS score
ORDER BY score DESC
LIMIT 10
RETURN rec.title, rec.year, score
```

### Feature 4: Influential Authors (PageRank)

**Find most influential researchers**:

```cypher theme={null}
// Create citation network projection
CALL gds.graph.project(
  'citation-network',
  'Author',
  {
    CITES: {
      orientation: 'NATURAL',
      aggregation: 'SINGLE'
    }
  }
)

// Run PageRank
CALL gds.pageRank.write('citation-network', {
  writeProperty: 'influence_score',
  dampingFactor: 0.85,
  maxIterations: 20
})

// Query top authors
MATCH (a:Author)
RETURN a.name, a.h_index, a.influence_score
ORDER BY a.influence_score DESC
LIMIT 20
```

### Feature 5: Research Communities

**Detect research groups** (Louvain):

```cypher theme={null}
// Project collaboration network
CALL gds.graph.project(
  'collaboration-network',
  'Author',
  'COLLABORATES_WITH'
)

// Detect communities
CALL gds.louvain.write('collaboration-network', {
  writeProperty: 'community'
})

// View communities
MATCH (a:Author)
WITH a.community AS community, collect(a.name) AS members
WHERE size(members) > 5
RETURN community, size(members) AS size, members[0..10] AS sample
ORDER BY size DESC
```

***

## Part 4: REST API (FastAPI)

**File: api.py**

```python theme={null}
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from neo4j import GraphDatabase
from sentence_transformers import SentenceTransformer
from typing import List, Optional

app = FastAPI()
driver = GraphDatabase.driver("bolt://localhost:7687", auth=("neo4j", "password"))
model = SentenceTransformer('all-MiniLM-L6-v2')

class SearchQuery(BaseModel):
    query: str
    limit: int = 10

class Paper(BaseModel):
    id: str
    title: str
    abstract: str
    year: int
    authors: List[str]
    topics: List[str]

@app.get("/papers/{paper_id}", response_model=Paper)
def get_paper(paper_id: str):
    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper {id: $id})
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
            OPTIONAL MATCH (p)-[:ABOUT]->(t:Topic)
            RETURN p.id AS id,
                   p.title AS title,
                   p.abstract AS abstract,
                   p.year AS year,
                   collect(DISTINCT a.name) AS authors,
                   collect(DISTINCT t.name) AS topics
        """, id=paper_id).single()

        if not result:
            raise HTTPException(status_code=404, detail="Paper not found")

        return Paper(**result)

@app.post("/search")
def search_papers(query: SearchQuery):
    embedding = model.encode(query.query).tolist()

    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper)
            WHERE vector.similarity.cosine(p.embedding, $embedding) > 0.6
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(p)
            WITH p, vector.similarity.cosine(p.embedding, $embedding) AS score,
                 collect(a.name) AS authors
            ORDER BY score DESC
            LIMIT $limit
            RETURN p.id AS id, p.title AS title, p.year AS year,
                   authors, score
        """, embedding=embedding, limit=query.limit)

        return [dict(record) for record in result]

@app.get("/authors/{author_id}/recommendations")
def recommend_papers(author_id: str, limit: int = 10):
    with driver.session() as session:
        result = session.run("""
            MATCH (me:Author {id: $author_id})-[:AUTHORED]->(my_paper:Paper)
            MATCH (my_paper)-[:ABOUT]->(topic:Topic)<-[:ABOUT]-(rec:Paper)
            WHERE NOT (me)-[:AUTHORED]->(rec)
            WITH rec, count(DISTINCT topic) AS relevance
            ORDER BY relevance DESC, rec.citation_count DESC
            LIMIT $limit
            OPTIONAL MATCH (a:Author)-[:AUTHORED]->(rec)
            RETURN rec.id AS id, rec.title AS title, rec.year AS year,
                   relevance, collect(a.name) AS authors
        """, author_id=author_id, limit=limit)

        return [dict(record) for record in result]

@app.get("/stats")
def get_stats():
    with driver.session() as session:
        result = session.run("""
            MATCH (p:Paper) WITH count(p) AS papers
            MATCH (a:Author) WITH papers, count(a) AS authors
            MATCH ()-[c:CITES]->() WITH papers, authors, count(c) AS citations
            MATCH (t:Topic) WITH papers, authors, citations, count(t) AS topics
            RETURN papers, authors, citations, topics
        """).single()

        return dict(result)

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)
```

**Run API**:

```bash theme={null}
python api.py
```

**Test**:

```bash theme={null}
curl http://localhost:8000/stats
curl -X POST http://localhost:8000/search \
  -H "Content-Type: application/json" \
  -d '{"query": "graph neural networks", "limit": 5}'
```

***

## Part 5: Visualization Dashboard

**Frontend: React + vis.js**

**File: App.js**

```javascript theme={null}
import React, { useState, useEffect } from 'react';
import Graph from 'react-graph-vis';

function App() {
  const [graph, setGraph] = useState({ nodes: [], edges: [] });
  const [search, setSearch] = useState('');

  const fetchGraph = async (authorId) => {
    const response = await fetch(`http://localhost:8000/authors/${authorId}/network`);
    const data = await response.json();

    const nodes = data.authors.map(a => ({
      id: a.id,
      label: a.name,
      title: `${a.name}\nPapers: ${a.paper_count}`,
      value: a.paper_count
    }));

    const edges = data.collaborations.map(c => ({
      from: c.author1,
      to: c.author2,
      value: c.papers_together
    }));

    setGraph({ nodes, edges });
  };

  const options = {
    nodes: {
      shape: 'dot',
      scaling: { min: 10, max: 30 }
    },
    edges: {
      width: 0.5,
      color: { inherit: 'from' },
      smooth: { type: 'continuous' }
    },
    physics: {
      stabilization: false,
      barnesHut: {
        gravitationalConstant: -8000,
        springConstant: 0.04,
        springLength: 95
      }
    }
  };

  return (
    <div>
      <h1>ScholarGraph</h1>
      <input
        type="text"
        placeholder="Search authors..."
        value={search}
        onChange={(e) => setSearch(e.target.value)}
      />
      <button onClick={() => fetchGraph(search)}>Visualize Network</button>

      <Graph
        graph={graph}
        options={options}
        style={{ height: '600px' }}
      />
    </div>
  );
}

export default App;
```

***

## Part 6: Performance Optimization

### Batch Imports

**Use UNWIND for bulk inserts**:

```cypher theme={null}
// Instead of 1000 individual queries
UNWIND $papers AS paper
MERGE (p:Paper {id: paper.id})
SET p.title = paper.title,
    p.abstract = paper.abstract,
    p.year = paper.year
```

### Query Optimization

**Before** (slow):

```cypher theme={null}
MATCH (a:Author)
MATCH (a)-[:AUTHORED]->(p:Paper)
WHERE p.year = 2023
RETURN a.name, count(p)
```

**After** (fast):

```cypher theme={null}
MATCH (a:Author)-[:AUTHORED]->(p:Paper {year: 2023})
RETURN a.name, count(p)
```

### Caching

**Add Redis for hot queries**:

```python theme={null}
import redis
import json

cache = redis.Redis(host='localhost', port=6379, decode_responses=True)

def get_paper(paper_id):
    # Check cache
    cached = cache.get(f"paper:{paper_id}")
    if cached:
        return json.loads(cached)

    # Query Neo4j
    with driver.session() as session:
        result = session.run("MATCH (p:Paper {id: $id}) RETURN p", id=paper_id).single()
        paper = dict(result['p'])

    # Cache for 1 hour
    cache.setex(f"paper:{paper_id}", 3600, json.dumps(paper))

    return paper
```

***

## Part 7: Deployment

### Docker Compose

**docker-compose.yml**:

```yaml theme={null}
version: '3.8'

services:
  neo4j:
    image: neo4j:5.13
    ports:
      - "7474:7474"  # Browser
      - "7687:7687"  # Bolt
    environment:
      NEO4J_AUTH: neo4j/password
      NEO4J_dbms_memory_heap_max__size: 8G
      NEO4J_dbms_memory_pagecache_size: 4G
    volumes:
      - neo4j_data:/data
      - neo4j_logs:/logs

  api:
    build: .
    ports:
      - "8000:8000"
    depends_on:
      - neo4j
    environment:
      NEO4J_URI: bolt://neo4j:7687
      NEO4J_USER: neo4j
      NEO4J_PASSWORD: password

  redis:
    image: redis:7
    ports:
      - "6379:6379"

volumes:
  neo4j_data:
  neo4j_logs:
```

**Dockerfile** (API):

```dockerfile theme={null}
FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]
```

**Deploy**:

```bash theme={null}
docker-compose up -d
```

***

## Part 8: Deliverables

1. **Data Model Diagram**: Nodes, relationships, properties
2. **Import Scripts**: Python code to load arXiv data
3. **API Implementation**: FastAPI with all endpoints
4. **Frontend**: React dashboard with graph visualization
5. **Query Collection**: 20+ Cypher queries for features
6. **Performance Report**: Query times, optimization results
7. **Deployment Guide**: Docker Compose setup
8. **Presentation**: Demo video + slides

***

## Summary

**ScholarGraph** demonstrates:

* ✅ Graph data modeling (papers, authors, topics)
* ✅ Semantic search (vector embeddings)
* ✅ Recommendations (collaborative filtering)
* ✅ Graph algorithms (PageRank, Louvain)
* ✅ REST API (FastAPI)
* ✅ Visualization (React + vis.js)
* ✅ Production deployment (Docker)

**Congratulations on completing the Neo4j mastery course!** 🎉

***

## What's Next?

<CardGroup cols={2}>
  <Card title="Apache Cassandra Course" icon="database" href="/distributed-systems-tools/cassandra-overview">
    Master distributed NoSQL databases with Cassandra
  </Card>

  <Card title="Apache Spark Course" icon="fire" href="/distributed-systems-tools/spark-introduction">
    Learn distributed data processing at scale
  </Card>

  <Card title="Apache Flink Course" icon="water" href="/distributed-systems-tools/flink-overview">
    Real-time stream processing and analytics
  </Card>

  <Card title="Apache Kafka Course" icon="stream" href="/distributed-systems-tools/kafka-overview">
    Build event-driven architectures with Kafka
  </Card>
</CardGroup>