Module Duration: 4-5 hours
Learning Style: Historical + Theoretical + Practical Evolution
Outcome: Understand the mathematical foundations and evolution of graph databases, and why they revolutionized data modeling
Consider finding “friends of friends” in a relational database:
Copy
-- Find friends of user 123SELECT friend_id FROM friendships WHERE user_id = 123;-- Find friends of friends (2 hops)SELECT f2.friend_idFROM friendships f1JOIN friendships f2 ON f1.friend_id = f2.user_idWHERE f1.user_id = 123;-- Find friends of friends of friends (3 hops)SELECT f3.friend_idFROM friendships f1JOIN friendships f2 ON f1.friend_id = f2.user_idJOIN friendships f3 ON f2.friend_id = f3.user_idWHERE f1.user_id = 123;
Performance: Each additional hop requires another JOIN, and query time explodes exponentially!
Hops
JOINs
Rows Scanned
Query Time
1
0
100
5ms
2
1
10,000
50ms
3
2
1,000,000
5s
4
3
100,000,000
500s (8+ min!)
The Insight: “Joins are computed at query time. But relationships exist in the data itself—why not store them explicitly?”This realization led to the property graph model and eventually Neo4j (2007).
Graph theory began with Leonhard Euler and the Seven Bridges of Königsberg problem.The Problem:The city of Königsberg had seven bridges connecting four land areas:
Copy
A / \ B C \ / DBridges:A-B, A-B, A-C, B-D, B-D, C-D, C-D
Question: Can you walk through the city crossing each bridge exactly once?Euler’s Insight: Model the problem as a graph:
Nodes (Vertices): Land areas (A, B, C, D)
Edges: Bridges
Euler’s Theorem (1736):
A graph has an Eulerian path (visiting every edge exactly once) if and only if it has exactly 0 or 2 vertices of odd degree.
Analysis:
Copy
Vertex A: degree 3 (odd) - connects to 3 bridgesVertex B: degree 5 (odd) - connects to 5 bridgesVertex C: degree 3 (odd) - connects to 3 bridgesVertex D: degree 3 (odd) - connects to 3 bridgesResult: 4 vertices with odd degree → No Eulerian path exists!
Conclusion: The walk is impossible!Impact: This was the birth of graph theory—representing real-world problems as nodes and edges.
Edgar F. Codd published “A Relational Model of Data for Large Shared Data Banks” (1970).Key Ideas:
Data stored in tables (relations)
Declarative queries (SQL) instead of navigational code
Mathematical foundation (set theory, relational algebra)
Example:
Copy
-- Declarative (what you want, not how to get it)SELECT customer.name, order.totalFROM customersJOIN orders ON customers.id = orders.customer_id;
Impact: Relational databases dominated for 40+ years (Oracle, MySQL, PostgreSQL, SQL Server).But: Relationships are implicit (reconstructed via JOINs at query time).
Authors: Jim Webber (Neo4j Chief Scientist), Emil EifremProblem: How to scale graph databases beyond single machines?Key Insights:1. Locality of TraversalsMost graph queries are localized:
Social network: “Friends of friends” stays within a community
Recommendation: “Users like you who bought…” limited scope
Implication: Sharding is hard because you need to traverse across shards!2. Vertical Scaling FirstModern servers can have:
1-2 TB RAM (entire graph in memory!)
NVMe SSDs (millions of IOPS)
64-128 CPU cores
Neo4j Approach: Optimize for single-machine performance first3. Read Replicas for Scaling
Composable (build complex queries from simple patterns)
Cypher vs. SQL:Find Alice’s friends:
Copy
-- SQL: Think in tablesSELECT u2.nameFROM users u1JOIN friendships f ON u1.id = f.user_idJOIN users u2 ON f.friend_id = u2.idWHERE u1.name = 'Alice';
Copy
-- Cypher: Think in graphsMATCH (alice:Person {name: 'Alice'})-[:KNOWS]->(friend)RETURN friend.name
Find Alice’s friends who live in the same city:
Copy
-- SQL: 3 JOINsSELECT u2.name, c.nameFROM users u1JOIN friendships f ON u1.id = f.user_idJOIN users u2 ON f.friend_id = u2.idJOIN cities c ON u2.city_id = c.idWHERE u1.name = 'Alice' AND u1.city_id = u2.city_id;
Native graph storage: Nodes and relationships stored as disk records with pointers
ACID transactions: Full transactional guarantees
Embedded Java API: Neo4j ran inside your JVM
Storage Format:
Copy
Node Record (9 bytes):┌────────┬────────┬────────┬────────┬────────┐│ In Use │ Next │ Next │ Labels │ Extra ││ (1 bit)│ Rel │ Prop │ │ ││ │ (4 B) │ (4 B) │ │ │└────────┴────────┴────────┴────────┴────────┘Relationship Record (33 bytes):┌────────┬────────┬────────┬────────┬────────┬────────┬────────┬────────┐│ In Use │ First │ Second │ Rel │ First │ First │ Second │ Second ││ (1 bit)│ Node │ Node │ Type │ Prev │ Next │ Prev │ Next ││ │ (4 B) │ (4 B) │ (4 B) │ Rel │ Rel │ Rel │ Rel ││ │ │ │ │ (4 B) │ (4 B) │ (4 B) │ (4 B) │└────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
Key Innovation: Fixed-size records enable O(1) pointer following!To traverse (A)-[:KNOWS]->(B):
// Before: No way to distinguish node typesMATCH (n {name: 'Alice'}) // Scans ALL nodes!// After: Label-based indexingMATCH (alice:Person {name: 'Alice'}) // Only scans Person nodes
2. Schema Indexes:
Copy
CREATE INDEX FOR (p:Person) ON (p.name)
Performance: Name lookup changed from O(N) to O(log N)!3. Cypher as Default:Before: Imperative Java Traversal API
// Create vector index for embeddingsCREATE VECTOR INDEX user_embeddingsFOR (u:User) ON (u.embedding)OPTIONS {dimension: 256, similarity: 'cosine'};// Similarity searchMATCH (u:User)WHERE u.embedding SIMILAR TO $query_vectorRETURN u.name, vector.similarity(u.embedding, $query_vector) AS scoreORDER BY score DESC LIMIT 10;
3. GQL Standard (SQL for graphs):ISO/IEC 39075 (Graph Query Language) standardization in progress.Neo4j contributing to make Cypher the basis for GQL.
Complex relationships: missions, components, failures, teams
SQL queries took hours for deep analysis
Solution: Migrated to Neo4jResults:
Queries reduced from hours to seconds
Engineers could explore connections interactively
Discovered hidden patterns in failure cascades
Example Query:
Copy
// Find all missions affected by a specific component failureMATCH (failure:Failure {component: 'O-ring'})<-[:EXPERIENCED]-(mission:Mission)MATCH (mission)-[:USED]->(component)RETURN mission.name, collect(component.name) AS components
Pre-computed recommendations (batch jobs overnight)
Couldn’t personalize in real-time
2-3% conversion rate
New System (Neo4j):
Copy
// Real-time: "Customers who bought X also bought Y"MATCH (product:Product {id: $product_id})<-[:PURCHASED]-(buyer:User)MATCH (buyer)-[:PURCHASED]->(other:Product)WHERE other <> productRETURN other.name, count(*) AS scoreORDER BY score DESC LIMIT 10
1. “The Network is the Computer” (2012)Authors: Jim Webber, Ian RobinsonAbstract:
Graph databases flip the database paradigm: instead of optimizing for data storage, optimize for data traversal. The network (relationships) is the primary asset.
Key Quote:
“In a graph database, relationships are stored, not computed. This is the single most important difference between graph databases and other NoSQL stores.”
2. “Graph Databases and the Future of Large-Scale Knowledge Management” (2013)Authors: Peter Mika (Yahoo Research)Focus: Using graphs for enterprise knowledge graphsApplications:
Semantic search
Entity resolution
Recommendation systems
3. “Benchmarking Graph Databases” (2015)Authors: Various (LDBC - Linked Data Benchmark Council)Contribution: Standard benchmarks for graph databasesBenchmarks:
Social Network Benchmark (SNB): Facebook-like workload
Business Intelligence Benchmark: Analytics queries
# PyTorch Geometric on Neo4j datafrom torch_geometric.data import Data# Load graph from Neo4jquery = "MATCH (a)-[r]->(b) RETURN id(a), id(b), r.weight"edges = neo4j.execute(query)# Convert to PyTorch tensoredge_index = torch.tensor([[e.a for e in edges], [e.b for e in edges]])data = Data(edge_index=edge_index)# Train GNNmodel = GCNConv(in_channels=128, out_channels=64)
Use Cases:
Node classification (fraud detection)
Link prediction (recommendations)
Graph generation (molecule discovery)
2. Multi-Modal GraphsCombine different data types:
// Relationship with timestampCREATE (alice)-[:KNOWS {from: date('2020-01-01'), to: date('2021-12-31')}]->(bob)// Query: Who was Alice's friend in 2020?MATCH (alice)-[k:KNOWS]->(friend)WHERE date('2020-06-15') >= k.from AND date('2020-06-15') <= k.toRETURN friend.name
4. Distributed Graph DatabasesSharding graphs across machines:Challenges: