Graph Algorithms at Scale
Module Duration : 6-8 hours
Learning Style : Algorithm Theory + Implementation + Real-World Applications
Outcome : Apply graph algorithms to solve real problems: recommendations, fraud detection, network analysis
Introduction: Why Graph Algorithms?
Graph algorithms solve problems that are intractable in relational databases:
Shortest Path : GPS navigation, network routing
PageRank : Search engine ranking, influence scoring
Community Detection : Social groups, customer segmentation
Centrality : Find influencers, critical infrastructure
Link Prediction : Recommendations, fraud detection
Neo4j provides these via the Graph Data Science (GDS) library .
Part 1: Neo4j GDS Setup
Installation
// Check if GDS is installed
CALL gds . version ()
If not installed, download from: https://neo4j.com/download-center/#algorithms
Creating a Graph Projection
Problem : Running algorithms on live data is slow and risky.
Solution : Create an in-memory graph projection (read-only snapshot):
CALL gds . graph . project (
'my-graph' , // Graph name
'Person' , // Node labels
'KNOWS' // Relationship types
)
YIELD graphName , nodeCount , relationshipCount
With properties :
CALL gds . graph . project (
'social-network' ,
'Person' ,
'KNOWS' ,
{
nodeProperties : [ 'age' , 'city' ] ,
relationshipProperties : [ 'since' ]
}
)
Cypher projection (more flexible):
CALL gds . graph . project . cypher (
'my-graph' ,
'MATCH (n:Person) RETURN id(n) AS id' ,
'MATCH (a:Person)-[r:KNOWS]->(b:Person) RETURN id(a) AS source, id(b) AS target'
)
Listing and Dropping Projections
// List all projections
CALL gds . graph . list ()
// Drop projection
CALL gds . graph . drop ( 'my-graph' )
Part 2: Pathfinding Algorithms
1. Shortest Path (Dijkstra)
Use Case : Find minimum-cost path between two nodes
Example : Shortest route between cities
// Create graph with weighted relationships
MATCH ( source : City { name : 'San Francisco' } )
MATCH ( target : City { name : 'New York' } )
CALL gds . shortestPath . dijkstra . stream ( 'city-network' , {
sourceNode : source ,
targetNode : target ,
relationshipWeightProperty : 'distance'
} )
YIELD index , sourceNode , targetNode , totalCost , nodeIds , costs , path
RETURN
gds . util . asNode ( sourceNode ). name AS from ,
gds . util . asNode ( targetNode ). name AS to ,
totalCost AS distance ,
[ nodeId IN nodeIds | gds . util . asNode ( nodeId ). name ] AS route
Output :
from: San Francisco
to: New York
distance: 2900
route: ["San Francisco", "Denver", "Chicago", "New York"]
2. All Shortest Paths
Find all shortest paths (multiple paths with same cost):
MATCH ( source : Person { name : 'Alice' } )
MATCH ( target : Person { name : 'Bob' } )
CALL gds . shortestPath . dijkstra . stream ( 'social-network' , {
sourceNode : source ,
targetNode : target
} )
YIELD path
RETURN [ node IN nodes ( path ) | node . name ] AS path
3. Single-Source Shortest Path
Find shortest paths from one node to all others :
MATCH ( source : Person { name : 'Alice' } )
CALL gds . allShortestPaths . delta . stream ( 'social-network' , {
sourceNode : source
} )
YIELD targetNode , distance
RETURN gds . util . asNode ( targetNode ). name AS person , distance
ORDER BY distance ASC
LIMIT 10
Part 3: Centrality Algorithms
1. Degree Centrality
Measures : Number of connections (simplest centrality)
Use Case : Find most connected people
CALL gds . degree . stream ( 'social-network' )
YIELD nodeId , score
RETURN gds . util . asNode ( nodeId ). name AS person , score AS connections
ORDER BY score DESC
LIMIT 10
Write back to graph :
CALL gds . degree . write ( 'social-network' , {
writeProperty : 'degree'
} )
YIELD nodePropertiesWritten , computeMillis
Measures : Importance based on connections (Google’s algorithm)
Formula :
PR(A) = (1-d) + d × Σ(PR(T_i) / C(T_i))
Where:
d = damping factor (0.85)
T_i = nodes linking to A
C(T_i) = out-degree of T_i
Use Case : Rank influencers, important nodes
CALL gds . pageRank . stream ( 'social-network' , {
maxIterations : 20 ,
dampingFactor : 0.85
} )
YIELD nodeId , score
RETURN gds . util . asNode ( nodeId ). name AS person , score
ORDER BY score DESC
LIMIT 10
Real-World Example : Twitter influence
// High PageRank = influential (many followers who are also influential)
CALL gds . graph . project (
'twitter' ,
'User' ,
{ FOLLOWS : { orientation : 'NATURAL' } }
)
CALL gds . pageRank . write ( 'twitter' , {
writeProperty : 'influence_score'
} )
YIELD nodePropertiesWritten
3. Betweenness Centrality
Measures : How often a node lies on shortest paths between others
Use Case : Find bridges, bottlenecks
CALL gds . betweenness . stream ( 'social-network' )
YIELD nodeId , score
RETURN gds . util . asNode ( nodeId ). name AS person , score AS betweenness
ORDER BY score DESC
LIMIT 10
Interpretation : High betweenness = broker, gatekeeper
4. Closeness Centrality
Measures : Average distance to all other nodes
Use Case : Find nodes that can spread information quickly
CALL gds . closeness . stream ( 'social-network' )
YIELD nodeId , score
RETURN gds . util . asNode ( nodeId ). name AS person , score
ORDER BY score DESC
LIMIT 10
Interpretation : High closeness = well-connected, central position
1. Louvain (Modularity-Based)
Finds : Groups with more internal connections than external
Use Case : Customer segmentation, fraud rings
CALL gds . louvain . stream ( 'social-network' )
YIELD nodeId , communityId
WITH communityId , collect ( gds . util . asNode ( nodeId ). name ) AS members
RETURN communityId , size ( members ) AS size , members
ORDER BY size DESC
Write communities back :
CALL gds . louvain . write ( 'social-network' , {
writeProperty : 'community'
} )
YIELD communityCount , modularity
Query by community :
MATCH ( p : Person { community : 5 } )
RETURN p . name
2. Label Propagation
Faster than Louvain, less accurate
Algorithm : Nodes adopt the most common label among neighbors
CALL gds . labelPropagation . stream ( 'social-network' )
YIELD nodeId , communityId
RETURN gds . util . asNode ( nodeId ). name AS person , communityId
3. Weakly Connected Components
Finds : Disconnected subgraphs
Use Case : Find isolated clusters
CALL gds . wcc . stream ( 'social-network' )
YIELD nodeId , componentId
WITH componentId , collect ( gds . util . asNode ( nodeId ). name ) AS members
WHERE size ( members ) > 1
RETURN componentId , size ( members ) AS size , members
ORDER BY size DESC
Part 5: Similarity Algorithms
1. Node Similarity (Jaccard)
Measures : Overlap of neighbors
Use Case : Recommend similar users
CALL gds . nodeSimilarity . stream ( 'social-network' , {
topK : 10 // Top 10 most similar per node
} )
YIELD node1 , node2 , similarity
RETURN
gds . util . asNode ( node1 ). name AS person1 ,
gds . util . asNode ( node2 ). name AS person2 ,
similarity
ORDER BY similarity DESC
Write similarities as relationships :
CALL gds . nodeSimilarity . write ( 'social-network' , {
writeRelationshipType : 'SIMILAR_TO' ,
writeProperty : 'score' ,
topK : 10
} )
2. Cosine Similarity (Vector-Based)
Use Case : Compare embeddings, feature vectors
// Assume nodes have 'embedding' property (array of floats)
CALL gds . graph . project (
'embeddings-graph' ,
{ Person : { properties : 'embedding' } },
'*'
)
CALL gds . similarity . cosine . stream ( 'embeddings-graph' , {
nodeProperties : [ 'embedding' ]
} )
YIELD node1 , node2 , similarity
RETURN
gds . util . asNode ( node1 ). name ,
gds . util . asNode ( node2 ). name ,
similarity
ORDER BY similarity DESC
Part 6: Link Prediction
1. Adamic Adar
Predicts : Likelihood of future connection
Formula : Sum of 1/log(degree) for common neighbors
Use Case : Friend recommendations, missing links
MATCH ( p1 : Person { name : 'Alice' } )
MATCH ( p2 : Person { name : 'Bob' } )
WHERE NOT ( p1 ) - [: KNOWS ] - ( p2 ) // Not already connected
RETURN gds . alpha . linkprediction . adamicAdar ( p1 , p2 ) AS score
Find top recommendations :
MATCH ( alice : Person { name : 'Alice' } )
MATCH ( candidate : Person )
WHERE NOT ( alice ) - [: KNOWS ] - ( candidate ) AND alice <> candidate
WITH alice , candidate ,
gds . alpha . linkprediction . adamicAdar ( alice , candidate ) AS score
WHERE score > 0
RETURN candidate . name , score
ORDER BY score DESC
LIMIT 10
2. Common Neighbors
Simpler : Count shared neighbors
MATCH ( alice : Person { name : 'Alice' } )
MATCH ( candidate : Person )
WHERE NOT ( alice ) - [: KNOWS ] - ( candidate ) AND alice <> candidate
MATCH ( alice ) - [: KNOWS ] - ( mutual ) - [: KNOWS ] - ( candidate )
WITH candidate , count ( DISTINCT mutual ) AS common_friends
RETURN candidate . name , common_friends
ORDER BY common_friends DESC
LIMIT 10
Part 7: Graph Embeddings
Node2Vec
Generates : Vector representation of nodes
Use Case : Machine learning features, clustering
CALL gds . node2vec . stream ( 'social-network' , {
embeddingDimension : 128 ,
iterations : 10
} )
YIELD nodeId , embedding
RETURN gds . util . asNode ( nodeId ). name AS person , embedding
Write embeddings :
CALL gds . node2vec . write ( 'social-network' , {
embeddingDimension : 128 ,
writeProperty : 'embedding'
} )
Use embeddings for similarity :
MATCH ( p1 : Person { name : 'Alice' } )
MATCH ( p2 : Person )
WITH p1 , p2 , gds . similarity . cosine ( p1 . embedding , p2 . embedding ) AS similarity
WHERE similarity > 0.8 AND p1 <> p2
RETURN p2 . name , similarity
ORDER BY similarity DESC
Part 8: Real-World Use Cases
Use Case 1: Fraud Detection
Goal : Find fraud rings (groups of accounts committing fraud together)
Model :
( account : Account ) - [: SHARED_IP ] -> ( ip : IP )
( account ) - [: SHARED_DEVICE ] -> ( device : Device )
( account ) - [: SHARED_EMAIL ] -> ( email : Email )
Algorithm : Louvain community detection
CALL gds . graph . project ( 'fraud-network' , 'Account' , [ 'SHARED_IP' , 'SHARED_DEVICE' , 'SHARED_EMAIL' ])
CALL gds . louvain . write ( 'fraud-network' , {
writeProperty : 'fraud_community'
} )
// Find suspicious communities (high connectivity)
MATCH ( a : Account )
WITH a . fraud_community AS community , collect ( a ) AS accounts
WHERE size ( accounts ) > 5 // Rings with 5+ accounts
RETURN community , size ( accounts ) AS ring_size , accounts
ORDER BY ring_size DESC
Use Case 2: Supply Chain Analysis
Goal : Find critical suppliers (bottlenecks)
Algorithm : Betweenness centrality
CALL gds . graph . project ( 'supply-chain' , [ 'Supplier' , 'Manufacturer' , 'Distributor' ], 'SUPPLIES' )
CALL gds . betweenness . write ( 'supply-chain' , {
writeProperty : 'criticality'
} )
// Find critical nodes
MATCH ( n )
WHERE n . criticality > 1000
RETURN labels ( n ) AS type , n . name , n . criticality
ORDER BY n . criticality DESC
Use Case 3: Knowledge Graph Reasoning
Goal : Infer missing relationships
Example : Predict who might know whom
// Find people in same company who don't know each other yet
MATCH ( p1 : Person ) - [: WORKS_AT ] -> ( company : Company ) <- [: WORKS_AT ] - ( p2 : Person )
WHERE NOT ( p1 ) - [: KNOWS ] - ( p2 ) AND p1 <> p2
WITH p1 , p2 , gds . alpha . linkprediction . adamicAdar ( p1 , p2 ) AS score
WHERE score > 2
RETURN p1 . name , p2 . name , score
ORDER BY score DESC
1. Use Graph Projections
Always project before running algorithms:
// Bad: Direct algorithm on database (slow, locks data)
CALL gds . pageRank . stream ( { nodeQuery : 'MATCH (n:Person) RETURN id(n)' , ... } )
// Good: Project first (fast, safe)
CALL gds . graph . project ( 'my-graph' , 'Person' , 'KNOWS' )
CALL gds . pageRank . stream ( 'my-graph' )
2. Estimate Memory Usage
CALL gds . degree . estimate ( 'my-graph' )
YIELD requiredMemory , nodeCount , relationshipCount
3. Limit Iterations
CALL gds . pageRank . stream ( 'my-graph' , {
maxIterations : 20 , // Stop after 20 iterations
tolerance : 0.0001 // Or when change < 0.01%
} )
4. Filter Projections
Don’t load entire graph if you only need subset:
CALL gds . graph . project . cypher (
'active-users' ,
'MATCH (n:User) WHERE n.active = true RETURN id(n) AS id' ,
'MATCH (a:User)-[r:FOLLOWS]->(b:User) RETURN id(a) AS source, id(b) AS target'
)
Summary
Pathfinding : Shortest path, Dijkstra for routing
Centrality : Degree, PageRank, Betweenness for influence
Community Detection : Louvain, Label Propagation for clustering
Similarity : Node similarity, Cosine for recommendations
Link Prediction : Adamic Adar for missing connections
Embeddings : Node2Vec for ML features
Best Practices :
Always use graph projections
Estimate memory before running
Write results back for reuse
Filter projections for performance
What’s Next?
Module 7: Production Deployment & Operations Deploy Neo4j clusters, configure high availability, monitor performance, and operate at scale