Skip to main content

Graph Algorithms at Scale

Module Duration: 6-8 hours Learning Style: Algorithm Theory + Implementation + Real-World Applications Outcome: Apply graph algorithms to solve real problems: recommendations, fraud detection, network analysis

Introduction: Why Graph Algorithms?

Graph algorithms solve problems that are intractable in relational databases:
  • Shortest Path: GPS navigation, network routing
  • PageRank: Search engine ranking, influence scoring
  • Community Detection: Social groups, customer segmentation
  • Centrality: Find influencers, critical infrastructure
  • Link Prediction: Recommendations, fraud detection
Neo4j provides these via the Graph Data Science (GDS) library.

Part 1: Neo4j GDS Setup

Installation

// Check if GDS is installed
CALL gds.version()
If not installed, download from: https://neo4j.com/download-center/#algorithms

Creating a Graph Projection

Problem: Running algorithms on live data is slow and risky. Solution: Create an in-memory graph projection (read-only snapshot):
CALL gds.graph.project(
  'my-graph',              // Graph name
  'Person',                // Node labels
  'KNOWS'                  // Relationship types
)
YIELD graphName, nodeCount, relationshipCount
With properties:
CALL gds.graph.project(
  'social-network',
  'Person',
  'KNOWS',
  {
    nodeProperties: ['age', 'city'],
    relationshipProperties: ['since']
  }
)
Cypher projection (more flexible):
CALL gds.graph.project.cypher(
  'my-graph',
  'MATCH (n:Person) RETURN id(n) AS id',
  'MATCH (a:Person)-[r:KNOWS]->(b:Person) RETURN id(a) AS source, id(b) AS target'
)

Listing and Dropping Projections

// List all projections
CALL gds.graph.list()

// Drop projection
CALL gds.graph.drop('my-graph')

Part 2: Pathfinding Algorithms

1. Shortest Path (Dijkstra)

Use Case: Find minimum-cost path between two nodes Example: Shortest route between cities
// Create graph with weighted relationships
MATCH (source:City {name: 'San Francisco'})
MATCH (target:City {name: 'New York'})
CALL gds.shortestPath.dijkstra.stream('city-network', {
  sourceNode: source,
  targetNode: target,
  relationshipWeightProperty: 'distance'
})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
  gds.util.asNode(sourceNode).name AS from,
  gds.util.asNode(targetNode).name AS to,
  totalCost AS distance,
  [nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS route
Output:
from: San Francisco
to: New York
distance: 2900
route: ["San Francisco", "Denver", "Chicago", "New York"]

2. All Shortest Paths

Find all shortest paths (multiple paths with same cost):
MATCH (source:Person {name: 'Alice'})
MATCH (target:Person {name: 'Bob'})
CALL gds.shortestPath.dijkstra.stream('social-network', {
  sourceNode: source,
  targetNode: target
})
YIELD path
RETURN [node IN nodes(path) | node.name] AS path

3. Single-Source Shortest Path

Find shortest paths from one node to all others:
MATCH (source:Person {name: 'Alice'})
CALL gds.allShortestPaths.delta.stream('social-network', {
  sourceNode: source
})
YIELD targetNode, distance
RETURN gds.util.asNode(targetNode).name AS person, distance
ORDER BY distance ASC
LIMIT 10

Part 3: Centrality Algorithms

1. Degree Centrality

Measures: Number of connections (simplest centrality) Use Case: Find most connected people
CALL gds.degree.stream('social-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score AS connections
ORDER BY score DESC
LIMIT 10
Write back to graph:
CALL gds.degree.write('social-network', {
  writeProperty: 'degree'
})
YIELD nodePropertiesWritten, computeMillis

2. PageRank

Measures: Importance based on connections (Google’s algorithm) Formula:
PR(A) = (1-d) + d × Σ(PR(T_i) / C(T_i))
Where:
  • d = damping factor (0.85)
  • T_i = nodes linking to A
  • C(T_i) = out-degree of T_i
Use Case: Rank influencers, important nodes
CALL gds.pageRank.stream('social-network', {
  maxIterations: 20,
  dampingFactor: 0.85
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score
ORDER BY score DESC
LIMIT 10
Real-World Example: Twitter influence
// High PageRank = influential (many followers who are also influential)
CALL gds.graph.project(
  'twitter',
  'User',
  {FOLLOWS: {orientation: 'NATURAL'}}
)

CALL gds.pageRank.write('twitter', {
  writeProperty: 'influence_score'
})
YIELD nodePropertiesWritten

3. Betweenness Centrality

Measures: How often a node lies on shortest paths between others Use Case: Find bridges, bottlenecks
CALL gds.betweenness.stream('social-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score AS betweenness
ORDER BY score DESC
LIMIT 10
Interpretation: High betweenness = broker, gatekeeper

4. Closeness Centrality

Measures: Average distance to all other nodes Use Case: Find nodes that can spread information quickly
CALL gds.closeness.stream('social-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score
ORDER BY score DESC
LIMIT 10
Interpretation: High closeness = well-connected, central position

Part 4: Community Detection

1. Louvain (Modularity-Based)

Finds: Groups with more internal connections than external Use Case: Customer segmentation, fraud rings
CALL gds.louvain.stream('social-network')
YIELD nodeId, communityId
WITH communityId, collect(gds.util.asNode(nodeId).name) AS members
RETURN communityId, size(members) AS size, members
ORDER BY size DESC
Write communities back:
CALL gds.louvain.write('social-network', {
  writeProperty: 'community'
})
YIELD communityCount, modularity
Query by community:
MATCH (p:Person {community: 5})
RETURN p.name

2. Label Propagation

Faster than Louvain, less accurate Algorithm: Nodes adopt the most common label among neighbors
CALL gds.labelPropagation.stream('social-network')
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS person, communityId

3. Weakly Connected Components

Finds: Disconnected subgraphs Use Case: Find isolated clusters
CALL gds.wcc.stream('social-network')
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).name) AS members
WHERE size(members) > 1
RETURN componentId, size(members) AS size, members
ORDER BY size DESC

Part 5: Similarity Algorithms

1. Node Similarity (Jaccard)

Measures: Overlap of neighbors Use Case: Recommend similar users
CALL gds.nodeSimilarity.stream('social-network', {
  topK: 10  // Top 10 most similar per node
})
YIELD node1, node2, similarity
RETURN
  gds.util.asNode(node1).name AS person1,
  gds.util.asNode(node2).name AS person2,
  similarity
ORDER BY similarity DESC
Write similarities as relationships:
CALL gds.nodeSimilarity.write('social-network', {
  writeRelationshipType: 'SIMILAR_TO',
  writeProperty: 'score',
  topK: 10
})

2. Cosine Similarity (Vector-Based)

Use Case: Compare embeddings, feature vectors
// Assume nodes have 'embedding' property (array of floats)
CALL gds.graph.project(
  'embeddings-graph',
  {Person: {properties: 'embedding'}},
  '*'
)

CALL gds.similarity.cosine.stream('embeddings-graph', {
  nodeProperties: ['embedding']
})
YIELD node1, node2, similarity
RETURN
  gds.util.asNode(node1).name,
  gds.util.asNode(node2).name,
  similarity
ORDER BY similarity DESC

1. Adamic Adar

Predicts: Likelihood of future connection Formula: Sum of 1/log(degree) for common neighbors Use Case: Friend recommendations, missing links
MATCH (p1:Person {name: 'Alice'})
MATCH (p2:Person {name: 'Bob'})
WHERE NOT (p1)-[:KNOWS]-(p2)  // Not already connected
RETURN gds.alpha.linkprediction.adamicAdar(p1, p2) AS score
Find top recommendations:
MATCH (alice:Person {name: 'Alice'})
MATCH (candidate:Person)
WHERE NOT (alice)-[:KNOWS]-(candidate) AND alice <> candidate
WITH alice, candidate,
     gds.alpha.linkprediction.adamicAdar(alice, candidate) AS score
WHERE score > 0
RETURN candidate.name, score
ORDER BY score DESC
LIMIT 10

2. Common Neighbors

Simpler: Count shared neighbors
MATCH (alice:Person {name: 'Alice'})
MATCH (candidate:Person)
WHERE NOT (alice)-[:KNOWS]-(candidate) AND alice <> candidate
MATCH (alice)-[:KNOWS]-(mutual)-[:KNOWS]-(candidate)
WITH candidate, count(DISTINCT mutual) AS common_friends
RETURN candidate.name, common_friends
ORDER BY common_friends DESC
LIMIT 10

Part 7: Graph Embeddings

Node2Vec

Generates: Vector representation of nodes Use Case: Machine learning features, clustering
CALL gds.node2vec.stream('social-network', {
  embeddingDimension: 128,
  iterations: 10
})
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name AS person, embedding
Write embeddings:
CALL gds.node2vec.write('social-network', {
  embeddingDimension: 128,
  writeProperty: 'embedding'
})
Use embeddings for similarity:
MATCH (p1:Person {name: 'Alice'})
MATCH (p2:Person)
WITH p1, p2, gds.similarity.cosine(p1.embedding, p2.embedding) AS similarity
WHERE similarity > 0.8 AND p1 <> p2
RETURN p2.name, similarity
ORDER BY similarity DESC

Part 8: Real-World Use Cases

Use Case 1: Fraud Detection

Goal: Find fraud rings (groups of accounts committing fraud together) Model:
(account:Account)-[:SHARED_IP]->(ip:IP)
(account)-[:SHARED_DEVICE]->(device:Device)
(account)-[:SHARED_EMAIL]->(email:Email)
Algorithm: Louvain community detection
CALL gds.graph.project('fraud-network', 'Account', ['SHARED_IP', 'SHARED_DEVICE', 'SHARED_EMAIL'])

CALL gds.louvain.write('fraud-network', {
  writeProperty: 'fraud_community'
})

// Find suspicious communities (high connectivity)
MATCH (a:Account)
WITH a.fraud_community AS community, collect(a) AS accounts
WHERE size(accounts) > 5  // Rings with 5+ accounts
RETURN community, size(accounts) AS ring_size, accounts
ORDER BY ring_size DESC

Use Case 2: Supply Chain Analysis

Goal: Find critical suppliers (bottlenecks) Algorithm: Betweenness centrality
CALL gds.graph.project('supply-chain', ['Supplier', 'Manufacturer', 'Distributor'], 'SUPPLIES')

CALL gds.betweenness.write('supply-chain', {
  writeProperty: 'criticality'
})

// Find critical nodes
MATCH (n)
WHERE n.criticality > 1000
RETURN labels(n) AS type, n.name, n.criticality
ORDER BY n.criticality DESC

Use Case 3: Knowledge Graph Reasoning

Goal: Infer missing relationships Example: Predict who might know whom
// Find people in same company who don't know each other yet
MATCH (p1:Person)-[:WORKS_AT]->(company:Company)<-[:WORKS_AT]-(p2:Person)
WHERE NOT (p1)-[:KNOWS]-(p2) AND p1 <> p2
WITH p1, p2, gds.alpha.linkprediction.adamicAdar(p1, p2) AS score
WHERE score > 2
RETURN p1.name, p2.name, score
ORDER BY score DESC

Part 9: Performance Tips

1. Use Graph Projections

Always project before running algorithms:
// Bad: Direct algorithm on database (slow, locks data)
CALL gds.pageRank.stream({nodeQuery: 'MATCH (n:Person) RETURN id(n)', ...})

// Good: Project first (fast, safe)
CALL gds.graph.project('my-graph', 'Person', 'KNOWS')
CALL gds.pageRank.stream('my-graph')

2. Estimate Memory Usage

CALL gds.degree.estimate('my-graph')
YIELD requiredMemory, nodeCount, relationshipCount

3. Limit Iterations

CALL gds.pageRank.stream('my-graph', {
  maxIterations: 20,  // Stop after 20 iterations
  tolerance: 0.0001   // Or when change < 0.01%
})

4. Filter Projections

Don’t load entire graph if you only need subset:
CALL gds.graph.project.cypher(
  'active-users',
  'MATCH (n:User) WHERE n.active = true RETURN id(n) AS id',
  'MATCH (a:User)-[r:FOLLOWS]->(b:User) RETURN id(a) AS source, id(b) AS target'
)

Summary

Pathfinding: Shortest path, Dijkstra for routing Centrality: Degree, PageRank, Betweenness for influence Community Detection: Louvain, Label Propagation for clustering Similarity: Node similarity, Cosine for recommendations Link Prediction: Adamic Adar for missing connections Embeddings: Node2Vec for ML features Best Practices:
  • Always use graph projections
  • Estimate memory before running
  • Write results back for reuse
  • Filter projections for performance

What’s Next?

Module 7: Production Deployment & Operations

Deploy Neo4j clusters, configure high availability, monitor performance, and operate at scale