> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Graph Algorithms at Scale

> Implement PageRank, community detection, shortest paths, centrality measures, and graph analytics using Neo4j GDS

# Graph Algorithms at Scale

<Info>
  **Module Duration**: 6-8 hours
  **Learning Style**: Algorithm Theory + Implementation + Real-World Applications
  **Outcome**: Apply graph algorithms to solve real problems: recommendations, fraud detection, network analysis
</Info>

## Introduction: Why Graph Algorithms?

Graph algorithms solve problems that are **intractable** in relational databases:

* **Shortest Path**: GPS navigation, network routing
* **PageRank**: Search engine ranking, influence scoring
* **Community Detection**: Social groups, customer segmentation
* **Centrality**: Find influencers, critical infrastructure
* **Link Prediction**: Recommendations, fraud detection

Neo4j provides these via the **Graph Data Science (GDS) library**.

***

## Part 1: Neo4j GDS Setup

### Installation

```cypher theme={null}
// Check if GDS is installed
CALL gds.version()
```

If not installed, download from: [https://neo4j.com/download-center/#algorithms](https://neo4j.com/download-center/#algorithms)

### Creating a Graph Projection

**Problem**: Running algorithms on live data is slow and risky.

**Solution**: Create an **in-memory graph projection** (read-only snapshot):

```cypher theme={null}
CALL gds.graph.project(
  'my-graph',              // Graph name
  'Person',                // Node labels
  'KNOWS'                  // Relationship types
)
YIELD graphName, nodeCount, relationshipCount
```

**With properties**:

```cypher theme={null}
CALL gds.graph.project(
  'social-network',
  'Person',
  'KNOWS',
  {
    nodeProperties: ['age', 'city'],
    relationshipProperties: ['since']
  }
)
```

**Cypher projection** (more flexible):

```cypher theme={null}
CALL gds.graph.project.cypher(
  'my-graph',
  'MATCH (n:Person) RETURN id(n) AS id',
  'MATCH (a:Person)-[r:KNOWS]->(b:Person) RETURN id(a) AS source, id(b) AS target'
)
```

### Listing and Dropping Projections

```cypher theme={null}
// List all projections
CALL gds.graph.list()

// Drop projection
CALL gds.graph.drop('my-graph')
```

***

## Part 2: Pathfinding Algorithms

### 1. Shortest Path (Dijkstra)

**Use Case**: Find minimum-cost path between two nodes

**Example**: Shortest route between cities

```cypher theme={null}
// Create graph with weighted relationships
MATCH (source:City {name: 'San Francisco'})
MATCH (target:City {name: 'New York'})
CALL gds.shortestPath.dijkstra.stream('city-network', {
  sourceNode: source,
  targetNode: target,
  relationshipWeightProperty: 'distance'
})
YIELD index, sourceNode, targetNode, totalCost, nodeIds, costs, path
RETURN
  gds.util.asNode(sourceNode).name AS from,
  gds.util.asNode(targetNode).name AS to,
  totalCost AS distance,
  [nodeId IN nodeIds | gds.util.asNode(nodeId).name] AS route
```

**Output**:

```
from: San Francisco
to: New York
distance: 2900
route: ["San Francisco", "Denver", "Chicago", "New York"]
```

### 2. All Shortest Paths

**Find all shortest paths** (multiple paths with same cost):

```cypher theme={null}
MATCH (source:Person {name: 'Alice'})
MATCH (target:Person {name: 'Bob'})
CALL gds.shortestPath.dijkstra.stream('social-network', {
  sourceNode: source,
  targetNode: target
})
YIELD path
RETURN [node IN nodes(path) | node.name] AS path
```

### 3. Single-Source Shortest Path

**Find shortest paths from one node to all others**:

```cypher theme={null}
MATCH (source:Person {name: 'Alice'})
CALL gds.allShortestPaths.delta.stream('social-network', {
  sourceNode: source
})
YIELD targetNode, distance
RETURN gds.util.asNode(targetNode).name AS person, distance
ORDER BY distance ASC
LIMIT 10
```

***

## Part 3: Centrality Algorithms

### 1. Degree Centrality

**Measures**: Number of connections (simplest centrality)

**Use Case**: Find most connected people

```cypher theme={null}
CALL gds.degree.stream('social-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score AS connections
ORDER BY score DESC
LIMIT 10
```

**Write back to graph**:

```cypher theme={null}
CALL gds.degree.write('social-network', {
  writeProperty: 'degree'
})
YIELD nodePropertiesWritten, computeMillis
```

### 2. PageRank

**Measures**: Importance based on connections (Google's algorithm)

**Formula**:

```
PR(A) = (1-d) + d × Σ(PR(T_i) / C(T_i))
```

Where:

* d = damping factor (0.85)
* T\_i = nodes linking to A
* C(T\_i) = out-degree of T\_i

**Use Case**: Rank influencers, important nodes

```cypher theme={null}
CALL gds.pageRank.stream('social-network', {
  maxIterations: 20,
  dampingFactor: 0.85
})
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score
ORDER BY score DESC
LIMIT 10
```

**Real-World Example**: Twitter influence

```cypher theme={null}
// High PageRank = influential (many followers who are also influential)
CALL gds.graph.project(
  'twitter',
  'User',
  {FOLLOWS: {orientation: 'NATURAL'}}
)

CALL gds.pageRank.write('twitter', {
  writeProperty: 'influence_score'
})
YIELD nodePropertiesWritten
```

### 3. Betweenness Centrality

**Measures**: How often a node lies on shortest paths between others

**Use Case**: Find bridges, bottlenecks

```cypher theme={null}
CALL gds.betweenness.stream('social-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score AS betweenness
ORDER BY score DESC
LIMIT 10
```

**Interpretation**: High betweenness = broker, gatekeeper

### 4. Closeness Centrality

**Measures**: Average distance to all other nodes

**Use Case**: Find nodes that can spread information quickly

```cypher theme={null}
CALL gds.closeness.stream('social-network')
YIELD nodeId, score
RETURN gds.util.asNode(nodeId).name AS person, score
ORDER BY score DESC
LIMIT 10
```

**Interpretation**: High closeness = well-connected, central position

***

## Part 4: Community Detection

### 1. Louvain (Modularity-Based)

**Finds**: Groups with more internal connections than external

**Use Case**: Customer segmentation, fraud rings

```cypher theme={null}
CALL gds.louvain.stream('social-network')
YIELD nodeId, communityId
WITH communityId, collect(gds.util.asNode(nodeId).name) AS members
RETURN communityId, size(members) AS size, members
ORDER BY size DESC
```

**Write communities back**:

```cypher theme={null}
CALL gds.louvain.write('social-network', {
  writeProperty: 'community'
})
YIELD communityCount, modularity
```

**Query by community**:

```cypher theme={null}
MATCH (p:Person {community: 5})
RETURN p.name
```

### 2. Label Propagation

**Faster** than Louvain, less accurate

**Algorithm**: Nodes adopt the most common label among neighbors

```cypher theme={null}
CALL gds.labelPropagation.stream('social-network')
YIELD nodeId, communityId
RETURN gds.util.asNode(nodeId).name AS person, communityId
```

### 3. Weakly Connected Components

**Finds**: Disconnected subgraphs

**Use Case**: Find isolated clusters

```cypher theme={null}
CALL gds.wcc.stream('social-network')
YIELD nodeId, componentId
WITH componentId, collect(gds.util.asNode(nodeId).name) AS members
WHERE size(members) > 1
RETURN componentId, size(members) AS size, members
ORDER BY size DESC
```

***

## Part 5: Similarity Algorithms

### 1. Node Similarity (Jaccard)

**Measures**: Overlap of neighbors

**Use Case**: Recommend similar users

```cypher theme={null}
CALL gds.nodeSimilarity.stream('social-network', {
  topK: 10  // Top 10 most similar per node
})
YIELD node1, node2, similarity
RETURN
  gds.util.asNode(node1).name AS person1,
  gds.util.asNode(node2).name AS person2,
  similarity
ORDER BY similarity DESC
```

**Write similarities as relationships**:

```cypher theme={null}
CALL gds.nodeSimilarity.write('social-network', {
  writeRelationshipType: 'SIMILAR_TO',
  writeProperty: 'score',
  topK: 10
})
```

### 2. Cosine Similarity (Vector-Based)

**Use Case**: Compare embeddings, feature vectors

```cypher theme={null}
// Assume nodes have 'embedding' property (array of floats)
CALL gds.graph.project(
  'embeddings-graph',
  {Person: {properties: 'embedding'}},
  '*'
)

CALL gds.similarity.cosine.stream('embeddings-graph', {
  nodeProperties: ['embedding']
})
YIELD node1, node2, similarity
RETURN
  gds.util.asNode(node1).name,
  gds.util.asNode(node2).name,
  similarity
ORDER BY similarity DESC
```

***

## Part 6: Link Prediction

### 1. Adamic Adar

**Predicts**: Likelihood of future connection

**Formula**: Sum of 1/log(degree) for common neighbors

**Use Case**: Friend recommendations, missing links

```cypher theme={null}
MATCH (p1:Person {name: 'Alice'})
MATCH (p2:Person {name: 'Bob'})
WHERE NOT (p1)-[:KNOWS]-(p2)  // Not already connected
RETURN gds.alpha.linkprediction.adamicAdar(p1, p2) AS score
```

**Find top recommendations**:

```cypher theme={null}
MATCH (alice:Person {name: 'Alice'})
MATCH (candidate:Person)
WHERE NOT (alice)-[:KNOWS]-(candidate) AND alice <> candidate
WITH alice, candidate,
     gds.alpha.linkprediction.adamicAdar(alice, candidate) AS score
WHERE score > 0
RETURN candidate.name, score
ORDER BY score DESC
LIMIT 10
```

### 2. Common Neighbors

**Simpler**: Count shared neighbors

```cypher theme={null}
MATCH (alice:Person {name: 'Alice'})
MATCH (candidate:Person)
WHERE NOT (alice)-[:KNOWS]-(candidate) AND alice <> candidate
MATCH (alice)-[:KNOWS]-(mutual)-[:KNOWS]-(candidate)
WITH candidate, count(DISTINCT mutual) AS common_friends
RETURN candidate.name, common_friends
ORDER BY common_friends DESC
LIMIT 10
```

***

## Part 7: Graph Embeddings

### Node2Vec

**Generates**: Vector representation of nodes

**Use Case**: Machine learning features, clustering

```cypher theme={null}
CALL gds.node2vec.stream('social-network', {
  embeddingDimension: 128,
  iterations: 10
})
YIELD nodeId, embedding
RETURN gds.util.asNode(nodeId).name AS person, embedding
```

**Write embeddings**:

```cypher theme={null}
CALL gds.node2vec.write('social-network', {
  embeddingDimension: 128,
  writeProperty: 'embedding'
})
```

**Use embeddings for similarity**:

```cypher theme={null}
MATCH (p1:Person {name: 'Alice'})
MATCH (p2:Person)
WITH p1, p2, gds.similarity.cosine(p1.embedding, p2.embedding) AS similarity
WHERE similarity > 0.8 AND p1 <> p2
RETURN p2.name, similarity
ORDER BY similarity DESC
```

***

## Part 8: Real-World Use Cases

### Use Case 1: Fraud Detection

**Goal**: Find fraud rings (groups of accounts committing fraud together)

**Model**:

```cypher theme={null}
(account:Account)-[:SHARED_IP]->(ip:IP)
(account)-[:SHARED_DEVICE]->(device:Device)
(account)-[:SHARED_EMAIL]->(email:Email)
```

**Algorithm**: Louvain community detection

```cypher theme={null}
CALL gds.graph.project('fraud-network', 'Account', ['SHARED_IP', 'SHARED_DEVICE', 'SHARED_EMAIL'])

CALL gds.louvain.write('fraud-network', {
  writeProperty: 'fraud_community'
})

// Find suspicious communities (high connectivity)
MATCH (a:Account)
WITH a.fraud_community AS community, collect(a) AS accounts
WHERE size(accounts) > 5  // Rings with 5+ accounts
RETURN community, size(accounts) AS ring_size, accounts
ORDER BY ring_size DESC
```

### Use Case 2: Supply Chain Analysis

**Goal**: Find critical suppliers (bottlenecks)

**Algorithm**: Betweenness centrality

```cypher theme={null}
CALL gds.graph.project('supply-chain', ['Supplier', 'Manufacturer', 'Distributor'], 'SUPPLIES')

CALL gds.betweenness.write('supply-chain', {
  writeProperty: 'criticality'
})

// Find critical nodes
MATCH (n)
WHERE n.criticality > 1000
RETURN labels(n) AS type, n.name, n.criticality
ORDER BY n.criticality DESC
```

### Use Case 3: Knowledge Graph Reasoning

**Goal**: Infer missing relationships

**Example**: Predict who might know whom

```cypher theme={null}
// Find people in same company who don't know each other yet
MATCH (p1:Person)-[:WORKS_AT]->(company:Company)<-[:WORKS_AT]-(p2:Person)
WHERE NOT (p1)-[:KNOWS]-(p2) AND p1 <> p2
WITH p1, p2, gds.alpha.linkprediction.adamicAdar(p1, p2) AS score
WHERE score > 2
RETURN p1.name, p2.name, score
ORDER BY score DESC
```

***

## Part 9: Performance Tips

### 1. Use Graph Projections

**Always** project before running algorithms:

```cypher theme={null}
// Bad: Direct algorithm on database (slow, locks data)
CALL gds.pageRank.stream({nodeQuery: 'MATCH (n:Person) RETURN id(n)', ...})

// Good: Project first (fast, safe)
CALL gds.graph.project('my-graph', 'Person', 'KNOWS')
CALL gds.pageRank.stream('my-graph')
```

### 2. Estimate Memory Usage

```cypher theme={null}
CALL gds.degree.estimate('my-graph')
YIELD requiredMemory, nodeCount, relationshipCount
```

### 3. Limit Iterations

```cypher theme={null}
CALL gds.pageRank.stream('my-graph', {
  maxIterations: 20,  // Stop after 20 iterations
  tolerance: 0.0001   // Or when change < 0.01%
})
```

### 4. Filter Projections

**Don't load entire graph** if you only need subset:

```cypher theme={null}
CALL gds.graph.project.cypher(
  'active-users',
  'MATCH (n:User) WHERE n.active = true RETURN id(n) AS id',
  'MATCH (a:User)-[r:FOLLOWS]->(b:User) RETURN id(a) AS source, id(b) AS target'
)
```

***

## Summary

**Pathfinding**: Shortest path, Dijkstra for routing
**Centrality**: Degree, PageRank, Betweenness for influence
**Community Detection**: Louvain, Label Propagation for clustering
**Similarity**: Node similarity, Cosine for recommendations
**Link Prediction**: Adamic Adar for missing connections
**Embeddings**: Node2Vec for ML features

**Best Practices**:

* Always use graph projections
* Estimate memory before running
* Write results back for reuse
* Filter projections for performance

***

## What's Next?

<Card title="Module 7: Production Deployment & Operations" icon="server" href="/distributed-systems-tools/neo4j-production-deployment">
  Deploy Neo4j clusters, configure high availability, monitor performance, and operate at scale
</Card>