Skip to main content

Neo4j Architecture & Native Graph Storage

Module Duration: 6-8 hours Learning Style: Deep Technical + Implementation Details + Performance Analysis Outcome: Understand how Neo4j achieves O(1) traversals and ACID guarantees at the storage layer

Introduction: What Makes Neo4j “Native”?

Most databases were built for tables, then retrofitted for graphs. Neo4j was designed from day one for graphs. Key Difference: Index-free adjacency
Traditional (PostgreSQL with graph extension):
MATCH (a:Person)-[:KNOWS]->(b)
→ Index scan for Person nodes
→ Index scan for KNOWS relationships
→ JOIN on IDs
→ O(log N) per hop

Neo4j (Native Graph):
MATCH (a:Person)-[:KNOWS]->(b)
→ Follow pointer from node to relationship to node
→ O(1) per hop (constant time, regardless of graph size!)
Performance: 1000x faster for multi-hop traversals! This module reveals how Neo4j achieves this at the storage level.

Part 1: High-Level Architecture

System Layering

┌─────────────────────────────────────────────────────┐
│            Client Applications                       │
│  (Browser, Drivers: Java, Python, Go, JS)          │
└────────────────────┬────────────────────────────────┘
                     │ Bolt Protocol (Binary)
┌────────────────────▼────────────────────────────────┐
│               Query Engine                           │
│  ┌─────────────────────────────────────────┐       │
│  │  Cypher Parser                           │       │
│  │  (Convert Cypher → AST)                  │       │
│  └─────────────────┬───────────────────────┘       │
│                    │                                 │
│  ┌─────────────────▼───────────────────────┐       │
│  │  Query Planner                           │       │
│  │  (Optimize: cost-based, rule-based)      │       │
│  └─────────────────┬───────────────────────┘       │
│                    │                                 │
│  ┌─────────────────▼───────────────────────┐       │
│  │  Execution Engine                        │       │
│  │  (Execute plan: operators, pipelining)   │       │
│  └─────────────────┬───────────────────────┘       │
└────────────────────┼────────────────────────────────┘
                     │ Kernel API
┌────────────────────▼────────────────────────────────┐
│              Transaction Manager                     │
│  • ACID guarantees                                   │
│  • Write-Ahead Logging (WAL)                         │
│  • Locking (pessimistic, read/write locks)          │
└────────────────────┬────────────────────────────────┘

┌────────────────────▼────────────────────────────────┐
│               Storage Engine                         │
│  ┌──────────────────────────────────────────┐      │
│  │  Page Cache (in-memory buffer)           │      │
│  │  • LRU eviction                           │      │
│  │  • Memory-mapped files                    │      │
│  └──────────────────┬───────────────────────┘      │
│                     │                                │
│  ┌──────────────────▼───────────────────────┐      │
│  │  Store Files (disk)                      │      │
│  │  • neostore.nodestore.db                 │      │
│  │  • neostore.relationshipstore.db         │      │
│  │  • neostore.propertystore.db             │      │
│  │  • neostore.labeltokenstore.db           │      │
│  └──────────────────────────────────────────┘      │
└─────────────────────────────────────────────────────┘

Core Components

1. Cypher Parser
  • Converts Cypher query string → Abstract Syntax Tree (AST)
  • Validates syntax
2. Query Planner
  • Generates execution plan (like SQL EXPLAIN)
  • Cost-based optimization (estimates row counts, cardinality)
  • Rule-based optimization (predicate push-down, etc.)
3. Execution Engine
  • Executes query plan using operators (Scan, Filter, Expand, etc.)
  • Pipelined execution (streaming results)
4. Transaction Manager
  • Ensures ACID properties
  • Write-Ahead Log (WAL) for durability
  • Locking for isolation
5. Storage Engine
  • Native graph storage (nodes, relationships, properties)
  • Page cache (in-memory buffer pool)
  • Indexes (for lookups)

Part 2: Storage Layout

Store Files

Neo4j stores graph data in multiple fixed-size record stores:
# In data/databases/neo4j/
neostore.nodestore.db              # Node records
neostore.relationshipstore.db      # Relationship records
neostore.propertystore.db          # Property records
neostore.propertystore.db.strings  # String values
neostore.propertystore.db.arrays   # Array values
neostore.relationshipgroupstore.db # Relationship type groups (dense nodes)
neostore.labeltokenstore.db        # Label names
neostore.relationshiptypestore.db  # Relationship type names
neostore.schemastore.db            # Indexes, constraints
Key Principle: Fixed-size records enable O(1) access via record ID.

Node Store

Node Record Format (15 bytes in Neo4j 4.x):
┌────────┬────────────┬────────────┬────────────┬────────┐
│ In Use │ Next Rel   │ Next Prop  │ Labels     │ Extra  │
│ 1 bit  │ 35 bits    │ 36 bits    │ 5 bytes    │ ...    │
└────────┴────────────┴────────────┴────────────┴────────┘
Fields:
  • In Use (1 bit): Is this record active? (0 = deleted, 1 = active)
  • Next Rel (35 bits): Pointer to first relationship (ID)
  • Next Prop (36 bits): Pointer to first property (ID)
  • Labels (5 bytes): Inline label storage or pointer to label array
Example: Node with ID = 42:
Record at offset: 42 × 15 bytes = 630 bytes into neostore.nodestore.db

┌─────┬─────────┬─────────┬──────────────┐
│  1  │  10050  │  20010  │  0x0001      │  (Person label)
└─────┴─────────┴─────────┴──────────────┘
Accessing Node 42:
// Pseudocode
long nodeId = 42;
long offset = nodeId * NODE_RECORD_SIZE;  // 42 * 15 = 630
byte[] record = readFromFile("neostore.nodestore.db", offset, 15);

boolean inUse = getBit(record, 0);
long firstRelId = getBits(record, 1, 35);
long firstPropId = getBits(record, 36, 36);
Time Complexity: O(1)—direct calculation, no scanning!

Relationship Store

Relationship Record Format (34 bytes in Neo4j 4.x):
┌────────┬────────┬────────┬─────────┬──────────┬──────────┬──────────┬──────────┬────────┐
│ In Use │ First  │ Second │ Rel     │ First    │ First    │ Second   │ Second   │ Next   │
│        │ Node   │ Node   │ Type    │ Prev Rel │ Next Rel │ Prev Rel │ Next Rel │ Prop   │
│ 1 bit  │ 35 bits│ 35 bits│ 16 bits │ 35 bits  │ 35 bits  │ 35 bits  │ 35 bits  │ 36 bits│
└────────┴────────┴────────┴─────────┴──────────┴──────────┴──────────┴──────────┴────────┘
Fields:
  • First Node: Source node ID
  • Second Node: Target node ID
  • Rel Type: Relationship type ID (index into relationship type store)
  • First Prev/Next Rel: Doubly-linked list for source node’s relationships
  • Second Prev/Next Rel: Doubly-linked list for target node’s relationships
  • Next Prop: Pointer to first property
Why Doubly-Linked Lists? Each node maintains a linked list of its relationships:
Node A (ID=10):
  ↓ Next Rel = 100
Relationship 100: (A)-[:KNOWS]->(B)
  ├─ First Node = 10 (A)
  ├─ Second Node = 20 (B)
  ├─ First Next Rel = 101  ← Next relationship for A
  └─ Second Next Rel = 200 ← Next relationship for B

Relationship 101: (A)-[:LIKES]->(C)
  ├─ First Node = 10 (A)
  ├─ Second Node = 30 (C)
  └─ First Next Rel = NULL  ← End of A's relationships
Traversal: To find all relationships for Node A:
  1. Read Node A’s record → get Next Rel = 100
  2. Read Relationship 100 → check if First Node == A or Second Node == A
  3. Follow appropriate Next Rel pointer (100 → 101 → NULL)
Time Complexity: O(D), where D = degree of node (number of relationships) Critical Insight: Independent of total graph size!
  • In relational DB: O(log N) per relationship (index scan across all edges)
  • In Neo4j: O(D) (only scan node’s own relationships)
Example:
  • Graph with 1 billion relationships
  • Node A has 10 relationships
  • Relational: O(log 10⁹) ≈ 30 operations
  • Neo4j: O(10) = 10 operations
Speed-up: 3x, and scales better!

Property Store

Property Record Format (25 bytes):
┌────────┬─────────┬─────────┬──────────┬───────────┐
│ In Use │ Type    │ Key ID  │ Value    │ Next Prop │
│ 1 bit  │ 4 bits  │ 24 bits │ 64 bits  │ 36 bits   │
└────────┴─────────┴─────────┴──────────┴───────────┘
Property Types:
TypeValue Storage
boolInline (1 bit in value field)
intInline (up to 64 bits)
floatInline (IEEE 754, 64 bits)
short stringInline (up to 12 characters encoded)
long stringPointer to string store
arrayPointer to array store
Property Chain: Properties form a singly-linked list:
Node A:
  Next Prop = 500

Property 500: {name: "Alice"}
  ├─ Key ID = 1 (points to "name" in property key store)
  ├─ Value = pointer to string store → "Alice"
  └─ Next Prop = 501

Property 501: {age: 30}
  ├─ Key ID = 2 ("age")
  ├─ Value = 30 (inline int)
  └─ Next Prop = NULL
Reading All Properties:
long propId = node.nextProp;  // 500
List<Property> props = new ArrayList<>();

while (propId != NULL) {
    PropertyRecord prop = readProperty(propId);
    props.add(new Property(
        getKeyName(prop.keyId),
        decodeValue(prop.type, prop.value)
    ));
    propId = prop.nextProp;
}

String Store

Stores strings > 12 characters: String Record (128 bytes, stores 120 chars + metadata):
┌────────┬──────────┬────────────────────────────────┐
│ In Use │ Length   │ Data (120 chars)               │
│ 1 byte │ 4 bytes  │ 120 bytes                      │
└────────┴──────────┴────────────────────────────────┘
Long Strings (> 120 chars): Stored across multiple records forming a linked list:
String: "This is a very long string..." (200 chars)

Record 1: chars 0-119   + Next Record = 5001
Record 2: chars 120-199 + Next Record = NULL

Dense Node Optimization

Problem: Nodes with many relationships (high degree) slow down traversals. Example: Celebrity node with 1M followers
Without optimization:
MATCH (celeb)-[:FOLLOWED_BY]->(follower)
→ Scan 1M relationship records to find FOLLOWED_BY relationships!
Solution: Relationship Group Store (introduced Neo4j 2.1) When a node exceeds a threshold (default: 50 relationships), Neo4j creates relationship type groups:
Node (Celebrity):
  Next Rel = NULL
  Next Rel Group = 7000  ← Points to group store

Relationship Group 7000:
  ├─ Type = FOLLOWED_BY
  ├─ First Outgoing = 8000
  ├─ First Incoming = 9000
  ├─ First Loop = NULL
  └─ Next Group = 7001

Relationship Group 7001:
  ├─ Type = FOLLOWS
  ├─ First Outgoing = 10000
  ├─ ...
Traversal:
MATCH (celeb)-[:FOLLOWED_BY]->(follower)
Process:
  1. Read celeb node → Get Next Rel Group = 7000
  2. Scan relationship groups for FOLLOWED_BY (typically < 10 types)
  3. Follow First Outgoing pointer
  4. Scan only FOLLOWED_BY relationships (not all 1M!)
Performance: From O(1M) to O(types) + O(followers of type) = much faster!

Part 3: Indexes

Label Indexes (Default)

When you create a label, Neo4j automatically creates an index for lookups:
CREATE (p:Person {name: "Alice"})
Without Index:
MATCH (p:Person {name: "Alice"})
Scan ALL nodes (O(N))
With Label Index:
MATCH (p:Person {name: "Alice"})
Scan only Person nodes (O(P), where P = # of Person nodes)
Storage: Labels are stored in the Node Record (5 bytes for labels):
Node 42:
  Labels = 0x0001  → Label ID 1 = "Person"
Label IDs map to names in neostore.labeltokenstore.db.

Property Indexes

Create Index:
CREATE INDEX person_name FOR (p:Person) ON (p.name)
Index Structure (Neo4j 4.x): Native Index (custom B+tree variant) Index File: schema/index/lucene-*/1/ (despite “lucene” in path, it’s native in 4.x+) B+Tree Structure:
                Root
               /    \
          [Alice]  [Charlie]
           /   \      /    \
     [Alice] [Bob] [Charlie] [David]
       ↓      ↓       ↓        ↓
    Node 10 Node 20 Node 30  Node 40
Query:
MATCH (p:Person {name: "Bob"})
Process:
  1. Look up “Bob” in index → O(log P)
  2. Get node ID (20)
  3. Read node 20 → O(1)
Total: O(log P) << O(P) (full scan)

Composite Indexes

Create:
CREATE INDEX person_name_age FOR (p:Person) ON (p.name, p.age)
Use Case:
MATCH (p:Person {name: "Alice", age: 30})
Index Key: Composite key (name, age)("Alice", 30) Benefit: Single index lookup instead of two separate lookups + intersection.

Full-Text Indexes

Create:
CREATE FULLTEXT INDEX person_description FOR (p:Person) ON EACH [p.name, p.bio]
Query:
CALL db.index.fulltext.queryNodes("person_description", "software engineer")
YIELD node, score
RETURN node.name, score
Backend: Apache Lucene (inverted index) Use Cases:
  • Fuzzy search
  • Tokenization (split “software engineer” → [“software”, “engineer”])
  • Relevance scoring (TF-IDF)

Vector Indexes (Neo4j 5.x)

Create:
CREATE VECTOR INDEX user_embedding FOR (u:User) ON (u.embedding)
OPTIONS {dimension: 256, similarity: 'cosine'}
Query (Nearest neighbor search):
MATCH (u:User)
WHERE u.embedding SIMILAR TO $query_vector
RETURN u.name, vector.similarity(u.embedding, $query_vector) AS score
ORDER BY score DESC
LIMIT 10
Backend: HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbor (ANN) search Use Cases:
  • Recommendation systems (user/item embeddings)
  • Semantic search (document embeddings)
  • Image similarity (CNN features)

Part 4: Transaction Management

ACID Guarantees

Neo4j provides full ACID transactions:
  • Atomicity: All-or-nothing (commit or rollback)
  • Consistency: Constraints enforced (uniqueness, existence)
  • Isolation: Transactions don’t interfere (locking)
  • Durability: Committed transactions survive crashes (WAL)

Write-Ahead Log (WAL)

Purpose: Durability (survive crashes) Location: neostore.transaction.db.0, neostore.transaction.db.1, … Process:
1. Transaction begins

2. Modifications stored in memory

3. Transaction commits

4. Write transaction log entry to WAL (fsync to disk)

5. Respond "SUCCESS" to client

6. Apply changes to store files (asynchronous)
Log Entry Format:
┌──────────┬─────────┬──────────────────────────────┐
│ TX ID    │ Command │ Data                         │
│ 8 bytes  │ 1 byte  │ Variable                     │
└──────────┴─────────┴──────────────────────────────┘
Commands:
  • CREATE_NODE: Node ID, labels, properties
  • CREATE_RELATIONSHIP: Rel ID, type, start/end nodes, properties
  • SET_PROPERTY: Node/Rel ID, key, value
  • DELETE_NODE: Node ID
Crash Recovery:
On startup:
1. Read last committed transaction ID from store
2. Replay WAL entries from that point forward
3. Restore to consistent state
Checkpointing: Periodically, Neo4j flushes in-memory changes to store files and truncates WAL:
┌─────────────────────────────────────────────────────┐
│                  Timeline                            │
├─────────────────────────────────────────────────────┤
│ t=0:   Last checkpoint (TX ID = 1000)               │
│ t=10:  TX 1001, 1002, 1003 (in WAL)                 │
│ t=20:  TX 1004, 1005 (in WAL)                       │
│ t=30:  Checkpoint! (flush 1001-1005 to store)       │
│        → Truncate WAL (safe to discard 1001-1005)   │
│ t=40:  TX 1006, 1007 (new WAL)                      │
└─────────────────────────────────────────────────────┘
Configuration:
# neo4j.conf
db.checkpoint.interval.time=15m   # Checkpoint every 15 minutes
db.checkpoint.interval.tx=100000  # Or every 100K transactions

Locking

Neo4j uses pessimistic locking (lock before modify). Lock Types:
  1. Read Locks (Shared locks):
    • Multiple transactions can hold read locks simultaneously
    • Prevents writes while reading
  2. Write Locks (Exclusive locks):
    • Only one transaction can hold a write lock
    • Blocks reads and writes
Lock Granularity: Neo4j locks at the node/relationship level:
// Transaction 1:
MATCH (p:Person {name: "Alice"})
SET p.age = 31Acquires WRITE lock on Node(Alice)

// Transaction 2 (concurrent):
MATCH (p:Person {name: "Alice"})
RETURN p.ageBlocks until TX1 commits (released lock)
Deadlock Detection:
// Transaction 1:
MATCH (a:Person {name: "Alice"})
MATCH (b:Person {name: "Bob"})
SET a.age = 31, b.age = 32

// Transaction 2 (concurrent):
MATCH (b:Person {name: "Bob"})
MATCH (a:Person {name: "Alice"})
SET b.age = 33, a.age = 34
Deadlock:
  • TX1 locks Alice, waits for Bob
  • TX2 locks Bob, waits for Alice
  • Deadlock!
Resolution: Neo4j detects deadlock and aborts one transaction:
Neo4jException: DeadlockDetected
Transaction aborted. Please retry.
Best Practice: Always acquire locks in consistent order:
// Both transactions lock in order: Alice → Bob
MATCH (a:Person {name: "Alice"})
MATCH (b:Person {name: "Bob"})
SET a.age = ..., b.age = ...

Isolation Levels

Neo4j provides Read Committed isolation by default: Behavior:
  • Reads see committed data only (no dirty reads)
  • Phantom reads possible (another TX inserts/deletes between reads)
Example:
// Transaction 1:
BEGIN
MATCH (p:Person) RETURN count(p)  → 100

// Transaction 2 (concurrent):
CREATE (p:Person {name: "New"})
COMMIT

// Transaction 1 (continues):
MATCH (p:Person) RETURN count(p)  → 101 (phantom read!)
COMMIT
Higher Isolation (Serializable): Neo4j doesn’t support serializable isolation natively. Use explicit locking:
// Pessimistic locking (manual):
MATCH (counter:Counter {id: 1})
SET counter.value = counter.value + 1Acquires write lock
RETURN counter.value

Part 5: Page Cache

Memory Architecture

Page Cache = Neo4j’s in-memory buffer pool (like PostgreSQL’s shared buffers). Purpose:
  • Cache frequently accessed pages (node, relationship, property records)
  • Reduce disk I/O
Configuration:
# neo4j.conf
db.memory.pagecache.size=4g  # 4GB for page cache
Sizing Rule: pagecache.size = 0.5 × (RAM - heap - OS) Example (64GB RAM):
Heap (JVM): 8GB
OS: 8GB
Available for page cache: 64 - 8 - 8 = 48GB
Recommended: 24-32GB

Page Cache Structure

Page Size: 8KB (default, configurable) Cache Entry:
┌─────────────────────────────────────────┐
│  File: neostore.nodestore.db            │
│  Offset: 630 (Node ID 42)               │
│  Data: [15 bytes of node record]        │
│  LRU: Last accessed timestamp           │
└─────────────────────────────────────────┘
Page Eviction: LRU (Least Recently Used) When cache is full:
  1. Find least recently used page
  2. If dirty (modified), flush to disk
  3. Evict page
  4. Load new page into cache

Monitoring Page Cache

CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache")
YIELD attributes
RETURN attributes.hits, attributes.faults, attributes.evictions
Metrics:
  • Hits: Requests served from cache (fast!)
  • Faults: Requests requiring disk read (slow)
  • Evictions: Pages evicted to make room
Hit Rate: hits / (hits + faults) Target: > 95% hit rate Example:
Hits: 950,000
Faults: 50,000
Hit Rate: 950,000 / 1,000,000 = 95%  ✓ Good!
If hit rate < 90%: Increase page cache size!

Part 6: Query Execution

Explain and Profile

Explain (planning only, no execution):
EXPLAIN
MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name
Output:
+---------------------------+
| Operator         | Rows  |
+---------------------------+
| ProduceResults   | 10    |
| Projection       | 10    |
| Expand(All)      | 10    |
| NodeIndexSeek    | 1     |
+---------------------------+
Profile (execute and collect statistics):
PROFILE
MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name
Output:
+----------------------------------------+
| Operator         | Rows | DB Hits     |
+----------------------------------------+
| ProduceResults   | 10   | 0           |
| Projection       | 10   | 20          |
| Expand(All)      | 10   | 21          |
| NodeIndexSeek    | 1    | 2           |
+----------------------------------------+
Total DB Hits: 43
DB Hits: Number of page cache accesses (lower = better!)

Query Operators

NodeIndexSeek:
Look up node by indexed property
Cost: O(log N)
DB Hits: 1-3 (index lookup + node fetch)
NodeByLabelScan:
Scan all nodes with a specific label
Cost: O(L), where L = # of labeled nodes
DB Hits: L
Expand(All):
Traverse relationships from a node
Cost: O(D), where D = degree
DB Hits: D (read relationship records)
Filter:
Apply predicate (WHERE clause)
Cost: O(R), where R = input rows
DB Hits: R × P (P = properties checked per row)
Sort:
Sort results (ORDER BY)
Cost: O(R log R)
DB Hits: 0 (in-memory sort)
Aggregation (COUNT, SUM, etc.):
Group and aggregate
Cost: O(R)
DB Hits: 0 (in-memory)

Example Query Analysis

Query:
MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend:Person)
WHERE friend.age > 25
RETURN friend.name, friend.age
ORDER BY friend.age DESC
Plan:
ProduceResults

Sort (friend.age DESC)

Projection (friend.name, friend.age)

Filter (friend.age > 25)

Expand(All) [:KNOWS]

NodeIndexSeek (Person.name = "Alice")
Step-by-Step:
  1. NodeIndexSeek: Look up Alice in Person.name index
    • Cost: O(log P)
    • DB Hits: 2
    • Rows: 1
  2. Expand: Traverse :KNOWS relationships
    • Cost: O(D), D = Alice’s degree (e.g., 10 friends)
    • DB Hits: 10 (relationship records)
    • Rows: 10
  3. Filter: Check friend.age > 25
    • Cost: O(10)
    • DB Hits: 10 (read age property)
    • Rows: 6 (assume 6 friends > 25)
  4. Projection: Extract name and age
    • Cost: O(6)
    • DB Hits: 6 (read name property)
    • Rows: 6
  5. Sort: Order by age DESC
    • Cost: O(6 log 6) ≈ 15
    • DB Hits: 0 (in-memory)
    • Rows: 6
  6. ProduceResults: Return to client
    • Rows: 6
Total DB Hits: 2 + 10 + 10 + 6 = 28

Optimization Tips

1. Use Indexes:
-- Bad (full scan)
MATCH (p:Person)
WHERE p.name = "Alice"

-- Good (index seek)
CREATE INDEX FOR (p:Person) ON (p.name)
MATCH (p:Person {name: "Alice"})
2. Filter Early:
-- Bad (expand then filter)
MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend)
WHERE friend.age > 25

-- Good (filter during expand, if possible)
MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend:Person)
WHERE friend.age > 25
3. Limit Results:
-- Add LIMIT to avoid processing unnecessary rows
MATCH (p:Person)-[:KNOWS]->(friend)
RETURN friend.name
LIMIT 10
4. Use EXPLAIN/PROFILE: Always check query plans for:
  • Missing indexes (NodeByLabelScan instead of NodeIndexSeek)
  • High DB Hits
  • Cartesian products (avoid!)

Part 7: Performance Characteristics

Time Complexity

OperationNeo4jRelational DB
Find node by IDO(1)O(1)
Find node by indexed propertyO(log N)O(log N)
Find node by unindexed propertyO(N)O(N)
Get node’s relationshipsO(D)O(log N)
Traverse 2 hopsO(D²)O(N²)
Traverse 3 hopsO(D³)O(N³)
Shortest path (BFS, depth L)O(D^L)O(N^L)
Typical Values:
  • N = 1,000,000 (total nodes)
  • D = 50 (average degree)
3-hop traversal:
  • Neo4j: 50³ = 125,000
  • Relational: (10⁶)³ = 10¹⁸ (not feasible!)

Benchmark: Social Network Queries

Setup:
  • 1M users
  • 50M friendships (average 50 friends/user)
Query 1: Direct friends
MATCH (u:User {id: 12345})-[:FRIEND]->(friend)
RETURN friend.name
DatabaseTime
Neo4j5ms
PostgreSQL20ms
Query 2: Friends of friends
MATCH (u:User {id: 12345})-[:FRIEND*2]->(fof)
RETURN fof.name
DatabaseTime
Neo4j50ms
PostgreSQL5,000ms
Query 3: Friends up to 3 hops
MATCH (u:User {id: 12345})-[:FRIEND*1..3]->(friend)
RETURN DISTINCT friend.name
DatabaseTime
Neo4j200ms
PostgreSQLTimeout (> 60s)
Key Takeaway: Neo4j shines for deep traversals (2+ hops).

Part 8: Hands-On Exercises

Exercise 1: Explore Store Files

Task: Inspect Neo4j store files and calculate sizes
# Navigate to Neo4j data directory
cd <NEO4J_HOME>/data/databases/neo4j/

# List store files
ls -lh neostore.*

# Example output:
# -rw-r--r-- neostore.nodestore.db          150M
# -rw-r--r-- neostore.relationshipstore.db  2.1G
# -rw-r--r-- neostore.propertystore.db      500M
Questions:
  1. How many nodes? 150M / 15 bytes = 10M nodes
  2. How many relationships? 2.1G / 34 bytes ≈ 61.7M relationships
  3. Average degree? 61.7M × 2 / 10M ≈ 12.3

Exercise 2: Analyze Query Plans

Query:
PROFILE
MATCH (p:Person {name: "Alice"})-[:KNOWS*2]->(fof)
RETURN fof.name
Tasks:
  1. Identify bottleneck operator (highest DB Hits)
  2. Check if index is used (NodeIndexSeek vs NodeByLabelScan)
  3. Optimize query (add indexes, limit results)

Exercise 3: Measure Page Cache Hit Rate

// Run query multiple times
MATCH (p:Person)-[:KNOWS]->(friend)
RETURN count(friend)

// Check hit rate
CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache")
YIELD attributes
RETURN attributes.hits, attributes.faults,
       (toFloat(attributes.hits) / (attributes.hits + attributes.faults)) AS hit_rate
Expected: Hit rate increases on subsequent runs (data cached)

Summary

Native Graph Storage:
  • Fixed-size records (O(1) access by ID)
  • Pointers for relationships (no JOINs!)
  • Index-free adjacency (O(D) traversal, not O(N))
Indexes:
  • Label indexes (filter by node type)
  • Property indexes (B+tree, O(log N))
  • Full-text indexes (Lucene)
  • Vector indexes (HNSW for ANN)
Transactions:
  • Full ACID guarantees
  • Write-Ahead Log (durability)
  • Pessimistic locking (read/write locks)
  • Deadlock detection
Performance:
  • 10-1000x faster than relational for traversals
  • Scales with degree (D), not graph size (N)
  • Page cache critical for performance (aim > 95% hit rate)

What’s Next?

Module 4: Cypher Query Language Mastery

Master Cypher syntax, pattern matching, aggregations, and advanced query techniques