> ## Documentation Index > Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt > Use this file to discover all available pages before exploring further. # Neo4j Architecture & Native Graph Storage > Master Neo4j's native graph storage engine, index structures, transaction management, and performance optimization # Neo4j Architecture & Native Graph Storage **Module Duration**: 6-8 hours **Learning Style**: Deep Technical + Implementation Details + Performance Analysis **Outcome**: Understand how Neo4j achieves O(1) traversals and ACID guarantees at the storage layer ## Introduction: What Makes Neo4j "Native"? Most databases were built for **tables**, then retrofitted for graphs. Neo4j was designed **from day one** for graphs. **Key Difference**: **Index-free adjacency** ``` Traditional (PostgreSQL with graph extension): MATCH (a:Person)-[:KNOWS]->(b) → Index scan for Person nodes → Index scan for KNOWS relationships → JOIN on IDs → O(log N) per hop Neo4j (Native Graph): MATCH (a:Person)-[:KNOWS]->(b) → Follow pointer from node to relationship to node → O(1) per hop (constant time, regardless of graph size!) ``` **Performance**: 1000x faster for multi-hop traversals! This module reveals **how** Neo4j achieves this at the storage level. *** ## Part 1: High-Level Architecture ### System Layering ``` ┌─────────────────────────────────────────────────────┐ │ Client Applications │ │ (Browser, Drivers: Java, Python, Go, JS) │ └────────────────────┬────────────────────────────────┘ │ Bolt Protocol (Binary) ┌────────────────────▼────────────────────────────────┐ │ Query Engine │ │ ┌─────────────────────────────────────────┐ │ │ │ Cypher Parser │ │ │ │ (Convert Cypher → AST) │ │ │ └─────────────────┬───────────────────────┘ │ │ │ │ │ ┌─────────────────▼───────────────────────┐ │ │ │ Query Planner │ │ │ │ (Optimize: cost-based, rule-based) │ │ │ └─────────────────┬───────────────────────┘ │ │ │ │ │ ┌─────────────────▼───────────────────────┐ │ │ │ Execution Engine │ │ │ │ (Execute plan: operators, pipelining) │ │ │ └─────────────────┬───────────────────────┘ │ └────────────────────┼────────────────────────────────┘ │ Kernel API ┌────────────────────▼────────────────────────────────┐ │ Transaction Manager │ │ • ACID guarantees │ │ • Write-Ahead Logging (WAL) │ │ • Locking (pessimistic, read/write locks) │ └────────────────────┬────────────────────────────────┘ │ ┌────────────────────▼────────────────────────────────┐ │ Storage Engine │ │ ┌──────────────────────────────────────────┐ │ │ │ Page Cache (in-memory buffer) │ │ │ │ • LRU eviction │ │ │ │ • Memory-mapped files │ │ │ └──────────────────┬───────────────────────┘ │ │ │ │ │ ┌──────────────────▼───────────────────────┐ │ │ │ Store Files (disk) │ │ │ │ • neostore.nodestore.db │ │ │ │ • neostore.relationshipstore.db │ │ │ │ • neostore.propertystore.db │ │ │ │ • neostore.labeltokenstore.db │ │ │ └──────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────┘ ``` ### Core Components **1. Cypher Parser** * Converts Cypher query string → Abstract Syntax Tree (AST) * Validates syntax **2. Query Planner** * Generates execution plan (like SQL EXPLAIN) * Cost-based optimization (estimates row counts, cardinality) * Rule-based optimization (predicate push-down, etc.) **3. Execution Engine** * Executes query plan using operators (Scan, Filter, Expand, etc.) * Pipelined execution (streaming results) **4. Transaction Manager** * Ensures ACID properties * Write-Ahead Log (WAL) for durability * Locking for isolation **5. Storage Engine** * Native graph storage (nodes, relationships, properties) * Page cache (in-memory buffer pool) * Indexes (for lookups) *** ## Part 2: Storage Layout ### Store Files Neo4j stores graph data in **multiple fixed-size record stores**: ```bash theme={null} # In data/databases/neo4j/ neostore.nodestore.db # Node records neostore.relationshipstore.db # Relationship records neostore.propertystore.db # Property records neostore.propertystore.db.strings # String values neostore.propertystore.db.arrays # Array values neostore.relationshipgroupstore.db # Relationship type groups (dense nodes) neostore.labeltokenstore.db # Label names neostore.relationshiptypestore.db # Relationship type names neostore.schemastore.db # Indexes, constraints ``` **Key Principle**: **Fixed-size records** enable O(1) access via record ID. ### Node Store **Node Record Format** (15 bytes in Neo4j 4.x): ``` ┌────────┬────────────┬────────────┬────────────┬────────┐ │ In Use │ Next Rel │ Next Prop │ Labels │ Extra │ │ 1 bit │ 35 bits │ 36 bits │ 5 bytes │ ... │ └────────┴────────────┴────────────┴────────────┴────────┘ ``` **Fields**: * **In Use** (1 bit): Is this record active? (0 = deleted, 1 = active) * **Next Rel** (35 bits): Pointer to first relationship (ID) * **Next Prop** (36 bits): Pointer to first property (ID) * **Labels** (5 bytes): Inline label storage or pointer to label array **Example**: Node with ID = 42: ``` Record at offset: 42 × 15 bytes = 630 bytes into neostore.nodestore.db ┌─────┬─────────┬─────────┬──────────────┐ │ 1 │ 10050 │ 20010 │ 0x0001 │ (Person label) └─────┴─────────┴─────────┴──────────────┘ ``` **Accessing Node 42**: ```java theme={null} // Pseudocode long nodeId = 42; long offset = nodeId * NODE_RECORD_SIZE; // 42 * 15 = 630 byte[] record = readFromFile("neostore.nodestore.db", offset, 15); boolean inUse = getBit(record, 0); long firstRelId = getBits(record, 1, 35); long firstPropId = getBits(record, 36, 36); ``` **Time Complexity**: O(1)—direct calculation, no scanning! ### Relationship Store **Relationship Record Format** (34 bytes in Neo4j 4.x): ``` ┌────────┬────────┬────────┬─────────┬──────────┬──────────┬──────────┬──────────┬────────┐ │ In Use │ First │ Second │ Rel │ First │ First │ Second │ Second │ Next │ │ │ Node │ Node │ Type │ Prev Rel │ Next Rel │ Prev Rel │ Next Rel │ Prop │ │ 1 bit │ 35 bits│ 35 bits│ 16 bits │ 35 bits │ 35 bits │ 35 bits │ 35 bits │ 36 bits│ └────────┴────────┴────────┴─────────┴──────────┴──────────┴──────────┴──────────┴────────┘ ``` **Fields**: * **First Node**: Source node ID * **Second Node**: Target node ID * **Rel Type**: Relationship type ID (index into relationship type store) * **First Prev/Next Rel**: Doubly-linked list for source node's relationships * **Second Prev/Next Rel**: Doubly-linked list for target node's relationships * **Next Prop**: Pointer to first property **Why Doubly-Linked Lists?** Each node maintains a **linked list of its relationships**: ``` Node A (ID=10): ↓ Next Rel = 100 Relationship 100: (A)-[:KNOWS]->(B) ├─ First Node = 10 (A) ├─ Second Node = 20 (B) ├─ First Next Rel = 101 ← Next relationship for A └─ Second Next Rel = 200 ← Next relationship for B Relationship 101: (A)-[:LIKES]->(C) ├─ First Node = 10 (A) ├─ Second Node = 30 (C) └─ First Next Rel = NULL ← End of A's relationships ``` **Traversal**: To find all relationships for Node A: 1. Read Node A's record → get `Next Rel = 100` 2. Read Relationship 100 → check if `First Node == A` or `Second Node == A` 3. Follow appropriate `Next Rel` pointer (100 → 101 → NULL) **Time Complexity**: O(D), where D = degree of node (number of relationships) **Critical Insight**: **Independent of total graph size!** * In relational DB: O(log N) per relationship (index scan across all edges) * In Neo4j: O(D) (only scan node's own relationships) **Example**: * Graph with 1 billion relationships * Node A has 10 relationships * Relational: O(log 10⁹) ≈ 30 operations * Neo4j: O(10) = 10 operations **Speed-up**: 3x, and scales better! ### Property Store **Property Record Format** (25 bytes): ``` ┌────────┬─────────┬─────────┬──────────┬───────────┐ │ In Use │ Type │ Key ID │ Value │ Next Prop │ │ 1 bit │ 4 bits │ 24 bits │ 64 bits │ 36 bits │ └────────┴─────────┴─────────┴──────────┴───────────┘ ``` **Property Types**: | Type | Value Storage | | -------------- | ------------------------------------ | | `bool` | Inline (1 bit in value field) | | `int` | Inline (up to 64 bits) | | `float` | Inline (IEEE 754, 64 bits) | | `short string` | Inline (up to 12 characters encoded) | | `long string` | Pointer to string store | | `array` | Pointer to array store | **Property Chain**: Properties form a **singly-linked list**: ``` Node A: Next Prop = 500 Property 500: {name: "Alice"} ├─ Key ID = 1 (points to "name" in property key store) ├─ Value = pointer to string store → "Alice" └─ Next Prop = 501 Property 501: {age: 30} ├─ Key ID = 2 ("age") ├─ Value = 30 (inline int) └─ Next Prop = NULL ``` **Reading All Properties**: ```java theme={null} long propId = node.nextProp; // 500 List props = new ArrayList<>(); while (propId != NULL) { PropertyRecord prop = readProperty(propId); props.add(new Property( getKeyName(prop.keyId), decodeValue(prop.type, prop.value) )); propId = prop.nextProp; } ``` ### String Store Stores strings > 12 characters: **String Record** (128 bytes, stores 120 chars + metadata): ``` ┌────────┬──────────┬────────────────────────────────┐ │ In Use │ Length │ Data (120 chars) │ │ 1 byte │ 4 bytes │ 120 bytes │ └────────┴──────────┴────────────────────────────────┘ ``` **Long Strings** (> 120 chars): Stored across **multiple records** forming a linked list: ``` String: "This is a very long string..." (200 chars) Record 1: chars 0-119 + Next Record = 5001 Record 2: chars 120-199 + Next Record = NULL ``` ### Dense Node Optimization **Problem**: Nodes with **many relationships** (high degree) slow down traversals. **Example**: Celebrity node with 1M followers ``` Without optimization: MATCH (celeb)-[:FOLLOWED_BY]->(follower) → Scan 1M relationship records to find FOLLOWED_BY relationships! ``` **Solution**: **Relationship Group Store** (introduced Neo4j 2.1) When a node exceeds a threshold (default: 50 relationships), Neo4j creates **relationship type groups**: ``` Node (Celebrity): Next Rel = NULL Next Rel Group = 7000 ← Points to group store Relationship Group 7000: ├─ Type = FOLLOWED_BY ├─ First Outgoing = 8000 ├─ First Incoming = 9000 ├─ First Loop = NULL └─ Next Group = 7001 Relationship Group 7001: ├─ Type = FOLLOWS ├─ First Outgoing = 10000 ├─ ... ``` **Traversal**: ```cypher theme={null} MATCH (celeb)-[:FOLLOWED_BY]->(follower) ``` **Process**: 1. Read celeb node → Get `Next Rel Group = 7000` 2. Scan relationship groups for `FOLLOWED_BY` (typically \< 10 types) 3. Follow `First Outgoing` pointer 4. Scan only FOLLOWED\_BY relationships (not all 1M!) **Performance**: From O(1M) to O(types) + O(followers of type) = much faster! *** ## Part 3: Indexes ### Label Indexes (Default) When you create a label, Neo4j automatically creates an index for lookups: ```cypher theme={null} CREATE (p:Person {name: "Alice"}) ``` **Without Index**: ```cypher theme={null} MATCH (p:Person {name: "Alice"}) → Scan ALL nodes (O(N)) ``` **With Label Index**: ```cypher theme={null} MATCH (p:Person {name: "Alice"}) → Scan only Person nodes (O(P), where P = # of Person nodes) ``` **Storage**: Labels are stored in the **Node Record** (5 bytes for labels): ``` Node 42: Labels = 0x0001 → Label ID 1 = "Person" ``` Label IDs map to names in `neostore.labeltokenstore.db`. ### Property Indexes **Create Index**: ```cypher theme={null} CREATE INDEX person_name FOR (p:Person) ON (p.name) ``` **Index Structure** (Neo4j 4.x): **Native Index** (custom B+tree variant) **Index File**: `schema/index/lucene-*/1/` (despite "lucene" in path, it's native in 4.x+) **B+Tree Structure**: ``` Root / \ [Alice] [Charlie] / \ / \ [Alice] [Bob] [Charlie] [David] ↓ ↓ ↓ ↓ Node 10 Node 20 Node 30 Node 40 ``` **Query**: ```cypher theme={null} MATCH (p:Person {name: "Bob"}) ``` **Process**: 1. Look up "Bob" in index → O(log P) 2. Get node ID (20) 3. Read node 20 → O(1) **Total**: O(log P) \<\< O(P) (full scan) ### Composite Indexes **Create**: ```cypher theme={null} CREATE INDEX person_name_age FOR (p:Person) ON (p.name, p.age) ``` **Use Case**: ```cypher theme={null} MATCH (p:Person {name: "Alice", age: 30}) ``` **Index Key**: Composite key `(name, age)` → `("Alice", 30)` **Benefit**: Single index lookup instead of two separate lookups + intersection. ### Full-Text Indexes **Create**: ```cypher theme={null} CREATE FULLTEXT INDEX person_description FOR (p:Person) ON EACH [p.name, p.bio] ``` **Query**: ```cypher theme={null} CALL db.index.fulltext.queryNodes("person_description", "software engineer") YIELD node, score RETURN node.name, score ``` **Backend**: Apache Lucene (inverted index) **Use Cases**: * Fuzzy search * Tokenization (split "software engineer" → \["software", "engineer"]) * Relevance scoring (TF-IDF) ### Vector Indexes (Neo4j 5.x) **Create**: ```cypher theme={null} CREATE VECTOR INDEX user_embedding FOR (u:User) ON (u.embedding) OPTIONS {dimension: 256, similarity: 'cosine'} ``` **Query** (Nearest neighbor search): ```cypher theme={null} MATCH (u:User) WHERE u.embedding SIMILAR TO $query_vector RETURN u.name, vector.similarity(u.embedding, $query_vector) AS score ORDER BY score DESC LIMIT 10 ``` **Backend**: HNSW (Hierarchical Navigable Small World) graph for approximate nearest neighbor (ANN) search **Use Cases**: * Recommendation systems (user/item embeddings) * Semantic search (document embeddings) * Image similarity (CNN features) *** ## Part 4: Transaction Management ### ACID Guarantees Neo4j provides **full ACID** transactions: * **Atomicity**: All-or-nothing (commit or rollback) * **Consistency**: Constraints enforced (uniqueness, existence) * **Isolation**: Transactions don't interfere (locking) * **Durability**: Committed transactions survive crashes (WAL) ### Write-Ahead Log (WAL) **Purpose**: Durability (survive crashes) **Location**: `neostore.transaction.db.0`, `neostore.transaction.db.1`, ... **Process**: ``` 1. Transaction begins ↓ 2. Modifications stored in memory ↓ 3. Transaction commits ↓ 4. Write transaction log entry to WAL (fsync to disk) ↓ 5. Respond "SUCCESS" to client ↓ 6. Apply changes to store files (asynchronous) ``` **Log Entry Format**: ``` ┌──────────┬─────────┬──────────────────────────────┐ │ TX ID │ Command │ Data │ │ 8 bytes │ 1 byte │ Variable │ └──────────┴─────────┴──────────────────────────────┘ ``` **Commands**: * `CREATE_NODE`: Node ID, labels, properties * `CREATE_RELATIONSHIP`: Rel ID, type, start/end nodes, properties * `SET_PROPERTY`: Node/Rel ID, key, value * `DELETE_NODE`: Node ID **Crash Recovery**: ``` On startup: 1. Read last committed transaction ID from store 2. Replay WAL entries from that point forward 3. Restore to consistent state ``` **Checkpointing**: Periodically, Neo4j flushes in-memory changes to store files and **truncates WAL**: ``` ┌─────────────────────────────────────────────────────┐ │ Timeline │ ├─────────────────────────────────────────────────────┤ │ t=0: Last checkpoint (TX ID = 1000) │ │ t=10: TX 1001, 1002, 1003 (in WAL) │ │ t=20: TX 1004, 1005 (in WAL) │ │ t=30: Checkpoint! (flush 1001-1005 to store) │ │ → Truncate WAL (safe to discard 1001-1005) │ │ t=40: TX 1006, 1007 (new WAL) │ └─────────────────────────────────────────────────────┘ ``` **Configuration**: ```conf theme={null} # neo4j.conf db.checkpoint.interval.time=15m # Checkpoint every 15 minutes db.checkpoint.interval.tx=100000 # Or every 100K transactions ``` ### Locking Neo4j uses **pessimistic locking** (lock before modify). **Lock Types**: 1. **Read Locks** (Shared locks): * Multiple transactions can hold read locks simultaneously * Prevents writes while reading 2. **Write Locks** (Exclusive locks): * Only one transaction can hold a write lock * Blocks reads and writes **Lock Granularity**: Neo4j locks at the **node/relationship level**: ```cypher theme={null} // Transaction 1: MATCH (p:Person {name: "Alice"}) SET p.age = 31 ← Acquires WRITE lock on Node(Alice) // Transaction 2 (concurrent): MATCH (p:Person {name: "Alice"}) RETURN p.age ← Blocks until TX1 commits (released lock) ``` **Deadlock Detection**: ```cypher theme={null} // Transaction 1: MATCH (a:Person {name: "Alice"}) MATCH (b:Person {name: "Bob"}) SET a.age = 31, b.age = 32 // Transaction 2 (concurrent): MATCH (b:Person {name: "Bob"}) MATCH (a:Person {name: "Alice"}) SET b.age = 33, a.age = 34 ``` **Deadlock**: * TX1 locks Alice, waits for Bob * TX2 locks Bob, waits for Alice * **Deadlock!** **Resolution**: Neo4j detects deadlock and aborts one transaction: ``` Neo4jException: DeadlockDetected Transaction aborted. Please retry. ``` **Best Practice**: Always acquire locks in **consistent order**: ```cypher theme={null} // Both transactions lock in order: Alice → Bob MATCH (a:Person {name: "Alice"}) MATCH (b:Person {name: "Bob"}) SET a.age = ..., b.age = ... ``` ### Isolation Levels Neo4j provides **Read Committed** isolation by default: **Behavior**: * Reads see committed data only (no dirty reads) * Phantom reads possible (another TX inserts/deletes between reads) **Example**: ```cypher theme={null} // Transaction 1: BEGIN MATCH (p:Person) RETURN count(p) → 100 // Transaction 2 (concurrent): CREATE (p:Person {name: "New"}) COMMIT // Transaction 1 (continues): MATCH (p:Person) RETURN count(p) → 101 (phantom read!) COMMIT ``` **Higher Isolation** (Serializable): Neo4j doesn't support serializable isolation natively. Use explicit locking: ```cypher theme={null} // Pessimistic locking (manual): MATCH (counter:Counter {id: 1}) SET counter.value = counter.value + 1 ← Acquires write lock RETURN counter.value ``` *** ## Part 5: Page Cache ### Memory Architecture **Page Cache** = Neo4j's in-memory buffer pool (like PostgreSQL's shared buffers). **Purpose**: * Cache frequently accessed pages (node, relationship, property records) * Reduce disk I/O **Configuration**: ```conf theme={null} # neo4j.conf db.memory.pagecache.size=4g # 4GB for page cache ``` **Sizing Rule**: `pagecache.size = 0.5 × (RAM - heap - OS)` **Example** (64GB RAM): ``` Heap (JVM): 8GB OS: 8GB Available for page cache: 64 - 8 - 8 = 48GB Recommended: 24-32GB ``` ### Page Cache Structure **Page Size**: 8KB (default, configurable) **Cache Entry**: ``` ┌─────────────────────────────────────────┐ │ File: neostore.nodestore.db │ │ Offset: 630 (Node ID 42) │ │ Data: [15 bytes of node record] │ │ LRU: Last accessed timestamp │ └─────────────────────────────────────────┘ ``` **Page Eviction**: LRU (Least Recently Used) When cache is full: 1. Find least recently used page 2. If dirty (modified), flush to disk 3. Evict page 4. Load new page into cache ### Monitoring Page Cache ```cypher theme={null} CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache") YIELD attributes RETURN attributes.hits, attributes.faults, attributes.evictions ``` **Metrics**: * **Hits**: Requests served from cache (fast!) * **Faults**: Requests requiring disk read (slow) * **Evictions**: Pages evicted to make room **Hit Rate**: `hits / (hits + faults)` **Target**: > 95% hit rate **Example**: ``` Hits: 950,000 Faults: 50,000 Hit Rate: 950,000 / 1,000,000 = 95% ✓ Good! ``` **If hit rate \< 90%**: Increase page cache size! *** ## Part 6: Query Execution ### Explain and Profile **Explain** (planning only, no execution): ```cypher theme={null} EXPLAIN MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend) RETURN friend.name ``` **Output**: ``` +---------------------------+ | Operator | Rows | +---------------------------+ | ProduceResults | 10 | | Projection | 10 | | Expand(All) | 10 | | NodeIndexSeek | 1 | +---------------------------+ ``` **Profile** (execute and collect statistics): ```cypher theme={null} PROFILE MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend) RETURN friend.name ``` **Output**: ``` +----------------------------------------+ | Operator | Rows | DB Hits | +----------------------------------------+ | ProduceResults | 10 | 0 | | Projection | 10 | 20 | | Expand(All) | 10 | 21 | | NodeIndexSeek | 1 | 2 | +----------------------------------------+ Total DB Hits: 43 ``` **DB Hits**: Number of page cache accesses (lower = better!) ### Query Operators **NodeIndexSeek**: ``` Look up node by indexed property Cost: O(log N) DB Hits: 1-3 (index lookup + node fetch) ``` **NodeByLabelScan**: ``` Scan all nodes with a specific label Cost: O(L), where L = # of labeled nodes DB Hits: L ``` **Expand(All)**: ``` Traverse relationships from a node Cost: O(D), where D = degree DB Hits: D (read relationship records) ``` **Filter**: ``` Apply predicate (WHERE clause) Cost: O(R), where R = input rows DB Hits: R × P (P = properties checked per row) ``` **Sort**: ``` Sort results (ORDER BY) Cost: O(R log R) DB Hits: 0 (in-memory sort) ``` **Aggregation** (COUNT, SUM, etc.): ``` Group and aggregate Cost: O(R) DB Hits: 0 (in-memory) ``` ### Example Query Analysis **Query**: ```cypher theme={null} MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend:Person) WHERE friend.age > 25 RETURN friend.name, friend.age ORDER BY friend.age DESC ``` **Plan**: ``` ProduceResults ↑ Sort (friend.age DESC) ↑ Projection (friend.name, friend.age) ↑ Filter (friend.age > 25) ↑ Expand(All) [:KNOWS] ↑ NodeIndexSeek (Person.name = "Alice") ``` **Step-by-Step**: 1. **NodeIndexSeek**: Look up Alice in `Person.name` index * Cost: O(log P) * DB Hits: 2 * Rows: 1 2. **Expand**: Traverse `:KNOWS` relationships * Cost: O(D), D = Alice's degree (e.g., 10 friends) * DB Hits: 10 (relationship records) * Rows: 10 3. **Filter**: Check `friend.age > 25` * Cost: O(10) * DB Hits: 10 (read `age` property) * Rows: 6 (assume 6 friends > 25) 4. **Projection**: Extract `name` and `age` * Cost: O(6) * DB Hits: 6 (read `name` property) * Rows: 6 5. **Sort**: Order by `age DESC` * Cost: O(6 log 6) ≈ 15 * DB Hits: 0 (in-memory) * Rows: 6 6. **ProduceResults**: Return to client * Rows: 6 **Total DB Hits**: 2 + 10 + 10 + 6 = 28 ### Optimization Tips **1. Use Indexes**: ```cypher theme={null} -- Bad (full scan) MATCH (p:Person) WHERE p.name = "Alice" -- Good (index seek) CREATE INDEX FOR (p:Person) ON (p.name) MATCH (p:Person {name: "Alice"}) ``` **2. Filter Early**: ```cypher theme={null} -- Bad (expand then filter) MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend) WHERE friend.age > 25 -- Good (filter during expand, if possible) MATCH (p:Person {name: "Alice"})-[:KNOWS]->(friend:Person) WHERE friend.age > 25 ``` **3. Limit Results**: ```cypher theme={null} -- Add LIMIT to avoid processing unnecessary rows MATCH (p:Person)-[:KNOWS]->(friend) RETURN friend.name LIMIT 10 ``` **4. Use EXPLAIN/PROFILE**: Always check query plans for: * Missing indexes (NodeByLabelScan instead of NodeIndexSeek) * High DB Hits * Cartesian products (avoid!) *** ## Part 7: Performance Characteristics ### Time Complexity | Operation | Neo4j | Relational DB | | ------------------------------- | -------- | ------------- | | Find node by ID | O(1) | O(1) | | Find node by indexed property | O(log N) | O(log N) | | Find node by unindexed property | O(N) | O(N) | | Get node's relationships | O(D) | O(log N) | | Traverse 2 hops | O(D²) | O(N²) | | Traverse 3 hops | O(D³) | O(N³) | | Shortest path (BFS, depth L) | O(D^L) | O(N^L) | **Typical Values**: * N = 1,000,000 (total nodes) * D = 50 (average degree) **3-hop traversal**: * Neo4j: 50³ = 125,000 * Relational: (10⁶)³ = 10¹⁸ (not feasible!) ### Benchmark: Social Network Queries **Setup**: * 1M users * 50M friendships (average 50 friends/user) **Query 1: Direct friends** ```cypher theme={null} MATCH (u:User {id: 12345})-[:FRIEND]->(friend) RETURN friend.name ``` | Database | Time | | ---------- | ---- | | Neo4j | 5ms | | PostgreSQL | 20ms | **Query 2: Friends of friends** ```cypher theme={null} MATCH (u:User {id: 12345})-[:FRIEND*2]->(fof) RETURN fof.name ``` | Database | Time | | ---------- | ------- | | Neo4j | 50ms | | PostgreSQL | 5,000ms | **Query 3: Friends up to 3 hops** ```cypher theme={null} MATCH (u:User {id: 12345})-[:FRIEND*1..3]->(friend) RETURN DISTINCT friend.name ``` | Database | Time | | ---------- | --------------- | | Neo4j | 200ms | | PostgreSQL | Timeout (> 60s) | **Key Takeaway**: Neo4j shines for **deep traversals** (2+ hops). *** ## Part 8: Hands-On Exercises ### Exercise 1: Explore Store Files **Task**: Inspect Neo4j store files and calculate sizes ```bash theme={null} # Navigate to Neo4j data directory cd /data/databases/neo4j/ # List store files ls -lh neostore.* # Example output: # -rw-r--r-- neostore.nodestore.db 150M # -rw-r--r-- neostore.relationshipstore.db 2.1G # -rw-r--r-- neostore.propertystore.db 500M ``` **Questions**: 1. How many nodes? `150M / 15 bytes = 10M nodes` 2. How many relationships? `2.1G / 34 bytes ≈ 61.7M relationships` 3. Average degree? `61.7M × 2 / 10M ≈ 12.3` ### Exercise 2: Analyze Query Plans **Query**: ```cypher theme={null} PROFILE MATCH (p:Person {name: "Alice"})-[:KNOWS*2]->(fof) RETURN fof.name ``` **Tasks**: 1. Identify bottleneck operator (highest DB Hits) 2. Check if index is used (NodeIndexSeek vs NodeByLabelScan) 3. Optimize query (add indexes, limit results) ### Exercise 3: Measure Page Cache Hit Rate ```cypher theme={null} // Run query multiple times MATCH (p:Person)-[:KNOWS]->(friend) RETURN count(friend) // Check hit rate CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache") YIELD attributes RETURN attributes.hits, attributes.faults, (toFloat(attributes.hits) / (attributes.hits + attributes.faults)) AS hit_rate ``` **Expected**: Hit rate increases on subsequent runs (data cached) *** ## Summary **Native Graph Storage**: * Fixed-size records (O(1) access by ID) * Pointers for relationships (no JOINs!) * Index-free adjacency (O(D) traversal, not O(N)) **Indexes**: * Label indexes (filter by node type) * Property indexes (B+tree, O(log N)) * Full-text indexes (Lucene) * Vector indexes (HNSW for ANN) **Transactions**: * Full ACID guarantees * Write-Ahead Log (durability) * Pessimistic locking (read/write locks) * Deadlock detection **Performance**: * 10-1000x faster than relational for traversals * Scales with degree (D), not graph size (N) * Page cache critical for performance (aim > 95% hit rate) *** ## What's Next? Master Cypher syntax, pattern matching, aggregations, and advanced query techniques