Production Deployment & Operations
Neo4j Deployment Options
Part 1: Causal Clustering Architecture
How It Works
Configuration
Starting a Cluster
Part 2: Routing and Load Balancing
Bolt Routing
External Load Balancer
Part 3: Backup and Recovery
Online Backup
Restore from Backup
Disaster Recovery Strategy
Part 4: Monitoring
Built-In Metrics (JMX)
Prometheus + Grafana
Key Metrics to Monitor
Part 5: Performance Tuning
JVM Configuration
Page Cache
Transaction Log
Part 6: Security
Authentication
Authorization (RBAC)
Encryption
Part 7: Troubleshooting
Issue 1: Slow Queries
Issue 2: High Memory Usage
Issue 3: Cluster Split-Brain
Issue 4: Replication Lag
Part 8: Best Practices Checklist
Summary
What’s Next?

Production Deployment & Operations

Module Duration: 5-6 hours Learning Style: Configuration + Monitoring + Troubleshooting Outcome: Deploy and operate production Neo4j clusters with HA and performance

Neo4j Deployment Options

1. Single Instance: Development, testing 2. Causal Cluster: Production HA (3+ servers) 3. Neo4j Aura: Managed cloud service 4. Fabric: Multi-database federation (sharding)

Part 1: Causal Clustering Architecture

How It Works

        Core Servers (Raft Consensus)
    ┌─────────┬─────────┬─────────┐
    │ Leader  │ Follower│ Follower│
    │ (Writes)│ (Reads) │ (Reads) │
    └────┬────┴────┬────┴────┬────┘
         │         │         │
         └─────────┼─────────┘
                   │
        ┌──────────┴──────────┐
        │                     │
   Read Replica          Read Replica
    (Reads Only)         (Reads Only)

Core Servers:

Handle writes (via Raft leader election)
Minimum 3 servers (tolerate 1 failure)
Recommended: 5 or 7 (tolerate 2 or 3 failures)

Read Replicas:

Asynchronous replication from core servers
Scale read throughput
Can be added/removed dynamically

Configuration

neo4j.conf (Core Server):

# Cluster settings
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3

# Discovery (initial servers)
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000

# Ports
causal_clustering.discovery_listen_address=0.0.0.0:5000
causal_clustering.transaction_listen_address=0.0.0.0:6000
causal_clustering.raft_listen_address=0.0.0.0:7000

# Bolt connector (client connections)
dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=0.0.0.0:7687

neo4j.conf (Read Replica):

dbms.mode=READ_REPLICA
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000

Starting a Cluster

# On each server
neo4j start

# Check cluster status
cypher-shell "CALL dbms.cluster.overview()"

Output:

┌──────────┬─────────┬────────┬──────────┐
│ id       │ address │ role   │ database │
├──────────┼─────────┼────────┼──────────┤
│ server1  │ :7687   │ LEADER │ neo4j    │
│ server2  │ :7687   │ FOLLOWER│ neo4j   │
│ server3  │ :7687   │ FOLLOWER│ neo4j   │
│ replica1 │ :7687   │ READ_REPLICA│ neo4j│
└──────────┴─────────┴────────┴──────────┘

Part 2: Routing and Load Balancing

Bolt Routing

Driver Configuration (Python):

from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    "neo4j://server1:7687,server2:7687,server3:7687",  # Cluster endpoints
    auth=("neo4j", "password"),
    max_connection_pool_size=50
)

# Write transaction (routed to leader)
with driver.session(default_access_mode="WRITE") as session:
    session.run("CREATE (p:Person {name: $name})", name="Alice")

# Read transaction (routed to follower or replica)
with driver.session(default_access_mode="READ") as session:
    result = session.run("MATCH (p:Person) RETURN p.name")
    for record in result:
        print(record["p.name"])

Driver auto-discovers cluster topology via Bolt routing protocol.

External Load Balancer

For additional control, use HAProxy: haproxy.cfg:

frontend neo4j_bolt
    bind *:7687
    mode tcp
    default_backend neo4j_core_servers

backend neo4j_core_servers
    mode tcp
    balance roundrobin
    option tcp-check
    server core1 server1:7687 check
    server core2 server2:7687 check
    server core3 server3:7687 check

Client connects to HAProxy:

driver = GraphDatabase.driver("neo4j://haproxy:7687", auth=("neo4j", "password"))

Part 3: Backup and Recovery

Online Backup

Full Backup:

neo4j-admin backup --backup-dir=/backups --database=neo4j

Incremental Backup:

# First backup (full)
neo4j-admin backup --backup-dir=/backups --database=neo4j

# Subsequent backups (incremental, faster)
neo4j-admin backup --backup-dir=/backups --database=neo4j

Automated Backups (cron):

# Daily backup at 2 AM
0 2 * * * /usr/bin/neo4j-admin backup --backup-dir=/backups/$(date +\%Y-\%m-\%d) --database=neo4j

Restore from Backup

# Stop Neo4j
neo4j stop

# Restore backup
neo4j-admin restore --from=/backups/2024-01-15 --database=neo4j --force

# Start Neo4j
neo4j start

Disaster Recovery Strategy

3-2-1 Rule:

3 copies of data (production + 2 backups)
2 different storage types (local + cloud)
1 off-site backup (S3, Azure Blob, etc.)

Example Script:

#!/bin/bash
BACKUP_DIR="/backups/$(date +%Y-%m-%d)"
neo4j-admin backup --backup-dir=$BACKUP_DIR --database=neo4j

# Upload to S3
aws s3 sync $BACKUP_DIR s3://my-neo4j-backups/$(date +%Y-%m-%d)

# Delete backups older than 30 days
find /backups -type d -mtime +30 -exec rm -rf {} \;

Part 4: Monitoring

Built-In Metrics (JMX)

Enable JMX:

# neo4j.conf
dbms.jvm.additional=-Dcom.sun.management.jmxremote.port=3637
dbms.jvm.additional=-Dcom.sun.management.jmxremote.authenticate=false
dbms.jvm.additional=-Dcom.sun.management.jmxremote.ssl=false

Query JMX Metrics:

CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Transactions")
YIELD attributes
RETURN attributes.NumberOfOpenTransactions

Prometheus + Grafana

1. Install Neo4j Prometheus Plugin:

wget https://github.com/neo4j-contrib/neo4j-prometheus-exporter/releases/download/1.0.0/neo4j-prometheus-1.0.0.jar
cp neo4j-prometheus-1.0.0.jar /var/lib/neo4j/plugins/

2. Configure (neo4j.conf):

dbms.prometheus.enabled=true
dbms.prometheus.endpoint=localhost:2004

3. Configure Prometheus (prometheus.yml):

scrape_configs:
  - job_name: 'neo4j'
    static_configs:
      - targets: ['server1:2004', 'server2:2004', 'server3:2004']

4. Import Grafana Dashboard: Dashboard ID 14331

Key Metrics to Monitor

Metric	Description	Alert Threshold
Transactions/sec	Write throughput	-
Open transactions	Active transactions	> 100 (leaks?)
Page cache hit ratio	Cache efficiency	< 90%
GC pause time	JVM garbage collection	> 1s
Store size	Disk usage	> 80%
Cluster lag	Replication delay	> 10s

Part 5: Performance Tuning

JVM Configuration

Heap Size (neo4j.conf):

# Set heap to 8-16GB (or 25% of RAM)
dbms.memory.heap.initial_size=8g
dbms.memory.heap.max_size=8g

GC Settings (jvm.additional):

dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:MaxGCPauseMillis=200
dbms.jvm.additional=-XX:ParallelGCThreads=16

Page Cache

Size Recommendation: 50-75% of remaining RAM (after heap)

# neo4j.conf
dbms.memory.pagecache.size=32g  # For 64GB RAM (8GB heap + 32GB cache)

Monitor Hit Ratio:

CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache")
YIELD attributes
RETURN
  attributes.hits AS hits,
  attributes.faults AS faults,
  (toFloat(attributes.hits) / (attributes.hits + attributes.faults)) AS hit_ratio

Target: > 95% hit ratio

Transaction Log

Configuration:

dbms.tx_log.rotation.retention_policy=2 days  # Keep 2 days of logs
dbms.tx_log.rotation.size=256M               # Rotate at 256MB

Part 6: Security

Authentication

Enable authentication:

dbms.security.auth_enabled=true

Create users:

// Create admin user
CREATE USER admin SET PASSWORD 'secure_password' CHANGE NOT REQUIRED;
GRANT ROLE admin TO admin;

// Create read-only user
CREATE USER analyst SET PASSWORD 'password' CHANGE NOT REQUIRED;
GRANT ROLE reader TO analyst;

Authorization (RBAC)

Create custom role:

CREATE ROLE data_scientist;

// Grant read access to neo4j database
GRANT MATCH {*} ON GRAPH neo4j TO data_scientist;

// Grant algorithm execution
GRANT EXECUTE PROCEDURE gds.* ON DBMS TO data_scientist;

// Assign role to user
GRANT ROLE data_scientist TO alice;

Encryption

SSL/TLS for Bolt:

dbms.connector.bolt.tls_level=REQUIRED
dbms.ssl.policy.bolt.enabled=true
dbms.ssl.policy.bolt.base_directory=certificates/bolt
dbms.ssl.policy.bolt.private_key=private.key
dbms.ssl.policy.bolt.public_certificate=public.crt

Generate certificates:

openssl req -newkey rsa:2048 -nodes -keyout private.key -x509 -days 365 -out public.crt

Part 7: Troubleshooting

Issue 1: Slow Queries

Diagnosis:

// Enable query logging (neo4j.conf)
dbms.logs.query.enabled=true
dbms.logs.query.threshold=1s  // Log queries > 1 second

// View slow queries
CALL dbms.listQueries()
YIELD query, elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN query, elapsedTimeMillis
ORDER BY elapsedTimeMillis DESC

Solutions:

Add indexes
Use PROFILE to find bottlenecks
Rewrite query (filter early, limit results)

Issue 2: High Memory Usage

Diagnosis:

CALL dbms.queryJmx("java.lang:type=Memory")
YIELD attributes
RETURN attributes.HeapMemoryUsage

Solutions:

Increase heap size (dbms.memory.heap.max_size)
Increase page cache (dbms.memory.pagecache.size)
Check for transaction leaks (unclosed transactions)

Issue 3: Cluster Split-Brain

Symptom: Multiple leaders elected Diagnosis:

CALL dbms.cluster.overview()
YIELD role
WITH role, count(*) AS count
WHERE role = 'LEADER'
RETURN count  // Should be 1!

Prevention:

Use odd number of core servers (3, 5, 7)
Ensure network stability
Configure causal_clustering.minimum_core_cluster_size_at_runtime correctly

Issue 4: Replication Lag

Diagnosis:

CALL dbms.cluster.overview()
YIELD id, role, database
MATCH (s:Server {id: id})
RETURN id, role, s.lag AS replication_lag
ORDER BY s.lag DESC

Solutions:

Add more read replicas
Increase network bandwidth
Tune causal_clustering.catchup.batch_size

Part 8: Best Practices Checklist

Deployment:

Use causal cluster (3+ core servers)
Configure read replicas for read scaling
Use Bolt routing in drivers
External load balancer (HAProxy) for fine control

Backup:

Automated daily backups
Off-site backup to S3/cloud
Tested restore procedure
30-day retention

Monitoring:

Prometheus + Grafana setup
Alerts for key metrics (transactions, GC, lag)
Query logging enabled (threshold: 1s)

Performance:

Heap: 8-16GB
Page cache: 50-75% of RAM
Page cache hit ratio > 95%
Indexes on frequently queried properties

Security:

Authentication enabled
RBAC configured (least privilege)
SSL/TLS for Bolt
Firewall rules (only necessary ports open)

Summary

Causal Clustering: 3+ core servers for HA, read replicas for scale Backup: Automated daily backups with off-site storage Monitoring: Prometheus + Grafana for metrics Performance: Proper heap/cache sizing, GC tuning Security: Authentication, RBAC, SSL/TLS Production-Ready: HA, monitored, secured, backed up, performant!

What’s Next?

Module 8: Capstone Project - Knowledge Graph Platform

Build a complete knowledge graph application with recommendations, search, and analytics

6. Graph Algorithms 8. Capstone Project

Overview

Testing & Code Quality

Crash Courses

AI Engineering

Math for ML - Understanding Linear Algebra

Probability & Statistics for ML

Math for ML - Understanding Calculus

ML Mastery

Deep Learning Mastery

NestJS Mastery

Microservices Mastery

Low Level Design

OOP Concepts

SOLID Principles

Design Patterns

LLD Case Studies

System Design (HLD)

Senior Level (L5+/Staff)

HLD Case Studies

Engineering Fundamentals

DevOps & Operations

Azure Cloud Engineering

AWS Cloud

AWS Monitoring & Observability

AWS Security Services

AWS Serverless

AWS Operations

AWS Advanced

AWS Case Studies

GCP Cloud Engineering

DevOps Tools

Database Engineering

HIPAA Compliance Mastery

Operating Systems

Linux Internals

Distributed Systems

Networking Mastery

Build Your Own X

Go Lang Mastery

C Programming

Classic Research Papers

Distributed System Tools

​Production Deployment & Operations

​Neo4j Deployment Options

​Part 1: Causal Clustering Architecture

​How It Works

​Configuration

​Starting a Cluster

​Part 2: Routing and Load Balancing

​Bolt Routing

​External Load Balancer

​Part 3: Backup and Recovery

​Online Backup

​Restore from Backup

​Disaster Recovery Strategy

​Part 4: Monitoring

​Built-In Metrics (JMX)

​Prometheus + Grafana

​Key Metrics to Monitor

​Part 5: Performance Tuning

​JVM Configuration

​Page Cache

​Transaction Log

​Part 6: Security

​Authentication

​Authorization (RBAC)

​Encryption

​Part 7: Troubleshooting

​Issue 1: Slow Queries

​Issue 2: High Memory Usage

​Issue 3: Cluster Split-Brain

​Issue 4: Replication Lag

​Part 8: Best Practices Checklist

​Summary

​What’s Next?

Module 8: Capstone Project - Knowledge Graph Platform

Production Deployment & Operations

Neo4j Deployment Options

Part 1: Causal Clustering Architecture

How It Works

Configuration

Starting a Cluster

Part 2: Routing and Load Balancing

Bolt Routing

External Load Balancer

Part 3: Backup and Recovery

Online Backup

Restore from Backup

Disaster Recovery Strategy

Part 4: Monitoring

Built-In Metrics (JMX)

Prometheus + Grafana

Key Metrics to Monitor

Part 5: Performance Tuning

JVM Configuration

Page Cache

Transaction Log

Part 6: Security

Authentication

Authorization (RBAC)

Encryption

Part 7: Troubleshooting

Issue 1: Slow Queries

Issue 2: High Memory Usage

Issue 3: Cluster Split-Brain

Issue 4: Replication Lag

Part 8: Best Practices Checklist

Summary

What’s Next?