Production Deployment & Operations
Module Duration : 5-6 hours
Learning Style : Configuration + Monitoring + Troubleshooting
Outcome : Deploy and operate production Neo4j clusters with HA and performance
Neo4j Deployment Options
1. Single Instance : Development, testing
2. Causal Cluster : Production HA (3+ servers)
3. Neo4j Aura : Managed cloud service
4. Fabric : Multi-database federation (sharding)
Part 1: Causal Clustering Architecture
How It Works
Core Servers (Raft Consensus)
┌─────────┬─────────┬─────────┐
│ Leader │ Follower│ Follower│
│ (Writes)│ (Reads) │ (Reads) │
└────┬────┴────┬────┴────┬────┘
│ │ │
└─────────┼─────────┘
│
┌──────────┴──────────┐
│ │
Read Replica Read Replica
(Reads Only) (Reads Only)
Core Servers :
Handle writes (via Raft leader election)
Minimum 3 servers (tolerate 1 failure)
Recommended: 5 or 7 (tolerate 2 or 3 failures)
Read Replicas :
Asynchronous replication from core servers
Scale read throughput
Can be added/removed dynamically
Configuration
neo4j.conf (Core Server):
# Cluster settings
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3
# Discovery (initial servers)
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000
# Ports
causal_clustering.discovery_listen_address=0.0.0.0:5000
causal_clustering.transaction_listen_address=0.0.0.0:6000
causal_clustering.raft_listen_address=0.0.0.0:7000
# Bolt connector (client connections)
dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=0.0.0.0:7687
neo4j.conf (Read Replica):
dbms.mode=READ_REPLICA
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000
Starting a Cluster
# On each server
neo4j start
# Check cluster status
cypher-shell "CALL dbms.cluster.overview()"
Output :
┌──────────┬─────────┬────────┬──────────┐
│ id │ address │ role │ database │
├──────────┼─────────┼────────┼──────────┤
│ server1 │ :7687 │ LEADER │ neo4j │
│ server2 │ :7687 │ FOLLOWER│ neo4j │
│ server3 │ :7687 │ FOLLOWER│ neo4j │
│ replica1 │ :7687 │ READ_REPLICA│ neo4j│
└──────────┴─────────┴────────┴──────────┘
Part 2: Routing and Load Balancing
Bolt Routing
Driver Configuration (Python):
from neo4j import GraphDatabase
driver = GraphDatabase.driver(
"neo4j://server1:7687,server2:7687,server3:7687" , # Cluster endpoints
auth = ( "neo4j" , "password" ),
max_connection_pool_size = 50
)
# Write transaction (routed to leader)
with driver.session( default_access_mode = "WRITE" ) as session:
session.run( "CREATE (p:Person {name: $name} )" , name = "Alice" )
# Read transaction (routed to follower or replica)
with driver.session( default_access_mode = "READ" ) as session:
result = session.run( "MATCH (p:Person) RETURN p.name" )
for record in result:
print (record[ "p.name" ])
Driver auto-discovers cluster topology via Bolt routing protocol.
External Load Balancer
For additional control, use HAProxy:
haproxy.cfg :
frontend neo4j_bolt
bind *:7687
mode tcp
default_backend neo4j_core_servers
backend neo4j_core_servers
mode tcp
balance roundrobin
option tcp-check
server core1 server1:7687 check
server core2 server2:7687 check
server core3 server3:7687 check
Client connects to HAProxy :
driver = GraphDatabase.driver( "neo4j://haproxy:7687" , auth = ( "neo4j" , "password" ))
Part 3: Backup and Recovery
Online Backup
Full Backup :
neo4j-admin backup --backup-dir=/backups --database=neo4j
Incremental Backup :
# First backup (full)
neo4j-admin backup --backup-dir=/backups --database=neo4j
# Subsequent backups (incremental, faster)
neo4j-admin backup --backup-dir=/backups --database=neo4j
Automated Backups (cron):
# Daily backup at 2 AM
0 2 * * * /usr/bin/neo4j-admin backup --backup-dir=/backups/$( date + \% Y- \% m- \% d ) --database=neo4j
Restore from Backup
# Stop Neo4j
neo4j stop
# Restore backup
neo4j-admin restore --from=/backups/2024-01-15 --database=neo4j --force
# Start Neo4j
neo4j start
Disaster Recovery Strategy
3-2-1 Rule :
3 copies of data (production + 2 backups)
2 different storage types (local + cloud)
1 off-site backup (S3, Azure Blob, etc.)
Example Script :
#!/bin/bash
BACKUP_DIR = "/backups/$( date +%Y-%m-%d)"
neo4j-admin backup --backup-dir= $BACKUP_DIR --database=neo4j
# Upload to S3
aws s3 sync $BACKUP_DIR s3://my-neo4j-backups/ $( date +%Y-%m-%d )
# Delete backups older than 30 days
find /backups -type d -mtime +30 -exec rm -rf {} \;
Part 4: Monitoring
Built-In Metrics (JMX)
Enable JMX :
# neo4j.conf
dbms.jvm.additional=-Dcom.sun.management.jmxremote.port=3637
dbms.jvm.additional=-Dcom.sun.management.jmxremote.authenticate=false
dbms.jvm.additional=-Dcom.sun.management.jmxremote.ssl=false
Query JMX Metrics :
CALL dbms . queryJmx ( "org.neo4j:instance=kernel#0,name=Transactions" )
YIELD attributes
RETURN attributes . NumberOfOpenTransactions
Prometheus + Grafana
1. Install Neo4j Prometheus Plugin :
wget https://github.com/neo4j-contrib/neo4j-prometheus-exporter/releases/download/1.0.0/neo4j-prometheus-1.0.0.jar
cp neo4j-prometheus-1.0.0.jar /var/lib/neo4j/plugins/
2. Configure (neo4j.conf):
dbms.prometheus.enabled=true
dbms.prometheus.endpoint=localhost:2004
3. Configure Prometheus (prometheus.yml):
scrape_configs :
- job_name : 'neo4j'
static_configs :
- targets : [ 'server1:2004' , 'server2:2004' , 'server3:2004' ]
4. Import Grafana Dashboard : Dashboard ID 14331
Key Metrics to Monitor
Metric Description Alert Threshold Transactions/sec Write throughput - Open transactions Active transactions > 100 (leaks?) Page cache hit ratio Cache efficiency < 90% GC pause time JVM garbage collection > 1s Store size Disk usage > 80% Cluster lag Replication delay > 10s
JVM Configuration
Heap Size (neo4j.conf):
# Set heap to 8-16GB (or 25% of RAM)
dbms.memory.heap.initial_size=8g
dbms.memory.heap.max_size=8g
GC Settings (jvm.additional):
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:MaxGCPauseMillis=200
dbms.jvm.additional=-XX:ParallelGCThreads=16
Page Cache
Size Recommendation : 50-75% of remaining RAM (after heap)
# neo4j.conf
dbms.memory.pagecache.size=32g # For 64GB RAM (8GB heap + 32GB cache)
Monitor Hit Ratio :
CALL dbms . queryJmx ( "org.neo4j:instance=kernel#0,name=Page cache" )
YIELD attributes
RETURN
attributes . hits AS hits ,
attributes . faults AS faults ,
( toFloat ( attributes . hits ) / ( attributes . hits + attributes . faults )) AS hit_ratio
Target : > 95% hit ratio
Transaction Log
Configuration :
dbms.tx_log.rotation.retention_policy=2 days # Keep 2 days of logs
dbms.tx_log.rotation.size=256M # Rotate at 256MB
Part 6: Security
Authentication
Enable authentication :
dbms.security.auth_enabled=true
Create users :
// Create admin user
CREATE USER admin SET PASSWORD 'secure_password' CHANGE NOT REQUIRED ;
GRANT ROLE admin TO admin ;
// Create read-only user
CREATE USER analyst SET PASSWORD 'password' CHANGE NOT REQUIRED ;
GRANT ROLE reader TO analyst ;
Authorization (RBAC)
Create custom role :
CREATE ROLE data_scientist ;
// Grant read access to neo4j database
GRANT MATCH { * } ON GRAPH neo4j TO data_scientist ;
// Grant algorithm execution
GRANT EXECUTE PROCEDURE gds . * ON DBMS TO data_scientist ;
// Assign role to user
GRANT ROLE data_scientist TO alice ;
Encryption
SSL/TLS for Bolt :
dbms.connector.bolt.tls_level=REQUIRED
dbms.ssl.policy.bolt.enabled=true
dbms.ssl.policy.bolt.base_directory=certificates/bolt
dbms.ssl.policy.bolt.private_key=private.key
dbms.ssl.policy.bolt.public_certificate=public.crt
Generate certificates :
openssl req -newkey rsa:2048 -nodes -keyout private.key -x509 -days 365 -out public.crt
Part 7: Troubleshooting
Issue 1: Slow Queries
Diagnosis :
// Enable query logging (neo4j.conf)
dbms . logs . query . enabled = true
dbms . logs . query . threshold = 1 s // Log queries > 1 second
// View slow queries
CALL dbms . listQueries ()
YIELD query , elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN query , elapsedTimeMillis
ORDER BY elapsedTimeMillis DESC
Solutions :
Add indexes
Use PROFILE to find bottlenecks
Rewrite query (filter early, limit results)
Issue 2: High Memory Usage
Diagnosis :
CALL dbms . queryJmx ( "java.lang:type=Memory" )
YIELD attributes
RETURN attributes . HeapMemoryUsage
Solutions :
Increase heap size (dbms.memory.heap.max_size)
Increase page cache (dbms.memory.pagecache.size)
Check for transaction leaks (unclosed transactions)
Issue 3: Cluster Split-Brain
Symptom : Multiple leaders elected
Diagnosis :
CALL dbms . cluster . overview ()
YIELD role
WITH role , count ( * ) AS count
WHERE role = 'LEADER'
RETURN count // Should be 1!
Prevention :
Use odd number of core servers (3, 5, 7)
Ensure network stability
Configure causal_clustering.minimum_core_cluster_size_at_runtime correctly
Issue 4: Replication Lag
Diagnosis :
CALL dbms . cluster . overview ()
YIELD id , role , database
MATCH ( s : Server { id : id } )
RETURN id , role , s . lag AS replication_lag
ORDER BY s . lag DESC
Solutions :
Add more read replicas
Increase network bandwidth
Tune causal_clustering.catchup.batch_size
Part 8: Best Practices Checklist
Deployment :
Backup :
Monitoring :
Performance :
Security :
Summary
Causal Clustering : 3+ core servers for HA, read replicas for scale
Backup : Automated daily backups with off-site storage
Monitoring : Prometheus + Grafana for metrics
Performance : Proper heap/cache sizing, GC tuning
Security : Authentication, RBAC, SSL/TLS
Production-Ready : HA, monitored, secured, backed up, performant!
What’s Next?
Module 8: Capstone Project - Knowledge Graph Platform Build a complete knowledge graph application with recommendations, search, and analytics