Skip to main content

Production Deployment & Operations

Module Duration: 5-6 hours Learning Style: Configuration + Monitoring + Troubleshooting Outcome: Deploy and operate production Neo4j clusters with HA and performance

Neo4j Deployment Options

1. Single Instance: Development, testing 2. Causal Cluster: Production HA (3+ servers) 3. Neo4j Aura: Managed cloud service 4. Fabric: Multi-database federation (sharding)

Part 1: Causal Clustering Architecture

How It Works

        Core Servers (Raft Consensus)
    ┌─────────┬─────────┬─────────┐
    │ Leader  │ Follower│ Follower│
    │ (Writes)│ (Reads) │ (Reads) │
    └────┬────┴────┬────┴────┬────┘
         │         │         │
         └─────────┼─────────┘

        ┌──────────┴──────────┐
        │                     │
   Read Replica          Read Replica
    (Reads Only)         (Reads Only)
Core Servers:
  • Handle writes (via Raft leader election)
  • Minimum 3 servers (tolerate 1 failure)
  • Recommended: 5 or 7 (tolerate 2 or 3 failures)
Read Replicas:
  • Asynchronous replication from core servers
  • Scale read throughput
  • Can be added/removed dynamically

Configuration

neo4j.conf (Core Server):
# Cluster settings
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3

# Discovery (initial servers)
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000

# Ports
causal_clustering.discovery_listen_address=0.0.0.0:5000
causal_clustering.transaction_listen_address=0.0.0.0:6000
causal_clustering.raft_listen_address=0.0.0.0:7000

# Bolt connector (client connections)
dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=0.0.0.0:7687
neo4j.conf (Read Replica):
dbms.mode=READ_REPLICA
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000

Starting a Cluster

# On each server
neo4j start

# Check cluster status
cypher-shell "CALL dbms.cluster.overview()"
Output:
┌──────────┬─────────┬────────┬──────────┐
│ id       │ address │ role   │ database │
├──────────┼─────────┼────────┼──────────┤
│ server1  │ :7687   │ LEADER │ neo4j    │
│ server2  │ :7687   │ FOLLOWER│ neo4j   │
│ server3  │ :7687   │ FOLLOWER│ neo4j   │
│ replica1 │ :7687   │ READ_REPLICA│ neo4j│
└──────────┴─────────┴────────┴──────────┘

Part 2: Routing and Load Balancing

Bolt Routing

Driver Configuration (Python):
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    "neo4j://server1:7687,server2:7687,server3:7687",  # Cluster endpoints
    auth=("neo4j", "password"),
    max_connection_pool_size=50
)

# Write transaction (routed to leader)
with driver.session(default_access_mode="WRITE") as session:
    session.run("CREATE (p:Person {name: $name})", name="Alice")

# Read transaction (routed to follower or replica)
with driver.session(default_access_mode="READ") as session:
    result = session.run("MATCH (p:Person) RETURN p.name")
    for record in result:
        print(record["p.name"])
Driver auto-discovers cluster topology via Bolt routing protocol.

External Load Balancer

For additional control, use HAProxy: haproxy.cfg:
frontend neo4j_bolt
    bind *:7687
    mode tcp
    default_backend neo4j_core_servers

backend neo4j_core_servers
    mode tcp
    balance roundrobin
    option tcp-check
    server core1 server1:7687 check
    server core2 server2:7687 check
    server core3 server3:7687 check
Client connects to HAProxy:
driver = GraphDatabase.driver("neo4j://haproxy:7687", auth=("neo4j", "password"))

Part 3: Backup and Recovery

Online Backup

Full Backup:
neo4j-admin backup --backup-dir=/backups --database=neo4j
Incremental Backup:
# First backup (full)
neo4j-admin backup --backup-dir=/backups --database=neo4j

# Subsequent backups (incremental, faster)
neo4j-admin backup --backup-dir=/backups --database=neo4j
Automated Backups (cron):
# Daily backup at 2 AM
0 2 * * * /usr/bin/neo4j-admin backup --backup-dir=/backups/$(date +\%Y-\%m-\%d) --database=neo4j

Restore from Backup

# Stop Neo4j
neo4j stop

# Restore backup
neo4j-admin restore --from=/backups/2024-01-15 --database=neo4j --force

# Start Neo4j
neo4j start

Disaster Recovery Strategy

3-2-1 Rule:
  • 3 copies of data (production + 2 backups)
  • 2 different storage types (local + cloud)
  • 1 off-site backup (S3, Azure Blob, etc.)
Example Script:
#!/bin/bash
BACKUP_DIR="/backups/$(date +%Y-%m-%d)"
neo4j-admin backup --backup-dir=$BACKUP_DIR --database=neo4j

# Upload to S3
aws s3 sync $BACKUP_DIR s3://my-neo4j-backups/$(date +%Y-%m-%d)

# Delete backups older than 30 days
find /backups -type d -mtime +30 -exec rm -rf {} \;

Part 4: Monitoring

Built-In Metrics (JMX)

Enable JMX:
# neo4j.conf
dbms.jvm.additional=-Dcom.sun.management.jmxremote.port=3637
dbms.jvm.additional=-Dcom.sun.management.jmxremote.authenticate=false
dbms.jvm.additional=-Dcom.sun.management.jmxremote.ssl=false
Query JMX Metrics:
CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Transactions")
YIELD attributes
RETURN attributes.NumberOfOpenTransactions

Prometheus + Grafana

1. Install Neo4j Prometheus Plugin:
wget https://github.com/neo4j-contrib/neo4j-prometheus-exporter/releases/download/1.0.0/neo4j-prometheus-1.0.0.jar
cp neo4j-prometheus-1.0.0.jar /var/lib/neo4j/plugins/
2. Configure (neo4j.conf):
dbms.prometheus.enabled=true
dbms.prometheus.endpoint=localhost:2004
3. Configure Prometheus (prometheus.yml):
scrape_configs:
  - job_name: 'neo4j'
    static_configs:
      - targets: ['server1:2004', 'server2:2004', 'server3:2004']
4. Import Grafana Dashboard: Dashboard ID 14331

Key Metrics to Monitor

MetricDescriptionAlert Threshold
Transactions/secWrite throughput-
Open transactionsActive transactions> 100 (leaks?)
Page cache hit ratioCache efficiency< 90%
GC pause timeJVM garbage collection> 1s
Store sizeDisk usage> 80%
Cluster lagReplication delay> 10s

Part 5: Performance Tuning

JVM Configuration

Heap Size (neo4j.conf):
# Set heap to 8-16GB (or 25% of RAM)
dbms.memory.heap.initial_size=8g
dbms.memory.heap.max_size=8g
GC Settings (jvm.additional):
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:MaxGCPauseMillis=200
dbms.jvm.additional=-XX:ParallelGCThreads=16

Page Cache

Size Recommendation: 50-75% of remaining RAM (after heap)
# neo4j.conf
dbms.memory.pagecache.size=32g  # For 64GB RAM (8GB heap + 32GB cache)
Monitor Hit Ratio:
CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache")
YIELD attributes
RETURN
  attributes.hits AS hits,
  attributes.faults AS faults,
  (toFloat(attributes.hits) / (attributes.hits + attributes.faults)) AS hit_ratio
Target: > 95% hit ratio

Transaction Log

Configuration:
dbms.tx_log.rotation.retention_policy=2 days  # Keep 2 days of logs
dbms.tx_log.rotation.size=256M               # Rotate at 256MB

Part 6: Security

Authentication

Enable authentication:
dbms.security.auth_enabled=true
Create users:
// Create admin user
CREATE USER admin SET PASSWORD 'secure_password' CHANGE NOT REQUIRED;
GRANT ROLE admin TO admin;

// Create read-only user
CREATE USER analyst SET PASSWORD 'password' CHANGE NOT REQUIRED;
GRANT ROLE reader TO analyst;

Authorization (RBAC)

Create custom role:
CREATE ROLE data_scientist;

// Grant read access to neo4j database
GRANT MATCH {*} ON GRAPH neo4j TO data_scientist;

// Grant algorithm execution
GRANT EXECUTE PROCEDURE gds.* ON DBMS TO data_scientist;

// Assign role to user
GRANT ROLE data_scientist TO alice;

Encryption

SSL/TLS for Bolt:
dbms.connector.bolt.tls_level=REQUIRED
dbms.ssl.policy.bolt.enabled=true
dbms.ssl.policy.bolt.base_directory=certificates/bolt
dbms.ssl.policy.bolt.private_key=private.key
dbms.ssl.policy.bolt.public_certificate=public.crt
Generate certificates:
openssl req -newkey rsa:2048 -nodes -keyout private.key -x509 -days 365 -out public.crt

Part 7: Troubleshooting

Issue 1: Slow Queries

Diagnosis:
// Enable query logging (neo4j.conf)
dbms.logs.query.enabled=true
dbms.logs.query.threshold=1s  // Log queries > 1 second

// View slow queries
CALL dbms.listQueries()
YIELD query, elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN query, elapsedTimeMillis
ORDER BY elapsedTimeMillis DESC
Solutions:
  1. Add indexes
  2. Use PROFILE to find bottlenecks
  3. Rewrite query (filter early, limit results)

Issue 2: High Memory Usage

Diagnosis:
CALL dbms.queryJmx("java.lang:type=Memory")
YIELD attributes
RETURN attributes.HeapMemoryUsage
Solutions:
  1. Increase heap size (dbms.memory.heap.max_size)
  2. Increase page cache (dbms.memory.pagecache.size)
  3. Check for transaction leaks (unclosed transactions)

Issue 3: Cluster Split-Brain

Symptom: Multiple leaders elected Diagnosis:
CALL dbms.cluster.overview()
YIELD role
WITH role, count(*) AS count
WHERE role = 'LEADER'
RETURN count  // Should be 1!
Prevention:
  • Use odd number of core servers (3, 5, 7)
  • Ensure network stability
  • Configure causal_clustering.minimum_core_cluster_size_at_runtime correctly

Issue 4: Replication Lag

Diagnosis:
CALL dbms.cluster.overview()
YIELD id, role, database
MATCH (s:Server {id: id})
RETURN id, role, s.lag AS replication_lag
ORDER BY s.lag DESC
Solutions:
  1. Add more read replicas
  2. Increase network bandwidth
  3. Tune causal_clustering.catchup.batch_size

Part 8: Best Practices Checklist

Deployment:
  • Use causal cluster (3+ core servers)
  • Configure read replicas for read scaling
  • Use Bolt routing in drivers
  • External load balancer (HAProxy) for fine control
Backup:
  • Automated daily backups
  • Off-site backup to S3/cloud
  • Tested restore procedure
  • 30-day retention
Monitoring:
  • Prometheus + Grafana setup
  • Alerts for key metrics (transactions, GC, lag)
  • Query logging enabled (threshold: 1s)
Performance:
  • Heap: 8-16GB
  • Page cache: 50-75% of RAM
  • Page cache hit ratio > 95%
  • Indexes on frequently queried properties
Security:
  • Authentication enabled
  • RBAC configured (least privilege)
  • SSL/TLS for Bolt
  • Firewall rules (only necessary ports open)

Summary

Causal Clustering: 3+ core servers for HA, read replicas for scale Backup: Automated daily backups with off-site storage Monitoring: Prometheus + Grafana for metrics Performance: Proper heap/cache sizing, GC tuning Security: Authentication, RBAC, SSL/TLS Production-Ready: HA, monitored, secured, backed up, performant!

What’s Next?

Module 8: Capstone Project - Knowledge Graph Platform

Build a complete knowledge graph application with recommendations, search, and analytics