> ## Documentation Index
> Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Production Deployment & Operations

> Deploy Neo4j clusters with high availability, configure causal clustering, monitor performance, backup strategies, and production best practices

# Production Deployment & Operations

<Info>
  **Module Duration**: 5-6 hours
  **Learning Style**: Configuration + Monitoring + Troubleshooting
  **Outcome**: Deploy and operate production Neo4j clusters with HA and performance
</Info>

## Neo4j Deployment Options

**1. Single Instance**: Development, testing
**2. Causal Cluster**: Production HA (3+ servers)
**3. Neo4j Aura**: Managed cloud service
**4. Fabric**: Multi-database federation (sharding)

***

## Part 1: Causal Clustering Architecture

### How It Works

```
        Core Servers (Raft Consensus)
    ┌─────────┬─────────┬─────────┐
    │ Leader  │ Follower│ Follower│
    │ (Writes)│ (Reads) │ (Reads) │
    └────┬────┴────┬────┴────┬────┘
         │         │         │
         └─────────┼─────────┘
                   │
        ┌──────────┴──────────┐
        │                     │
   Read Replica          Read Replica
    (Reads Only)         (Reads Only)
```

**Core Servers**:

* Handle writes (via Raft leader election)
* Minimum 3 servers (tolerate 1 failure)
* Recommended: 5 or 7 (tolerate 2 or 3 failures)

**Read Replicas**:

* Asynchronous replication from core servers
* Scale read throughput
* Can be added/removed dynamically

### Configuration

**neo4j.conf** (Core Server):

```conf theme={null}
# Cluster settings
dbms.mode=CORE
causal_clustering.minimum_core_cluster_size_at_formation=3
causal_clustering.minimum_core_cluster_size_at_runtime=3

# Discovery (initial servers)
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000

# Ports
causal_clustering.discovery_listen_address=0.0.0.0:5000
causal_clustering.transaction_listen_address=0.0.0.0:6000
causal_clustering.raft_listen_address=0.0.0.0:7000

# Bolt connector (client connections)
dbms.connector.bolt.enabled=true
dbms.connector.bolt.listen_address=0.0.0.0:7687
```

**neo4j.conf** (Read Replica):

```conf theme={null}
dbms.mode=READ_REPLICA
causal_clustering.initial_discovery_members=server1:5000,server2:5000,server3:5000
```

### Starting a Cluster

```bash theme={null}
# On each server
neo4j start

# Check cluster status
cypher-shell "CALL dbms.cluster.overview()"
```

**Output**:

```
┌──────────┬─────────┬────────┬──────────┐
│ id       │ address │ role   │ database │
├──────────┼─────────┼────────┼──────────┤
│ server1  │ :7687   │ LEADER │ neo4j    │
│ server2  │ :7687   │ FOLLOWER│ neo4j   │
│ server3  │ :7687   │ FOLLOWER│ neo4j   │
│ replica1 │ :7687   │ READ_REPLICA│ neo4j│
└──────────┴─────────┴────────┴──────────┘
```

***

## Part 2: Routing and Load Balancing

### Bolt Routing

**Driver Configuration** (Python):

```python theme={null}
from neo4j import GraphDatabase

driver = GraphDatabase.driver(
    "neo4j://server1:7687,server2:7687,server3:7687",  # Cluster endpoints
    auth=("neo4j", "password"),
    max_connection_pool_size=50
)

# Write transaction (routed to leader)
with driver.session(default_access_mode="WRITE") as session:
    session.run("CREATE (p:Person {name: $name})", name="Alice")

# Read transaction (routed to follower or replica)
with driver.session(default_access_mode="READ") as session:
    result = session.run("MATCH (p:Person) RETURN p.name")
    for record in result:
        print(record["p.name"])
```

**Driver auto-discovers** cluster topology via Bolt routing protocol.

### External Load Balancer

For additional control, use HAProxy:

**haproxy.cfg**:

```conf theme={null}
frontend neo4j_bolt
    bind *:7687
    mode tcp
    default_backend neo4j_core_servers

backend neo4j_core_servers
    mode tcp
    balance roundrobin
    option tcp-check
    server core1 server1:7687 check
    server core2 server2:7687 check
    server core3 server3:7687 check
```

**Client connects to HAProxy**:

```python theme={null}
driver = GraphDatabase.driver("neo4j://haproxy:7687", auth=("neo4j", "password"))
```

***

## Part 3: Backup and Recovery

### Online Backup

**Full Backup**:

```bash theme={null}
neo4j-admin backup --backup-dir=/backups --database=neo4j
```

**Incremental Backup**:

```bash theme={null}
# First backup (full)
neo4j-admin backup --backup-dir=/backups --database=neo4j

# Subsequent backups (incremental, faster)
neo4j-admin backup --backup-dir=/backups --database=neo4j
```

**Automated Backups** (cron):

```bash theme={null}
# Daily backup at 2 AM
0 2 * * * /usr/bin/neo4j-admin backup --backup-dir=/backups/$(date +\%Y-\%m-\%d) --database=neo4j
```

### Restore from Backup

```bash theme={null}
# Stop Neo4j
neo4j stop

# Restore backup
neo4j-admin restore --from=/backups/2024-01-15 --database=neo4j --force

# Start Neo4j
neo4j start
```

### Disaster Recovery Strategy

**3-2-1 Rule**:

* **3** copies of data (production + 2 backups)
* **2** different storage types (local + cloud)
* **1** off-site backup (S3, Azure Blob, etc.)

**Example Script**:

```bash theme={null}
#!/bin/bash
BACKUP_DIR="/backups/$(date +%Y-%m-%d)"
neo4j-admin backup --backup-dir=$BACKUP_DIR --database=neo4j

# Upload to S3
aws s3 sync $BACKUP_DIR s3://my-neo4j-backups/$(date +%Y-%m-%d)

# Delete backups older than 30 days
find /backups -type d -mtime +30 -exec rm -rf {} \;
```

***

## Part 4: Monitoring

### Built-In Metrics (JMX)

**Enable JMX**:

```conf theme={null}
# neo4j.conf
dbms.jvm.additional=-Dcom.sun.management.jmxremote.port=3637
dbms.jvm.additional=-Dcom.sun.management.jmxremote.authenticate=false
dbms.jvm.additional=-Dcom.sun.management.jmxremote.ssl=false
```

**Query JMX Metrics**:

```cypher theme={null}
CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Transactions")
YIELD attributes
RETURN attributes.NumberOfOpenTransactions
```

### Prometheus + Grafana

**1. Install Neo4j Prometheus Plugin**:

```bash theme={null}
wget https://github.com/neo4j-contrib/neo4j-prometheus-exporter/releases/download/1.0.0/neo4j-prometheus-1.0.0.jar
cp neo4j-prometheus-1.0.0.jar /var/lib/neo4j/plugins/
```

**2. Configure** (neo4j.conf):

```conf theme={null}
dbms.prometheus.enabled=true
dbms.prometheus.endpoint=localhost:2004
```

**3. Configure Prometheus** (prometheus.yml):

```yaml theme={null}
scrape_configs:
  - job_name: 'neo4j'
    static_configs:
      - targets: ['server1:2004', 'server2:2004', 'server3:2004']
```

**4. Import Grafana Dashboard**: Dashboard ID 14331

### Key Metrics to Monitor

| Metric                   | Description            | Alert Threshold |
| ------------------------ | ---------------------- | --------------- |
| **Transactions/sec**     | Write throughput       | -               |
| **Open transactions**    | Active transactions    | > 100 (leaks?)  |
| **Page cache hit ratio** | Cache efficiency       | \< 90%          |
| **GC pause time**        | JVM garbage collection | > 1s            |
| **Store size**           | Disk usage             | > 80%           |
| **Cluster lag**          | Replication delay      | > 10s           |

***

## Part 5: Performance Tuning

### JVM Configuration

**Heap Size** (neo4j.conf):

```conf theme={null}
# Set heap to 8-16GB (or 25% of RAM)
dbms.memory.heap.initial_size=8g
dbms.memory.heap.max_size=8g
```

**GC Settings** (jvm.additional):

```conf theme={null}
dbms.jvm.additional=-XX:+UseG1GC
dbms.jvm.additional=-XX:MaxGCPauseMillis=200
dbms.jvm.additional=-XX:ParallelGCThreads=16
```

### Page Cache

**Size Recommendation**: 50-75% of remaining RAM (after heap)

```conf theme={null}
# neo4j.conf
dbms.memory.pagecache.size=32g  # For 64GB RAM (8GB heap + 32GB cache)
```

**Monitor Hit Ratio**:

```cypher theme={null}
CALL dbms.queryJmx("org.neo4j:instance=kernel#0,name=Page cache")
YIELD attributes
RETURN
  attributes.hits AS hits,
  attributes.faults AS faults,
  (toFloat(attributes.hits) / (attributes.hits + attributes.faults)) AS hit_ratio
```

**Target**: > 95% hit ratio

### Transaction Log

**Configuration**:

```conf theme={null}
dbms.tx_log.rotation.retention_policy=2 days  # Keep 2 days of logs
dbms.tx_log.rotation.size=256M               # Rotate at 256MB
```

***

## Part 6: Security

### Authentication

**Enable authentication**:

```conf theme={null}
dbms.security.auth_enabled=true
```

**Create users**:

```cypher theme={null}
// Create admin user
CREATE USER admin SET PASSWORD 'secure_password' CHANGE NOT REQUIRED;
GRANT ROLE admin TO admin;

// Create read-only user
CREATE USER analyst SET PASSWORD 'password' CHANGE NOT REQUIRED;
GRANT ROLE reader TO analyst;
```

### Authorization (RBAC)

**Create custom role**:

```cypher theme={null}
CREATE ROLE data_scientist;

// Grant read access to neo4j database
GRANT MATCH {*} ON GRAPH neo4j TO data_scientist;

// Grant algorithm execution
GRANT EXECUTE PROCEDURE gds.* ON DBMS TO data_scientist;

// Assign role to user
GRANT ROLE data_scientist TO alice;
```

### Encryption

**SSL/TLS for Bolt**:

```conf theme={null}
dbms.connector.bolt.tls_level=REQUIRED
dbms.ssl.policy.bolt.enabled=true
dbms.ssl.policy.bolt.base_directory=certificates/bolt
dbms.ssl.policy.bolt.private_key=private.key
dbms.ssl.policy.bolt.public_certificate=public.crt
```

**Generate certificates**:

```bash theme={null}
openssl req -newkey rsa:2048 -nodes -keyout private.key -x509 -days 365 -out public.crt
```

***

## Part 7: Troubleshooting

### Issue 1: Slow Queries

**Diagnosis**:

```cypher theme={null}
// Enable query logging (neo4j.conf)
dbms.logs.query.enabled=true
dbms.logs.query.threshold=1s  // Log queries > 1 second

// View slow queries
CALL dbms.listQueries()
YIELD query, elapsedTimeMillis
WHERE elapsedTimeMillis > 1000
RETURN query, elapsedTimeMillis
ORDER BY elapsedTimeMillis DESC
```

**Solutions**:

1. Add indexes
2. Use PROFILE to find bottlenecks
3. Rewrite query (filter early, limit results)

### Issue 2: High Memory Usage

**Diagnosis**:

```cypher theme={null}
CALL dbms.queryJmx("java.lang:type=Memory")
YIELD attributes
RETURN attributes.HeapMemoryUsage
```

**Solutions**:

1. Increase heap size (dbms.memory.heap.max\_size)
2. Increase page cache (dbms.memory.pagecache.size)
3. Check for transaction leaks (unclosed transactions)

### Issue 3: Cluster Split-Brain

**Symptom**: Multiple leaders elected

**Diagnosis**:

```cypher theme={null}
CALL dbms.cluster.overview()
YIELD role
WITH role, count(*) AS count
WHERE role = 'LEADER'
RETURN count  // Should be 1!
```

**Prevention**:

* Use odd number of core servers (3, 5, 7)
* Ensure network stability
* Configure causal\_clustering.minimum\_core\_cluster\_size\_at\_runtime correctly

### Issue 4: Replication Lag

**Diagnosis**:

```cypher theme={null}
CALL dbms.cluster.overview()
YIELD id, role, database
MATCH (s:Server {id: id})
RETURN id, role, s.lag AS replication_lag
ORDER BY s.lag DESC
```

**Solutions**:

1. Add more read replicas
2. Increase network bandwidth
3. Tune causal\_clustering.catchup.batch\_size

***

## Part 8: Best Practices Checklist

**Deployment**:

* [ ] Use causal cluster (3+ core servers)
* [ ] Configure read replicas for read scaling
* [ ] Use Bolt routing in drivers
* [ ] External load balancer (HAProxy) for fine control

**Backup**:

* [ ] Automated daily backups
* [ ] Off-site backup to S3/cloud
* [ ] Tested restore procedure
* [ ] 30-day retention

**Monitoring**:

* [ ] Prometheus + Grafana setup
* [ ] Alerts for key metrics (transactions, GC, lag)
* [ ] Query logging enabled (threshold: 1s)

**Performance**:

* [ ] Heap: 8-16GB
* [ ] Page cache: 50-75% of RAM
* [ ] Page cache hit ratio > 95%
* [ ] Indexes on frequently queried properties

**Security**:

* [ ] Authentication enabled
* [ ] RBAC configured (least privilege)
* [ ] SSL/TLS for Bolt
* [ ] Firewall rules (only necessary ports open)

***

## Summary

**Causal Clustering**: 3+ core servers for HA, read replicas for scale
**Backup**: Automated daily backups with off-site storage
**Monitoring**: Prometheus + Grafana for metrics
**Performance**: Proper heap/cache sizing, GC tuning
**Security**: Authentication, RBAC, SSL/TLS

**Production-Ready**: HA, monitored, secured, backed up, performant!

***

## What's Next?

<Card title="Module 8: Capstone Project - Knowledge Graph Platform" icon="rocket" href="/distributed-systems-tools/neo4j-capstone">
  Build a complete knowledge graph application with recommendations, search, and analytics
</Card>
