Capstone Project: Building a Real-Time Analytics Platform
Project Duration: 20-30 hours
Learning Style: Hands-On Implementation + Architecture Design + Production Deployment
Outcome: A complete, production-ready Cassandra application demonstrating mastery of all concepts
Project Overview
You will design and implement SensorMetrics, a real-time IoT analytics platform that:- Ingests millions of sensor readings per second from 100,000+ devices
- Stores time-series data with automatic TTL-based expiration
- Provides real-time dashboards with low-latency queries (< 10ms p95)
- Supports multi-datacenter deployment for geographic distribution
- Handles device failures gracefully (late data, out-of-order events)
- Operates 24/7 with 99.9% uptime and automated recovery
- Architecture (consistent hashing, replication, vnodes)
- Data Modeling (partition keys, clustering columns, time-series patterns)
- Read/Write Internals (CommitLog, MemTable, SSTable, compaction)
- Cluster Operations (gossip, failure detection, repair, multi-DC)
- Performance (JVM tuning, monitoring, capacity planning)
Part 1: Requirements Analysis
Functional Requirements
1. Data Ingestion:- 100,000 sensors sending metrics every 10 seconds
- Throughput: 10,000 writes/second sustained, 50,000 writes/second peak
- Metrics: temperature, humidity, pressure, battery level
- Late arrivals: Up to 5 minutes delayed data must be accepted
- Deduplication: Same reading shouldn’t be stored twice
- Hot data: Last 7 days (low-latency queries)
- Warm data: 8-30 days (moderate latency acceptable)
- Cold data: 31-365 days (high latency acceptable, compressed)
- Ancient data: > 365 days deleted automatically (TTL)
- Frequency: Very high (1000s/sec)
- Latency requirement: < 10ms p95
- Frequency: Moderate (100s/sec)
- Latency requirement: < 50ms p95
- Frequency: Low (10s/sec)
- Latency requirement: < 100ms p95
- Frequency: Low (monitoring system)
- Latency requirement: < 1s
Non-Functional Requirements
Availability:- 99.9% uptime (< 9 hours downtime/year)
- No single point of failure
- Graceful degradation during node failures
- Horizontal scaling to 1M+ sensors
- Linear performance scaling with nodes
- Automatic rebalancing on node addition
- Write latency: < 5ms p95
- Read latency: < 10ms p95 (hot data)
- Throughput: 50,000 writes/sec peak
- No data loss for acknowledged writes
- Automatic replication (RF=3)
- Regular backups with 7-day retention
- Multi-datacenter deployment (US-East, EU-West)
- Local reads/writes (< 50ms latency)
- Eventual consistency across DCs (< 10 seconds)
Part 2: Architecture Design
High-Level Architecture
Cassandra Cluster Design
Cluster Configuration:Part 3: Data Model Design
Schema Design Process
Step 1: Identify Queries (done in Part 1) Step 2: Design Tables (one table per query pattern)Table 1: Raw Sensor Metrics (Query Pattern 1)
Query:-
Partition Key = sensor_id:
- Queries are always for a specific sensor
- Ensures even distribution (100K sensors)
- Partition size: ~200 bytes/reading × 8,640 readings/day = 1.7 MB/day (acceptable)
-
Clustering Key = timestamp DESC:
- Time-series data sorted newest-first
- Efficient range queries (
WHERE timestamp > ?) - Descending order for “latest N readings” queries
-
TWCS Compaction:
- Time-bucketed data (1-day windows)
- Entire SSTable dropped when TTL expires (ultra-fast)
- Minimal compaction overhead
-
TTL = 365 days:
- Automatic deletion (no manual cleanup)
- Aligns with retention policy
Table 2: Daily Aggregates (Query Pattern 2)
Query:-
Partition Key = sensor_id:
- Matches query pattern
- Small partitions (1 row/day = 365 rows/year)
-
Clustering Key = date:
- Enables efficient range queries
- Sorted descending (recent aggregates first)
-
LCS Compaction:
- Aggregates updated daily (read-modify-write)
- LCS handles updates efficiently
-
Longer TTL (3 years):
- Aggregates more valuable than raw data
- Smaller storage footprint (365 rows/sensor vs 3M+ raw readings)
Table 3: Sensors by Building (Query Pattern 3)
Query:-
Partition Key = building_id:
- Query retrieves all sensors in a building
- Partition size: ~100 sensors/building × 200 bytes = 20 KB (tiny!)
-
Clustering Key = sensor_id:
- Sorted by sensor_id for easy lookup
- Enables
WHERE building_id = ? AND sensor_id = ?queries
-
Denormalization:
- Stores latest values (redundant with sensor_metrics)
- Avoids scatter-gather query across all sensors
- Updated on every sensor reading (write amplification accepted)
-
No TTL:
- Metadata table (doesn’t expire)
- Updated in-place as new readings arrive
Table 4: Anomaly Events (Query Pattern 4)
Query:-
Composite Clustering Key:
event_timefor time-series orderingevent_typeto allow multiple anomalies at same timestamp
-
TWCS with 7-day windows:
- Anomalies are time-series data
- Fast TTL-based expiration
-
Shorter TTL (90 days):
- Anomalies less valuable after resolution
- Reduces storage
Complete Keyspace Definition
Part 4: Implementation
Phase 1: Ingestion API
Technology: Python FastAPI (async I/O for high throughput) File:ingestion_api.py
- Batching: Client sends 100-1000 readings per request (reduces network overhead)
- Prepared Statements: Pre-compiled queries (10x performance improvement)
- Deduplication: Bloom filter prevents duplicate writes
- Validation: Pydantic models ensure data quality
- LOCAL_QUORUM: Fast writes to local DC, async replication to remote DC
- Denormalization: Update
sensors_by_buildingin same batch (eventual consistency)
Phase 2: Background Aggregation
Technology: Python (scheduled job or Spark for larger scale) File:daily_aggregator.py
Phase 3: Dashboard API
File:dashboard_api.py
Part 5: Deployment and Operations
Cassandra Configuration
cassandra.yaml (production settings):Monitoring Setup
Prometheus Configuration (prometheus.yml):Backup Strategy
Automated Snapshot Script (backup.sh):Repair Schedule
Automated Repair with Cassandra Reaper:Part 6: Testing and Validation
Load Testing
cassandra-stress Configuration:| Metric | Target | Actual (to be filled) |
|---|---|---|
| Write throughput | 10,000 ops/sec | _____ |
| Write latency p95 | < 5ms | _____ |
| Read throughput | 5,000 ops/sec | _____ |
| Read latency p95 | < 10ms | _____ |
| Error rate | 0% | _____ |
Failure Testing
Test 1: Node FailurePart 7: Advanced Challenges (Optional)
Challenge 1: Hot Partition Mitigation
Problem: One sensor (sensor-99999) sends 10x more data than others, creating a hot partition. Task:- Detect hot partition (use
nodetool toppartitions) - Redesign data model to shard hot partition:
- Modify ingestion API to compute shard:
shard = hash(timestamp) % 10 - Modify dashboard API to query all shards and merge results
Challenge 2: Cross-DC Latency Optimization
Problem: Reads from eu-west to us-east data take 150ms (cross-Atlantic latency). Task:- Implement read-from-local-DC-first logic:
- Measure latency improvement
- Consider trade-offs (stale data vs latency)
Challenge 3: Materialized Views (Advanced)
Task: Create a materialized view for “sensors by latest temperature”:Part 8: Deliverables and Evaluation
Project Deliverables
-
Architecture Diagram (draw.io or similar)
- Cassandra cluster topology
- Application components
- Data flow
-
Data Model Documentation
- CQL schema definitions
- Query pattern → table mapping
- Design decision rationale
-
Implementation Code
- Ingestion API (
ingestion_api.py) - Dashboard API (
dashboard_api.py) - Background jobs (
daily_aggregator.py)
- Ingestion API (
-
Configuration Files
cassandra.yamljvm11-server.optionsprometheus.yml,alerts.yml
-
Operational Runbook
- Deployment procedure
- Backup/restore steps
- Common troubleshooting scenarios
-
Load Test Results
- cassandra-stress logs
- Latency percentiles
- Throughput measurements
-
Presentation (10-15 slides)
- Problem statement
- Architecture overview
- Data model design
- Performance results
- Lessons learned
Evaluation Criteria
| Category | Weight | Criteria |
|---|---|---|
| Architecture | 20% | Appropriate use of Cassandra features (replication, consistency, compaction) |
| Data Model | 25% | Partitioning strategy, denormalization, query alignment |
| Implementation | 20% | Code quality, error handling, batching, prepared statements |
| Performance | 20% | Meets latency/throughput targets, efficient queries |
| Operations | 10% | Monitoring, backups, repair schedules, runbook |
| Documentation | 5% | Clarity, completeness, diagrams |
Mastery Checklist
-
Data Modeling
- Query-driven design (1 query = 1 table)
- Appropriate partition keys (even distribution)
- Time-series clustering keys
- Denormalization for performance
- TTL for automatic expiration
-
Write Path
- Batch inserts (100+ per batch)
- Prepared statements
- Appropriate consistency level (LOCAL_QUORUM)
- Deduplication logic
-
Read Path
- Efficient partition key queries (no ALLOW FILTERING)
- Caching (Redis or application-level)
- Pagination for large results
-
Compaction
- TWCS for time-series data
- LCS for frequently updated data
- Appropriate window sizes
-
Multi-DC
- NetworkTopologyStrategy (RF per DC)
- LOCAL consistency levels
- Per-DC repair
-
JVM Tuning
- Heap ≤ 8GB
- G1GC with 200ms pause target
- GC logging enabled
-
Monitoring
- Prometheus + Grafana setup
- Critical alerts (dropped mutations, GC, node down)
- Dashboard for key metrics
-
Backup/Repair
- Automated daily snapshots
- Weekly repair schedule
- Tested restore procedure
Part 9: Real-World Extensions
Extending the Project
1. Stream Processing Integration:- Add Kafka for event streaming
- Use Kafka Connect Cassandra Sink for ingestion
- Implement Kafka Streams for real-time anomaly detection
- Train ML model to predict sensor failures (battery level, read patterns)
- Use Spark MLlib or TensorFlow
- Store predictions in Cassandra for dashboard
- Support multiple customers (tenant isolation)
- Redesign schema with
tenant_idin partition key - Implement tenant-level quotas
- Time-series forecasting (predict future temperature)
- Correlation analysis (temperature vs humidity patterns)
- Spatial queries (sensors within 1km radius)
Production Considerations
Security:- Deploy across 3+ availability zones per DC
- Use Kubernetes for API layer (auto-scaling, self-healing)
- Implement circuit breakers and retries in clients
- Use tiered storage (hot=SSD, cold=HDD)
- Implement data lifecycle (move old data to S3 for archival)
- Right-size nodes (monitor CPU, memory, disk usage)
Summary
You’ve designed and implemented a production-ready, scalable IoT analytics platform using Apache Cassandra. This capstone demonstrated: ✅ Data Modeling: Query-driven design with proper partitioning ✅ Write Optimization: Batching, prepared statements, TWCS compaction ✅ Read Optimization: Caching, denormalization, efficient queries ✅ Multi-DC: Geographic distribution with LOCAL consistency ✅ Performance Tuning: JVM, OS, Cassandra configuration ✅ Operations: Monitoring, backups, repair, troubleshooting ✅ Production Deployment: Complete stack from ingestion to visualization Congratulations on completing the Cassandra mastery course! 🎉 You now have the skills to:- Design schemas for any query pattern
- Operate Cassandra clusters at scale
- Troubleshoot performance issues
- Deploy production-ready systems
What’s Next?
Neo4j Graph Database Course
Learn graph databases, Cypher queries, and property graph modeling
Apache Spark Course
Master distributed data processing with RDDs, DataFrames, and Structured Streaming
Apache Kafka Course
Build real-time streaming pipelines with Kafka, Kafka Streams, and KSQL
Advanced Cassandra Topics
Explore change data capture, materialized views, and advanced operations