Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Batch vs Stream Processing
Lambda Architecture
Kappa Architecture
Data Pipeline Components
Message Queues
Stream Processing
Handling Late Data (Windowing)
Watermarks for Late Data
Late data is the single most underestimated problem in stream processing. In a batch world, you process a complete dataset — no data is “late” because you wait for all of it. In a streaming world, events arrive out of order due to network delays, client clock skew, mobile devices going offline then reconnecting, and retry logic in upstream producers. A robust pipeline must decide: how long do I wait for stragglers before closing a window and emitting results? Watermarks are the mechanism for making this decision systematically. They are essentially the pipeline’s best guess at “what time is it in the event world?”Exactly-Once Processing
Idempotent Consumer Implementation
Stream Processing Implementation
- Python
- JavaScript
ETL vs ELT
Data Quality
Senior Interview Questions
How would you design a real-time analytics dashboard?
How would you design a real-time analytics dashboard?
- What metrics? (pageviews, conversions, revenue)
- How real-time? (seconds vs minutes)
- Scale? (events per second)
- Pre-aggregate: Don’t query raw events for dashboards
- Materialized views: Update incrementally, not full recompute
- Time-series DB: Optimized for time-based queries
How do you handle schema changes in a data pipeline?
How do you handle schema changes in a data pipeline?
-
Backward compatible: New code reads old data
- Add optional fields only
- Don’t remove/rename fields
-
Forward compatible: Old code reads new data
- Ignore unknown fields
-
Schema registry: Centralized schema versioning
- Avro/Protobuf with Confluent Schema Registry
- Validate compatibility on publish
- Deploy new schema (backward compatible)
- Backfill if needed
- Deploy new producers
- Old data still works
How would you design a data lake?
How would you design a data lake?
- Bronze (Raw): Exact copy of source, append-only
- Silver (Cleaned): Deduplicated, validated, typed
- Gold (Business): Aggregated, ready for consumption
- Storage: S3/ADLS with Delta Lake/Iceberg format
- Compute: Spark/Databricks
- Catalog: AWS Glue, Hive Metastore
- Query: Athena, Presto, Trino
- Partition by date for time-series data
- Use columnar formats (Parquet)
- Implement data quality checks at each layer
- Track data lineage
How do you handle backpressure in streaming systems?
How do you handle backpressure in streaming systems?
- Buffering: Kafka naturally buffers in log
- Rate limiting: Limit producer rate
- Sampling: Process subset of events
- Auto-scaling: Add more consumers
- Load shedding: Drop low-priority events
Interview Deep-Dive
You are building a real-time fraud detection pipeline for a payment processor handling 50,000 transactions per second. Walk me through the architecture and explain why you cannot use batch processing here.
You are building a real-time fraud detection pipeline for a payment processor handling 50,000 transactions per second. Walk me through the architecture and explain why you cannot use batch processing here.
- Ingestion: Kafka topic with transactions partitioned by
merchant_id(keeps related transactions together for pattern detection). At 50K TPS with ~1KB per transaction event, that is 50 MB/sec = 432 TB/day of raw event data. - Stream processor: Apache Flink with event-time windowing. Each transaction triggers three parallel checks:
- Rule engine (sub-10ms): Hard-coded rules like “decline if amount exceeds 3x the user’s average transaction in the last 30 days.” This requires a pre-computed user profile in Redis (average transaction amount, transaction count, last transaction time). Redis lookup: ~1ms.
- Velocity check (sub-20ms): Count transactions per card in sliding windows (1 minute, 1 hour, 24 hours). Flink maintains this state in-memory with RocksDB state backend. If a card has 10+ transactions in 1 minute, flag it.
- ML model scoring (sub-50ms): Feature vector (transaction amount, merchant category, time of day, device fingerprint, geo-distance from last transaction) fed into a pre-trained model served via a low-latency inference service (TensorFlow Serving or ONNX Runtime). The model returns a fraud probability score.
- Decision aggregation: Combine all three scores. If any hard rule triggers, decline. If ML score exceeds threshold (0.8), decline. If velocity check + ML score is borderline, route to human review queue.
- Total pipeline latency budget: Kafka consume (5ms) + parallel checks (50ms worst case) + decision logic (2ms) + Kafka produce for downstream (5ms) = ~62ms end-to-end. Well within the 500ms gateway timeout.
Your company has both a Lambda architecture (batch + speed layer) and is considering migrating to Kappa architecture (stream-only). The batch layer runs a nightly Spark job that takes 4 hours to reprocess all data. What are the trade-offs, and what would you recommend?
Your company has both a Lambda architecture (batch + speed layer) and is considering migrating to Kappa architecture (stream-only). The batch layer runs a nightly Spark job that takes 4 hours to reprocess all data. What are the trade-offs, and what would you recommend?
- Your nightly Spark batch takes 4 hours to process a full day of data. Assume 1 day of data = 10 TB.
- In Kappa, reprocessing means replaying Kafka. Flink can typically process faster than real-time when reading from Kafka (no external I/O latency). If real-time throughput is 50 MB/sec, replay throughput might be 500 MB/sec (10x, limited by state checkpointing and sink writes). 10 TB / 500 MB/sec = 20,000 seconds = ~5.5 hours. Comparable to the 4-hour batch.
- But what if you need to reprocess 30 days of data (a bug was found in the business logic 30 days ago)? That is 300 TB. Replay: 300 TB / 500 MB/sec = 600,000 seconds = ~7 days. With Lambda, you run the corrected Spark job against the data lake in 4 hours (it processes 30 days in parallel, not sequentially).
Estimate the infrastructure cost of a data pipeline that ingests 1 billion events per day, stores raw events for 1 year, and produces real-time aggregations every 5 minutes.
Estimate the infrastructure cost of a data pipeline that ingests 1 billion events per day, stores raw events for 1 year, and produces real-time aggregations every 5 minutes.
- 1B events/day = 11,574 events/sec average, ~35K peak.
- Average event size: 500 bytes. Throughput: 11,574 * 500 = 5.8 MB/sec average, 17.4 MB/sec peak.
- Kafka retention: 7 days for hot replay. 7 * 86,400 * 11,574 * 500 bytes = 3.5 TB.
- Kafka cluster: 3 brokers with replication factor 3. Each broker stores 3.5 TB / 3 * 3 (replication) = 3.5 TB. Use i3.xlarge instances (1 TB NVMe). Need 4 i3.xlarge per broker = 12 instances total. At ~2,700/month.
- 5-minute tumbling windows for aggregations. State size: depends on aggregation dimensions. If aggregating by 1M unique keys with 200 bytes of state each = 200 MB of state. Modest — a 3-node Flink cluster handles this easily. 3 m5.2xlarge at ~820/month.
- 1B events * 500 bytes = 500 GB/day raw. Compressed with Parquet: ~100 GB/day.
- 1 year: 100 GB * 365 = 36.5 TB.
- S3 Standard: 839/month.
- After 90 days, move to S3 Glacier Instant Retrieval: 400/month.
- 5-minute aggregations for 1M keys = 288 aggregation points per key per day. 1M * 288 * 100 bytes = 28.8 GB/day. 1 year = 10.5 TB.
- TimescaleDB on a db.r6g.4xlarge: ~$2,500/month. With compression (10x on time-series data): 1 TB actual storage.
- Kafka to Flink: same VPC, free.
- Flink to S3: $0.00/GB within same region.
- API reads from TimescaleDB: negligible.