Data Engineering — Pipelines, Platforms, and Production Reality
Data engineering is the unsexy backbone of every data-driven company. Every recommendation engine, every analytics dashboard, every ML model depends on data that arrived on time, in the right format, without duplicates, and without silent corruption. This chapter covers how senior data engineers think about warehouse architecture, pipeline design, stream processing, data quality, and system design — the skills that keep the entire data stack from collapsing under its own weight.Real-World Stories: Why This Matters
How Uber Rebuilt Their Data Platform After a $100M Data Quality Crisis
How Uber Rebuilt Their Data Platform After a $100M Data Quality Crisis
How Spotify Processes 600 Billion Events Per Day
How Spotify Processes 600 Billion Events Per Day
How Netflix's Data Mesh Transformed Their Data Culture
How Netflix's Data Mesh Transformed Their Data Culture
Part I — Data Architecture and Modeling
1. Data Warehouse Architecture
1.1 Classical Approaches
- Kimball (Dimensional)
- Inmon (Enterprise)
- Data Vault 2.0
1.2 Modern Cloud Warehouse Architecture
The cloud fundamentally changed data warehouse economics by separating storage and compute.| Platform | Architecture | Strengths | Pricing Model |
|---|---|---|---|
| Snowflake | Storage/compute separation, multi-cluster warehouses | Auto-scaling, zero-copy cloning, data sharing, time travel | Credit-based (pay per second of compute) |
| BigQuery | Serverless, columnar storage on Dremel engine | Zero infrastructure management, slot-based autoscaling, ML integration (BQML) | On-demand (per TB scanned) or slot-based |
| Redshift | PostgreSQL-based, node clusters | Familiar SQL, tight AWS integration, Redshift Serverless | Provisioned (per node) or serverless (per RPU) |
| Databricks | Spark-based lakehouse, Delta Lake | Unified batch + streaming, ML integration, photon engine | DBU-based (compute units per hour) |
Interview: Your company uses Snowflake and monthly costs have tripled in six months. How do you diagnose and reduce costs?
Interview: Your company uses Snowflake and monthly costs have tripled in six months. How do you diagnose and reduce costs?
- Check
SNOWFLAKE.ACCOUNT_USAGE.WAREHOUSE_METERING_HISTORYto identify which warehouses are consuming the most credits. - Check
QUERY_HISTORYto find the most expensive individual queries (sort byCREDITS_USED_CLOUD_SERVICESandEXECUTION_TIME). - Check for warehouses that auto-suspend is not configured on (they run 24/7 even when idle).
- Enable auto-suspend (5 minutes for interactive, 1 minute for ETL warehouses). A warehouse that runs idle for 8 hours/day wastes 66% of its compute budget.
- Right-size warehouses. A common pattern: someone created an XL warehouse for a one-time backfill and left it as the default. Most scheduled queries run on Small or Medium.
- Identify and fix full-table scans. Snowflake’s micro-partition pruning only works if queries filter on the clustering key. A WHERE clause on a non-clustered column scans every micro-partition.
- Implement query tagging so each team/pipeline is attributed to a cost center. You cannot optimize what you cannot measure.
- Move infrequent queries to auto-scaling multi-cluster warehouses with min=0 (scales to zero when idle).
- Implement incremental models in dbt instead of full table rebuilds. A model that rebuilds a 500M row table daily when only 100K rows changed is wasting 99.98% of its compute.
- Consider clustering keys for large tables (>1TB) where queries consistently filter on the same columns.
- Evaluate Snowflake’s resource monitors to set per-warehouse spending alerts and hard limits.
- Consider moving infrequently-accessed historical data to lower-cost storage tiers.
- Negotiate reserved capacity pricing if usage is predictable.
QUERY_HISTORY with team tags and computes per-team credit consumption. Expose this in a dashboard that updates daily. Add a weekly Slack alert for teams whose spend increased more than 20% week-over-week.Follow-up Chain:- Failure mode: What if a cost optimization breaks a downstream report? Always test changes in a cloned environment (
CREATE DATABASE ... CLONE) before applying to production. Roll back by reverting the warehouse size or materialization config — both are metadata changes, not data changes. - Rollout: Phase cost reductions in priority order — auto-suspend first (zero risk), then right-sizing (low risk, monitor query runtimes for 1 week), then incremental models (medium risk, validate row counts against full rebuild).
- Rollback: Keep the old dbt model YAML with a
--full-refreshflag so you can revert any incremental model to a full rebuild instantly if data quality degrades. - Measurement: Track credits/day per warehouse, p95 query runtime per model, and data freshness SLO adherence. Cost reduction that degrades freshness from 6 AM to 8 AM is not a win.
- Cost: The meta-cost: building the attribution system itself consumes engineering time. Estimate 2-3 sprint days for tagging + dashboard + alerts. ROI is immediate if monthly spend exceeds $20K.
- Security/Governance: Ensure
QUERY_HISTORYaccess is restricted — it exposes query text which may contain PII in WHERE clauses. Grant access only to the FinOps team and data platform leads.
WAREHOUSE_METERING_HISTORY, right-size warehouses, enable auto-suspend, and convert top models to incremental.”
A staff/principal engineer adds: “Before touching anything, I want to understand the cost trajectory — is this a one-time spike from a backfill or a structural trend from organic growth? I would build a cost model that projects spend 3 months out under current growth, then present engineering leadership with a menu of options at different effort/savings trade-offs. I would also evaluate whether the warehouse architecture itself needs restructuring — separate warehouses per workload class (ETL, interactive, reporting) with different scaling policies, rather than optimizing individual queries.”AI-Assisted Engineering Lens: Cloud Warehouse Cost Optimization
AI-Assisted Engineering Lens: Cloud Warehouse Cost Optimization
QUERY_HISTORY exports into an LLM and prompt: “Identify the top 10 queries by total credit consumption that could benefit from incremental logic, partition pruning, or caching.” LLMs excel at pattern-matching across hundreds of query plans to surface optimization candidates that a human reviewing one query at a time would miss. The trap: LLM-suggested optimizations must be validated against actual execution plans — a query that “looks” optimizable may already be hitting Snowflake’s result cache. Always benchmark before and after.Work-Sample Prompt: Snowflake Cost Triage
Work-Sample Prompt: Snowflake Cost Triage
ACCOUNT_USAGE views and the Airflow/dbt project. Walk through your first 2 hours of investigation: what do you query first, what are you looking for, and what are the three most likely root causes based on your experience?2. Data Modeling at Scale
2.1 Dimensional Modeling
Fact tables store measurements — quantitative data about business events. Dimension tables store the context for those measurements — the who, what, where, when, and why.| Fact Table Type | Description | Example | Grain |
|---|---|---|---|
| Transactional | One row per event | Each order line item | Finest grain |
| Periodic Snapshot | One row per entity per time period | Daily account balance | Regular intervals |
| Accumulating Snapshot | One row per entity lifecycle | Order (created → shipped → delivered) | Updates as stages complete |
| Factless Fact | Records events with no measurement | Student attendance (present/absent) | Event occurrence |
2.2 Slowly Changing Dimensions (SCDs)
When dimension data changes (a customer moves to a new city, a product changes categories), how do you handle the history?| SCD Type | Strategy | Use When | Example |
|---|---|---|---|
| Type 0 | Never update | Dimension is immutable | Date dimension, original values |
| Type 1 | Overwrite | History does not matter | Fixing a typo in a customer name |
| Type 2 | Add new row with effective dates | Full history required | Customer address changes (for order attribution) |
| Type 3 | Add “previous” column | Only need current and prior value | Current and previous department |
| Type 4 | Separate history table | Dimension changes frequently | Rapidly changing attributes (loyalty points) |
| Type 6 | Hybrid (1 + 2 + 3) | Need current value on historical rows | Complex reporting requirements |
AI-Assisted Engineering Lens: Data Modeling
AI-Assisted Engineering Lens: Data Modeling
3. Data Lake Architecture
3.1 Medallion Architecture
The medallion (or multi-hop) architecture organizes data into layers of increasing quality:3.2 Table Formats
Table formats bring database-like capabilities (ACID transactions, schema evolution, time travel) to files on object storage.| Format | Originated By | Key Features | Ecosystem |
|---|---|---|---|
| Delta Lake | Databricks | ACID transactions, time travel, Z-ordering, liquid clustering | Spark, Databricks, Flink (limited) |
| Apache Iceberg | Netflix | Hidden partitioning, partition evolution, vendor-neutral | Spark, Flink, Trino, Snowflake, BigQuery |
| Apache Hudi | Uber | Upserts/deletes optimized, incremental processing, record-level indexing | Spark, Flink, Presto, AWS native |
3.3 Table Format Decision Framework
Choosing a table format is one of the stickiest decisions in a data platform — switching later is a multi-quarter migration. The right answer depends on your workload profile, ecosystem commitments, and operational maturity.| Decision Factor | Choose Iceberg | Choose Delta Lake | Choose Hudi |
|---|---|---|---|
| Multi-engine reads | You query from Spark, Trino, Flink, AND a warehouse | You are Databricks-first | You tolerate Spark/Flink only |
| Partition evolution | Partitioning strategy will change over time | Partitioning is stable | Partitioning is stable |
| Upsert volume | Moderate (batch merges) | Moderate (MERGE INTO) | Heavy — millions of row-level upserts/day from CDC |
| Vendor lock-in tolerance | Low (Apache Foundation governance) | Medium (Databricks stewards, open-sourced) | Low-Medium (Apache Foundation) |
| Time travel needs | Snapshot-based, cheap | Log-based, configurable retention | Timeline-based, per-record |
| Community momentum (2025) | Highest (every major vendor adopted) | Strong (Databricks ecosystem) | Niche (CDC-heavy shops, AWS-oriented) |
Interview: Your team is choosing between Iceberg and Delta Lake for a new lakehouse. How do you decide?
Interview: Your team is choosing between Iceberg and Delta Lake for a new lakehouse. How do you decide?
PARTITIONED BY and changing it means rewriting the table. If your access patterns will evolve — which they always do — Iceberg is more forgiving.3. Ecosystem commitment. If the company is all-in on Databricks (Unity Catalog, MLflow, Photon), Delta Lake is the native citizen. You get the deepest integration, the best performance optimizations (liquid clustering, predictive I/O), and the smoothest dbt integration. Fighting the ecosystem is expensive.4. CDC workload profile. If 80% of your writes are row-level upserts from Debezium CDC, Hudi’s record-level indexing is genuinely faster than Iceberg’s copy-on-write or merge-on-read. But Hudi’s ecosystem is narrower — you trade write performance for read flexibility.My default recommendation for a new project in 2025: Iceberg, unless you are committed to Databricks — then Delta Lake. Hudi only if CDC upserts are your dominant write pattern and you have benchmarked it against Iceberg MoR (merge-on-read).Red Flag Answer: “We should use Delta Lake because it’s the most popular” — popularity is not a technical argument. Or: “Iceberg because Netflix uses it” — cargo-culting without understanding the trade-offs.Follow-up: “How would you migrate from Delta Lake to Iceberg if the business decides to move off Databricks?”Databricks actually anticipated this — Delta Lake UniForm writes Iceberg metadata alongside Delta metadata, so tables can be read by Iceberg-compatible engines without data conversion. For a full migration: (1) Enable UniForm on existing Delta tables. (2) Validate that Iceberg reads produce identical results (row counts, checksums). (3) Switch read engines to Iceberg catalogs. (4) Once all readers are on Iceberg, stop writing Delta metadata. The data files (Parquet) are identical — only the metadata layer changes.Follow-up Chain:- Failure mode: What if UniForm metadata divergence causes silent data mismatches? Run a daily reconciliation job that compares row counts and checksum aggregates between Delta and Iceberg reads during the migration window. Any delta (pun intended) > 0 rows blocks the cutover.
- Rollout: Migrate read-only consumers first (dashboards, ad-hoc analysts). They are lower risk because they do not produce data. Migrate writers last. Use a canary pattern: one non-critical table migrated end-to-end before touching the 500-table production catalog.
- Rollback: Because data files are shared Parquet, rollback means re-pointing the catalog back to Delta metadata. Keep Delta transaction logs for 30 days post-migration as insurance.
- Measurement: Track query performance (latency P50/P95) and data freshness SLOs per table before, during, and after migration. A successful migration means zero SLO regressions.
- Cost: The migration itself is mostly metadata — near-zero storage cost. The real cost is engineering time: expect 1 quarter for a 500-table catalog with testing. Budget for 2 data engineers full-time.
- Security/Governance: Table-level access control must be re-established in the new Iceberg catalog (REST catalog, Nessie, or Polaris). Audit that permission parity exists before cutover — a migration that accidentally opens restricted tables to all users is a compliance incident.
Part II — Pipelines and Processing
4. ETL vs ELT
4.1 The Paradigm Shift
| Aspect | Traditional ETL | Modern ELT |
|---|---|---|
| Transform location | Before loading (external process) | After loading (inside the warehouse) |
| Tools | Informatica, Talend, custom scripts | dbt, Snowflake SQL, BigQuery SQL |
| Compute | Dedicated ETL servers | Warehouse compute (scalable) |
| Agility | Schema changes require pipeline changes | Transform changes are SQL changes |
| Cost model | Fixed infrastructure | Pay-per-query |
| Who transforms | Data engineers | Data engineers AND analytics engineers |
4.2 dbt (Data Build Tool)
dbt is the de facto standard for the “T” in ELT. It treats SQL transformations as software — version-controlled, tested, documented, and deployed through CI/CD. Core concepts:- Models — SQL SELECT statements that dbt materializes as tables or views
- Sources — Declarations of raw tables that dbt reads from (with freshness checks)
- Tests — Assertions on data (not_null, unique, accepted_values, relationships, custom SQL tests)
- Macros — Jinja-templated SQL functions for reuse
- Materializations — How a model is persisted (view, table, incremental, ephemeral)
Interview: Explain how you would design dbt models for an e-commerce analytics platform
Interview: Explain how you would design dbt models for an e-commerce analytics platform
- One model per source table:
stg_orders,stg_customers,stg_products,stg_payments - Responsible for: renaming columns to consistent conventions, casting types, filtering test/internal data, deduplication
- Materialized as views (no storage cost, always fresh)
- Business logic that combines staging models:
int_orders_with_payments(join orders with payments),int_customer_order_summary(aggregate orders per customer) - Materialized as views or ephemeral (no direct querying, only used as inputs to gold models)
- Business-facing models:
fct_orders(one row per order with all metrics),dim_customers(SCD Type 2 with customer attributes over time),dim_products - Materialized as tables with incremental strategy for large fact tables
uniqueandnot_nullon every primary keyaccepted_valueson enum columns (order_status, payment_method)relationshipsto ensure referential integrity between facts and dimensions- Custom SQL tests for business rules (e.g.,
total_amount >= 0,ship_date >= order_date)
WHERE order_date > MAX(order_date), use WHERE order_date > MAX(order_date) - INTERVAL '3 days'. This reprocesses the last 3 days on every run, catching late arrivals at the cost of some recomputation. (2) Use a merge strategy with unique_key — dbt performs an upsert, so late-arriving data updates existing rows rather than creating duplicates.Follow-up Chain:- Failure mode: What if an incremental model silently drops rows because the
is_incremental()filter excludes late data beyond the lookback window? Add a weekly full-refresh run (dbt run --full-refresh --select fct_orders) and compare row counts against the incremental version. Any divergence > 0.1% triggers an alert and a root-cause investigation. - Rollout: Start new dbt projects with views for all staging models. Only convert to incremental when a model exceeds 5-minute build time or 100M rows. Premature incrementalization adds complexity for no benefit.
- Rollback: Every incremental model should support
--full-refreshas an escape hatch. If an incremental model produces bad data, a full refresh rebuilds from scratch using the same SQL. This must be tested in CI — a model that only works incrementally but fails on full refresh is a ticking time bomb. - Measurement: Track model build times, row counts per run (incremental delta vs full), and data freshness SLO. A good incremental model processes <1% of total rows per run.
- Cost: Incremental models trade compute cost for complexity cost. A 500M-row table that takes 45 minutes as a full rebuild but 30 seconds as an incremental saves ~360/day — significant enough to justify the complexity.
- Security/Governance: dbt models that contain business logic (discount calculations, revenue attribution) should have their SQL reviewed by finance/analytics stakeholders, not just engineers. Treat gold-layer models like application code — PR reviews with domain experts.
dbt test step in CI that runs against a PR branch’s models using a subset of production data, so we catch logic errors before they hit production. And I would set up dbt exposures to document which dashboards and ML models consume each mart — so when someone proposes a schema change to fct_orders, we know exactly what breaks.”AI-Assisted Engineering Lens: dbt Development
AI-Assisted Engineering Lens: dbt Development
DESCRIBE TABLE output and prompt “generate a dbt staging model with standard naming conventions.” The LLM produces a clean SELECT with renamed columns, type casts, and deduplication logic in seconds. (2) Generate test YAML — prompt “generate dbt schema tests for this model” with the model SQL, and the LLM identifies primary key candidates, enum columns for accepted_values, and foreign key relationships. (3) Documentation — prompt “generate a dbt model description and column descriptions for this SQL” to auto-generate the schema.yml documentation that nobody wants to write manually. The trap: LLM-generated incremental logic often gets the is_incremental() filter wrong for edge cases (late-arriving data, deletions). Always hand-verify incremental predicates.Work-Sample Prompt: dbt Model Review
Work-Sample Prompt: dbt Model Review
fct_daily_revenue model. It is materialized as table (full rebuild), joins 4 staging models, and has no tests defined. The table currently has 2M rows but is growing by 50K rows/day. The dbt project has 200+ models and CI runs take 45 minutes. Write your PR review: what do you approve, what do you request changes on, and what questions do you ask about the business requirements before making your recommendation?5. Batch Processing
5.1 Apache Spark Internals
Understanding Spark’s execution model is essential for writing performant data pipelines.Driver creates a logical plan
filter, map, join add nodes to the DAG. Only “actions” (count, write, collect) trigger execution.Catalyst optimizer creates a physical plan
Data is partitioned across executors
| Problem | Solution | Impact |
|---|---|---|
| Small table joined with large table | Broadcast join — Spark sends the small table to every executor | Eliminates shuffle entirely |
| Too many small files in output | Coalesce before write | Reduces file count, improves downstream reads |
| Skewed join keys (one key has 100x more data) | Salting — add random suffix to skewed key, join, then aggregate | Distributes work evenly |
| Full table scans when only recent data needed | Partition pruning — partition by date, filter early | Reads 1/365th of the data |
| Spark reads columns it does not need | Predicate pushdown with Parquet/ORC | Reads only needed columns from storage |
| Out of memory errors | Increase spark.sql.shuffle.partitions, use disk-based join strategies | Trades memory for disk I/O |
AI-Assisted Engineering Lens: Spark Development and Optimization
AI-Assisted Engineering Lens: Spark Development and Optimization
df.explain(True)) and prompt “identify performance issues and suggest optimizations.” LLMs can spot missing predicate pushdown, unnecessary shuffles, and broadcast join opportunities from the physical plan. (3) Error diagnosis: Paste a Spark OOM stack trace and prompt “explain the root cause and suggest fixes with specific Spark configuration changes.” The LLM maps the error pattern to known causes (e.g., skewed partition, insufficient spark.sql.shuffle.partitions, broadcast threshold too high). Caveat: LLMs trained on Spark 2.x code may suggest deprecated APIs. Always verify against your Spark version (3.x defaults differ significantly for AQE, shuffle partitions, and broadcast thresholds).5.2 Batch Processing: Failure and Recovery
Every batch pipeline will fail. The question is how fast you detect it, how cleanly you recover, and whether recovery introduces data corruption.| Failure Mode | What Happens | Detection | Recovery Pattern |
|---|---|---|---|
| Task OOM (Out of Memory) | Executor dies mid-partition, partial output may exist | Spark UI shows failed stages, executor lost events | Increase memory OR repartition to reduce per-task data. Delete partial output before retry. |
| Data skew blows up one task | 99 tasks finish in 2 minutes, 1 task runs for 3 hours | Spark UI task duration histogram, one bar 100x taller | Salt the skewed key, enable AQE skew handling (spark.sql.adaptive.skewJoin.enabled) |
| Upstream table schema changed | Read fails on column mismatch, or worse — silently reads NULLs | Schema validation at read time, dbt source tests | Pin schemas explicitly. Add schema-change detection in bronze layer. Never assume upstream stability. |
| Source data arrived late or incomplete | Pipeline runs on time but processes partial input | Row count checks vs expected volume, freshness sensors | Implement a “gate” that blocks the pipeline until source readiness is confirmed (Airflow ExternalTaskSensor, dbt source freshness) |
| Cluster preemption (spot instances) | Cloud provider reclaims spot nodes mid-job | Spark task failures with “executor lost” pattern | Use checkpointing for long jobs. Mix on-demand and spot nodes. Design jobs to be restartable from the last successful stage. |
| Output corruption from partial write | Pipeline crashes after writing 50% of partitions | Row count reconciliation, partition-level checksums | Write to a staging location, then atomically swap (ALTER TABLE SWAP, RENAME). Never write directly to the production table. |
Interview: Your Spark job ran for 6 hours, failed at 95% completion, and needs to deliver by 6 AM. It is 4 AM. What do you do?
Interview: Your Spark job ran for 6 hours, failed at 95% completion, and needs to deliver by 6 AM. It is 4 AM. What do you do?
spark.sql.shuffle.partitions from 200 to 2000 (smaller partitions = less memory per task). (2) Increase executor memory. (3) Restart.If the output table already has partial data: Check if the write mode is overwrite-by-partition. If yes, safe to restart — it will overwrite the partial partitions. If it is append mode, you MUST truncate the target partition before restarting or you will get duplicates.If the job cannot possibly finish by 6 AM: Communicate to stakeholders immediately. Offer a partial delivery: “We can deliver data through yesterday with 95% of tables ready. The remaining 5% will land by 8 AM.” A partial, known-incomplete delivery with clear communication is always better than a silent miss.Red Flag Answer: “I would just restart the job.” Without diagnosing the root cause, you will likely hit the same failure again and waste another 6 hours.Follow-up: “How would you redesign this pipeline so a 95% failure does not mean starting from zero?”Checkpoint intermediate results. Break the monolithic 6-hour job into stages that write intermediate DataFrames to Parquet. Use Airflow to orchestrate stages independently — if stage 3 of 5 fails, rerun from stage 3. The cost is more storage for intermediate files and slightly more orchestration complexity, but the recovery time drops from 6 hours to ~1 hour.Follow-up Chain:- Failure mode: What if the checkpointed intermediate data is itself corrupted (e.g., a skewed partition wrote partial results before OOM)? Add row-count validation at each checkpoint boundary. Each stage’s output should match expected count within a tolerance before the next stage starts.
- Rollout: Do not refactor the entire 6-hour job at once. Start by splitting off the most failure-prone stage (usually the largest join). Run the split version in shadow mode alongside the monolith for 1 week to validate output parity.
- Rollback: Keep the monolithic job definition in version control. If the staged version introduces a regression, revert to the monolith while investigating. The monolith is slow but correct — correctness beats speed in a rollback scenario.
- Measurement: Track per-stage duration, per-stage failure rate, and overall pipeline MTTR (mean time to recovery). Success means MTTR drops from 6 hours to <1 hour.
- Cost: Intermediate Parquet files on S3 cost ~11.50/month in storage — negligible compared to the compute cost of re-running a failed 6-hour job ($200-500 on EMR).
- Security/Governance: Intermediate checkpoint files may contain sensitive data (PII, financial records). Apply the same encryption and access controls to intermediate storage as to final output. Do not write intermediate PII to an unencrypted temp bucket.
AI-Assisted Engineering Lens: Spark Failure Diagnosis
AI-Assisted Engineering Lens: Spark Failure Diagnosis
Work-Sample Prompt: Pipeline SLA Pressure
Work-Sample Prompt: Pipeline SLA Pressure
6. Stream Processing
6.1 Apache Kafka Architecture
Kafka is the backbone of most modern streaming architectures. Understanding its internals is table stakes for data engineering interviews. Core concepts:- Topics — Named streams of events. A topic is the logical unit.
- Partitions — Each topic is split into partitions. Partitions enable parallelism — each partition is consumed by exactly one consumer in a consumer group.
- Brokers — Servers that store partition data. A typical cluster has 3-50+ brokers.
- Consumer Groups — A group of consumers that cooperate to consume a topic. Each partition is assigned to exactly one consumer in the group.
- Offsets — Each message in a partition has a sequential offset. Consumers track their position by committing offsets.
- ISR (In-Sync Replicas) — Replicas that are fully caught up with the leader. Only ISR members can become the new leader if the current leader fails.
6.2 Exactly-Once Semantics
The reality: “exactly-once” is a marketing term. What actually exists:| Guarantee | Meaning | Implementation |
|---|---|---|
| At-most-once | Messages may be lost, never duplicated | Fire-and-forget (no acks) |
| At-least-once | Messages never lost, may be duplicated | Ack after processing, retry on failure |
| Effectively exactly-once | Duplicates may occur but are handled (idempotency) | At-least-once + idempotent consumers |
AI-Assisted Engineering Lens: Kafka Operations and Development
AI-Assisted Engineering Lens: Kafka Operations and Development
Work-Sample Prompt: Kafka Consumer Lag Investigation
Work-Sample Prompt: Kafka Consumer Lag Investigation
analytics-pipeline has been stable for 6 months. This morning, consumer lag jumped from 200 messages to 2 million messages on topic user-events (12 partitions, 3 consumers in the group). Producer throughput has not changed. No deployments occurred overnight. The consumer instances show normal CPU and memory usage. Walk through your diagnosis: what do you check first, what are the three most likely causes, and how do you resolve each?6.3 Stream Processing Frameworks
| Framework | Architecture | Strengths | Best For |
|---|---|---|---|
| Kafka Streams | Library (no separate cluster) | Simple deployment, exactly-once with Kafka | Kafka-native transformations |
| Apache Flink | Distributed cluster, true streaming | Lowest latency, event time processing, savepoints | Complex event processing, large-scale streaming |
| Spark Structured Streaming | Micro-batch (or continuous) | Unified batch + streaming API | Teams already on Spark |
| Apache Beam | Abstraction layer (runs on Flink, Spark, Dataflow) | Portable across runners | Multi-cloud, Google Cloud Dataflow |
Interview: Design a real-time fraud detection pipeline
Interview: Design a real-time fraud detection pipeline
- Process payment events within 500ms of occurrence
- Flag suspicious transactions based on rules and ML model scores
- Handle 50,000 events per second at peak
- Allow rules to be updated without pipeline restart
-
Ingestion — Payment service publishes events to Kafka topic
paymentswith at-least-once guarantee. Events include: transaction_id, user_id, amount, merchant, timestamp, device_fingerprint, IP address. -
Stream enrichment (Flink) — Flink job consumes from Kafka, enriches each event with:
- User’s historical spending patterns (from a feature store, keyed by user_id)
- Merchant risk score (from a lookup table, updated daily)
- Device/IP reputation (from an external service, cached with 5-minute TTL)
- Rule engine — Apply configurable rules (amount > $5000, transaction in new country, >3 transactions in 5 minutes). Rules stored in a configuration service and hot-reloaded via Flink’s broadcast state pattern — rule updates fan out to all parallel instances without restart.
- ML scoring — For transactions that pass rules but have medium risk signals, invoke an ML model (served via a low-latency endpoint, <50ms P99). The model scores based on behavioral features computed in real-time.
-
Decision and action — Based on combined rule + ML score:
- Block (score > 0.95): Reject transaction, alert user
- Challenge (0.7-0.95): Trigger 2FA or OTP
- Allow (< 0.7): Approve, log for post-hoc analysis
- Feedback loop — Analyst-confirmed fraud/not-fraud labels feed back into model retraining (daily batch). False positive rates are tracked as an SLO (target: < 2% false positive rate).
- Flink over Kafka Streams: Flink’s event-time processing and windowing are better suited for “3 transactions in 5 minutes” type rules. Kafka Streams could work but requires more manual state management.
- ML model latency: If the model endpoint is slow (>100ms), consider pre-computing scores in batch and using them as features rather than calling the model in real-time for every transaction.
- False positives vs false negatives: Blocking a legitimate transaction costs customer trust. Missing fraud costs money. The threshold tuning is a business decision, not an engineering one.
- Failure mode: What if the Flink job crashes and restarts — do you lose the windowed state (“3 transactions in 5 minutes”)? Flink checkpoints state to durable storage (S3/HDFS). On restart, it restores from the last checkpoint and replays from the last committed Kafka offset. The gap between checkpoint and crash may result in a few seconds of duplicate processing — this is why the downstream decision service must be idempotent (do not block the same transaction twice).
- Rollout: Deploy new fraud rules via the broadcast state pattern — rules are published to a Kafka topic and fanned out to all Flink parallel instances. No pipeline restart required. For ML model updates: canary deployment — route 5% of traffic to the new model, compare false-positive rates for 24 hours, then promote or rollback.
- Rollback: Old rules are versioned in the config topic. Rollback means publishing the previous rule version. For ML models, the serving layer maintains the last 3 model versions — rollback is a config change, not a deployment.
- Measurement: Track false-positive rate (target: <2%), false-negative rate (target: <0.1%), P99 decision latency (target: <500ms), and rules-engine throughput. Alert if any metric breaches for >5 minutes.
- Cost: A Flink cluster processing 50K events/second costs roughly 1K/month. Compare against fraud losses prevented — typically $50K-500K/month for a mid-size payments company. ROI is clear.
- Security/Governance: Payment event data is PCI-DSS scoped. Flink state stores and Kafka topics must be encrypted at rest and in transit. Access to the fraud pipeline codebase and rule configuration should be restricted to the fraud engineering team with audit logging. Model training data must be anonymized — do not train on raw card numbers.
AI-Assisted Engineering Lens: Fraud Detection
AI-Assisted Engineering Lens: Fraud Detection
Work-Sample Prompt: Fraud Pipeline Degradation
Work-Sample Prompt: Fraud Pipeline Degradation
7. Orchestration
7.1 Apache Airflow
Airflow is the most widely-used data pipeline orchestrator. Understanding its architecture and anti-patterns is essential. Architecture: Scheduler (parses DAGs, creates task instances), Executor (runs tasks — Local, Celery, Kubernetes), Workers (execute task logic), Metadata DB (PostgreSQL, tracks state), Web Server (UI). Common anti-patterns:| Anti-Pattern | Problem | Fix |
|---|---|---|
| Fat DAGs | 500+ tasks in one DAG, slow to parse | Split into multiple DAGs with ExternalTaskSensor or dataset triggers |
| Top-level code in DAG files | API calls or DB queries at import time run on every scheduler heartbeat | Move all logic into task callables |
| Not idempotent tasks | Re-running a failed task creates duplicates | Use MERGE/upsert, delete-then-insert, or unique constraints |
| No backfill strategy | Historical reprocessing is manual and error-prone | Design DAGs with catchup=True and date-partitioned logic from day one |
| Hardcoded connections | Credentials in code | Use Airflow Connections and Variables, backed by a secrets backend (Vault, AWS Secrets Manager) |
AI-Assisted Engineering Lens: Pipeline Orchestration
AI-Assisted Engineering Lens: Pipeline Orchestration
catchup=False, and overly broad depends_on_past settings. (3) Migration assistance: for teams moving from Airflow to Dagster, LLMs can translate Airflow DAGs into Dagster asset definitions, handling the conceptual shift from task-centric to asset-centric. The trap: LLM-generated DAGs often use deprecated Airflow operators or miss platform-specific configurations (e.g., KubernetesPodOperator resource limits). Always validate against your Airflow version and infrastructure.Part III — Data Quality and Governance
8. Data Quality
8.1 The Six Dimensions
| Dimension | Definition | Example Check |
|---|---|---|
| Accuracy | Data reflects real-world truth | Order amounts match payment processor records |
| Completeness | Required fields are populated | No NULL customer_id in orders table |
| Consistency | Same data agrees across systems | Customer count matches between CRM and warehouse |
| Timeliness | Data arrives when expected | Daily sales table refreshed by 6 AM |
| Uniqueness | No unintended duplicates | Each order_id appears exactly once |
| Validity | Data conforms to expected format/range | Email contains @, age is 0-150, dates are not in the future |
8.2 Data Contracts
A data contract is an explicit agreement between a data producer and its consumers about the schema, quality, and delivery guarantees of a dataset. What a data contract includes:- Schema — Field names, types, nullable/required, descriptions
- SLAs — Freshness (data available within 2 hours of event), completeness (>99.9% of events captured)
- Quality rules — Uniqueness constraints, valid value ranges, referential integrity
- Ownership — Which team owns the data, escalation path for quality issues
- Breaking change policy — How schema changes are communicated and when they take effect
Interview: How would you implement data quality monitoring for a critical financial pipeline?
Interview: How would you implement data quality monitoring for a critical financial pipeline?
uniqueandnot_nullon all primary keysaccepted_valueson enum columns- Custom SQL tests for business invariants:
SUM(debits) = SUM(credits)for double-entry accounting - Row count tests:
row_count > 0(catches empty table failures)
--fail-fast flag combined with a dependency chain. In Airflow, failed quality check tasks block downstream task execution.The key principle: The further upstream you catch a quality issue, the cheaper it is to fix. A schema validation failure at ingestion costs seconds. A data quality issue discovered by a VP in a board meeting costs trust that takes months to rebuild.Follow-up Chain:- Failure mode: What if the circuit breaker is too aggressive — it blocks a dashboard refresh for a minor quality issue, and the CFO cannot see yesterday’s revenue? Implement severity tiers: P1 quality failures (missing data, duplicate transactions) block downstream. P2 failures (null rate spike in a non-critical column) alert but do not block. Let the data owner configure the tier per test.
- Rollout: Start with Layer 2 (dbt tests) — highest ROI, lowest effort. Add statistical anomaly detection (Layer 3) only after you have 30+ days of baseline data. Cross-system reconciliation (Layer 4) requires buy-in from source system teams — start with one critical source and expand.
- Rollback: Quality checks themselves can have bugs (a test with a wrong threshold blocks every run). Every quality rule should be version-controlled and deployable independently. A broken test should be disableable via a config flag without a code deploy.
- Measurement: Track data quality SLO adherence per dataset (target: 99.5% of runs pass all checks). Track mean time to detection (MTTD) for quality issues — the gap between when bad data enters and when the alert fires. Target: <15 minutes.
- Cost: Great Expectations or Elementary (open-source) is free but requires engineering time to configure. Monte Carlo (commercial) costs $50K-150K/year but provides automated anomaly detection with minimal setup. The decision depends on team size: <3 data engineers, go commercial; >5, build in-house.
- Security/Governance: Quality check results are metadata about data — they can reveal business-sensitive patterns (e.g., “revenue table had 0 rows today” tells you there were no sales). Restrict quality dashboard access to data team and business stakeholders, not the entire company.
AI-Assisted Engineering Lens: Data Quality
AI-Assisted Engineering Lens: Data Quality
Work-Sample Prompt: Data Quality Incident Response
Work-Sample Prompt: Data Quality Incident Response
fct_daily_revenue and see that yesterday’s partition has 0 rows, but the day before has the normal ~1.2M rows. The dbt job completed successfully at 5:30 AM with no test failures. Walk through your investigation: how did 0 rows pass through with no alerts, where do you look first, and what changes do you make to prevent this from happening again?9. Data Governance
9.1 Data Catalogs
| Tool | Type | Strengths |
|---|---|---|
| DataHub (LinkedIn) | Open-source | Lineage, metadata, search. Extensible. Active community. |
| Amundsen (Lyft) | Open-source | Simple, search-focused. Good for discovery. |
| Atlan | Commercial | Modern UX, collaboration features, embedded governance |
| Unity Catalog (Databricks) | Platform-native | Deep Delta Lake integration, fine-grained access control |
| AWS Glue Data Catalog | Cloud-native | Integrated with Athena, EMR, Redshift. Low operational overhead. |
9.2 PII and Compliance
| Regulation | Scope | Key Data Engineering Requirements |
|---|---|---|
| GDPR | EU residents | Right to erasure (delete on request), data portability, consent tracking, DPIAs |
| CCPA/CPRA | California residents | Right to know, right to delete, opt-out of sale |
| HIPAA | US health data | Encryption at rest and in transit, access audit logs, BAAs with vendors |
- Column-level encryption for PII fields in the data lake
- Tokenization — replace PII with reversible tokens, store the mapping in a separate, access-controlled vault
- Dynamic data masking — show full data to authorized users, masked data to others (Snowflake, BigQuery support this natively)
- Soft deletes with TTL — mark records as deleted, purge after retention period
- Lineage-based impact analysis — when a GDPR deletion request comes in, trace which downstream tables contain that user’s data
AI-Assisted Engineering Lens: PII Detection and Compliance
AI-Assisted Engineering Lens: PII Detection and Compliance
Work-Sample Prompt: GDPR Deletion Request
Work-Sample Prompt: GDPR Deletion Request
user_id. Some are raw (bronze), some are aggregated (gold), and some contain the user’s data only as part of an aggregate metric (e.g., daily_revenue_by_city). The legal team says you have 30 days. Walk through: which tables require hard deletion, which require anonymization, which require nothing (aggregated beyond individual identification), and how do you ensure you have not missed any table?Part IV — Advanced Topics
10. Real-Time Analytics
10.1 OLAP Engines
| Engine | Architecture | Query Latency | Best For |
|---|---|---|---|
| ClickHouse | Column-store, shared-nothing | Sub-second on billions of rows | Logs, metrics, product analytics |
| Apache Druid | Pre-aggregation + inverted index | Sub-second | Time-series analytics, dashboards |
| Apache Pinot | Real-time ingestion + offline segments | Sub-second | User-facing analytics (LinkedIn, Uber) |
| StarRocks | Vectorized, MPP | Sub-second | Mixed workloads (OLAP + ad-hoc) |
10.2 Lambda vs Kappa Architecture
| Aspect | Lambda | Kappa |
|---|---|---|
| Architecture | Separate batch + speed layers, serving layer merges | Single streaming layer for everything |
| Complexity | High (maintain two codepaths) | Lower (one codepath) |
| Accuracy | Batch layer corrects streaming approximations | Streaming must be accurate enough |
| Reprocessing | Re-run batch job | Replay from Kafka (if retention is long enough) |
| When to use | When streaming accuracy is insufficient | When event stream is the source of truth |
11. Data Mesh
- Organization has 5+ domain teams with distinct data domains
- Each domain team has at least 1-2 data-savvy engineers
- A platform team exists to provide self-service tooling
- Leadership supports the organizational change required
- It is used as an excuse to disband the central data team without building platform capabilities
- Domain teams lack the skill or staffing to own their data products
- There is no federated governance — each domain invents its own conventions
- It is adopted for a 20-person company (you do not have “domains” at that size)
AI-Assisted Engineering Lens: Data Mesh and Data Discovery
AI-Assisted Engineering Lens: Data Mesh and Data Discovery
Work-Sample Prompt: Data Mesh Evaluation
Work-Sample Prompt: Data Mesh Evaluation
12. Data System Design Patterns
System Design: Change Data Capture (CDC) Pipeline
System Design: Change Data Capture (CDC) Pipeline
- CDC Source — Debezium connector reads PostgreSQL WAL (Write-Ahead Log) and publishes row-level changes to Kafka topics. One topic per table. Changes include the full row state (before and after) and the operation type (INSERT, UPDATE, DELETE).
- Kafka as the backbone — Events are retained for 7 days (configurable). Multiple consumers can read independently at their own pace without affecting each other.
- Warehouse sink — Kafka Connect with S3 sink writes events to S3 in Parquet format. Snowflake Snowpipe or BigQuery data transfer auto-ingests new files. The warehouse table uses MERGE to apply changes.
- Search index sink — Separate Kafka consumer transforms events into Elasticsearch/OpenSearch documents and indexes them. Uses bulk indexing with a 5-second buffer for efficiency.
- Monitoring — Track consumer lag per consumer group. Alert if lag exceeds 5 minutes (data is too stale). Monitor Debezium slot LSN progress to detect WAL retention issues.
- Debezium vs application-level events: Debezium captures ALL changes (including manual database updates) but is coupled to the database schema. Application events are more semantic but miss changes made outside the application.
- WAL retention: PostgreSQL must retain WAL segments until Debezium has read them. If Debezium falls behind, WAL can consume excessive disk space. Configure
wal_level=logicaland monitor slot lag. - Schema evolution: When the source table schema changes (new column, dropped column), Debezium can forward the change, but downstream consumers must handle it. Use a schema registry to manage this.
- Failure mode: What if Debezium silently falls behind and WAL accumulates on the PostgreSQL primary? (This is the exact scenario from the War Stories section.) Monitor replication slot lag in bytes and seconds. Alert at 1 hour of lag. Set
max_slot_wal_keep_sizeto cap WAL retention and prevent disk exhaustion — accept data loss over production database crash. - Rollout: Start CDC with a single non-critical table (e.g., a logging table). Validate end-to-end: row counts match between source and warehouse within the expected lag window. Then onboard tables in batches of 5-10 per week, with validation after each batch.
- Rollback: If CDC is causing production database performance issues (WAL decoding adds CPU load), you can drop the replication slot immediately. Data sync reverts to the previous method (full snapshot loads). Keep the snapshot-based pipeline in maintenance mode (runnable but not scheduled) for 90 days after CDC go-live.
- Measurement: Track consumer lag per topic (target: <5 minutes), end-to-end latency (source write to warehouse availability), and event throughput (events/second). Alert if lag exceeds the SLA agreed with downstream consumers.
- Cost: Debezium on Kafka Connect costs minimal compute (~1
m5.largeper 10 source tables). Kafka storage for 7-day retention depends on event volume — a high-write table producing 10K events/second generates ~60GB/day. S3 sink storage is ~$0.023/GB/month. The hidden cost: engineering time to manage schema evolution and handle edge cases (e.g., DDL changes, TOAST columns). - Security/Governance: CDC events contain the full row state, including PII. The Kafka topic inherits the sensitivity classification of the source table. Apply column-level encryption or masking in the Kafka Connect transformation chain (SMTs) before events reach downstream consumers who lack PII access. Audit who can read which CDC topics.
AI-Assisted Engineering Lens: CDC Pipeline Operations
AI-Assisted Engineering Lens: CDC Pipeline Operations
Work-Sample Prompt: CDC Production Incident
Work-Sample Prompt: CDC Production Incident
orders table topic has jumped from 30 seconds to 45 minutes and is growing. The PostgreSQL primary’s disk usage has increased from 60% to 78% in the last hour. The source orders table is the busiest table in the OLTP database (5K writes/second during peak). Walk through your incident response: what is your immediate action, how do you diagnose the root cause, and what is your communication plan?Cross-Chapter Connections
| Topic | Connection | Chapter |
|---|---|---|
| Database internals | Understanding B-trees, LSM trees, and query optimization for data pipeline performance | Database Deep Dives |
| Messaging systems | Kafka fundamentals, message queues, pub/sub for data pipelines | Messaging & Concurrency |
| Cloud architecture | AWS/GCP data services, cost optimization, multi-region strategies | Cloud & Problem Framing |
| ML systems | Feature stores, training data pipelines, model serving data requirements | ML & AI Systems |
| Observability | Pipeline monitoring, data freshness alerts, SLOs for data | Caching & Observability |
| System design | Data-intensive system design patterns | System Design Practice |
Quick Reference: Data Engineering Decision Cheatsheet
| Decision | Default Choice | When to Deviate |
|---|---|---|
| Warehouse | Snowflake or BigQuery | Redshift if deep AWS, Databricks if heavy ML |
| Table format | Apache Iceberg | Delta Lake if Databricks ecosystem, Hudi for upsert-heavy CDC |
| Transformation | dbt (ELT) | Spark for complex transformations that exceed SQL expressiveness |
| Orchestration | Airflow | Dagster for asset-centric pipelines, Prefect for simpler workflows |
| Streaming | Kafka + Flink | Kafka Streams for simple transformations, Pub/Sub on GCP |
| File format | Parquet (columnar) | Avro for row-based writes/Kafka, JSON only for bronze/raw |
| Data quality | dbt tests + Great Expectations | Monte Carlo for automated anomaly detection at scale |
| Data catalog | DataHub (open-source) | Atlan (commercial, better UX), Unity Catalog (Databricks-native) |
| CDC | Debezium | Application events if schema coupling is unacceptable |
| Real-time OLAP | ClickHouse | Pinot for user-facing analytics, Druid for time-series |
Part V — Deep Dive Interview Questions
13. Data Engineering Interview Gauntlet
Interview: Your daily pipeline takes 8 hours to run and the business needs results by 6 AM. It finished at 7:30 AM yesterday. How do you fix this?
Interview: Your daily pipeline takes 8 hours to run and the business needs results by 6 AM. It finished at 7:30 AM yesterday. How do you fix this?
| Bottleneck | Diagnosis | Fix | Impact |
|---|---|---|---|
| Data skew in joins | One Spark task takes 10x longer than others | Salt the join key, or use broadcast join for small tables | 2-5x faster |
| Full table rebuilds | dbt models materialized as table rebuild every row daily | Switch to incremental with a merge strategy | 10-100x faster |
| Small file problem | Input has 100,000 tiny Parquet files | Compact files with a scheduled job, use OPTIMIZE in Delta/Iceberg | 2-5x faster reads |
| Unnecessary shuffles | Multiple groupBy operations on different keys | Reorganize transformations to minimize shuffles, pre-aggregate | 2-3x faster |
| Over-provisioned scans | Reading entire table when only last 7 days needed | Add partition pruning, push predicates to scan | 10-50x faster |
| Sequential execution | Tasks that could run in parallel are sequential | Redesign DAG to maximize parallelism (Airflow task groups) | 2-4x faster |
- Increase parallelism (more Spark executors, more Airflow workers)
- Enable Adaptive Query Execution (AQE) in Spark for automatic skew handling
- Check if source tables have changed format or partitioning (a schema change upstream can silently break partition pruning)
- Split the monolithic 8-hour pipeline into independent sub-pipelines that run in parallel
- Implement incremental processing (only process rows changed since last run)
- Consider a streaming approach for the most time-sensitive data (Kafka → Flink → serving table) with a batch reconciliation layer
- Failure mode: What if the optimization makes the pipeline faster but introduces subtle data differences (e.g., incremental logic misses edge-case late arrivals that full rebuilds caught)? Run both the old and new pipeline in parallel for 2 weeks and diff the outputs. Any divergence > 0 rows must be explained before decommissioning the old pipeline.
- Rollout: Apply fixes in impact order: partition pruning first (biggest time savings, lowest risk), then incremental models (medium risk), then DAG parallelization (requires Airflow changes). Never optimize all three simultaneously — you will not know which change caused a regression.
- Rollback: Keep the old DAG definition under a feature flag. If the optimized pipeline produces incorrect results, revert to the old DAG while investigating. The old pipeline is slow but correct.
- Measurement: Track total pipeline runtime, per-task runtime, data freshness SLO adherence, and row-count accuracy. The goal is not just speed — it is speed without sacrificing correctness.
- Cost: Faster pipelines often cost less (fewer compute-hours). But some optimizations (more executors for parallelism) cost more compute per run while reducing wall-clock time. Clarify with the business: is the constraint time-to-delivery or cost? These are different optimization targets.
- Security/Governance: If the pipeline touches PII and you are splitting it into parallel sub-pipelines, ensure each sub-pipeline has the same access controls. A parallel DAG that writes intermediate PII to an unprotected temp table is a compliance risk.
AI-Assisted Engineering Lens: Pipeline Optimization
AI-Assisted Engineering Lens: Pipeline Optimization
Interview: Explain how you would handle schema evolution in a production data pipeline
Interview: Explain how you would handle schema evolution in a production data pipeline
- Backward compatible (default): new schema can read old data. New fields must have defaults. Fields cannot be removed.
- Forward compatible: old schema can read new data. Fields can be added but not removed.
- Full compatible: both backward and forward.
- Schema detection picks it up (Great Expectations schema check, dbt source freshness test)
- Alert fires (“new column
loyalty_tierdetected in sourcecustomers”) - Data engineer evaluates: is this column needed downstream? If yes, add to silver model. If no, ignore (it persists in bronze for future use).
- Notification to all downstream consumers
- Migration period (old and new columns coexist)
- Consumer migration
- Deprecation of old column
- Avro with schema registry for Kafka: schemas are versioned, compatibility checked on every produce
- Delta Lake / Iceberg schema evolution:
ALTER TABLE ADD COLUMNis metadata-only (no data rewrite). Column renames and type widening are supported. - dbt
on_schema_changeconfig:append_new_columnsautomatically adds new source columns to incremental models
- Failure mode: What if a schema change is deployed to the source system without notifying the data team? The bronze layer silently ingests the new schema, but the silver layer breaks on the next dbt run because a column type changed. Defense: add a schema fingerprint check at ingestion — hash the column names and types, compare against the last known fingerprint, and alert on any change before processing.
- Rollout: Schema evolution rules should be documented in a data contract between the source team and the data team. New columns: auto-propagated to bronze and silver. Type changes: 2-week migration window with dual columns. Column removals: 30-day deprecation notice.
- Rollback: If a schema change breaks downstream, rollback means re-reading from the bronze layer (which stored the raw data in the original format). Bronze’s schema-on-read design is exactly for this scenario — the data is preserved even if the interpretation changes.
- Measurement: Track schema change frequency per source, mean time to absorb a schema change (from detection to production deployment), and the number of downstream failures caused by unannounced schema changes per quarter.
- Cost: Schema registries (Confluent, AWS Glue) add minimal cost ($50-200/month). The real cost of NOT having schema management is the engineering time spent debugging failures from unannounced changes — typically 2-4 hours per incident.
- Security/Governance: Schema changes to PII-containing tables require a privacy review. If a new column contains PII (e.g.,
phone_numberadded tocustomers), it must be classified and masked/encrypted before propagating beyond the bronze layer.
on_schema_change='append_new_columns' in dbt, and a dual-column migration strategy for type changes.”
A staff/principal engineer adds: “I want to shift the schema evolution problem left. Instead of the data team reacting to schema changes, I would advocate for data contracts where the source team commits to a compatibility guarantee (backward-compatible changes only, 2-week notice for breaking changes). This is an organizational solution, not a technical one — it requires buy-in from engineering leadership that data stability is a shared responsibility, not just the data team’s problem. I would draft the data contract template, get one high-profile source team to adopt it as a pilot, measure the reduction in downstream incidents, and use that data to drive company-wide adoption.”AI-Assisted Engineering Lens: Schema Evolution
AI-Assisted Engineering Lens: Schema Evolution
manifest.json, which contains the full dependency graph and column lineage. Another application: auto-generating migration SQL for dual-column strategies — given the old and new column definitions, generate the ALTER TABLE, backfill UPDATE, and model-level SQL changes.Interview: Design a data pipeline for an e-commerce recommendation engine
Interview: Design a data pipeline for an e-commerce recommendation engine
- Input: clickstream data (user views, adds-to-cart, purchases), product catalog, user profiles
- Output: personalized product recommendations, updated in near-real-time
- Scale: 50M users, 10M products, 500M events/day
- Clickstream events → Kafka topic
user-events(partitioned by user_id for ordering) - Schema:
{user_id, event_type, product_id, timestamp, session_id, device, context} - Schema registry enforces Avro schema. Events retained for 7 days.
- User features: purchase history, category preferences, average order value, recency/frequency/monetary (RFM) scores
- Product features: popularity score, average rating, conversion rate, category embeddings
- User-product features: collaborative filtering scores (ALS matrix factorization)
- Session features: items viewed in current session, cart contents, time spent on product pages
- Trending: products gaining views rapidly in the last hour
- Two-stage model:
- Stage 1 (Candidate Generation): ALS collaborative filtering + content-based filtering → top 500 candidates per user
- Stage 2 (Ranking): Gradient boosted tree (XGBoost) or neural ranker that scores each candidate using user features + product features + interaction features
- Training data: last 30 days of implicit feedback (views, clicks, purchases weighted differently)
- Evaluation: offline NDCG@10, online A/B test CTR
- Pre-computed candidate sets stored in Redis (key: user_id, value: ranked product IDs)
- When user opens the app: fetch candidate set from Redis, re-rank using real-time session features
- Cold start (new users): fallback to popularity-based recommendations + content-based from initial category preferences
- Monitor event volume per hour (alert if drops >20%)
- Monitor feature freshness in the online store (alert if >1 hour stale)
- Monitor recommendation diversity (alert if >60% of recommendations are from one category)
- A/B test every model update against the incumbent
- Pre-computed candidates are stale (updated daily). Mitigation: streaming features adjust ranking in real-time, and trending products are injected into candidate sets hourly.
- Collaborative filtering cold-starts for new products. Mitigation: new products get a “boost” period where they are injected into candidate sets for relevant categories regardless of CF scores.
- Cost: the streaming Flink job costs 3x more than a batch-only approach. Justify with A/B data: real-time features typically improve CTR by 15-25% over batch-only.
- Failure mode: What if the feature store’s online store (Redis) goes down? Recommendations degrade to pre-computed candidate sets without real-time re-ranking. Design the serving layer with a graceful fallback: if Redis is unavailable within 50ms, serve cached recommendations from the last successful fetch. The user experience degrades (slightly stale recommendations) but does not break.
- Rollout: Launch with batch-only features first (collaborative filtering + content-based). Add streaming features in phase 2 after validating the batch-only baseline via A/B test. This establishes a clear measurement of real-time feature incremental value.
- Rollback: Every model version is tagged and stored. Rollback means changing a config to point at the previous model version in the serving layer — no code deploy required. Feature store keeps the last 7 days of feature values, so the previous model can still score against recent features.
- Measurement: Primary metric: CTR (click-through rate) on recommendations. Secondary: revenue per session, recommendation diversity (Gini coefficient), and cold-start coverage (% of new users receiving non-generic recommendations). Run A/B tests for every model or feature change with at least 7 days of data and 95% statistical significance.
- Cost: Feature store (Feast + Redis) ~3K/month. Spark training ~8K/month. Compare against the incremental revenue from improved recommendations — typically $100K-500K/month for a 50M-user e-commerce platform. The ROI is overwhelming if the model works.
- Security/Governance: User behavioral data (clickstream) is privacy-sensitive. Ensure all features are computed from aggregated or anonymized data — no raw user event sequences stored in the feature store. GDPR deletion requests must propagate to the feature store: when a user exercises their right to erasure, their features must be deleted from both online and offline stores within the SLA (typically 30 days).
AI-Assisted Engineering Lens: Recommendation Pipelines
AI-Assisted Engineering Lens: Recommendation Pipelines
Interview: How do you ensure idempotency in data pipelines?
Interview: How do you ensure idempotency in data pipelines?
| Component | Idempotency Pattern | Implementation |
|---|---|---|
| Kafka consumer | Exactly-once via transactions or idempotent writes | Consumer reads → process → write output + commit offset in a single transaction |
| Spark write | Overwrite partition | df.write.mode("overwrite").partitionBy("date") overwrites only the target partition |
| dbt incremental | Merge/upsert | unique_key config ensures duplicate rows are updated, not inserted |
| Airflow task | Delete-then-insert | Task deletes target data for the execution date, then inserts fresh data |
| File-based | Atomic rename | Write to a temp file, then rename to target path (rename is atomic on most filesystems) |
| Database write | Upsert (MERGE) | INSERT ... ON CONFLICT DO UPDATE ensures no duplicates on retry |
INSERT INTO without deduplication. If the pipeline fails after inserting 500 of 1000 rows and retries, you get 1500 rows (500 duplicated).Follow-up Chain:- Failure mode: What if the unique key used for MERGE is not truly unique? You get non-deterministic merge behavior (Snowflake and BigQuery will error or pick one row arbitrarily). Validate uniqueness of the merge key with a dbt test before relying on it for idempotency.
- Rollout: When converting a pipeline from INSERT to MERGE, run both in parallel for 1 week. Compare row counts and checksums. MERGE has different performance characteristics (slower for large batches) and may need different cluster sizing.
- Rollback: Reverting from MERGE to INSERT is risky — you lose the idempotency guarantee. If you must rollback, add an explicit deduplication step (
DELETE WHERE execution_date = '...'before INSERT) as a temporary bridge. - Measurement: Track duplicate row counts per table per day. A table with growing duplicates indicates an idempotency gap. Run a weekly audit query:
SELECT merge_key, COUNT(*) FROM table GROUP BY 1 HAVING COUNT(*) > 1. - Cost: MERGE operations are more expensive than INSERT (they scan the target table for matching keys). For a 1B-row target table, a MERGE of 100K rows scans 1B rows to find matches. Mitigation: partition the target table by date and MERGE only within the target partition.
- Security/Governance: Delete-then-insert patterns (an alternative idempotency strategy) temporarily create a window where data is missing. If a dashboard queries during that window, it shows incomplete data. Use a transaction to make the delete+insert atomic, or prefer MERGE which never has a missing-data window.
Interview: Compare Airflow, Dagster, and Prefect for pipeline orchestration
Interview: Compare Airflow, Dagster, and Prefect for pipeline orchestration
| Aspect | Airflow | Dagster | Prefect |
|---|---|---|---|
| Philosophy | Task-centric (define tasks and dependencies) | Asset-centric (define data assets and their lineage) | Flow-centric (define Python functions as tasks) |
| Configuration | DAG files in Python, scheduler parses them | Software-defined assets with type checking | Python decorators, minimal configuration |
| Scheduling | Cron-based, external triggers | Schedule by asset freshness policies | Cron-based, event-based, deployments |
| Testing | Difficult (DAGs are hard to unit test) | First-class testing (assets are testable functions) | Good (flows are regular Python functions) |
| UI | Mature, functional, not beautiful | Modern, asset-lineage-focused | Modern, cloud-native |
| Scaling | CeleryExecutor, KubernetesExecutor | Kubernetes, multi-process, cloud-native | Kubernetes, serverless workers |
| Community | Massive (most popular, most plugins) | Growing fast | Growing |
| Best for | Teams already on Airflow, complex DAG dependencies | New projects, data mesh, asset-oriented thinking | Simple-to-medium workflows, rapid development |
| Weakness | Complex to operate, DAG parsing overhead at scale | Smaller ecosystem, newer | Fewer integrations than Airflow |
- Existing Airflow users: Stay on Airflow unless the pain is severe. The migration cost is real.
- New projects with data engineering focus: Consider Dagster. Asset-centric thinking aligns naturally with data engineering (each dbt model is an asset, each table is an asset).
- New projects with general workflow needs: Prefect for simplicity. If your pipelines are mostly “run this Python function, then that one,” Prefect’s decorator-based API is the simplest.
- Enterprise with heavy compliance: Airflow has the most mature RBAC, audit logging, and enterprise features.
- Failure mode: What if Airflow’s scheduler becomes a bottleneck (DAG parsing takes minutes, task scheduling is delayed)? At 500+ DAGs, the scheduler spends most of its time parsing Python files. Mitigation: move to
dag_dir_list_intervaltuning, use@dagdecorators instead of top-level Python, and consider Airflow 2.x’s DAG serialization which parses once and stores in the metadata DB. - Rollout: Never migrate orchestrators mid-sprint. Plan a 1-quarter migration: week 1-4 run both orchestrators in parallel for 5 non-critical DAGs. Week 5-8 migrate all non-critical pipelines. Week 9-12 migrate critical pipelines with extensive validation.
- Rollback: Keep the old orchestrator’s DAG definitions for 90 days post-migration. If the new orchestrator has a critical bug, you can revert by re-enabling the old scheduler.
- Measurement: Track scheduler latency (time from scheduled run to actual task start), task success rate, and MTTR for failed tasks. The orchestrator should not be the bottleneck.
- Cost: Managed Airflow (MWAA, Cloud Composer, Astronomer) costs 100-500/month for small teams. Self-hosted Airflow on Kubernetes costs engineering time (~0.5 FTE for operations). Factor in operational burden, not just license cost.
- Security/Governance: Airflow’s
Connectionsstore credentials. Ensure they are backed by a secrets manager (Vault, AWS Secrets Manager), not stored in the metadata DB in plaintext. Dagster and Prefect have native secrets management that is more secure by default.
Interview: What is the difference between event time and processing time, and why does it matter?
Interview: What is the difference between event time and processing time, and why does it matter?
- For analytics dashboards: processing time is fine. A few seconds of inaccuracy does not matter.
- For billing/financial: event time with generous watermarks (hours or days of allowed lateness). Accuracy is non-negotiable.
- For fraud detection: event time with tight watermarks (minutes). Speed matters, and you can reconcile later.
- Failure mode: What if a mobile app buffers events offline for hours and then sends them all at once when the user reconnects? This creates a burst of late events with event times hours in the past. If your watermark has already advanced past those timestamps, the events are dropped. Mitigation: use a side output for late events and process them in a separate “late reconciliation” job that updates the affected windows.
- Rollout: Default to processing time for new pipelines (simpler, fewer edge cases). Switch to event time only when you have evidence that processing-time inaccuracy affects business decisions. Over-engineering for event time correctness in a pipeline that powers a daily dashboard is wasted effort.
- Rollback: If event-time processing introduces bugs (incorrect watermark heuristics, excessive state size), reverting to processing time is a one-line config change in Flink. The output changes, but the pipeline keeps running.
- Measurement: Track late event rate (% of events arriving after the watermark). If >5% of events are “late,” your watermark is too aggressive. Track window completeness: does the 13:00-14:00 window contain all events that actually occurred in that hour?
- Cost: Event-time processing requires more state (windows stay open longer to accommodate late events). In Flink, state is stored on RocksDB with checkpoints to S3. A 24-hour allowed-lateness window for a high-volume topic can consume hundreds of GB of state, costing significant S3 storage and checkpoint I/O.
- Security/Governance: Event time is set by the client (producer). A malicious or buggy producer can set event timestamps in the future or far past, corrupting windows. Validate event timestamps at ingestion: reject events with timestamps >1 hour in the future or >7 days in the past.
Interview: Design a data quality monitoring platform
Interview: Design a data quality monitoring platform
- Monitor 500+ datasets across 3 data warehouses
- Detect anomalies (volume drops, freshness delays, distribution shifts, schema changes)
- Alert appropriate data owners within 5 minutes
- Provide root cause analysis (which upstream pipeline caused the issue)
- Agent per warehouse (Snowflake, BigQuery, Redshift) that queries information_schema every 15 minutes
- Collects: row counts, null rates per column, min/max/mean for numeric columns, distinct count for categorical columns, schema fingerprint, last update timestamp
- Stores metadata in a time-series database (ClickHouse or TimescaleDB)
- Statistical baselines: For each metric, compute a 30-day rolling baseline (mean + standard deviation). Alert when current value exceeds 3 sigma.
- Pattern-based: Learn daily/weekly patterns (Monday volumes are always lower). Use STL decomposition to separate trend, seasonality, and residual. Alert on residual anomalies.
- Rule-based: Configurable rules per dataset (row_count > 0, freshness < 2 hours, null_rate(‘email’) < 0.01)
- Maintain a lineage graph (extracted from dbt, Airflow, or a lineage tool like DataHub)
- When an anomaly is detected on table X, trace upstream: check the health of all parent tables
- If parent table Y also has an anomaly, the root cause is likely Y (or Y’s parent). Surface this in the alert.
- Route alerts to dataset owners (from data catalog ownership metadata)
- Slack integration with actionable context: “Table
fct_ordershas 0 rows (expected ~1.2M). Upstream tablestg_orderslast updated 6 hours ago (expected: hourly). Airflow DAGetl_orderslast run failed at 04:15 UTC.” - Severity tiers: P1 (revenue-impacting tables), P2 (team-level impact), P3 (informational)
- Define SLOs per dataset: freshness (updated within X hours), completeness (>99.9% non-null for required fields), accuracy (reconciliation with source within Y%)
- Error budget tracking: if a dataset breaches its freshness SLO 5 times in a month, escalate to the data platform team
- Statistical anomaly detection has false positives (holidays, promotions cause legitimate volume spikes). Mitigation: allow users to acknowledge and suppress known patterns.
- Metadata collection adds load to the warehouse. Mitigation: schedule during off-peak hours, use information_schema queries (metadata, not data scans).
- Lineage-based root cause requires accurate lineage. If pipelines are not tracked in a lineage system, root cause analysis falls back to manual investigation.
- Failure mode: What if the anomaly detection system itself has a bug and generates false alerts at 3 AM for 200 datasets? Alert fatigue is a real risk — after 3 false-alarm weekends, the on-call engineer starts ignoring alerts. Mitigation: implement alert suppression for known patterns (holidays, promotions), require anomalies to persist for 2 consecutive checks before alerting, and track alert-to-action ratio (% of alerts that result in actual investigation).
- Rollout: Start with rule-based checks only (row count > 0, freshness < X hours). Add statistical anomaly detection after 30+ days of baseline data. Add lineage-based root cause analysis after lineage coverage exceeds 80% of critical tables.
- Rollback: Each monitoring rule is independently toggleable. If a rule produces excessive false positives, disable it via config (not a code deploy). The monitoring platform should never block data pipeline execution — it is an observer, not a gate (unless explicitly configured as a circuit breaker for P1 datasets).
- Measurement: Track MTTD (mean time to detection), MTTR (mean time to resolution), false positive rate, and alert-to-action ratio. A good data quality platform has MTTD <15 minutes and a false positive rate <10%.
- Cost: ClickHouse/TimescaleDB for metadata storage: ~50K-150K/year but zero build time. Build vs buy depends on team size and customization needs.
- Security/Governance: The quality monitoring platform has read access to every dataset in the warehouse (to compute statistics). This makes it a high-value target — if compromised, it provides a map of all data assets and their schemas. Apply the principle of least privilege: the monitoring agent should have read-only access, restricted to metadata queries (COUNT, MIN, MAX, DISTINCT) rather than SELECT * access.
AI-Assisted Engineering Lens: Data Quality Monitoring
AI-Assisted Engineering Lens: Data Quality Monitoring
etl_payments failed at the extract_stripe task due to an expired API key.” This saves 20 minutes of manual investigation per incident. (3) Automated documentation of data quality rules: Feed a table schema and sample data to an LLM and prompt “generate a comprehensive Great Expectations suite.” The LLM infers appropriate tests from data characteristics, bootstrapping monitoring for newly onboarded datasets.Work-Sample Prompt: Data Quality Platform Design
Work-Sample Prompt: Data Quality Platform Design
14. Data Engineering War Stories
The Great Snowflake Cost Explosion at a Series B Startup
The Great Snowflake Cost Explosion at a Series B Startup
- They created XL warehouses for every job, including simple 5-second queries that ran on a schedule
- They set auto-suspend to 10 minutes (Snowflake charges per second, but warehouses have a 60-second minimum charge per resume — frequent resume/suspend cycles at 10-minute intervals accumulated costs)
- They rebuilt every dbt model as a full table materialization (no incremental logic). A 500M row fact table was rebuilt from scratch every hour.
- Right-sized warehouses: XS for simple queries, S for most ETL, M only for the largest joins. Saved 60%.
- Set auto-suspend to 1 minute for ETL warehouses (they run, finish, and suspend). Saved 15%.
- Converted the top 5 largest dbt models to incremental. The 500M row rebuild became a 100K row upsert. Saved 20%.
- Implemented per-team cost attribution and weekly cost reviews. Made cost visible to the people creating it.
How a CDC Pipeline Filled a PostgreSQL Disk and Took Down Production
How a CDC Pipeline Filled a PostgreSQL Disk and Took Down Production
- Immediate: dropped the replication slot to free disk. Restarted Debezium. Accepted 36 hours of missing CDC events (backfilled from a full snapshot).
- Prevention: (1) Set
max_slot_wal_keep_sizein PostgreSQL to cap WAL retention per slot. (2) Monitor replication slot lag with an alert at 1 hour. (3) Configure Debezium with automatic reconnection and dead-letter queuing for failed events. (4) Run PostgreSQL disk space alerts at 70% threshold, not 90%.
15. Data Engineering Self-Assessment
Rate your confidence in each area. Any “Low” is a gap worth studying before an interview.| Topic | Check Your Knowledge |
|---|---|
| Warehouse architecture | Can you explain Kimball vs Inmon vs Data Vault and when each fits? |
| Table formats | Can you explain why Iceberg/Delta/Hudi exist and the problems they solve vs raw Parquet? |
| Spark optimization | Can you explain what causes a shuffle, how to broadcast join, and how to handle data skew? |
| Kafka internals | Can you draw the partition → consumer group mapping and explain exactly-once semantics? |
| dbt | Can you explain incremental models, testing strategy, and how dbt fits in the ELT pattern? |
| Data quality | Can you describe a multi-layer data quality strategy with specific tools and thresholds? |
| Schema evolution | Can you handle a source system adding, removing, or changing column types without pipeline breakage? |
| Cost optimization | Can you diagnose and reduce a 10x cost increase in Snowflake/BigQuery? |
| Stream processing | Can you explain event time vs processing time, watermarks, and windowing strategies? |
| Orchestration | Can you compare Airflow, Dagster, and Prefect with specific trade-offs? |
| Data mesh | Can you explain when it works, when it fails, and the four principles? |
| CDC | Can you design a Debezium-based CDC pipeline and explain the operational risks? |