These latency numbers (originally from Jeff Dean, updated for modern hardware) are essential for back-of-envelope calculations and system design interviews:
Operation
Latency
Notes
L1 cache reference
0.5 ns
L2 cache reference
7 ns
Main memory reference
100 ns
SSD random read
16 us
~16,000 ns
Read 1 MB from memory
250 us
Read 1 MB from SSD
1 ms
Network round-trip (same datacenter)
0.5 ms
500 us
Read 1 MB from network
10 ms
Disk seek (HDD)
10 ms
Network round-trip (cross-continent)
150 ms
Read 1 MB from HDD
20 ms
Useful throughput numbers:
A single PostgreSQL instance: ~5,000-20,000 simple queries/second
Redis: ~100,000 operations/second
A single web server (Node.js/Go): ~10,000-50,000 requests/second for simple endpoints
Kafka: ~1 million messages/second per broker (small messages)
Order of Magnitude. An estimate accurate to within 10x of the real answer. In capacity planning, you do not need exact numbers — you need to know whether you need 1 GB or 1 TB, 100 requests/second or 100,000. If your estimate says 50 GB and reality is 80 GB, you were right (same order of magnitude). If your estimate says 50 GB and reality is 5 TB, you were wrong (two orders of magnitude off). The goal of back-of-envelope math is getting the order of magnitude right.
Think like a contractor, not an accountant. Back-of-envelope estimation is like being a building contractor — before you build, you estimate materials. You do not need exact numbers, you need to know if you are building a shed or a skyscraper. A contractor who says “we need about 50 cubic yards of concrete” is useful. One who says “we need exactly 47.3 cubic yards” has wasted time on false precision. One who says “we need about 5 cubic yards” and then runs out mid-pour has made the project fail. Get the order of magnitude right and you will make the right architectural decisions. Sweat the decimals and you will waste time on precision that does not change any decision.
Forgetting Multiplicative Factors. Raw data is just the beginning. Add replication (3x for databases). Add indexes (20-50% of data size). Add overhead (protocol headers, metadata, JSON serialization bloat). Add headroom (plan for 3x current traffic). A 100 GB estimate easily becomes 500+ GB in reality.
Tools: Napkin math calculators (back-of-the-envelope.com). Cloud pricing calculators (AWS, GCP, Azure). pgbench, k6 (for validating estimates with real load tests).
Use this reference table during system design interviews and architecture sessions. These are approximate values — the point is speed, not decimal precision.Storage Sizes
The skill of quickly estimating whether a design will work at the required scale. This comes up constantly in system design interviews and real architecture discussions.The process: Start with the requirement (e.g., “store 5 years of user activity”). Estimate the data size per event. Estimate the volume (events/day x 365 x 5). Multiply. Add overhead (indexes, replication, headroom). Compare against known limits of the technology you are considering.Example — URL Shortener storage: 100 million new URLs/month. Each record: short code (7 bytes) + original URL (average 200 bytes) + metadata (50 bytes) = ~257 bytes. Per month: 100M x 257 bytes = ~25.7 GB. Per year: ~308 GB. Over 5 years: ~1.5 TB. This fits easily in a single PostgreSQL instance. No need for sharding — the problem is read throughput (caching), not storage.Example — Chat message throughput: 10 million daily active users, average 50 messages/day = 500 million messages/day = ~5,787 messages/second average, ~17,000 messages/second peak (3x). Each message: ~500 bytes. Daily storage: 500M x 500 bytes = 250 GB/day. This helps decide: do we need Kafka? (Yes, for the throughput.) Do we need sharding? (Eventually, for storage.) Can a single database handle it? (Not long-term.)
Interview Question: Estimate the storage needed for a service that stores all tweets for 5 years.
Strong answer: ~500 million tweets/day (public figure). Average tweet: 280 characters = ~280 bytes text + 200 bytes metadata (user_id, timestamp, geo, reply_to, etc.) = ~480 bytes. Many tweets have media — store media separately in object storage, just the URL reference in the DB (~100 bytes). So ~580 bytes per tweet. Daily: 500M x 580 = 290 GB. Yearly: ~106 TB. 5 years: ~530 TB. This requires distributed storage — no single database handles this. Add replication (3x) = ~1.6 PB. Plus indexes. This is why Twitter built custom infrastructure. The point is not getting the exact number — it is showing you can reason through scale.
Interview Question: Estimate the storage needed for a YouTube-like platform — 500 hours of video uploaded per minute, 1 billion daily views. Walk me through your math.
Strong answer structure:Step 1 — Ingestion (uploads). 500 hours/minute = 30,000 hours/day = 720,000 hours/year. Average video: assume a mix of resolutions. Raw upload average ~1 GB/hour (compressed H.264 at 1080p is roughly 1.5–3 GB/hour, but many uploads are lower resolution or shorter clips, so ~1 GB/hour is a reasonable average). Daily upload volume: 30,000 hours x 1 GB = 30 TB/day raw. But YouTube stores multiple resolutions (144p, 360p, 720p, 1080p, 4K). A reasonable multiplier for transcoded copies is 5–8x the raw storage. So daily storage after transcoding: 30 TB x 6 = ~180 TB/day. Yearly: ~65 PB/year. Over 5 years: ~325 PB just for video files. This goes to object storage (Google uses Colossus, you would use S3 or GCS).Step 2 — Metadata. Each video has metadata: title, description, uploader, timestamps, tags, category, thumbnails, view count, comments count. Estimate ~5 KB per video. 500 hours/minute with an average video length of ~5 minutes = ~100 videos/minute = ~144,000 videos/day. Daily metadata: 144K x 5 KB = ~720 MB/day. Yearly: ~263 GB. This is trivially small — fits in a single database cluster.Step 3 — Thumbnails. Each video generates multiple thumbnails (auto-generated + custom). Assume 5 thumbnails at ~50 KB each = 250 KB per video. 144K videos/day x 250 KB = ~36 GB/day. Small relative to video.Step 4 — Serving (reads). 1 billion daily views. Average watch time: assume 5 minutes at 720p average (~0.5 GB/hour = ~42 MB for 5 minutes). Daily egress: 1B x 42 MB = ~42 PB/day of bandwidth. This is the real cost driver — not storage, but CDN bandwidth. This is why YouTube has one of the largest CDN networks in the world.Step 5 — Summary. Storage is dominated by video files at ~65 PB/year. Bandwidth is dominated by reads at ~42 PB/day. The key insight: for a video platform, bandwidth and CDN cost dwarf storage cost. Storage is a solved problem (object storage scales indefinitely). The hard problems are transcoding throughput (processing 500 hours/minute of video), CDN distribution (serving petabytes/day globally with low latency), and cost optimization (tiering storage, adaptive bitrate streaming).
Total: 100B x 24 = ~2.4 TB (manageable in a graph database or sharded relational DB)
Step 7: Summary.
Component
Daily
Yearly
5 Years
Storage Technology
User profiles
-
-
~3 TB
Sharded RDBMS
Tweets
192 GB
70 TB
~1.4 PB (with replication)
Distributed DB (Cassandra, etc.)
Images
16 TB
~5.8 PB
~29 PB
Object storage (S3)
Videos
100 TB
~36.5 PB
~182 PB
Object storage (S3) + CDN
Follow graph
-
-
~2.4 TB
Graph DB / sharded RDBMS
Timelines (cache)
-
-
~10-50 TB
Redis cluster
The key insight from this exercise: media storage dwarfs everything else. Tweet text and metadata are large in aggregate but manageable with distributed databases. Images and especially video require object storage at petabyte scale plus a CDN for delivery. This is why companies like Twitter invest heavily in media infrastructure and CDN partnerships.
Real-World Story: How Google Manages Billions of Lines of Code in a Single Repository
Google stores virtually all of its code — billions of lines across tens of thousands of projects — in a single monolithic repository. As of their 2016 paper, the repository contained over 2 billion lines of code across 9 million source files, with 86 TB of content and roughly 45,000 commits per day from 25,000+ developers.They do not use Git for this. Git was not designed for a repository of this scale — operations like git status and git clone would take hours or days on a repo this size. Instead, Google built a custom version control system called Piper, backed by their distributed file system (Bigtable and Colossus). Developers do not clone the entire repo. They use a virtual filesystem called CitC (Clients in the Cloud) that presents the illusion of having the full repo locally, while fetching only the files you actually open or build.Why a monorepo? Because dependency management across thousands of projects becomes tractable. If Team A changes an API, they can also update every caller of that API across the company in a single commit (called a “large-scale change” or LSC). This eliminates dependency versioning hell — there is only one version of every library: the one at HEAD. Google built custom tooling (Rosie, TAP) to automatically test the impact of changes across the entire codebase before they land.The lesson for capacity estimation: Google’s monorepo is a masterclass in understanding your scale constraints and building tools to match. They estimated that a standard VCS could not handle their volume, validated that estimate, and built custom infrastructure. They did not guess — they did the math. 86 TB of code. 45,000 commits/day. 800,000 queries/second to the version control backend. Those numbers told them exactly what architecture they needed.
Continuous Integration (CI). The practice of merging all developer working copies to a shared mainline multiple times a day. The key insight: integration pain grows exponentially with the time between integrations. Merging a 2-day branch is easy. Merging a 2-week branch is painful. Merging a 2-month branch is a project in itself.
Long-Lived Feature Branches. A branch that lives for 3+ weeks almost always ends in a painful merge. The code has diverged from main, other features have shipped, and conflicts are everywhere. If a feature takes 3 weeks, break it into smaller increments that can be merged daily behind a feature flag. The branch should never live longer than the feature flag.
Tools: GitHub (PRs, code review, Actions). GitLab (merge requests, CI/CD). Bitbucket (PRs, Pipelines). Conventional Commits (standardized commit message format). git-absorb, git-revise (tools for cleaning up commit history before merge).
Interview Question: Your team has 12 engineers and frequent merge conflicts. What do you do?
Strong answer: Merge conflicts are a symptom of long-lived branches and overlapping work. Fix: (1) Move to trunk-based development or enforce short-lived branches (merged within 1-2 days). (2) Break large features into smaller PRs — a 2000-line PR is a review bottleneck and a conflict magnet. (3) Reduce PR review latency — if PRs wait 3 days for review, branches live 3 extra days. Set a team norm: review within 4 hours. (4) Improve code ownership clarity — if two teams keep modifying the same files, they need to coordinate or the code needs to be restructured so each team owns separate files. (5) Use feature flags so incomplete work can be merged to main without being released.
Trunk-Based Development: All developers commit to main (or trunk). Short-lived feature branches (hours, not weeks). Feature flags hide incomplete work. CI runs on every commit. This is the strategy used by Google, Facebook, and most high-performing teams measured in the Accelerate research. Requires strong CI, feature flags, and automated testing.GitHub Flow: Create a branch from main, do work, open a pull request, review, merge to main, deploy. Simple, effective for most teams. Branches should be short-lived (1-3 days). Long-lived branches create merge conflicts and delay integration.GitFlow: Separate develop, release, and hotfix branches alongside main. Popular but complex. Appropriate for software with formal release cycles (mobile apps, packaged software). Overkill for web services that deploy continuously.The rule: Start with trunk-based development or GitHub Flow. Only adopt GitFlow if you have a genuine need for parallel release management.
SaaS, web services, high-deployment-frequency teams
Most web projects, startups, open source
Mobile apps, packaged software, regulated environments
Merge conflict risk
Very low
Low (if branches are short)
High (long-lived branches diverge)
CI/CD requirement
Strong CI mandatory
CI recommended
CI helpful but not critical
Rollback strategy
Feature flags / revert commit
Revert commit / redeploy
Hotfix branch from main
How to choose: If your team deploys multiple times per day, use trunk-based development. If your team deploys once per day or per sprint, use GitHub Flow. If you ship versioned software with parallel supported releases (e.g., a mobile app with v3.1 in production and v3.2 in beta), consider GitFlow. When in doubt, start with GitHub Flow — it is the simplest model that works for the majority of teams.
Interview Question: Your team uses GitFlow but releases are painful and slow. How would you propose migrating to trunk-based development?
Strong answer:First, diagnose why GitFlow is painful. Common symptoms: merge conflicts between long-lived develop and release branches, hotfixes that need cherry-picking across multiple branches, release branches that accumulate so many changes that testing them is a project in itself, and a “merge day” before each release where the team stops feature work to resolve conflicts.The migration plan — do it incrementally, not as a big bang:Phase 1 — Shorten branch lifetimes (2-4 weeks). Before changing the branching model, start merging feature branches to develop more frequently. Set a team norm: no branch lives longer than 2 days. Break large features into smaller, independently mergeable increments. This alone reduces most merge pain.Phase 2 — Introduce feature flags (2-4 weeks). Deploy a feature flag system (LaunchDarkly, Unleash, or even a simple config file). Start wrapping incomplete features behind flags so that partially complete work can be merged to develop without being visible to users. This is the critical prerequisite — without feature flags, trunk-based development does not work because you cannot merge incomplete features.Phase 3 — Eliminate the develop branch (1-2 weeks). Once branches are short-lived and feature flags are in place, develop is no longer adding value — it is just an extra step between your branch and main. Start merging directly to main. Keep the CI pipeline running on every merge. The develop branch becomes a mirror of main and can be retired.Phase 4 — Eliminate release branches (gradual). With CI/CD deploying from main, release branches become unnecessary for web services. If you still ship versioned software (mobile apps), you may keep lightweight release branches that are cut from main and only receive cherry-picked hotfixes — but no feature development happens on them.Phase 5 — Invest in CI and automated testing. Trunk-based development requires that main is always deployable. This means: fast CI (under 10 minutes), comprehensive automated tests, and a culture where a broken main is treated as a stop-the-world emergency. If your CI takes 45 minutes, fix that before going fully trunk-based.Key risks to address: (1) Team members accustomed to long-lived branches will need coaching — pair programming helps. (2) Feature flags add complexity — clean them up aggressively. (3) CI must be fast and reliable — flaky tests undermine trunk-based development because people stop trusting the green build.The strongest answer acknowledges that this is as much a cultural change as a technical one. The tooling is straightforward. Getting 12 engineers to change how they work daily requires buy-in, patience, and visible wins early in the migration.
Merge commit: Preserves full branch history. Creates a merge commit. Good for seeing exactly what a PR contained. Noisy history. Squash merge: Combines all branch commits into a single commit on main. Clean history. Loses granular commit history (which rarely matters). Most teams prefer this for PRs. Rebase: Replays branch commits on top of main. Linear history. Can cause confusion with force pushes. Best for personal workflow, risky for shared branches.
For authors: Keep PRs small (under 400 lines changed). Write a clear description (what, why, how, testing). Self-review before requesting review. Respond to feedback promptly. Do not take feedback personally.For reviewers: Review within 24 hours (review latency kills velocity). Focus on correctness, clarity, and maintainability — not style preferences (use linters for that). Ask questions instead of making demands (“What happens if this is null?” vs “This will crash on null”). Approve when it is “good enough” — do not block on perfection. Distinguish blocking issues from nits.What to look for: Logic errors. Missing edge cases. Security vulnerabilities. Performance concerns at scale. Missing tests for changed behavior. API contract changes. Error handling gaps. Race conditions.
Use this checklist when reviewing pull requests. Not every item applies to every PR, but scanning through this list ensures you do not miss critical issues.
Category
What to Check
Correctness
Does the code do what the PR description says? Are edge cases handled (null, empty, overflow, boundary values)? Are error paths correct?
Readability
Can you understand the code in a single pass? Are names descriptive? Is the logic straightforward or needlessly clever?
Performance
Any N+1 queries? Unnecessary loops over large data sets? Missing indexes for new queries? Unbounded result sets?
Recognizing these patterns helps teams improve their review culture.
Nit-Picking Style. A reviewer blocks a PR over brace placement, import ordering, or variable naming preferences that are not covered by the team’s style guide. The fix: Automate style enforcement with linters and formatters (Prettier, ESLint, Black, gofmt). If the formatter does not flag it, it is not worth blocking a PR. Prefix optional style suggestions with “Nit:” so the author knows they are non-blocking.
Rubber-Stamping. A reviewer approves without actually reading the code — clicking “Approve” within 30 seconds of a 500-line PR. This defeats the purpose of code review and lets bugs, security issues, and design problems slip through. The fix: Set a team expectation that reviews include at least one substantive comment or question. Track review quality, not just review speed.
Blocking on Personal Preferences. A reviewer refuses to approve because they would have implemented the feature differently — not because the submitted approach is wrong. “I would have used a strategy pattern here” is not a blocking issue if the submitted approach is correct, readable, and maintainable. The fix: Distinguish between “request changes” (blocking — there is a bug, security issue, or maintainability concern) and “comment” (non-blocking — suggestion for a different approach). Only block on objective issues, not subjective preferences.
Enormous PRs. An author submits a 3,000-line PR and expects a meaningful review. No reviewer can effectively review 3,000 lines. The fix: Set a team norm for PR size (under 400 lines is a good target). If a feature requires more, break it into stacked PRs or incremental changes. A PR that is too large to review is too large to merge safely.
Real-World Story: How GitHub Scales Code Review Across Thousands of Engineers
GitHub itself — the company that hosts code review for millions of teams — has a distinctive code review culture worth studying. With over 3,000 employees and hundreds of engineers shipping to a monolithic Ruby on Rails application (and increasingly a set of microservices), they have had to solve the “code review at scale” problem for their own codebase, not just for their customers.How they do it: GitHub uses their own pull request workflow (naturally), but the interesting part is the culture and tooling around it. They practice what they call “review-driven development” — PRs are the unit of work, and no code reaches main without at least one substantive review. They use CODEOWNERS files extensively, which automatically assign reviewers based on which files a PR touches. This prevents the bottleneck of a few senior engineers being the review gateway for everything.Review latency as a metric: GitHub treats review latency (the time between a PR being opened and receiving its first substantive review) as a key engineering health metric. Their internal target is that most PRs receive a review within a few hours. They found that when review latency creeps up, branch lifetimes grow, merge conflicts increase, and deploy frequency drops — a cascading effect that slows the entire organization.The “ship small” principle: GitHub’s engineering blog has documented their emphasis on small, incremental PRs. They encourage engineers to break work into PRs of under 400 lines when possible. Smaller PRs get reviewed faster, reviewed more thoroughly, and merge with fewer conflicts. They use feature flags extensively to allow incomplete features to be merged behind a flag.The lesson: Code review is not just a quality gate — it is a communication mechanism. GitHub’s approach shows that the biggest leverage is not in what reviewers look for (checklists help, but they are secondary) but in how fast the review loop turns. Fast reviews mean short branches mean few conflicts mean fast deploys. Slow reviews poison the entire development cycle.
Further reading:Chelsea Troy’s “Reviewing Pull Requests” — one of the best deep dives on what makes code review effective. Covers cognitive load, the reviewer’s mindset, and practical techniques for giving useful feedback. Her follow-up posts on reviewing with empathy and handling disagreements in review are equally valuable.
Further reading:Conventional Commits specification — a lightweight convention for commit messages that adds structure (feat, fix, chore, docs, refactor) and enables automated changelogs and semantic versioning. Worth adopting for any team that wants consistent, parseable commit history. Martin Fowler on Trunk-Based Development — the definitive reference on trunk-based development patterns, including feature flags, branch by abstraction, and managing releases from trunk.
Data pipelines are like plumbing. Nobody notices when the plumbing works perfectly — water flows, toilets flush, life is good. But everyone notices when it is backed up. Data pipelines are the same. When they work, clean data flows from source systems into dashboards, ML models, and business reports on schedule and nobody thinks about it. When a pipeline breaks or delivers stale data, the CEO is asking why the revenue numbers are wrong at 8 AM. Like plumbing, the best data engineering is invisible. And like plumbing, the cost of fixing a problem grows exponentially the further downstream you discover it. A broken join in a staging model is a 10-minute fix. That same broken join discovered three months later when a financial report is wrong is a week-long investigation with trust damage that takes months to repair.
Data Lineage. Tracking where data came from, how it was transformed, and where it went. When a dashboard shows incorrect revenue numbers, data lineage lets you trace back: the dashboard reads from a BI table, which was built by a dbt model, which joins data from the orders table and the refunds table, which were loaded by Fivetran from the production database. Without lineage, debugging data issues is guesswork.
“The Data Is Wrong.” The most common data engineering incident. Usually caused by: a schema change in the source that broke the pipeline silently, a pipeline that ran before the source data was complete (partial day’s data), duplicate records from a retry that was not idempotent, or a business logic change that the transformation did not account for. Build data quality checks into every pipeline stage.
Interview Question: A business analyst reports that yesterday's revenue dashboard is $50,000 lower than expected. How do you investigate?
Strong answer: Start at the dashboard and work backward. Check: did the pipeline run successfully yesterday? (Airflow/Dagster logs.) Did it process the expected number of records? (Row count checks.) Compare the raw source data against the transformed data — is the discrepancy in extraction (missing data from source) or transformation (wrong calculation)? Common causes: a new payment type that the transformation does not handle, orders stuck in a “pending” status that are not counted, a timezone issue (the pipeline uses UTC but the dashboard shows local time, so some orders fall on the wrong day), or a failed job that was silently skipped. The fix depends on the cause, but the investigation pattern is always: trace the data backward from the wrong output to the source, comparing row counts and amounts at each stage.
ETL (Extract, Transform, Load): Data is extracted from sources, transformed in a staging area, then loaded into the destination. Traditional approach. Transformation happens before loading, so only clean data enters the warehouse. Good when the destination has limited compute (traditional data warehouses).ELT (Extract, Load, Transform): Data is extracted and loaded raw into the destination, then transformed there. Modern approach enabled by powerful cloud data warehouses (BigQuery, Snowflake, Redshift) that have cheap, scalable compute. Faster initial loading. Transforms can be iterated without re-extraction. This is the dominant pattern today.
Legacy systems, constrained warehouse compute, strict compliance (only approved data enters the warehouse)
Modern cloud warehouses, fast iteration, analytics teams that own transformations
The industry shift: ELT has become the dominant pattern because cloud data warehouses (Snowflake, BigQuery, Redshift) offer nearly unlimited, pay-per-query compute. Storing raw data in the warehouse and transforming it with dbt gives analysts flexibility to iterate on transformations without involving the data engineering team for re-extraction. ETL still makes sense when you need to filter sensitive data before it enters the warehouse (e.g., stripping PII before loading into a shared analytics environment) or when working with on-premises systems with limited compute.
Batch: Process accumulated data at scheduled intervals (hourly, daily). Simpler, cheaper, handles large volumes efficiently. Use for: daily reports, analytics aggregation, data warehouse loading, ML model training, end-of-day reconciliation.Stream: Process data as it arrives, continuously. More complex, higher infrastructure cost, but enables real-time insights. Use for: fraud detection, live dashboards, real-time recommendations, alerting, event-driven architectures.The practical reality: Most companies need both. Stream processing for time-sensitive operations (fraud alerts within seconds), batch processing for everything else (daily reports, historical analytics). Do not default to stream processing — it is significantly more complex to build, debug, and operate.
Real-World Story: Spotify’s Journey from Batch to Streaming
Spotify’s data infrastructure evolution is one of the best-documented cases of a company navigating the batch-to-streaming transition — and learning the hard way that “just stream everything” is not the answer.The early days (batch-first). Spotify started with a Hadoop-based batch processing architecture. Every night, massive MapReduce jobs would crunch the previous day’s listening data — billions of events — to produce the analytics, recommendations, and royalty calculations that powered the business. This worked well when Spotify had tens of millions of users. Daily batch jobs ran overnight, and by morning the data team had fresh dashboards and the recommendation models had updated features.The pain point. As Spotify scaled past 100 million users and the event volume grew into the hundreds of billions per day, the batch architecture started breaking. Jobs that used to finish in 2 hours were taking 8. The “nightly batch window” was not enough — some jobs were still running when users woke up. Worse, the batch-only model meant that features like Discover Weekly and personalized playlists could only update daily. If a user listened to jazz all morning, the recommendations would not reflect that until the next day’s batch run.The hybrid architecture. Rather than ripping out batch and going fully streaming, Spotify built a hybrid architecture. They adopted Google Cloud Dataflow (based on Apache Beam) for streaming use cases where latency mattered — real-time listening activity, live A/B test metrics, and fraud detection. But they kept batch processing (migrating from Hadoop to Google BigQuery and Scio, their open-source Scala library for Apache Beam) for the heavy analytical workloads where daily granularity was perfectly fine: royalty calculations, long-term trend analysis, and ML model training.The lesson. Spotify’s story illustrates a pattern that almost every data-intensive company discovers: you end up with both batch and streaming, not one or the other. The key skill is knowing which workloads belong where. Spotify engineers have talked publicly about the mistake of over-eagerly migrating batch workloads to streaming — the added complexity of managing streaming state, handling late-arriving events, and debugging time-windowed aggregations was not worth it for workloads where “yesterday’s data by 8 AM” was a perfectly acceptable SLA.
Further reading:Netflix Technology Blog — data pipeline architecture posts — Netflix’s data engineering team has published extensively on their pipeline architecture, including their migration from batch to a hybrid batch/streaming model, their use of Apache Spark and Flink, and how they handle data quality at massive scale. Their posts on “Data Mesh” adoption and the evolution of their data platform are particularly valuable for understanding how large organizations manage data infrastructure.
Higher (ordering, exactly-once, late arrivals, watermarks)
Infrastructure cost
Lower (runs periodically, can use spot instances)
Higher (always running, dedicated compute)
Error recovery
Re-run the entire batch
Complex (replay from offset, handle duplicates)
Throughput
Very high (optimized for bulk)
High but costlier per event
State management
Straightforward (read all input, produce output)
Complex (windowing, checkpointing, state stores)
Debugging
Easier (input/output are files, reproducible)
Harder (transient state, time-dependent behavior)
Use cases
Daily reports, data warehouse loading, ML training, reconciliation, historical analytics
Fraud detection, live dashboards, alerting, real-time recommendations, IoT event processing
Do not default to stream processing. Stream processing is an order of magnitude more complex to build, test, debug, and operate than batch. If your business requirement is “the dashboard should update daily,” batch is the right choice. Stream processing is justified only when the business explicitly needs sub-minute latency (fraud detection, live bidding, real-time alerts). Many teams have wasted months building streaming pipelines for use cases where a cron job running every hour would have been sufficient.
Extract: Fivetran, Airbyte, Stitch (managed connectors from SaaS tools and databases to the warehouse). Load: Direct loading into BigQuery, Snowflake, Redshift, Databricks. Transform: dbt (SQL-based transformations with version control, testing, and documentation — the standard tool). Orchestrate: Airflow, Dagster, Prefect (schedule and manage pipeline dependencies). Serve: Looker, Metabase, Tableau (BI and visualization), or reverse ETL back to operational systems.
These three tools cover the majority of modern data processing needs. Knowing when to use each is a critical skill.Apache SparkSpark is a distributed data processing engine designed for large-scale batch processing and analytics. It processes data in parallel across a cluster, making it ideal for workloads that would take hours on a single machine.
Aspect
Details
Primary mode
Batch processing (with Structured Streaming for micro-batch stream processing)
Languages
Python (PySpark), Scala, Java, SQL
Best for
ETL on massive data sets (terabytes+), ML feature engineering, data lake processing, complex transformations that exceed SQL expressiveness
Not ideal for
True low-latency streaming (sub-second), small data sets (overhead is not worth it), simple SQL transformations (use dbt instead)
Where it runs
Databricks, AWS EMR, Google Dataproc, self-managed clusters
Apache FlinkFlink is a distributed stream processing engine designed for true real-time, event-at-a-time processing. Unlike Spark Structured Streaming (which uses micro-batches), Flink processes each event individually with very low latency.
Aspect
Details
Primary mode
Stream processing (with batch as a special case of bounded streams)
Languages
Java, Scala, Python, SQL
Best for
True real-time processing (fraud detection, live analytics), event-driven applications, complex event processing (CEP), applications needing exactly-once semantics on streams
Not ideal for
Simple batch ETL (overkill — use Spark or dbt), teams without streaming expertise (steep learning curve)
Where it runs
AWS Kinesis Data Analytics, Confluent Cloud, self-managed clusters
dbt (data build tool)dbt is a SQL-first transformation tool that lets analytics engineers define transformations as versioned, tested, documented SQL models. It has become the standard for the “T” in ELT.
Aspect
Details
Primary mode
SQL-based transformations inside the data warehouse
Languages
SQL + Jinja templating
Best for
Data warehouse transformations, building analytics models, data quality testing, documentation and lineage tracking, enabling analysts to own their transformations
Not ideal for
Real-time processing, non-SQL transformations, processing data outside the warehouse
Where it runs
dbt Cloud, or dbt Core (open source CLI) against any supported warehouse (Snowflake, BigQuery, Redshift, Databricks, Postgres)
Decision framework: If your data fits in a single warehouse and transformations are expressible in SQL, use dbt. If you have terabytes of raw data in a data lake that needs heavy processing before loading, use Spark. If you need true real-time processing with sub-second latency, use Flink. Many production data platforms use all three: Flink for real-time event processing, Spark for heavy batch ETL, and dbt for warehouse-layer transformations.
Real-World Story: How Airbnb’s Data Quality Crisis Led to Minerva
Airbnb’s data quality journey is a cautionary tale that became a success story — and it is one of the best-documented examples of why data quality cannot be an afterthought.The problem. By the mid-2010s, Airbnb had grown rapidly and so had their data infrastructure. Hundreds of engineers and analysts were writing SQL queries, building dashboards, and creating derived data sets. The problem: there was no single source of truth for core business metrics. Different teams had different definitions of “active listing,” “booking,” and “revenue.” One team’s revenue dashboard would show a number 15% different from another team’s — both built from the same raw data but with different filters, joins, and business logic applied. Leadership could not trust the numbers. Meetings that should have been about strategy devolved into arguments about whose dashboard was right.The investigation. Airbnb’s data team traced the root cause: metric definitions were scattered across hundreds of SQL queries, dbt models, and notebook analyses. There was no central, authoritative definition of what “revenue” meant — gross or net? Including cancellations or not? In the host’s currency or USD? Different analysts made different reasonable choices, and over time the definitions diverged silently.The solution: Minerva. Airbnb built Minerva, a centralized metrics platform that serves as the single source of truth for all business metrics. Minerva enforces that every metric has one canonical definition — written in a structured configuration language, version-controlled, reviewed, and tested. When an analyst or a dashboard needs “revenue,” they query Minerva rather than writing their own SQL. Minerva handles the joins, filters, currency conversions, and edge cases consistently every time.Key design decisions in Minerva:
Metric definitions as code: Every metric is defined in a YAML-like configuration file, version-controlled in Git, and subject to code review. Changing a metric definition is a PR, not an ad-hoc SQL edit.
Dimensional consistency: Minerva enforces that metrics can be sliced by a standard set of dimensions (date, country, platform, etc.) and that these dimensions are defined once and reused.
Data quality checks built in: Every metric computation includes automated validation — row counts, null rates, bounds checking, and comparison against previous runs.
Self-serve with guardrails: Analysts can explore and slice metrics freely, but they cannot redefine them. New metrics go through a formal definition and review process.
The outcome. After adopting Minerva, Airbnb reported that metric discrepancies dropped dramatically. The “whose numbers are right?” debates largely disappeared. Data trust increased across the organization, and the data team could spend time building new capabilities rather than debugging inconsistencies.The lesson for interviews: When discussing data quality, Airbnb’s story shows that the problem is usually not bad data in the source — it is inconsistent interpretation in the transformation layer. The fix is not more data quality checks (though those help) — it is centralizing metric definitions so that “revenue” means exactly one thing, everywhere, always.
Bad data is worse than no data — it leads to wrong decisions. Data quality checks: schema validation (are the expected columns present with correct types?), freshness checks (is the data from today or stale?), volume checks (did we receive the expected number of rows?), null checks (are critical fields populated?), uniqueness checks (no duplicate primary keys?), referential integrity (do foreign keys point to valid records?).
Data Quality: Schema Validation, Data Contracts, and Monitoring for Drift
Data quality is not a one-time check — it is a continuous practice woven into every stage of the pipeline.Schema ValidationEvery pipeline stage should validate that incoming data matches the expected schema before processing. This catches breaking changes early instead of letting corrupt data flow downstream into dashboards and ML models.
At extraction: Validate that source API responses or database exports match the expected schema (column names, types, non-null constraints)
At loading: Enforce schema-on-write or schema-on-read validation in the warehouse. Snowflake and BigQuery support schema enforcement.
At transformation: dbt tests validate the output schema — every model should have not_null, unique, and accepted_values tests on critical columns
Data ContractsA data contract is a formal agreement between a data producer (the team that owns the source system) and a data consumer (the analytics or ML team). It specifies:
What fields will be present and their types
What guarantees are made about data freshness and completeness
What the change process looks like (producer cannot rename a field without notifying consumers)
SLAs for data delivery (e.g., “yesterday’s data available by 6 AM UTC”)
Data contracts shift data quality left — instead of discovering that a source schema changed when the dashboard breaks, the producing team is obligated to communicate changes in advance. This is the same principle as API versioning applied to data.
Monitoring for DriftEven with contracts, data can drift over time. Monitor for:
Drift Type
What to Watch
How to Detect
Schema drift
New columns appearing, columns removed, type changes
Compare current schema against contract/baseline schema on each pipeline run
Volume drift
Gradual increase or sudden drop in row counts
Track daily row counts over time; alert on deviations beyond 2 standard deviations
Distribution drift
A column that was 5% null is now 40% null; an enum gains new values
Statistical profiling on each load; Great Expectations or dbt-expectations for automated checks
Freshness drift
Data arrives later than expected
Track the max timestamp in each load; alert if it falls behind the expected SLA
Semantic drift
A field’s meaning changes (e.g., “revenue” used to be gross, now it is net)
Harder to detect automatically; requires data contracts and communication between teams
Tools: Great Expectations, dbt tests, Soda, Monte Carlo (data observability). Apache Spark (large-scale batch processing). Apache Flink (stream processing). Apache Airflow (workflow orchestration).
Interview Question: Your data pipeline has a 6-hour SLA but it has been taking 8 hours recently. The data volume grew 3x. What are your options?
Strong answer:This is a common production scenario. The pipeline was designed for volume X and now it is processing 3X. Before jumping to solutions, diagnose where the time is going.Step 1 — Profile the pipeline. Which stages are slow? A typical pipeline has extraction, loading, and multiple transformation steps. Often, one or two stages dominate the runtime. Check Airflow/Dagster task durations, query execution plans, and resource utilization (CPU, memory, I/O). The fix depends entirely on which stage is the bottleneck.Step 2 — Quick wins (can often recover 30-50% of runtime):
Incremental processing. If the pipeline reprocesses all data daily (full refresh), switch to incremental loads — only process rows that changed since the last run. If your source has a reliable updated_at timestamp, this can reduce processing volume by 90%+.
Query optimization. Check for missing indexes, expensive full table scans, poorly written joins (Cartesian products, joining on non-indexed columns), and unnecessary SELECT *. In Spark, check for data skew causing uneven partition sizes.
Partition pruning. If the data is partitioned by date but the query scans all partitions, add partition filters. This is one of the most common performance issues in data warehouses.
Parallelism. If independent transformation steps run sequentially, parallelize them. Airflow and Dagster support this natively through DAG structure.
Step 3 — Medium-term fixes (if quick wins are not enough):
Vertical scaling. Bigger warehouse size (Snowflake), more DPUs (AWS Glue), larger cluster (Spark/EMR). Costs more money but buys time.
Horizontal scaling. Partition the data and process partitions in parallel. If you have 3 years of data, process each month as an independent task.
Rearchitect the bottleneck stage. If a single complex dbt model takes 4 hours, break it into intermediate models that can be materialized as tables (not views) and computed incrementally.
Pre-aggregate. If downstream queries aggregate over fine-grained data, pre-compute common aggregations in an intermediate table.
Step 4 — Strategic options (if the 3x growth is continuing):
Move the bottleneck to a different technology. If a SQL-based transformation is hitting warehouse limits, consider moving that stage to Spark for distributed processing.
Renegotiate the SLA. Sometimes the honest answer is: a 6-hour SLA was set when the data was 3x smaller. If the business can tolerate an 8-hour SLA (data ready by 10 AM instead of 8 AM), that buys time for a proper rearchitecture.
Split into fast and slow paths. If 80% of the data is needed by 6 hours (yesterday’s transactions) but the remaining 20% can arrive later (historical recalculations), split the pipeline into a fast path and a slow path with different SLAs.
The meta-answer: The strongest candidates show a systematic approach (profile first, then optimize, then scale, then rearchitect) rather than jumping to “just add more machines.” They also mention that this problem will recur — if data grew 3x once, it will grow 3x again. The fix should include monitoring and alerting on pipeline duration so the team sees the next growth spike before it breaches the SLA.
A collection of essential references across all three topics covered in this module. These are not filler links — each one is here because it is genuinely worth your time.Capacity Planning and Estimation
System Design Interview by Alex Xu — the most practical guide to back-of-envelope estimation. Chapters walk through real problems step by step.
Chelsea Troy — Reviewing Pull Requests — a thoughtful, deeply practical series on what makes code review effective. Covers cognitive load, empathy in review, and how to give feedback that actually helps the author improve.
dbt documentation and the Analytics Engineering Guide — docs.getdbt.com — the official dbt documentation is excellent. Start with the “Getting Started” tutorial, then read the Analytics Engineering Guide for the philosophy behind the tool. Understanding dbt is non-negotiable for modern data engineering interviews.
Apache Flink documentation and use cases — flink.apache.org — the official Flink docs include excellent conceptual guides on stream processing fundamentals (windowing, watermarks, exactly-once semantics) that are valuable even if you never use Flink specifically.
Netflix Technology Blog — data pipeline posts — Netflix’s data engineering team has published some of the best real-world accounts of building and operating data pipelines at massive scale. Search for posts on their data mesh adoption, Spark usage, and pipeline reliability practices.
Designing Data-Intensive Applications by Martin Kleppmann — the chapters on batch processing (Chapter 10) and stream processing (Chapter 11) are some of the best technical writing on these topics anywhere. A must-read for anyone building or operating data systems.