Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Part XXXIV — Capacity Planning and Back-of-Envelope Estimation
Chapter 41: Capacity Planning
41.1 Numbers Every Engineer Should Know
These latency numbers (originally from Jeff Dean, updated for modern hardware) are essential for back-of-envelope calculations and system design interviews:| Operation | Latency | Notes |
|---|---|---|
| L1 cache reference | 0.5 ns | |
| L2 cache reference | 7 ns | |
| Main memory reference | 100 ns | |
| SSD random read | 16 us | ~16,000 ns |
| Read 1 MB from memory | 250 us | |
| Read 1 MB from SSD | 1 ms | |
| Network round-trip (same datacenter) | 0.5 ms | 500 us |
| Read 1 MB from network | 10 ms | |
| Disk seek (HDD) | 10 ms | |
| Network round-trip (cross-continent) | 150 ms | |
| Read 1 MB from HDD | 20 ms | |
| NVMe SSD random read | 10-20 us | 10x faster than SATA SSD |
sendfile() zero-copy transfer (1 MB) | 50-100 us | Why Kafka is fast — see OS Fundamentals |
- A single PostgreSQL instance: ~5,000-20,000 simple queries/second
- Redis: ~100,000 operations/second
- A single web server (Node.js/Go): ~10,000-50,000 requests/second for simple endpoints
- Kafka: ~1 million messages/second per broker (small messages)
- 1 Gbps network: ~125 MB/second theoretical throughput
- DynamoDB: ~25,000 RCU / 25,000 WCU per table in on-demand mode before throttling (see Database Deep Dives for partition-level limits)
- S3: ~5,500 GET and ~3,500 PUT requests/second per partitioned prefix (see Cloud Service Patterns for S3 performance tuning)
| Quantity | Human-Scale Equivalent |
|---|---|
| 1 million seconds | ~11.5 days |
| 1 billion seconds | ~31.7 years |
| 1 trillion seconds | ~31,700 years |
| 1 million rows x 1 KB each | 1 GB |
| 1 billion daily active users x 10 requests each | ~115,000 requests/second |
Back-of-Envelope Cheat Sheet
Use this reference table during system design interviews and architecture sessions. These are approximate values — the point is speed, not decimal precision. Storage Sizes| Data Type | Approximate Size |
|---|---|
| A single ASCII character | 1 byte |
| A UUID | 36 bytes (string), 16 bytes (binary) |
| A typical JSON API response | 1-10 KB |
| A user profile row (RDBMS) | 0.5-2 KB |
| A tweet / short message | 0.5-1 KB (text + metadata) |
| A thumbnail image | 10-50 KB |
| A high-res photo | 2-5 MB |
| A 1-minute video clip (compressed) | 5-10 MB |
| 1 million user rows | ~1 GB |
| 1 billion user rows | ~1 TB |
| Daily Volume | Requests/Second (avg) | Requests/Second (peak, 3x) |
|---|---|---|
| 100K / day | ~1.2 req/s | ~3.5 req/s |
| 1M / day | ~12 req/s | ~35 req/s |
| 10M / day | ~116 req/s | ~350 req/s |
| 100M / day | ~1,160 req/s | ~3,500 req/s |
| 1B / day | ~11,600 req/s | ~35,000 req/s |
| Operation | Latency |
|---|---|
| In-memory cache lookup (Redis) | < 1 ms |
| Simple DB query (indexed) | 1-5 ms |
| Complex DB query (join, aggregation) | 10-100 ms |
| Intra-datacenter API call | 1-10 ms |
| Cross-region API call | 50-150 ms |
| Third-party API call (Stripe, Twilio) | 100-1000 ms |
| DNS lookup (uncached) | 20-100 ms |
41.2 Back-of-Envelope Estimation
The skill of quickly estimating whether a design will work at the required scale. This comes up constantly in system design interviews and real architecture discussions. The process: Start with the requirement (e.g., “store 5 years of user activity”). Estimate the data size per event. Estimate the volume (events/day x 365 x 5). Multiply. Add overhead (indexes, replication, headroom). Compare against known limits of the technology you are considering.Interview Question: Estimate the storage needed for a service that stores all tweets for 5 years.
Interview Question: Estimate the storage needed for a service that stores all tweets for 5 years.
- Failure mode: What happens if your estimate is 5x too low? You run out of disk on your distributed DB cluster and writes start failing. The mitigation is monitoring disk utilization with alerts at 60% and 80%, plus auto-scaling policies or pre-provisioned headroom.
- Rollout: How would you validate this estimate before committing to infrastructure? Run a 1-week pilot ingesting real tweet volumes into a staging cluster, measure actual storage including indexes and MVCC overhead, then extrapolate.
- Cost: At 1.6 PB on S3 Standard, storage alone is ~$37K/month. With S3 Intelligent-Tiering, cold tweets (older than 90 days, rarely accessed) automatically move to cheaper tiers, potentially cutting cost by 40%.
- Security/governance: 5 years of tweet data includes deleted tweets, GDPR erasure requests, and court-ordered takedowns. Your storage design must support efficient single-record deletion across a distributed system — Cassandra tombstones handle this, but tombstone accumulation is its own operational problem.
Interview Question: Estimate the storage needed for a YouTube-like platform — 500 hours of video uploaded per minute, 1 billion daily views. Walk me through your math.
Interview Question: Estimate the storage needed for a YouTube-like platform — 500 hours of video uploaded per minute, 1 billion daily views. Walk me through your math.
- Failure mode: What if your CDN cache hit rate drops from 95% to 80%? Origin egress explodes. At 42 PB/day, a 15% increase in origin fetches is ~6.3 PB/day of additional origin bandwidth — potentially $300K+/day in data transfer costs. Monitor cache hit ratio as a first-class SLA metric.
- Rollout: How would you validate transcoding throughput estimates? Start by benchmarking a single transcoding worker against real video samples (not synthetic data) — actual user uploads are more variable in codec, resolution, and duration than benchmarks assume.
- Measurement: Track cost-per-view-minute as the north star metric. If infrastructure cost grows faster than views, you have an efficiency problem. YouTube reportedly spends ~$0.001 per view-minute on infrastructure.
- Security/governance: DMCA takedowns require removing a video and all its transcoded variants from all CDN edge caches globally within a compliance window. Design your CDN purge mechanism to handle this at scale — invalidating a popular video cached at 200+ edge locations is not instantaneous.
AI-Assisted Engineering: How LLMs Change Capacity Estimation
AI-Assisted Engineering: How LLMs Change Capacity Estimation
Worked Example: Design Twitter — Estimate Storage for 500M Users, 200M Daily Active
This is a complete back-of-envelope walkthrough combining multiple estimation skills. Step 1: Define the scope.- 500M total users, 200M daily active users (DAU)
- Features to estimate: user profiles, tweets, timelines, media, follows graph
- 500M users x ~2 KB per profile (name, bio, email, settings, timestamps) = 1 TB
- With 3x replication = ~3 TB
- This is trivially handled by a single sharded database cluster
- Average DAU posts 2 tweets/day: 200M x 2 = 400M tweets/day
- Each tweet: 280 bytes (text) + 200 bytes (metadata) = ~480 bytes
- Daily tweet storage: 400M x 480 = 192 GB/day
- Yearly: 192 x 365 = ~70 TB/year
- 5 years: ~350 TB of raw tweet data
- With replication (3x) + indexes (~30%): ~350 x 3 x 1.3 = ~1.4 PB
- Assume 20% of tweets have an image (~200 KB avg compressed) and 5% have a video (~5 MB avg compressed)
- Images: 400M x 0.20 x 200 KB = 16 TB/day
- Videos: 400M x 0.05 x 5 MB = 100 TB/day
- Media dominates storage. Per year: ~42 PB. This goes in object storage (S3), not a database.
- 200M DAU, each opens the app 5 times/day, each load fetches ~50 tweets
- Timeline reads: 200M x 5 = 1B timeline loads/day = ~11,600 requests/second (avg), ~35,000 peak
- Each timeline load fans out to fetch tweets from followed users
- This demands heavy caching (pre-computed timelines in Redis) and a fan-out architecture
- Average user follows 200 accounts: 500M x 200 = 100B edges
- Each edge: 8 bytes (follower_id) + 8 bytes (followee_id) + 8 bytes (timestamp) = ~24 bytes
- Total: 100B x 24 = ~2.4 TB (manageable in a graph database or sharded relational DB)
| Component | Daily | Yearly | 5 Years | Storage Technology |
|---|---|---|---|---|
| User profiles | - | - | ~3 TB | Sharded RDBMS |
| Tweets | 192 GB | 70 TB | ~1.4 PB (with replication) | Distributed DB (Cassandra, etc.) |
| Images | 16 TB | ~5.8 PB | ~29 PB | Object storage (S3) |
| Videos | 100 TB | ~36.5 PB | ~182 PB | Object storage (S3) + CDN |
| Follow graph | - | - | ~2.4 TB | Graph DB / sharded RDBMS |
| Timelines (cache) | - | - | ~10-50 TB | Redis cluster |
Estimation Sanity Checks
After any back-of-envelope calculation, pause and ask: does this make sense? Compare your estimate against known real-world benchmarks to verify you are in the right ballpark. The Sanity Check Process:- Compare against known systems. You estimated 100 TB/day for a social media app — is that reasonable? Yes. Facebook generates roughly 4 PB/day of data. Your estimate for a Twitter-scale system should be a meaningful fraction of that. If you estimated 100 GB/day for a platform with 200 million daily active users, something is wrong — that is less than 1 KB per user per day.
- Check per-user numbers. Divide your total by the number of users. If you estimated 70 TB/year for tweet storage with 200M DAU, that is about 350 KB per user per year, or roughly 1 KB per user per day. For a text-heavy service, that feels right. If your per-user number is 1 GB/day for a text messaging app, you have made an error somewhere.
- Check against your monthly cloud bill. Storage costs roughly 10M/month storage bill for a startup with 1M users, reconsider your assumptions. Real-world benchmark: Dropbox famously saved $75 million over two years by moving from S3 to their own infrastructure — but they were storing exabytes. For most companies (under 1 PB), S3 is cheaper than running your own storage. The Cloud Service Patterns chapter covers S3 storage class tiers (Standard vs Infrequent Access vs Glacier) that can cut costs 40-80% for cold data.
-
Order-of-magnitude gut check. Round your answer to the nearest power of 10 and ask if it passes the smell test:
- 1 GB — a single machine handles this trivially
- 1 TB — a single beefy machine or a small database cluster
- 1 PB — you need distributed storage, object storage, and serious infrastructure
- 1 EB — you are Google, Facebook, or AWS
- Cross-check with the throughput estimate. If your storage estimate implies 100 TB/day but your throughput estimate says the system handles 1,000 requests/second at 1 KB each, those numbers do not agree (1,000 req/s x 1 KB x 86,400 seconds = ~86 GB/day, not 100 TB). When storage and throughput estimates diverge, one of your assumptions is wrong — find it.
Real-World Story: How Google Manages Billions of Lines of Code in a Single Repository
Google stores virtually all of its code — billions of lines across tens of thousands of projects — in a single monolithic repository. As of their 2016 paper, the repository contained over 2 billion lines of code across 9 million source files, with 86 TB of content and roughly 45,000 commits per day from 25,000+ developers. They do not use Git for this. Git was not designed for a repository of this scale — operations likegit status and git clone would take hours or days on a repo this size. Instead, Google built a custom version control system called Piper, backed by their distributed file system (Bigtable and Colossus). Developers do not clone the entire repo. They use a virtual filesystem called CitC (Clients in the Cloud) that presents the illusion of having the full repo locally, while fetching only the files you actually open or build.
Why a monorepo? Because dependency management across thousands of projects becomes tractable. If Team A changes an API, they can also update every caller of that API across the company in a single commit (called a “large-scale change” or LSC). This eliminates dependency versioning hell — there is only one version of every library: the one at HEAD. Google built custom tooling (Rosie, TAP) to automatically test the impact of changes across the entire codebase before they land.
The lesson for capacity estimation: Google’s monorepo is a masterclass in understanding your scale constraints and building tools to match. They estimated that a standard VCS could not handle their volume, validated that estimate, and built custom infrastructure. They did not guess — they did the math. 86 TB of code. 45,000 commits/day. 800,000 queries/second to the version control backend. Those numbers told them exactly what architecture they needed.
Monorepo vs Polyrepo: The Decision Guide
The Google story above is the extreme end of a spectrum. Most teams face a more practical question: should our code live in one repository or many? This decision affects developer experience, CI/CD complexity, dependency management, and team autonomy. There is no universally correct answer — but there is a clear framework for deciding. Monorepo: All projects, services, and libraries live in a single Git repository. Shared code is consumed directly (not via package registries). A change to a shared library and all its consumers can be a single atomic commit. Polyrepo: Each project, service, or library has its own repository. Shared code is published as versioned packages (npm, PyPI, Maven) and consumed via dependency managers. Changes to shared code require a publish-then-update cycle.| Aspect | Monorepo | Polyrepo |
|---|---|---|
| Atomic cross-project changes | Yes — one commit updates library + all consumers | No — requires publish, then PRs to each consumer repo |
| Dependency management | Simplified — there is one version of everything at HEAD | Complex — each repo pins its own dependency versions |
| Code sharing | Effortless — just import from the shared directory | Requires publishing packages with proper versioning |
| CI/CD complexity | Higher — must determine which tests to run for which changes (affected-target analysis) | Lower — each repo has its own independent CI pipeline |
| Repository size | Grows over time — can become slow without tooling | Each repo stays small and fast |
| Team autonomy | Lower — shared tooling, shared CI, shared standards | Higher — each team controls their own repo, tooling, and release cadence |
| Onboarding | Easier to discover code — everything is searchable in one place | Harder to find things — code scattered across dozens of repos |
| Access control | Coarse-grained (Git does not natively support per-directory permissions) | Fine-grained (per-repo permissions are straightforward) |
| Build tooling required | Significant — need tools that understand project boundaries within the repo | Standard — each repo uses its own build system |
| Tool | Best For | Learning Curve | Language Support | Caching | Affected-Target Analysis |
|---|---|---|---|---|---|
| Nx | JS/TS-heavy monorepos, full-stack teams | Moderate | JS/TS native, others via plugins | Local + remote (Nx Cloud) | Yes (project graph) |
| Turborepo | Vercel/Next.js ecosystems, simplicity-first teams | Low | JS/TS primarily | Local + remote (Vercel) | Basic (task-level) |
| Bazel | Large polyglot repos, correctness-critical builds | High | Any language (via rules) | Local + remote (configurable) | Yes (hermetic dependency graph) |
- Small to medium teams (under ~50 engineers) where coordination cost is high and the overhead of managing dozens of repos is not worth it.
- Tightly coupled services that change together frequently — if Services A, B, and C always change in lockstep, separate repos mean three PRs instead of one.
- Strong shared library culture — if your team maintains internal libraries consumed by many services, a monorepo eliminates the publish-and-upgrade cycle.
- Early-stage companies where the codebase is small enough that tooling overhead is minimal and the ability to refactor across the entire codebase in a single PR is transformative.
- Real-world examples: Google (Piper), Meta (monorepo with custom tooling), Vercel (Turborepo), Stripe (Bazel).
- Large organizations with autonomous teams that release on independent schedules — a monorepo would force coordination overhead that slows teams down.
- Open-source projects or teams that need fine-grained access control — repository-level permissions are far simpler than directory-level permissions in a monorepo.
- Polyglot environments where each service uses a different language, build system, and deployment pipeline — forcing these into a shared monorepo build system is painful.
- Compliance-heavy environments where teams need isolated audit trails and access logs per service.
- Real-world examples: Netflix (hundreds of independent repos), Amazon (famously polyrepo — “two-pizza teams” with full ownership), Spotify (squads own their repos).
Interview Question: Your company has 8 microservices across 8 separate repos. Teams spend 30% of their time on cross-repo dependency updates. Should you move to a monorepo?
Interview Question: Your company has 8 microservices across 8 separate repos. Teams spend 30% of their time on cross-repo dependency updates. Should you move to a monorepo?
Part XXXV — Git Workflows and Code Review
Chapter 42: Git Workflows
Interview Question: Your team has 12 engineers and frequent merge conflicts. What do you do?
Interview Question: Your team has 12 engineers and frequent merge conflicts. What do you do?
42.1 Branching Strategies
Trunk-Based Development: All developers commit tomain (or trunk). Short-lived feature branches (hours, not weeks). Feature flags hide incomplete work. CI runs on every commit. This is the strategy used by Google, Facebook, and most high-performing teams measured in the Accelerate research. Requires strong CI, feature flags, and automated testing.
GitHub Flow: Create a branch from main, do work, open a pull request, review, merge to main, deploy. Simple, effective for most teams. Branches should be short-lived (1-3 days). Long-lived branches create merge conflicts and delay integration.
GitFlow: Separate develop, release, and hotfix branches alongside main. Popular but complex. Appropriate for software with formal release cycles (mobile apps, packaged software). Overkill for web services that deploy continuously.
The rule: Start with trunk-based development or GitHub Flow. Only adopt GitFlow if you have a genuine need for parallel release management.
Git Workflow Comparison
| Aspect | Trunk-Based Development | GitHub Flow | GitFlow |
|---|---|---|---|
| Branch lifetime | Hours (or direct commits) | 1-3 days | Days to weeks |
| Main branches | main only | main only | main, develop, release/*, hotfix/* |
| Complexity | Low | Low | High |
| Release model | Continuous deployment | Continuous deployment | Scheduled releases |
| Feature flags needed? | Yes (essential) | Helpful but optional | Rarely needed |
| Best team size | Any size (Google uses this at 30K+ engineers) | Small to medium (2-20 engineers) | Medium to large with formal release cycles |
| Best for | SaaS, web services, high-deployment-frequency teams | Most web projects, startups, open source | Mobile apps, packaged software, regulated environments |
| Merge conflict risk | Very low | Low (if branches are short) | High (long-lived branches diverge) |
| CI/CD requirement | Strong CI mandatory | CI recommended | CI helpful but not critical |
| Rollback strategy | Feature flags / revert commit | Revert commit / redeploy | Hotfix branch from main |
Interview Question: Your team uses GitFlow but releases are painful and slow. How would you propose migrating to trunk-based development?
Interview Question: Your team uses GitFlow but releases are painful and slow. How would you propose migrating to trunk-based development?
develop and release branches, hotfixes that need cherry-picking across multiple branches, release branches that accumulate so many changes that testing them is a project in itself, and a “merge day” before each release where the team stops feature work to resolve conflicts.The migration plan — do it incrementally, not as a big bang:Phase 1 — Shorten branch lifetimes (2-4 weeks). Before changing the branching model, start merging feature branches to develop more frequently. Set a team norm: no branch lives longer than 2 days. Break large features into smaller, independently mergeable increments. This alone reduces most merge pain.Phase 2 — Introduce feature flags (2-4 weeks). Deploy a feature flag system (LaunchDarkly, Unleash, or even a simple config file). Start wrapping incomplete features behind flags so that partially complete work can be merged to develop without being visible to users. This is the critical prerequisite — without feature flags, trunk-based development does not work because you cannot merge incomplete features.Phase 3 — Eliminate the develop branch (1-2 weeks). Once branches are short-lived and feature flags are in place, develop is no longer adding value — it is just an extra step between your branch and main. Start merging directly to main. Keep the CI pipeline running on every merge. The develop branch becomes a mirror of main and can be retired.Phase 4 — Eliminate release branches (gradual). With CI/CD deploying from main, release branches become unnecessary for web services. If you still ship versioned software (mobile apps), you may keep lightweight release branches that are cut from main and only receive cherry-picked hotfixes — but no feature development happens on them.Phase 5 — Invest in CI and automated testing. Trunk-based development requires that main is always deployable. This means: fast CI (under 10 minutes), comprehensive automated tests, and a culture where a broken main is treated as a stop-the-world emergency. If your CI takes 45 minutes, fix that before going fully trunk-based.Key risks to address: (1) Team members accustomed to long-lived branches will need coaching — pair programming helps. (2) Feature flags add complexity — clean them up aggressively. (3) CI must be fast and reliable — flaky tests undermine trunk-based development because people stop trusting the green build.The strongest answer acknowledges that this is as much a cultural change as a technical one. The tooling is straightforward. Getting 12 engineers to change how they work daily requires buy-in, patience, and visible wins early in the migration.42.2 Merge Strategies
Merge commit: Preserves full branch history. Creates a merge commit. Good for seeing exactly what a PR contained. Noisy history. Squash merge: Combines all branch commits into a single commit on main. Clean history. Loses granular commit history (which rarely matters). Most teams prefer this for PRs. Rebase: Replays branch commits on top of main. Linear history. Can cause confusion with force pushes. Best for personal workflow, risky for shared branches.42.3 Code Review Best Practices
For authors: Keep PRs small (under 400 lines changed). Write a clear description (what, why, how, testing). Self-review before requesting review. Respond to feedback promptly. Do not take feedback personally. For reviewers: Review within 24 hours (review latency kills velocity). Focus on correctness, clarity, and maintainability — not style preferences (use linters for that). Ask questions instead of making demands (“What happens if this is null?” vs “This will crash on null”). Approve when it is “good enough” — do not block on perfection. Distinguish blocking issues from nits. What to look for: Logic errors. Missing edge cases. Security vulnerabilities. Performance concerns at scale. Missing tests for changed behavior. API contract changes. Error handling gaps. Race conditions.Code Review: What to Look For (Checklist)
Use this checklist when reviewing pull requests. Not every item applies to every PR, but scanning through this list ensures you do not miss critical issues.| Category | What to Check |
|---|---|
| Correctness | Does the code do what the PR description says? Are edge cases handled (null, empty, overflow, boundary values)? Are error paths correct? |
| Readability | Can you understand the code in a single pass? Are names descriptive? Is the logic straightforward or needlessly clever? |
| Performance | Any N+1 queries? Unnecessary loops over large data sets? Missing indexes for new queries? Unbounded result sets? |
| Security | SQL injection? XSS? Missing input validation? Secrets hardcoded? Broken authentication or authorization checks? |
| Test coverage | Are new behaviors tested? Are edge cases covered? Do tests actually assert meaningful behavior (not just “no error”)? |
| API design | Are new endpoints or contracts backward compatible? Are field names consistent with the rest of the codebase? |
| Error handling | Are errors caught and handled (not silently swallowed)? Are error messages helpful for debugging? Are retries safe (idempotent)? |
| Concurrency | Any shared mutable state? Race conditions? Missing locks or transactions? Safe under concurrent access? |
| Dependencies | New dependency added? Is it maintained, well-known, and license-compatible? Does it pull in a large transitive dependency tree? |
| Observability | Are new operations logged? Are metrics emitted for new features? Can you debug this in production? |
Code Review Anti-Patterns
Recognizing these patterns helps teams improve their review culture.Real-World Story: How GitHub Scales Code Review Across Thousands of Engineers
GitHub itself — the company that hosts code review for millions of teams — has a distinctive code review culture worth studying. With over 3,000 employees and hundreds of engineers shipping to a monolithic Ruby on Rails application (and increasingly a set of microservices), they have had to solve the “code review at scale” problem for their own codebase, not just for their customers. How they do it: GitHub uses their own pull request workflow (naturally), but the interesting part is the culture and tooling around it. They practice what they call “review-driven development” — PRs are the unit of work, and no code reachesmain without at least one substantive review. They use CODEOWNERS files extensively, which automatically assign reviewers based on which files a PR touches. This prevents the bottleneck of a few senior engineers being the review gateway for everything.
Review latency as a metric: GitHub treats review latency (the time between a PR being opened and receiving its first substantive review) as a key engineering health metric. Their internal target is that most PRs receive a review within a few hours. They found that when review latency creeps up, branch lifetimes grow, merge conflicts increase, and deploy frequency drops — a cascading effect that slows the entire organization.
The “ship small” principle: GitHub’s engineering blog has documented their emphasis on small, incremental PRs. They encourage engineers to break work into PRs of under 400 lines when possible. Smaller PRs get reviewed faster, reviewed more thoroughly, and merge with fewer conflicts. They use feature flags extensively to allow incomplete features to be merged behind a flag.
The lesson: Code review is not just a quality gate — it is a communication mechanism. GitHub’s approach shows that the biggest leverage is not in what reviewers look for (checklists help, but they are secondary) but in how fast the review loop turns. Fast reviews mean short branches mean few conflicts mean fast deploys. Slow reviews poison the entire development cycle.
Part XXXVI — Data Pipelines and Batch Processing
Chapter 43: Data Engineering Fundamentals
Interview Question: A business analyst reports that yesterday's revenue dashboard is $50,000 lower than expected. How do you investigate?
Interview Question: A business analyst reports that yesterday's revenue dashboard is $50,000 lower than expected. How do you investigate?
43.1 ETL vs ELT
ETL (Extract, Transform, Load): Data is extracted from sources, transformed in a staging area, then loaded into the destination. Traditional approach. Transformation happens before loading, so only clean data enters the warehouse. Good when the destination has limited compute (traditional data warehouses). ELT (Extract, Load, Transform): Data is extracted and loaded raw into the destination, then transformed there. Modern approach enabled by powerful cloud data warehouses (BigQuery, Snowflake, Redshift) that have cheap, scalable compute. Faster initial loading. Transforms can be iterated without re-extraction. This is the dominant pattern today.ETL vs ELT: The Shift and When Each Applies
| Aspect | ETL | ELT |
|---|---|---|
| Transform location | Staging server / middleware | Inside the data warehouse |
| Warehouse compute needed | Low (receives clean data) | High (does heavy transformations) |
| Data in warehouse | Only transformed, curated data | Raw + transformed data |
| Iteration speed | Slow (re-extract to re-transform) | Fast (just re-run the transformation SQL) |
| Schema changes | Painful (pipeline changes needed) | Easier (raw data is preserved, transform adapts) |
| Tooling | Informatica, Talend, SSIS, custom scripts | dbt, Snowflake, BigQuery, Databricks |
| Best for | Legacy systems, constrained warehouse compute, strict compliance (only approved data enters the warehouse) | Modern cloud warehouses, fast iteration, analytics teams that own transformations |
43.2 Batch vs Stream Processing
Batch: Process accumulated data at scheduled intervals (hourly, daily). Simpler, cheaper, handles large volumes efficiently. Use for: daily reports, analytics aggregation, data warehouse loading, ML model training, end-of-day reconciliation. Stream: Process data as it arrives, continuously. More complex, higher infrastructure cost, but enables real-time insights. Use for: fraud detection, live dashboards, real-time recommendations, alerting, event-driven architectures. The practical reality: Most companies need both. Stream processing for time-sensitive operations (fraud alerts within seconds), batch processing for everything else (daily reports, historical analytics). Do not default to stream processing — it is significantly more complex to build, debug, and operate.Real-World Story: Spotify’s Journey from Batch to Streaming
Spotify’s data infrastructure evolution is one of the best-documented cases of a company navigating the batch-to-streaming transition — and learning the hard way that “just stream everything” is not the answer. The early days (batch-first). Spotify started with a Hadoop-based batch processing architecture. Every night, massive MapReduce jobs would crunch the previous day’s listening data — billions of events — to produce the analytics, recommendations, and royalty calculations that powered the business. This worked well when Spotify had tens of millions of users. Daily batch jobs ran overnight, and by morning the data team had fresh dashboards and the recommendation models had updated features. The pain point. As Spotify scaled past 100 million users and the event volume grew into the hundreds of billions per day, the batch architecture started breaking. Jobs that used to finish in 2 hours were taking 8. The “nightly batch window” was not enough — some jobs were still running when users woke up. Worse, the batch-only model meant that features like Discover Weekly and personalized playlists could only update daily. If a user listened to jazz all morning, the recommendations would not reflect that until the next day’s batch run. The hybrid architecture. Rather than ripping out batch and going fully streaming, Spotify built a hybrid architecture. They adopted Google Cloud Dataflow (based on Apache Beam) for streaming use cases where latency mattered — real-time listening activity, live A/B test metrics, and fraud detection. But they kept batch processing (migrating from Hadoop to Google BigQuery and Scio, their open-source Scala library for Apache Beam) for the heavy analytical workloads where daily granularity was perfectly fine: royalty calculations, long-term trend analysis, and ML model training. The lesson. Spotify’s story illustrates a pattern that almost every data-intensive company discovers: you end up with both batch and streaming, not one or the other. The key skill is knowing which workloads belong where. Spotify engineers have talked publicly about the mistake of over-eagerly migrating batch workloads to streaming — the added complexity of managing streaming state, handling late-arriving events, and debugging time-windowed aggregations was not worth it for workloads where “yesterday’s data by 8 AM” was a perfectly acceptable SLA. Similar stories exist at LinkedIn (moved from batch Hadoop to Kafka Streams for real-time people-you-may-know, but kept batch for data warehouse loads), Uber (real-time pricing with Apache Flink, batch analytics with Spark on Hive), and Airbnb (real-time search ranking with online feature stores, batch for the Minerva metrics platform described earlier in this chapter). The underlying OS-level reasons why batch throughput is so much cheaper than streaming — sequential I/O, prefetching, and amortized syscall overhead — are explained in OS Fundamentals.Batch vs Stream Processing: Detailed Comparison
| Aspect | Batch Processing | Stream Processing |
|---|---|---|
| Data arrival | Accumulated, processed on schedule | Continuous, processed as it arrives |
| Latency | Minutes to hours | Milliseconds to seconds |
| Complexity | Lower (simpler failure handling, easier debugging) | Higher (ordering, exactly-once, late arrivals, watermarks) |
| Infrastructure cost | Lower (runs periodically, can use spot instances) | Higher (always running, dedicated compute) |
| Error recovery | Re-run the entire batch | Complex (replay from offset, handle duplicates) |
| Throughput | Very high (optimized for bulk) | High but costlier per event |
| State management | Straightforward (read all input, produce output) | Complex (windowing, checkpointing, state stores) |
| Debugging | Easier (input/output are files, reproducible) | Harder (transient state, time-dependent behavior) |
| Use cases | Daily reports, data warehouse loading, ML training, reconciliation, historical analytics | Fraud detection, live dashboards, alerting, real-time recommendations, IoT event processing |
43.3 The Modern Data Stack
Extract: Fivetran, Airbyte, Stitch (managed connectors from SaaS tools and databases to the warehouse). Load: Direct loading into BigQuery, Snowflake, Redshift, Databricks. Transform: dbt (SQL-based transformations with version control, testing, and documentation — the standard tool). Orchestrate: Airflow, Dagster, Prefect (schedule and manage pipeline dependencies). Serve: Looker, Metabase, Tableau (BI and visualization), or reverse ETL back to operational systems.Key Tools: Apache Spark, Apache Flink, and dbt
These three tools cover the majority of modern data processing needs. Knowing when to use each is a critical skill. Apache Spark Spark is a distributed data processing engine designed for large-scale batch processing and analytics. It processes data in parallel across a cluster, making it ideal for workloads that would take hours on a single machine.| Aspect | Details |
|---|---|
| Primary mode | Batch processing (with Structured Streaming for micro-batch stream processing) |
| Languages | Python (PySpark), Scala, Java, SQL |
| Best for | ETL on massive data sets (terabytes+), ML feature engineering, data lake processing, complex transformations that exceed SQL expressiveness |
| Not ideal for | True low-latency streaming (sub-second), small data sets (overhead is not worth it), simple SQL transformations (use dbt instead) |
| Where it runs | Databricks, AWS EMR, Google Dataproc, self-managed clusters |
| Aspect | Details |
|---|---|
| Primary mode | Stream processing (with batch as a special case of bounded streams) |
| Languages | Java, Scala, Python, SQL |
| Best for | True real-time processing (fraud detection, live analytics), event-driven applications, complex event processing (CEP), applications needing exactly-once semantics on streams |
| Not ideal for | Simple batch ETL (overkill — use Spark or dbt), teams without streaming expertise (steep learning curve) |
| Where it runs | AWS Kinesis Data Analytics, Confluent Cloud, self-managed clusters |
| Aspect | Details |
|---|---|
| Primary mode | SQL-based transformations inside the data warehouse |
| Languages | SQL + Jinja templating |
| Best for | Data warehouse transformations, building analytics models, data quality testing, documentation and lineage tracking, enabling analysts to own their transformations |
| Not ideal for | Real-time processing, non-SQL transformations, processing data outside the warehouse |
| Where it runs | dbt Cloud, or dbt Core (open source CLI) against any supported warehouse (Snowflake, BigQuery, Redshift, Databricks, Postgres) |
Real-World Story: How Airbnb’s Data Quality Crisis Led to Minerva
Airbnb’s data quality journey is a cautionary tale that became a success story — and it is one of the best-documented examples of why data quality cannot be an afterthought. The problem. By the mid-2010s, Airbnb had grown rapidly and so had their data infrastructure. Hundreds of engineers and analysts were writing SQL queries, building dashboards, and creating derived data sets. The problem: there was no single source of truth for core business metrics. Different teams had different definitions of “active listing,” “booking,” and “revenue.” One team’s revenue dashboard would show a number 15% different from another team’s — both built from the same raw data but with different filters, joins, and business logic applied. Leadership could not trust the numbers. Meetings that should have been about strategy devolved into arguments about whose dashboard was right. The investigation. Airbnb’s data team traced the root cause: metric definitions were scattered across hundreds of SQL queries, dbt models, and notebook analyses. There was no central, authoritative definition of what “revenue” meant — gross or net? Including cancellations or not? In the host’s currency or USD? Different analysts made different reasonable choices, and over time the definitions diverged silently. The solution: Minerva. Airbnb built Minerva, a centralized metrics platform that serves as the single source of truth for all business metrics. Minerva enforces that every metric has one canonical definition — written in a structured configuration language, version-controlled, reviewed, and tested. When an analyst or a dashboard needs “revenue,” they query Minerva rather than writing their own SQL. Minerva handles the joins, filters, currency conversions, and edge cases consistently every time. Key design decisions in Minerva:- Metric definitions as code: Every metric is defined in a YAML-like configuration file, version-controlled in Git, and subject to code review. Changing a metric definition is a PR, not an ad-hoc SQL edit.
- Dimensional consistency: Minerva enforces that metrics can be sliced by a standard set of dimensions (date, country, platform, etc.) and that these dimensions are defined once and reused.
- Data quality checks built in: Every metric computation includes automated validation — row counts, null rates, bounds checking, and comparison against previous runs.
- Self-serve with guardrails: Analysts can explore and slice metrics freely, but they cannot redefine them. New metrics go through a formal definition and review process.
43.4 Data Quality
Bad data is worse than no data — it leads to wrong decisions. Data quality checks: schema validation (are the expected columns present with correct types?), freshness checks (is the data from today or stale?), volume checks (did we receive the expected number of rows?), null checks (are critical fields populated?), uniqueness checks (no duplicate primary keys?), referential integrity (do foreign keys point to valid records?).Data Quality: Schema Validation, Data Contracts, and Monitoring for Drift
Data quality is not a one-time check — it is a continuous practice woven into every stage of the pipeline. Schema Validation Every pipeline stage should validate that incoming data matches the expected schema before processing. This catches breaking changes early instead of letting corrupt data flow downstream into dashboards and ML models.- At extraction: Validate that source API responses or database exports match the expected schema (column names, types, non-null constraints)
- At loading: Enforce schema-on-write or schema-on-read validation in the warehouse. Snowflake and BigQuery support schema enforcement.
- At transformation: dbt tests validate the output schema — every model should have
not_null,unique, andaccepted_valuestests on critical columns
- What fields will be present and their types
- What guarantees are made about data freshness and completeness
- What the change process looks like (producer cannot rename a field without notifying consumers)
- SLAs for data delivery (e.g., “yesterday’s data available by 6 AM UTC”)
| Drift Type | What to Watch | How to Detect |
|---|---|---|
| Schema drift | New columns appearing, columns removed, type changes | Compare current schema against contract/baseline schema on each pipeline run |
| Volume drift | Gradual increase or sudden drop in row counts | Track daily row counts over time; alert on deviations beyond 2 standard deviations |
| Distribution drift | A column that was 5% null is now 40% null; an enum gains new values | Statistical profiling on each load; Great Expectations or dbt-expectations for automated checks |
| Freshness drift | Data arrives later than expected | Track the max timestamp in each load; alert if it falls behind the expected SLA |
| Semantic drift | A field’s meaning changes (e.g., “revenue” used to be gross, now it is net) | Harder to detect automatically; requires data contracts and communication between teams |
Data Quality Framework: Great Expectations, dbt Tests, and Data Contracts in Practice
The tools mentioned above deserve deeper treatment. In interviews and in production, the difference between a junior and senior data engineer is not whether they know data quality matters — everyone agrees it matters. The difference is whether they can describe a concrete, layered framework for enforcing it. Here is that framework. Layer 1: dbt Tests — The First Line of Defense dbt tests run after every transformation and catch structural issues before data reaches dashboards. They are the unit tests of the data world. Built-in generic tests (the four you should apply to every model):unique— no duplicate values in the specified column. Apply to primary keys and any column that should be unique (email, order_id).not_null— no null values. Apply to every column that downstream consumers depend on for joins, aggregations, or display.accepted_values— column values must be in an explicit list. Apply to enums, status fields, and category columns. When a new value appears (e.g., a new payment type), the test fails and forces the team to make a conscious decision about how to handle it.relationships— referential integrity between models. If anordersmodel referencescustomer_id, this test verifies everycustomer_idinordersexists in thecustomersmodel.
expect_column_values_to_be_between— revenue should be between 1,000,000 per transaction. A value outside that range is likely corrupt data.expect_column_proportion_of_unique_values_to_be_between— if a user_id column is suddenly 95% duplicates, something went wrong upstream.expect_table_row_count_to_be_between— your orders table should have between 50,000 and 200,000 rows per day. A sudden drop to 5,000 means the extraction broke; a spike to 2,000,000 means duplicates leaked in.
unique and not_null tests on the primary key. Critical business models (MRR, ARR, customer counts) have bounds tests and freshness tests. Their CI pipeline blocks merges if any dbt test fails — exactly like a code CI pipeline blocks merges on failing unit tests.
Layer 2: Great Expectations — Profiling and Validation at the Data Layer
Great Expectations (GE) is an open-source Python framework for validating, profiling, and documenting data. It operates at a different level than dbt tests: where dbt tests validate the output of transformations, GE can validate data at any point — during extraction, before loading, after loading, or anywhere a DataFrame or database table exists.
Core concepts:
- Expectation: A declarative assertion about your data (e.g., “this column should never be null,” “this column’s mean should be between 50 and 150”). Expectations are written in Python or configured in JSON/YAML.
- Expectation Suite: A collection of expectations applied to a specific dataset. Think of it as a test suite for a specific table or DataFrame.
- Validation Result: The output of running a suite against actual data — pass/fail for each expectation, with detailed statistics on failures.
- Data Docs: GE auto-generates HTML documentation of your expectations and their latest validation results. This becomes a living data quality dashboard that analysts and engineers can browse without reading code.
-
Producer defines the contract. The team that owns the source system (e.g., the payments team that owns the
transactionstable) publishes a contract specifying: schema (columns, types, nullable constraints), freshness SLA (“data available within 2 hours of event”), volume expectations (“between 100K and 500K rows per day”), and change policy (“14-day notice before any breaking change”). - Consumer agrees to the contract. The analytics team or ML team that consumes the data reviews the contract and confirms it meets their needs. If not, they negotiate — maybe they need an additional field, or a tighter freshness SLA.
-
Contract is enforced automatically. Every pipeline run validates incoming data against the contract. If the schema changes without notice, if freshness falls below the SLA, or if volume drops below the expected range, the pipeline fails loudly with a clear message: “Contract violation: field
payment_methodexpected type STRING, received INTEGER. Contact the Payments team.” - Changes go through a review process. When the producing team needs to change the contract (adding a field, deprecating a field, changing a type), they open a PR against the contract specification. Consuming teams are automatically notified and must approve — exactly like an API deprecation process.
- Protobuf/Avro schemas — enforce schema at the serialization layer. Kafka topics with Avro schemas (managed by Confluent Schema Registry) make contract violations impossible: a producer cannot publish a message that does not conform to the registered schema.
- dbt model contracts (dbt v1.5+) — dbt now supports
contract: {enforced: true}on models, which enforces column names, types, and constraints at build time. This is the simplest path for teams already using dbt. - Soda — open-source data quality tool that supports contract-as-code definitions in YAML and integrates with Airflow, dbt, and standalone pipelines.
- Custom contract registries — some organizations (like Airbnb with Minerva, described earlier) build internal contract systems that serve as the single source of truth for data definitions.
Interview Question: Two teams keep blaming each other for data quality issues — the data engineering team says the source data changed without notice, and the product team says the pipeline should handle schema changes gracefully. How do you resolve this?
Interview Question: Two teams keep blaming each other for data quality issues — the data engineering team says the source data changed without notice, and the product team says the pipeline should handle schema changes gracefully. How do you resolve this?
43.5 Change Data Capture (CDC)
Change Data Capture is the practice of listening to changes in a source database and streaming those changes to downstream consumers in near-real-time. Instead of polling the source on a schedule (“did anything change?”), CDC reads the database’s write-ahead log (WAL) and emits a stream of insert, update, and delete events. Why CDC matters: Every data pipeline that copies data from an operational database to a warehouse faces the same problem: how do you know what changed? The traditional approach is periodic full exports or timestamp-based incremental loads (WHERE updated_at > last_run). Both have flaws. Full exports are expensive and wasteful at scale. Timestamp-based loads miss deletes (a deleted row has no updated_at), miss updates to rows where the application does not update the timestamp, and require the source to have a reliable timestamp column in the first place.
CDC solves all three problems by reading the source of truth — the database’s transaction log. If a row was inserted, updated, or deleted, the event appears in the WAL. No polling. No missed deletes. No timestamp dependency.
Key tools:
- Debezium — The dominant open-source CDC tool. Reads WAL from PostgreSQL, MySQL, MongoDB, SQL Server, and others. Publishes change events to Kafka topics. Each change event includes the before and after state of the row, the operation type (create, update, delete), and the transaction metadata.
- AWS DMS (Database Migration Service) — Managed CDC for AWS. Supports continuous replication from RDS, Aurora, and on-premises databases to S3, Redshift, Kinesis, or Kafka.
- Fivetran / Airbyte — Managed ELT platforms that use CDC under the hood for supported sources. Simpler setup than Debezium but less control over the change stream.
Interview Question: Your team uses timestamp-based incremental loads. A product manager reports that deleted records still appear in the dashboard. What happened and how do you fix it?
Interview Question: Your team uses timestamp-based incremental loads. A product manager reports that deleted records still appear in the dashboard. What happened and how do you fix it?
updated_at changed since the last run. Deletes do not update a row — they remove it. The pipeline never sees the deletion, so the deleted record persists in the warehouse indefinitely.Immediate fix: Run a periodic reconciliation job that compares the full set of IDs in the source against the warehouse and removes the delta. This is expensive (full table scan of the source) but catches deletes.Long-term fix: Migrate to CDC with Debezium. CDC captures delete events from the WAL, so the pipeline receives an explicit “this row was deleted” message. The warehouse consumer can then either hard-delete the row or soft-delete it (set a _deleted_at timestamp) depending on the analytics requirement.Alternative if CDC is too heavy: Implement soft deletes in the source application — instead of DELETE FROM orders WHERE id = 123, use UPDATE orders SET deleted_at = NOW() WHERE id = 123. The timestamp-based load will pick up the soft delete because updated_at (or deleted_at) changed. This requires application code changes and discipline from the product engineering team.43.6 Late-Arriving Data, Backfills, and Reprocessing
Three operational realities that every production data pipeline must handle — and that most pipeline designs ignore until they cause an incident. Late-Arriving Data Data rarely arrives in perfect chronological order. Mobile apps queue events while offline and flush them hours later. Third-party APIs deliver webhook events with retry delays. Cross-timezone batch exports land at unpredictable times. A pipeline that assumes “all data for hour H arrives by hour H+1” will produce incorrect results when that assumption is violated. The impact: If a pipeline closes its processing window at midnight UTC and a batch of events with timestamps from 11:45 PM UTC arrives at 12:15 AM, those events are lost from yesterday’s aggregates and may or may not be counted in today’s. The dashboard shows yesterday’s revenue as 4.35M. But by then, the CFO has already cited the $4.2M figure in a board presentation. Strategies for late data:- Lookback window. Process not just new data but also data from the last N hours (e.g., reprocess the last 3 hours on every run). Late arrivals within the lookback window are automatically included. Cost: reprocessing overlapping data on every run. This is the simplest approach and works well when late arrivals are bounded by a known maximum delay.
- Allowed lateness with corrections. Accept that the first result is approximate. Emit a correction when late data arrives. The dashboard shows “4.35M (final)” when the window closes. This requires downstream consumers to handle corrections gracefully.
- Separate late-arrival pipeline. Route late events (detected by comparing event timestamp against arrival timestamp) to a dedicated reconciliation pipeline that merges them into the appropriate historical partition. This keeps the main pipeline simple while handling the edge case separately.
- Warehouse cost explosion. A backfill that reprocesses 2 years of data in BigQuery or Snowflake can consume more compute in one run than the pipeline normally uses in a month. A team backfilling a dbt model against 18 months of Snowflake data ran up a $47,000 compute bill in a single afternoon. Always estimate the cost before running a backfill. Use smaller warehouse sizes (slower but cheaper) for backfills that are not time-sensitive.
- Downstream confusion. If the backfill changes historical numbers, dashboards and reports that referenced the old numbers are now retroactively wrong. Communicate backfills to all downstream consumers before running them. Include the date range affected, what changed, and why.
- Pipeline concurrency. Running a backfill alongside the daily incremental pipeline can cause race conditions — both pipelines writing to the same tables simultaneously. Either pause the incremental pipeline during the backfill or use table partitioning to ensure the backfill writes to historical partitions while the incremental pipeline writes to the current partition.
- Idempotency. The pipeline must produce the same result regardless of how many times it runs for the same input. If the pipeline appends rows without checking for duplicates, reprocessing will double-count data.
- Partition-level replacement. Use
INSERT OVERWRITE(Spark, Hive) orDELETE + INSERTwithin a transaction (PostgreSQL, Snowflake) to atomically replace the old output. Never rely on “the new data will overwrite” without explicit deletion of the old data. - Audit trail. Log every reprocessing run: who triggered it, what date range, why, and the before/after row counts. When a finance team asks “why did yesterday’s numbers change at 3 PM?” the audit trail provides the answer.
Interview Question: Your pipeline backfill for a critical dbt model ran up a $47,000 Snowflake bill in a single afternoon. How do you prevent this and what process changes do you implement?
Interview Question: Your pipeline backfill for a critical dbt model ran up a $47,000 Snowflake bill in a single afternoon. How do you prevent this and what process changes do you implement?
- Cost estimation before backfill. Before any backfill, estimate the compute cost. In Snowflake: check the query profile for a single day’s run, note the credits consumed, and multiply by the number of days in the backfill. A model that costs 0.5 credits/day costs 365 credits for a year-long backfill. At 1,460 — reasonable. But a model that costs 50 credits/day costs $73,000 for a year — that requires a different approach (smaller warehouse, incremental backfill, or pre-filtering).
- Use a dedicated, smaller warehouse for backfills. The daily pipeline might run on an XL warehouse for speed. Backfills should run on a Medium or Small warehouse — 4-8x slower but 4-8x cheaper. For a non-urgent backfill, the speed difference does not matter.
- Backfill in chunks, not all at once. Instead of
dbt run --full-refreshfor 2 years of data, run the backfill month by month. This lets you monitor cost per chunk, catch issues early (the first month might reveal a Cartesian join that would be catastrophic at full scale), and pause if costs are tracking above estimate. - Require approval for backfills exceeding a cost threshold. Any backfill estimated at more than $500 requires engineering manager approval. This is not bureaucracy — it is a gate that forces the engineer to estimate costs and consider alternatives before running an expensive operation.
43.7 Freshness vs Cost: The Data SLA Trade-off
Every data pipeline operates under a freshness SLA — how recent does the data need to be? The answer directly determines the pipeline’s architecture, infrastructure cost, and operational complexity. Most teams default to “as fresh as possible” without quantifying the cost of that freshness. The freshness-cost spectrum:| Freshness | Architecture | Approximate Monthly Cost (10M events/day) | Operational Complexity |
|---|---|---|---|
| Real-time (< 1 minute) | Kafka + Flink, always-on infrastructure | 10,000+ | High (streaming state, exactly-once, watermarks) |
| Near-real-time (1-15 minutes) | Kafka + micro-batch (Spark Structured Streaming) | 5,000 | Medium (micro-batch tuning, small window management) |
| Hourly | Airflow + incremental dbt models, scheduled compute | 1,500 | Low-Medium (scheduling, incremental logic) |
| Daily | Airflow + full-refresh dbt models, spot/preemptible compute | 800 | Low (simple scheduling, full-refresh is idempotent by default) |
- Ask the consumer: “What decision would change if this data were 1 hour fresher?” If the answer is “none” — daily is fine.
- Ask: “What is the dollar cost of stale data?” For fraud detection, the answer might be “$10,000 per hour of delay.” For a weekly executive dashboard, the answer is zero.
- Separate SLAs by data product. The fraud alert pipeline needs real-time. The revenue dashboard needs daily. The ML training pipeline needs weekly. Do not apply the most demanding SLA to all pipelines.
43.8 Data Lineage, Ownership Boundaries, and Trust in Metrics
Data Lineage Data lineage is the ability to trace a metric from its final dashboard value back through every transformation, join, filter, and source table to the original raw event. When the CEO asks “why is revenue down 8% this week?” lineage turns a guessing game into a systematic investigation. Why lineage matters:- Debugging. A wrong number in a dashboard could originate from a source schema change, a transformation bug, a stale intermediate table, or a filter that excluded valid records. Without lineage, you check each possibility manually. With lineage, you follow the path.
- Impact analysis. Before changing a dbt model or dropping a column, lineage tells you which downstream dashboards, ML models, and reports will be affected. Without it, every schema change is a gamble.
- Compliance. Regulations like GDPR require knowing where personal data flows. Lineage answers: “Which downstream tables contain data derived from the
users.emailcolumn?”
- dbt’s built-in lineage graph — Shows model-to-model dependencies within the dbt project. Good for understanding the transformation layer. Limited to dbt models only.
- OpenLineage — An open standard for lineage metadata. Integrates with Airflow, Spark, dbt, and Flink to capture cross-tool lineage automatically.
- DataHub (LinkedIn) — Open-source metadata platform with lineage visualization, search, and data discovery. Integrates with most data tools.
- Monte Carlo — Commercial data observability platform with automated lineage discovery and anomaly detection.
- Source ownership. The team that runs the application owns the raw data it produces. The payments team owns the
transactionstable schema, SLA, and change notification process. - Pipeline ownership. The data engineering team owns the extraction, loading, and transformation pipelines. They are responsible for pipeline uptime, freshness SLA, and data quality checks at each stage.
- Metric ownership. The analytics or data science team owns metric definitions. They decide what “revenue” means, how it is calculated, and where it is served. This is the Minerva model from the Airbnb story.
- Consumer responsibility. Dashboard authors and ML engineers are responsible for using the correct, approved metric definitions — not rolling their own SQL.
- A dashboard shows a number that “does not look right.” An analyst investigates for 2 hours and discovers the pipeline silently skipped a data source.
- Two dashboards showing “revenue” display different numbers because they use different metric definitions.
- A metric changes retroactively because a backfill corrected historical data, but nobody communicated the change.
- A pipeline is “green” (no errors) but the data is 3 days stale because the source stopped sending events and the pipeline has no freshness check.
- Freshness badges. Every dashboard shows when the data was last refreshed. “Revenue as of 2024-03-15 06:00 UTC” is trustworthy. “Revenue” with no timestamp is suspicious.
- Data quality scorecards. Publish a weekly report showing: pipeline uptime, freshness SLA adherence, data quality test pass rate, and incident count. Transparency builds trust.
- Proactive communication. When a pipeline fails or a backfill changes numbers, notify consumers before they discover it themselves. “Heads up: yesterday’s revenue dashboard is 2% lower than initially reported due to a refund processing delay. The corrected numbers are available as of 10 AM” is trustworthy. Silence until someone asks is not.
43.9 Pipeline Failure and Recovery Patterns
Every pipeline will fail. The difference between a mature data platform and a fragile one is not whether failures happen — it is how quickly and cleanly they are recovered. Failure taxonomy for data pipelines:| Failure Type | Example | Detection | Recovery |
|---|---|---|---|
| Hard failure | Pipeline task throws an exception, OOM kill, network timeout | Airflow/Dagster task status, alerting | Retry with backoff, fix and rerun |
| Silent failure | Pipeline succeeds but produces wrong data (partial load, wrong join, stale source) | Data quality tests, freshness checks, anomaly detection | Identify root cause, fix transformation, reprocess affected date range |
| Upstream failure | Source system schema changed, API rate-limited, database migrated | Schema validation at extraction, contract violation alerts | Coordinate with source team, update extraction logic, backfill |
| Infrastructure failure | Warehouse unavailable, Airflow scheduler hung, Kafka consumer lag spike | Infrastructure monitoring, scheduler heartbeat checks | Failover to backup, restart scheduler, scale consumers |
| Cost failure | Backfill ran on an oversized warehouse, query plan regressed, full-refresh instead of incremental | Resource monitors, cost alerts, query duration monitoring | Kill runaway query, resize warehouse, fix incremental logic |
INSERT OVERWRITE or DELETE + INSERT within a transaction, not bare INSERT.
Dead-letter routing. When individual records fail validation or processing, do not fail the entire pipeline. Route the bad records to a dead-letter table, process the good records, and alert the team to investigate the dead letters. This is especially important for streaming pipelines where a single malformed event should not halt processing of millions of good events.
Circuit breaker for upstream dependencies. If the source API is returning errors, do not retry 10,000 times and DDoS the source. Implement a circuit breaker: after N consecutive failures, stop retrying and alert. Resume when the source recovers. Airflow supports this with sensor timeouts and TriggerRule.ALL_DONE for graceful degradation.
Partition-level recovery. When a pipeline fails mid-run, you need to reprocess only the affected partitions, not the entire history. Design pipelines with date-partitioned output so that reprocessing 2024-03-15 does not require reprocessing all of 2024. This is the data equivalent of blast radius containment.
Interview Question: Your streaming pipeline has been down for 4 hours. When it comes back up, there are 4 hours of unprocessed events in Kafka. How do you recover without losing data or producing duplicates?
Interview Question: Your streaming pipeline has been down for 4 hours. When it comes back up, there are 4 hours of unprocessed events in Kafka. How do you recover without losing data or producing duplicates?
| Pipeline Type | Most Common Failure | Recovery Pattern | What to Monitor |
|---|---|---|---|
| Batch (Airflow/dbt) | Upstream schema change breaks transformation | Fix dbt model, reprocess affected dates | dbt test results, row count trends, freshness SLA |
| Stream (Flink/Kafka) | Consumer lag spike causes backpressure, OOM | Scale consumers, increase parallelism, replay from offset | Consumer lag, checkpoint duration, backpressure metrics |
| Orchestration (Airflow) | Scheduler hung, metadata DB bloat, worker pool exhausted | Restart scheduler, clean metadata DB, increase worker pool | Scheduler heartbeat, task_instance table size, worker CPU |
| Data quality (dbt tests/GE) | Silent failure — tests pass but data is wrong due to missing test coverage | Add tests for the missed case, reprocess, communicate correction | Test coverage of critical metrics, anomaly detection on key aggregates |
Interview Question: Your data pipeline has a 6-hour SLA but it has been taking 8 hours recently. The data volume grew 3x. What are your options?
Interview Question: Your data pipeline has a 6-hour SLA but it has been taking 8 hours recently. The data volume grew 3x. What are your options?
- Incremental processing. If the pipeline reprocesses all data daily (full refresh), switch to incremental loads — only process rows that changed since the last run. If your source has a reliable
updated_attimestamp, this can reduce processing volume by 90%+. - Query optimization. Check for missing indexes, expensive full table scans, poorly written joins (Cartesian products, joining on non-indexed columns), and unnecessary
SELECT *. In Spark, check for data skew causing uneven partition sizes. - Partition pruning. If the data is partitioned by date but the query scans all partitions, add partition filters. This is one of the most common performance issues in data warehouses.
- Parallelism. If independent transformation steps run sequentially, parallelize them. Airflow and Dagster support this natively through DAG structure.
- Vertical scaling. Bigger warehouse size (Snowflake), more DPUs (AWS Glue), larger cluster (Spark/EMR). Costs more money but buys time.
- Horizontal scaling. Partition the data and process partitions in parallel. If you have 3 years of data, process each month as an independent task.
- Rearchitect the bottleneck stage. If a single complex dbt model takes 4 hours, break it into intermediate models that can be materialized as tables (not views) and computed incrementally.
- Pre-aggregate. If downstream queries aggregate over fine-grained data, pre-compute common aggregations in an intermediate table.
- Move the bottleneck to a different technology. If a SQL-based transformation is hitting warehouse limits, consider moving that stage to Spark for distributed processing.
- Renegotiate the SLA. Sometimes the honest answer is: a 6-hour SLA was set when the data was 3x smaller. If the business can tolerate an 8-hour SLA (data ready by 10 AM instead of 8 AM), that buys time for a proper rearchitecture.
- Split into fast and slow paths. If 80% of the data is needed by 6 hours (yesterday’s transactions) but the remaining 20% can arrive later (historical recalculations), split the pipeline into a fast path and a slow path with different SLAs.
Connecting the Dots: Cross-Chapter Links
The topics in this module do not exist in isolation. Capacity planning, Git workflows, and data pipelines connect directly to concepts covered in other chapters. Recognizing these connections in an interview signals that you think in systems, not silos. Capacity Planning connects to Cloud Service Patterns. Every back-of-envelope estimate eventually becomes a cloud bill. When you estimate “250 GB/day of chat storage,” the next question is: what does that cost on S3 Standard vs S3 Infrequent Access vs Glacier? When you estimate “35,000 peak requests/second,” the next question is: Lambda (pay-per-invocation) or Fargate (pay-per-hour) or EC2 (pay-per-instance)? The Cloud Service Patterns chapter covers these pricing models in detail, including the cold start trade-offs for Lambda, DynamoDB capacity modes (on-demand vs provisioned), and the S3 storage class tiers that turn your storage estimates into dollar figures. In interviews, connecting your capacity estimate to a cost estimate demonstrates that you think like a business owner, not just a technician. Capacity Planning connects to Database Deep Dives. When you estimate storage for a system, you are implicitly making database decisions. Your estimate of “1.4 PB for tweet data with replication” directly informs whether you need a single PostgreSQL instance, a sharded relational cluster, or a distributed database like Cassandra. The Database Deep Dives chapter covers the specific storage characteristics that affect your estimates: PostgreSQL’s MVCC overhead (dead tuples can double your effective storage), MongoDB’s WiredTiger compression ratios (typically 3-5x for JSON-like documents), DynamoDB’s 400 KB item size limit (which constrains how you model data), and Redis’s memory overhead per key (roughly 50-80 bytes overhead per key beyond the data itself). Using these database-specific numbers in your estimates signals deep knowledge. Capacity Planning connects to OS Fundamentals. Your throughput estimates assume hardware that behaves in predictable ways — but disk I/O, memory limits, and CPU scheduling can invalidate those assumptions. The OS Fundamentals chapter explains why sequential disk reads are 100x faster than random reads (prefetching and page cache behavior), why Kafka achieves GB/s throughput using zero-copy I/O (sendfile()), and how Linux’s page cache means your “disk read” might actually be a memory read. When your capacity estimate says “read 1 MB from SSD = 1 ms,” the OS Fundamentals chapter explains the conditions under which that is true — and the conditions under which it is not (random I/O, memory pressure, cgroup I/O limits in containers).
Capacity Planning connects to Performance. Back-of-envelope throughput calculations (“11,600 requests/second average, 35,000 peak”) are the starting point for the performance engineering work covered in the Performance & Scalability chapter. Your capacity estimates define the load profile; performance engineering tells you how to meet that profile efficiently. If your estimate says you need 35,000 req/s and a single server handles 10,000 req/s, you need at minimum 4 servers — but the Performance chapter will tell you whether caching, connection pooling, or query optimization might let you hit that number with fewer machines.
Data Pipelines connect to Observability. Pipeline monitoring is a specialized form of the observability practices covered in the Caching & Observability chapter. The same principles apply: instrument your pipelines with metrics (row counts, processing duration, error rates), set up alerts on SLA breaches, and build dashboards that let you trace data lineage when something goes wrong. A data pipeline without observability is flying blind — you will not know it is broken until someone downstream complains.
Git Workflows connect to System Design. Your branching strategy directly affects your deployment strategy, which directly affects your system architecture. Trunk-based development implies continuous deployment, which implies feature flags, which implies your system needs a feature flag service. GitFlow implies scheduled releases, which implies versioned APIs and backward compatibility windows. The Git decisions you make in this chapter have architectural consequences discussed in the System Design Practice chapter. The monorepo vs polyrepo choice (covered earlier in this chapter) also has direct implications for the Networking & Deployment chapter — monorepos need different CI/CD pipeline architectures than polyrepos.
Data Quality connects to APIs and Databases. Data contracts (covered in this chapter) are the data equivalent of API contracts covered in the APIs & Databases chapter. The same principles apply: versioning, backward compatibility, deprecation policies, and consumer-driven testing. If you understand API versioning, you already understand 80% of data contracts — the remaining 20% is adapting those principles to batch data instead of request/response APIs.
Curated Resources
A collection of essential references across all three topics covered in this module. These are not filler links — each one is here because it is genuinely worth your time. Capacity Planning and Estimation- System Design Interview by Alex Xu — the most practical guide to back-of-envelope estimation. Chapters walk through real problems step by step.
- The Google SRE Book — Capacity Planning chapter — how Google approaches capacity planning in production. Freely available online.
- “Why Google Stores Billions of Lines of Code in a Single Repository” — Rachel Potvin and Josh Levenberg (2016). The definitive paper on monorepo architecture at extreme scale. Essential for understanding how capacity constraints shape tooling decisions.
- Martin Fowler and the Trunk-Based Development community — trunkbaseddevelopment.com — the comprehensive reference on trunk-based development patterns including feature flags, branch by abstraction, and release strategies.
- Conventional Commits specification — conventionalcommits.org — a lightweight, widely adopted standard for structuring commit messages. Enables automated changelogs and semantic versioning.
- Google Engineering Practices — Code Review Guide — the best freely available guide on how to conduct and receive code reviews effectively.
- Chelsea Troy — Reviewing Pull Requests — a thoughtful, deeply practical series on what makes code review effective. Covers cognitive load, empathy in review, and how to give feedback that actually helps the author improve.
- Accelerate by Nicole Forsgren, Jez Humble, and Gene Kim — research-backed evidence that trunk-based development, short-lived branches, and fast code review cycles predict high-performing engineering teams.
- Nx documentation — nx.dev — the official Nx documentation. Start with the “Why Nx?” guide for the philosophy, then the monorepo tutorial for hands-on setup. The computation caching and affected-target analysis sections are what differentiate Nx from simpler build tools.
- Turborepo documentation — turbo.build — Vercel’s monorepo build tool. The “Core Concepts” section explains task-based caching clearly. Best starting point for teams in the Vercel/Next.js ecosystem.
- Bazel documentation — bazel.build — Google’s open-source build system. Start with the “Concepts” guide. Be warned: the learning curve is steep, but the “Why Bazel?” page makes a compelling case for large polyglot repositories.
- dbt documentation and the Analytics Engineering Guide — docs.getdbt.com — the official dbt documentation is excellent. Start with the “Getting Started” tutorial, then read the Analytics Engineering Guide for the philosophy behind the tool. Understanding dbt is non-negotiable for modern data engineering interviews.
- Apache Spark documentation — spark.apache.org — the official docs, plus the Databricks blog which publishes practical Spark optimization guides and architecture patterns regularly.
- Apache Flink documentation and use cases — flink.apache.org — the official Flink docs include excellent conceptual guides on stream processing fundamentals (windowing, watermarks, exactly-once semantics) that are valuable even if you never use Flink specifically.
- Netflix Technology Blog — data pipeline posts — Netflix’s data engineering team has published some of the best real-world accounts of building and operating data pipelines at massive scale. Search for posts on their data mesh adoption, Spark usage, and pipeline reliability practices.
- Fundamentals of Data Engineering by Joe Reis and Matt Housley — the most comprehensive modern introduction to the full data engineering lifecycle. Covers extraction, loading, transformation, orchestration, and data quality.
- Designing Data-Intensive Applications by Martin Kleppmann — the chapters on batch processing (Chapter 10) and stream processing (Chapter 11) are some of the best technical writing on these topics anywhere. A must-read for anyone building or operating data systems.
- Great Expectations documentation — greatexpectations.io — the Expectations Gallery is the best starting point. Browse the 300+ built-in expectations to understand the range of data quality checks available, then read the Getting Started tutorial.
- dbt-expectations package — GitHub — brings Great Expectations-style assertions into dbt as custom tests. The README has excellent examples. Pair this with the dbt generic tests documentation for a complete testing strategy.
- GoCardless Engineering Blog — Data Contracts — GoCardless’s posts on implementing data contracts across their payment infrastructure are among the most practical, real-world accounts of data contracts in production. Search for their data quality and data contract articles.
- Soda documentation — docs.soda.io — Soda offers a YAML-based approach to data quality checks that integrates with Airflow, dbt, and standalone pipelines. Lighter-weight than Great Expectations for teams that want contract-as-code without the Python framework.
Interview Deep-Dive Questions
These questions go beyond surface-level recall. They are structured the way a senior interviewer actually probes: an opening question to set the stage, follow-ups that push into real-world judgment, and “going deeper” prompts that separate staff-level thinking from rehearsed answers. Use these to pressure-test your understanding of capacity planning, Git workflows, and data pipelines.Q1: You are asked to estimate whether a single PostgreSQL instance can handle the read traffic for a new feature. Walk me through your approach.
Q1: You are asked to estimate whether a single PostgreSQL instance can handle the read traffic for a new feature. Walk me through your approach.
- Step 1 — Estimate the load. I need to know the expected daily active users for this feature, the average number of reads per user session, and the peak-to-average ratio. Say the feature serves 5 million DAU, each triggering 3 reads per session. That is 15 million reads/day, or roughly 175 reads/second average. With a 3x peak multiplier, I plan for about 525 reads/second peak.
- Step 2 — Know the machine’s capability. A well-tuned single PostgreSQL instance on modern hardware (NVMe SSD, 64 GB RAM, good indexes) can handle 5,000 to 20,000 simple indexed queries per second. For complex queries with joins and aggregations, that drops to 500-2,000 per second depending on the query plan.
- Step 3 — Compare with headroom. My peak estimate of 525 reads/second is well within PostgreSQL’s range for simple queries. But I would not just declare “it fits.” I would ask: are these queries simple key lookups, or do they involve joins across large tables? What is the working set size compared to available RAM? If the active dataset fits in memory, most reads hit the page cache and latency stays under 5 ms. If not, I am hitting disk and throughput drops dramatically.
- The honest conclusion: For this load, a single PostgreSQL instance almost certainly works. But I would validate the estimate with a load test using pgbench or k6 against a replica before committing. The estimate tells me “this should work.” The load test tells me “this does work.”
Follow-up: “What if six months later, DAU has grown 10x and reads are now 5,000/second peak. What changes?”
At 5,000 reads/second peak, I am still within PostgreSQL’s theoretical throughput for simple queries, but I have lost all headroom. One slow query or a vacuum operation could cause latency spikes that cascade. My options, in order of effort:- Read replicas. The simplest horizontal scaling for reads. Send read traffic to one or two replicas. This buys me 2-3x immediately with minimal application changes. The trade-off is replication lag — if the feature cannot tolerate reading data that is a few hundred milliseconds stale, I need to be careful about which queries go to replicas.
- Caching layer. Put Redis or Memcached in front of the most frequently read data. If 80% of reads hit the same 20% of data (which is common for user-facing features), a cache with a 30-second TTL could absorb 70-80% of the traffic.
- Connection pooling with PgBouncer. If the bottleneck is connection overhead rather than query execution, PgBouncer in transaction mode can dramatically increase effective throughput by multiplexing hundreds of application connections across a smaller pool of database connections.
Follow-up: “How do you decide between adding read replicas versus adding a cache?”
In my experience, the decision comes down to two factors: data staleness tolerance and access pattern shape.- If the feature can tolerate 30-60 seconds of stale data and the access pattern is skewed (a small percentage of keys get most of the reads), caching wins. It is cheaper, faster to implement, and absorbs more load per dollar. A single Redis instance handling 100K ops/second can replace multiple database replicas.
- If the feature needs near-real-time data consistency or the access pattern is uniform (every key is equally likely to be read), read replicas are better. Caching uniform access patterns leads to low hit rates and wasted memory.
- The hybrid approach is common in production: cache for the hot path (popular items, user profiles, session data), read replicas for the long tail (search results, analytics queries, reports).
Going Deeper: “You mentioned working set size versus RAM. Explain why that matters for PostgreSQL specifically.”
PostgreSQL relies heavily on the OS page cache. When you read a row, the data page it lives on gets loaded into the Linux page cache. Subsequent reads of nearby rows hit RAM instead of SSD. If your active dataset — the tables and indexes you are querying frequently — fits in RAM, PostgreSQL effectively becomes an in-memory database for reads. Latency stays under 1 ms, throughput is high.The moment your working set exceeds available RAM, reads start hitting the SSD. Even on NVMe (10-20 us per random read), that is an order of magnitude slower than a cache hit. Worse, random I/O patterns from index lookups across a large table can fragment the page cache and cause evictions, creating a cliff effect where performance degrades non-linearly. This is why a PostgreSQL instance that handles 15,000 queries/second at 80% memory utilization might struggle with 5,000 queries/second at 110% — the working set spilled to disk.The practical takeaway: always compare your estimated data size (including indexes, which often double the table size) against the instance’s available RAM. If the data fits, single-node PostgreSQL is remarkably capable. If it does not, no amount of query tuning will save you — you need to reduce the working set (partitioning, archiving old data) or add more RAM.Q2: Compare trunk-based development and GitFlow. When would you advocate for each, and when would you fight against each?
Q2: Compare trunk-based development and GitFlow. When would you advocate for each, and when would you fight against each?
- Trunk-based development means everyone commits to main (or a very short-lived branch that merges within hours). Feature flags hide incomplete work. CI runs on every commit. The branch is the feature flag. This works when you deploy continuously (multiple times per day), your CI is fast and reliable (under 10 minutes), and your team has the discipline and tooling for feature flags. Google, Meta, and most high-performing SaaS teams use this. The benefit: integration pain approaches zero because you never diverge far from main.
- GitFlow uses long-lived develop, release, and hotfix branches. It works when you ship versioned software — mobile apps, embedded systems, on-premise installations — where you need to support multiple versions simultaneously. If you are maintaining v3.1 in production, v3.2 in beta, and developing v4.0, GitFlow’s branch structure maps to that reality. The benefit: explicit control over what goes into each release.
Follow-up: “You inherit a team of 15 engineers using GitFlow for a SaaS product. Merge conflicts are a weekly problem. How do you migrate without disrupting the team?”
I would not do a big-bang migration. I have seen that fail — you announce “we are trunk-based now” on Monday and by Wednesday everyone is confused and the build is broken. Instead, I would do it in phases over 6-8 weeks:- Weeks 1-2: Shorten branch lifetimes within GitFlow. Do not change the branching model yet. Just enforce a new norm: no feature branch lives longer than 2 days. Break large features into smaller, independently mergeable pieces. This alone eliminates 70% of merge conflicts because conflicts grow exponentially with branch lifetime, not linearly.
- Weeks 3-4: Introduce feature flags. Deploy LaunchDarkly, Unleash, or even a simple config-driven flag system. Start wrapping new features behind flags so partially complete work can be merged. This is the critical prerequisite — without it, you cannot merge incomplete work to main.
- Weeks 5-6: Merge directly to main, retire the develop branch. With short branches and feature flags, the develop branch is just a mirror of main with extra merge steps. Start bypassing it. Keep the CI pipeline running on every merge to main.
- Weeks 7-8: Retire release branches for the SaaS product. With CI/CD deploying from main and feature flags controlling visibility, you do not need release branches for a web service. Cut them. If you also ship a mobile app, keep lightweight release branches for that — but no feature development happens on them.
Follow-up: “A principal engineer on the team pushes back, arguing that trunk-based development is riskier because untested code reaches main. How do you respond?”
With respect, because this concern comes from a real place — but the premise is inverted. Trunk-based development is not “less tested.” It is “tested differently and more frequently.”In GitFlow, a feature branch might accumulate two weeks of changes. Those changes are tested in isolation on the branch, but they are not tested against everyone else’s two weeks of changes until merge day. That is when you discover that three features have conflicting database migrations, or that two branches modified the same configuration file in incompatible ways. The “testing” happened, but the integration testing did not.In trunk-based development, every merge to main runs CI against the current state of the world. If my change breaks something, I find out in hours, not weeks. The feedback loop is tighter. Yes, main might occasionally break — but you catch it immediately and the blast radius is one small change, not two weeks of accumulated divergence.The concrete data to cite: the Accelerate research (Forsgren, Humble, Kim) measured thousands of engineering teams and found that trunk-based development with short-lived branches is one of the strongest predictors of high-performing teams. Long-lived branches correlate with lower deployment frequency and higher change failure rates. The data is clear.That said, I would acknowledge the principal engineer’s underlying concern: trunk-based development requires investment in CI speed, test coverage, and feature flags. If the team’s CI takes 45 minutes and test coverage is 20%, jumping to trunk-based development will be painful. Fix those prerequisites first.Q3: Explain the difference between batch and stream processing. Then tell me about a time you chose one over the other and what you learned.
Q3: Explain the difference between batch and stream processing. Then tell me about a time you chose one over the other and what you learned.
Follow-up: “You mentioned Flink for the fraud detection use case. Why Flink over Spark Structured Streaming?”
For that specific use case, the deciding factor was true event-at-a-time processing with low latency. Spark Structured Streaming uses micro-batches — it accumulates events for a configurable interval (default 100 ms, but in practice often configured to 1-10 seconds for throughput), then processes the batch. For a dashboard, that is fine. For fraud detection where every second of delay means another potentially fraudulent charge, we needed genuine per-event processing with sub-100ms latency. Flink processes each event individually as it arrives.The second factor was Flink’s native support for complex event processing (CEP). Fraud rules often involve patterns across multiple events: “flag a user if they make more than 5 transactions from 3 different countries within 10 minutes.” Flink’s CEP library expresses these temporal patterns natively. In Spark, I would have had to manage windowed state manually, which is doable but more code and more error-prone.The trade-off: Flink has a steeper learning curve and a smaller talent pool. If the team had no streaming experience, I might have started with Spark Structured Streaming as a stepping stone — accepting slightly higher latency in exchange for a tool the team already knew.Follow-up: “What is the hardest operational problem you have faced with streaming systems?”
Late-arriving events. Hands down. In batch processing, you have all the data for the window before you process it. In streaming, events arrive out of order. A transaction that happened at 2:00:15 PM might arrive at the processing engine at 2:01:30 PM due to network delays, retry queues, or mobile devices coming back online.The question is: what do you do with it? If your 2:00 PM window already closed and emitted results, do you reprocess the window? Emit a correction? Drop the late event? There is no universally correct answer — it depends on the use case. For fraud detection, we could not afford to drop late events (a fraudulent transaction arriving late still needs to be checked). For the analytics dashboard, late events arriving within 5 minutes were processed (using Flink’s allowed-lateness feature); beyond that, they were routed to a batch reconciliation job that corrected the numbers overnight.The operational pain is that late events make streaming results approximate until some reconciliation window closes. Business stakeholders who are used to batch — where the number is either ready or not ready — struggle with “the number is 95% accurate right now and will be fully accurate by tomorrow.” Setting those expectations upfront saves a lot of trust debt later.Going Deeper: “Explain watermarks in stream processing and why they matter.”
A watermark is the streaming engine’s way of saying: “I believe I have received all events with timestamps up to time T.” It is a heuristic, not a guarantee. Flink (and Beam/Dataflow) use watermarks to decide when to close a time window and emit results.Here is why it matters concretely. Suppose I have a 5-minute tumbling window aggregating transaction counts: 2:00-2:05, 2:05-2:10, etc. When should I close the 2:00-2:05 window and emit the count? If I close it at exactly 2:05 wall-clock time, I will miss any event with a 2:03 timestamp that arrives at 2:06 due to network delay. The watermark solves this: it tracks the progress of event time (not wall-clock time) based on observed events, and the engine closes the window when the watermark passes 2:05.The trade-off is between completeness and latency. A conservative watermark (waits longer, allows more late data) gives more accurate results but higher latency. An aggressive watermark (closes windows quickly) gives lower latency but may miss late events. In practice, I configure watermarks with an allowed lateness — say, 2 minutes. Events arriving within 2 minutes after the watermark trigger an update to the already-emitted result. Events arriving after that are sent to a side output for batch reconciliation.The mental model I use: watermarks are the streaming equivalent of the “data cutoff time” in batch processing. In batch, you implicitly say “I will process all data that arrived by 7 AM.” A watermark makes that same contract explicit and continuous.Q4: You are doing a system design interview and the interviewer says: 'Design a notification system for 50 million users.' Start with a capacity estimate.
Q4: You are doing a system design interview and the interviewer says: 'Design a notification system for 50 million users.' Start with a capacity estimate.
- Assumptions: 50 million total users, 20 million daily active. Each DAU receives an average of 5 notifications/day (some users get 1, power users get 20). Notification payload: ~500 bytes (JSON with user_id, type, title, body, timestamp, deeplink URL, metadata). Notifications are written once and read 1-3 times (initial delivery, plus the user opening the notification tray).
- Write throughput: 20M DAU x 5 notifications = 100 million notifications/day. That is roughly 1,160 writes/second average, approximately 3,500 writes/second peak (3x for peak hours — morning commute and evening in major timezones).
- Read throughput: Each DAU checks their notification tray ~10 times/day (pull to refresh, app open, etc.), fetching the latest 20 notifications. That is 200 million reads/day, or about 2,300 reads/second average, approximately 7,000 peak. Each read returns ~20 x 500 bytes = 10 KB.
- Storage: 100M notifications/day x 500 bytes = 50 GB/day. If we retain 90 days of notifications, that is 4.5 TB. With 3x replication, roughly 13.5 TB. This is well within a single sharded database cluster.
- Push delivery bandwidth: If 80% of notifications are pushed in real-time (via APNs, FCM, or WebSocket), that is 80M push deliveries/day, or about 925/second average, 2,800/second peak. Each push payload is small (~200 bytes to the push service). This is moderate throughput — a few horizontally scaled workers can handle it.
- Sanity check: 500 bytes per notification, 5 per user per day = 2.5 KB per user per day. For 20M DAU, that is 50 GB/day. Per-user storage for 90 days is about 225 KB — seems reasonable for a notification history.
- Key insight from the estimate: The system is write-heavy in terms of creation (every backend event generates notifications for potentially thousands of users via fan-out), but the per-user read pattern is small (fetch 20 recent items). Storage is modest. The real challenge is not storage or throughput — it is the fan-out. When a popular creator posts content that triggers notifications for 10 million followers, that single event becomes 10 million writes. That fan-out spike is the design challenge.
Follow-up: “Good. You mentioned fan-out as the hard problem. How do you handle a single event that needs to fan out to 10 million users?”
This is the classic fan-out-on-write vs fan-out-on-read decision. Both have major trade-offs.- Fan-out-on-write: When the event happens, immediately create 10 million notification records, one per follower. Pro: reads are fast — each user just queries their own notification table. Con: the write amplification is enormous. A single post from a celebrity generates 10 million writes. If that celebrity posts 5 times a day, that is 50 million writes from one user. This can overwhelm the write path.
- Fan-out-on-read: When the event happens, store it once. When a user opens their notification tray, query for recent events from all sources they follow. Pro: writes are cheap (one write per event). Con: reads are expensive — you need to query across all of a user’s subscription sources in real-time.
- The hybrid approach (what Twitter, Instagram, and most large-scale systems actually do): Fan-out-on-write for normal users (under 10K followers — the vast majority). Store the event reference for high-follower accounts and merge it at read time. This caps the write amplification while keeping reads fast for 99% of users.
Follow-up: “Your estimate assumed 500 bytes per notification. What if the product team wants to add rich media previews — an image thumbnail with each notification, about 50 KB each?”
That changes the math dramatically. 100 million notifications/day x 50 KB = 5 TB/day just for thumbnails. Over 90 days, that is 450 TB. With replication, over 1 PB. We have moved from “fits in a database cluster” to “needs object storage.”The solution is to not store the thumbnail inline with the notification. Store a URL reference to the thumbnail in the notification record (keeping it at ~600 bytes), and serve the actual image from a CDN-backed object store (S3 + CloudFront). The notification payload stays small. The client fetches the thumbnail separately when rendering the notification tray, and the CDN caches popular thumbnails.This is a general pattern: always separate metadata from media. Store references, not blobs. The capacity plan for metadata and the capacity plan for media should be independent because they have different storage technologies, different cost profiles, and different scaling characteristics.Q5: Your team is debating monorepo versus polyrepo for a new platform with 6 services. Three engineers want a monorepo, three want separate repos. How do you make this decision?
Q5: Your team is debating monorepo versus polyrepo for a new platform with 6 services. Three engineers want a monorepo, three want separate repos. How do you make this decision?
- How coupled are the services? If services A, B, and C share data models and change together in 70% of PRs, a monorepo eliminates the multi-repo coordination tax. If the services are truly independent (different teams, different release cadences, different languages), separate repos preserve autonomy. I would actually look at the last 3 months of changes — how often did a change in one service require a corresponding change in another? That data answers the question better than any opinion.
- Team size and growth trajectory. 6 engineers today can manage a monorepo with Turborepo or Nx with almost zero tooling overhead. If we plan to grow to 60 engineers in 18 months, the monorepo will need serious investment in affected-target analysis, build caching, and potentially CODEOWNERS to manage review routing. Separate repos handle team growth more naturally because each repo has its own CI, its own permissions, and its own release cycle.
- Language homogeneity. If all 6 services are TypeScript, a monorepo with Nx or Turborepo is a natural fit — shared linting configs, shared tsconfig, shared test infrastructure. If three services are in Python and three are in Go, cramming them into a monorepo with a single build system is painful. You end up maintaining parallel build configurations, and the “shared tooling” benefit of a monorepo evaporates.
- CI/CD maturity. A monorepo requires CI that understands project boundaries — only build and test what changed. If the team’s CI is “run everything on every push,” a monorepo with 6 services means every PR triggers 6 test suites. That will frustrate the team in week two. Polyrepo gives you per-service CI for free.
Follow-up: “Six months in, the monorepo CI is taking 25 minutes per PR because it runs all tests. Engineers are complaining. What do you do?”
This is the classic monorepo scaling problem, and it is solvable. The root cause is that the CI does not know what changed — it treats every PR as if it touched every service.- Immediate fix: affected-target analysis. Configure Turborepo (or Nx) to detect which packages/services were affected by the changed files and only run those tests. If I changed service A and services B-F are unaffected, only service A’s tests should run. This typically cuts CI time by 70-80% for most PRs.
- Add remote caching. Turborepo supports remote caching (via Vercel or a self-hosted cache). If a build or test has already been run for the exact same inputs (same code hash), serve the result from cache. This means if I pull main and run tests locally, and then CI runs the same tests, CI gets a cache hit and finishes in seconds.
- Parallelize the remaining tests. If the affected service still has a 15-minute test suite, split it into parallel jobs. Most CI systems (GitHub Actions, CircleCI) support fan-out: run unit tests, integration tests, and lint in parallel instead of sequentially.
- Profile the slow tests. Often 80% of CI time comes from 20% of tests. Find the slow ones. Are integration tests spinning up real databases for every test? Use a shared test database. Are tests doing unnecessary I/O? Mock the expensive parts. Are snapshot tests regenerating large fixtures? Cache the fixtures.
Going Deeper: “Compare Nx, Turborepo, and Bazel. When is each the right choice?”
- Turborepo is the right choice for teams that want monorepo benefits with minimal configuration. It works at the task level — it hashes inputs, caches outputs, and parallelizes tasks. It does not deeply model the dependency graph between projects. If your monorepo is 5-15 packages in the JavaScript/TypeScript ecosystem and you want fast setup, Turborepo wins. It is the Honda Civic of monorepo tools — reliable, efficient, no surprises.
- Nx is the right choice for teams that need deeper project-graph awareness. Nx models the dependency relationships between every project in the repo and uses that graph for affected-target analysis, build ordering, and code generation. If your monorepo has 30+ packages with complex inter-dependencies and you want the tooling to understand “changing this shared library requires retesting services X, Y, and Z but not W,” Nx is more powerful than Turborepo. It is more opinionated and has a steeper learning curve, but the project graph is genuinely useful at scale.
- Bazel is the right choice for large, polyglot repositories where build correctness and hermeticity are non-negotiable. Bazel guarantees that builds are reproducible — the same inputs always produce the same outputs, regardless of the developer’s local environment. It supports any language via rules, and it scales to millions of files. The cost: you write BUILD files for every package, you learn Starlark (Bazel’s configuration language), and you invest significant engineering time in build infrastructure. Bazel is the right choice for organizations with 50+ engineers, multiple languages, and a dedicated developer productivity team. It is the wrong choice for a 6-person startup.
Q6: A nightly data pipeline that has been running reliably for 18 months starts producing incorrect numbers. No code was deployed. How do you investigate?
Q6: A nightly data pipeline that has been running reliably for 18 months starts producing incorrect numbers. No code was deployed. How do you investigate?
- Step 1 — Confirm the symptoms. What specifically is wrong? Are numbers too high, too low, or missing entirely? Compare today’s output against yesterday’s and last week’s. Quantify the discrepancy. Is it all data or a specific segment (one region, one product category, one payment type)?
- Step 2 — Check the source. “No code was deployed” to the pipeline, but was code deployed to the source system? A product team shipping a schema change — renaming a column, adding a new enum value, changing a field from nullable to non-nullable — is the number one cause of “nothing changed but the data is wrong.” Check the source system’s deployment history and schema changelog. In my experience, 60-70% of data pipeline incidents trace back to upstream schema changes that were not communicated.
- Step 3 — Check data volume and completeness. Run row counts by day/hour and compare to the previous week. A sudden drop suggests data is missing (extraction failed partially, source system had an outage during the extraction window, network timeout caused a partial load). A sudden spike suggests duplicates (a retry that was not idempotent, a reprocessed batch that double-loaded records).
- Step 4 — Check for environmental changes. Did the warehouse resize? Did Airflow scheduling change? Did a dependency (database, API, file system) change credentials, IP addresses, or rate limits? Did DST (daylight saving time) cause a timezone shift in the extraction window?
- Step 5 — Diff the data. Take yesterday’s correct output and today’s incorrect output. Diff them at the row level. Where do they diverge? Trace those divergent rows back through each transformation stage to find the first point of divergence. This is the data equivalent of
git bisect— find the exact stage where the data went wrong.
Follow-up: “You traced it to a new enum value in the source system’s payment_type field. Your pipeline’s CASE WHEN statement does not handle the new value, so those rows fall into a NULL bucket and are excluded from revenue totals. How do you fix this — both immediately and long-term?”
Immediate fix: Add the new payment type to the CASE WHEN and rerun the pipeline for the affected days. Validate that the output now matches expected totals. Communicate the correction to downstream consumers (dashboard users, finance team) with a clear explanation of the date range affected and the magnitude of the correction.Long-term fix — this is where the answer separates senior from junior:- Replace the CASE WHEN with a dimension table. Instead of hardcoding payment types in SQL, maintain a
payment_typesreference table that maps every valid payment_type to its display name, category, and processing rules. The pipeline joins against this table. When a new payment type appears, adding a row to the reference table is all that is needed — no pipeline code change required. - Add a “catch-all” assertion. Configure a dbt test or Great Expectations check that flags any rows with unrecognized payment types. This way, the next time a new enum value appears, the pipeline fails loudly with “unknown payment_type: ‘crypto_wallet’” instead of silently dropping rows into NULL.
- Implement a data contract with the source team. The payments team should not be able to add a new payment type without notifying the data team. A data contract with a 14-day notice period for new enum values prevents this class of issue entirely. Use the Confluent Schema Registry or a dbt model contract with
accepted_valuesenforcement. - Backfill validation. After the fix, run the pipeline for the last 30 days and compare against previously generated numbers. If the new payment type has been trickling in for weeks before it reached a noticeable volume, earlier reports may also be slightly wrong.
Follow-up: “How would you set up monitoring so you catch this kind of issue before a business analyst notices?”
Three layers of monitoring, each catching different failure modes:- Freshness monitoring. Alert if the pipeline has not completed by its SLA (say, 7 AM). This catches outright failures. Tools: Airflow SLA monitoring, Monte Carlo, or a simple check on the max timestamp in the output table.
- Volume monitoring. Track daily row counts and total revenue over time. Alert on deviations beyond 2 standard deviations from the 7-day rolling average. This catches partial loads and duplicates. A 15% revenue drop should trigger an alert, not wait for a human to notice.
- Distribution monitoring. Profile the output data’s key columns on each run. Track the distinct values in payment_type, the null rate in critical fields, and the distribution of revenue across categories. Alert when a new value appears or a null rate exceeds its historical baseline. This is specifically what would have caught the enum issue — a new “unhandled” category appearing in payment_type or a spike in NULL values.
Q7: Explain data contracts. Then tell me why most teams that need them do not have them, and how you would introduce them to a resistant organization.
Q7: Explain data contracts. Then tell me why most teams that need them do not have them, and how you would introduce them to a resistant organization.
- Start with the pain. Do not lead with “we should implement data contracts.” Lead with “we have had 12 data incidents in the last quarter, 9 of which were caused by upstream schema changes. Each incident took 4-6 hours of data engineering time to investigate and fix, plus a day of incorrect dashboards. That is 36-54 hours of engineering time and 9 days of unreliable data. Here is the dollar cost.” Frame it as a business problem, not a tooling preference.
- Start small, with one high-value contract. Do not try to contract every table. Pick the single most critical data source — the one that feeds the revenue dashboard or the ML model that drives pricing. Define a contract for that one source. Implement automated enforcement. When the first would-be incident is caught by the contract instead of by a panicked Slack message, you have your proof of concept.
- Make compliance easy for producers. If the contract enforcement is painful for the product team, they will route around it. Provide tooling: a schema registry that validates automatically, a CI check that flags breaking changes in their PR before merge, a clear process for requesting contract changes. The goal is that compliance takes 5 minutes, not a meeting.
- Escalate through wins, not mandates. After the first contract saves an incident, publicize it. After three contracts save three incidents, present the results to engineering leadership. Let the data speak. Mandating contracts top-down without demonstrated value creates resentment. Growing them bottom-up with evidence creates buy-in.
Follow-up: “What is the relationship between data contracts and API versioning? How does understanding one help with the other?”
They are solving the same problem in different contexts. An API contract says “I will always return a JSON response with fielduser_id as a string and created_at as an ISO 8601 timestamp.” A data contract says “the transactions table will always have a payment_type column of type VARCHAR with values in the set .”The versioning patterns are identical:- Non-breaking changes (adding a new nullable column, adding a new optional API field) can be deployed without consumer coordination.
- Breaking changes (removing a column, changing a type, renaming a field) require a deprecation period, consumer notification, and a migration path.
- Versioning strategies apply: you can version your data contract (v1 guarantees these columns; v2 adds these new ones) just like you version an API (v1 returns this response shape; v2 returns a richer one).
Going Deeper: “How does Confluent Schema Registry enforce data contracts at the Kafka level?”
Schema Registry stores Avro, Protobuf, or JSON schemas for every Kafka topic. When a producer publishes a message, the serializer validates it against the registered schema. If the message does not conform, serialization fails and the message is not published. This is enforcement at the wire level — contract violations are physically impossible, not just flagged.The power is in the compatibility modes. You can configure a topic’s schema to enforce:- BACKWARD compatibility: New schema can read data written by old schema. Consumers can upgrade before producers.
- FORWARD compatibility: Old schema can read data written by new schema. Producers can upgrade before consumers.
- FULL compatibility: Both backward and forward. The safest, most restrictive mode.
payment_type is always a string, but it cannot guarantee that the string is always one of a known set of values. For semantic validation, you still need application-level checks (dbt accepted_values tests, Great Expectations assertions).Q8: Walk me through how you would sanity-check a back-of-envelope estimate that feels wrong.
Q8: Walk me through how you would sanity-check a back-of-envelope estimate that feels wrong.
- Per-user reasonableness. Divide the total by the number of users. If I estimated 500 TB/year for a service with 10 million users, that is 50 MB per user per year. For a text messaging app, that feels right (a few hundred messages/day at 500 bytes each). For a simple profile service, that is way too high. For a video platform, that is way too low. The per-user number gives immediate intuition.
- Comparison against known systems. I keep a mental database of real-world reference points. Twitter stores ~300 PB total. Netflix accounts for ~15% of internet bandwidth. Instagram stored 2+ billion images by 2022. If my estimate for a similar-scale system is 3 orders of magnitude off from a known reference point, I have an error.
- Cross-check storage against throughput. If I estimated 100 TB/day of storage but my throughput estimate says the system handles 1,000 requests/second at 1 KB each, those numbers conflict. 1,000 req/s x 1 KB x 86,400 seconds = 86 GB/day, not 100 TB/day. When storage and throughput estimates disagree, one of my assumptions is wrong.
- Cost gut-check. Multiply by cloud pricing. S3 at 5M/month storage bill for a company with 50 employees, something is wrong — or I have just discovered why they need to build their own storage infrastructure.
- Order-of-magnitude bands. 1 GB = a laptop handles it. 1 TB = a single beefy server. 1 PB = you need distributed systems engineering. 1 EB = you are a top-10 internet company. If my estimate puts a startup at exabyte scale, I dropped a factor of 1,000 somewhere (almost always confusing MB with GB or seconds with milliseconds).
Follow-up: “You did an estimate for a video streaming platform and arrived at 50 PB/year for storage. The actual number turns out to be 500 PB. Where did you probably go wrong?”
A 10x underestimate (one order of magnitude off) usually means I missed a major multiplicative factor, not that my base math was wrong. For video specifically, the most likely culprits:- Transcoding multiplier. I probably estimated raw upload storage but forgot that video platforms store multiple transcoded versions (144p, 360p, 720p, 1080p, 4K). Each resolution is a separate file. A single uploaded video might become 5-8 stored files. That is a 5-8x multiplier I may have omitted.
- Thumbnails and previews. Each video generates multiple thumbnails, animated preview clips (hover previews), and chapters. These add 5-10% on top of video storage — not a 10x factor by themselves, but they compound.
- Retention and growth. If I estimated based on current upload rate but did not account for the platform growing 2-3x during the year, my annual number would be low by end of year.
- Redundancy and availability. Object storage like S3 stores data with internal redundancy (3x in standard storage class). If my estimate was for logical data size, the physical storage is 3x that.
Follow-up: “In an interview, you realize mid-calculation that your estimate is going to be wildly off. What do you do?”
The worst thing to do is quietly hope the interviewer does not notice. The best thing is to stop, acknowledge it, and recalibrate. I would say something like: “Wait — let me sanity-check this. I am getting 50 exabytes for a mid-size social media app, which would make it larger than all of Google. I think I dropped a factor somewhere. Let me walk back through…” Then find the error (usually a units confusion or a missed division by 1,000) and correct it.Interviewers are watching for exactly this behavior. The ability to catch your own mistakes, flag them, and self-correct is more impressive than getting the right answer on the first try. It shows the self-monitoring that experienced engineers apply to everything — code, architecture, estimates, production incidents. An engineer who plows forward with an obviously wrong number is more concerning than one who catches and corrects a mistake.Q9: Your company has 30 engineers and 4 teams. Code reviews are taking 3-5 days on average. Deployment frequency has dropped. How do you diagnose and fix this?
Q9: Your company has 30 engineers and 4 teams. Code reviews are taking 3-5 days on average. Deployment frequency has dropped. How do you diagnose and fix this?
- Where is the latency? Measure time from PR opened to first review, first review to approval, and approval to merge. Often the bottleneck is “time to first review,” not “time from review to approval.” If PRs sit unreviewed for 3 days but are approved within an hour of being reviewed, the problem is reviewer allocation, not review quality.
- Who is the bottleneck? In many teams, one or two senior engineers are the review gateway for everything. If 80% of PRs are waiting for the same two people, the fix is distributing review responsibility. Use CODEOWNERS to auto-assign reviewers based on which files changed, and expand the set of trusted reviewers through mentorship.
- How big are the PRs? If the average PR is 1,500 lines, no reviewer wants to pick it up. That is not a process problem — it is a PR size problem. Large PRs are intimidating, take hours to review, and get procrastinated. Set a team norm: under 400 lines. If a feature requires more, use stacked PRs.
- Set a team norm: first review within 4 business hours. Make it a team agreement, not a mandate. Track it on the team dashboard. Public visibility creates accountability.
- Reduce PR size. Under 400 lines. This is the single highest-leverage change. Small PRs get reviewed faster, reviewed more thoroughly, and merge with fewer conflicts. A 200-line PR is a 15-minute review. A 2,000-line PR is a 2-hour commitment that no one volunteers for.
- Distribute review load. Use CODEOWNERS so PRs are automatically routed to the right reviewer. Rotate the “reviewer of the day” role so the burden is shared. Pair junior reviewers with senior ones to build capacity.
- Automate what machines can do. If reviewers spend time commenting on formatting, import ordering, or naming conventions, automate those with linters and formatters (Prettier, ESLint, gofmt, Black). Reserve human review time for logic, design, and edge cases — the things machines cannot catch.
- Address the cultural root cause. In some teams, review latency is high because engineers view review as a chore, not a primary responsibility. Reframe it: code review is not overhead on your “real work” — it is your work. An engineer who writes great code but does not review is half an engineer. Make review throughput part of performance discussions.
Follow-up: “A team lead argues that thorough reviews are more important than fast reviews, and rushing reviews leads to bugs. How do you respond?”
This is a false dichotomy, and it is one of the most common objections. Speed and thoroughness are not opposed — they are correlated.Small PRs (under 400 lines) can be reviewed both quickly and thoroughly. A reviewer can hold 200 lines of context in their head, spot edge cases, and give meaningful feedback in 20 minutes. A 2,000-line PR cannot be reviewed thoroughly regardless of how much time you spend — cognitive research shows that reviewer effectiveness drops dramatically after the first 200-400 lines. So the “thorough” 3-day review of a 2,000-line PR is actually less thorough than a 20-minute review of a 200-line PR.The data supports this: Google’s engineering practices research found that the most effective reviews happen within 24 hours and on PRs under 400 lines. Review quality (measured by defects caught) correlates with PR size more strongly than with review duration. The path to both speed and thoroughness is smaller PRs, not longer review sessions.That said, I would validate the team lead’s concern. If we speed up reviews and defect rates increase, we have a real problem. Track defect escape rate (bugs found in production that were present in a recently reviewed PR) before and after the process change. If it increases, the speed-up went too far.Follow-up: “How do you handle PRs that genuinely need to be large — a major database migration, a cross-cutting refactor?”
Some PRs are legitimately large and cannot be easily broken up. For those:- Use stacked PRs where possible. A database migration can often be broken into: (1) add the new column/table (schema only, no logic), (2) write the migration code that backfills, (3) update the application to use the new schema, (4) remove the old schema. Each PR is independently reviewable and mergeable.
- For truly atomic large changes, use the “guided review” approach. The author writes an extensive PR description that walks the reviewer through the changes in the order they should be read. Include a “how to review this PR” section: “Start with
migration.sqlto understand the schema change. Then readUserService.javafor the updated queries. ThenUserServiceTest.javafor the new test cases.” This reduces the cognitive load of a large review dramatically. - Pair-review for large PRs. Instead of asynchronous review, schedule a 30-minute synchronous review session where the author walks the reviewer through the code. This is 10x more efficient for large, complex changes than async comment threads.
Q10: You need to choose between ETL and ELT for a new data platform. What factors drive your decision?
Q10: You need to choose between ETL and ELT for a new data platform. What factors drive your decision?
- Modern cloud warehouses have cheap, scalable compute. Transforming data inside the warehouse using SQL (via dbt) is often faster than transforming it in a separate staging system and then loading it. You are leveraging the warehouse’s massively parallel query engine instead of a separate ETL server.
- Raw data preservation. In ELT, raw data is loaded first, then transformed. If your transformation logic has a bug, you can fix the logic and re-run the transformation on the preserved raw data. In ETL, if the transformation was wrong, you might need to re-extract from the source — which may not be possible if the source data has changed.
- Analytics team autonomy. With dbt and SQL-based transformations, analytics engineers can own and iterate on transformations without depending on the data engineering team. This is a huge productivity unlock for organizations where the data engineering team is a bottleneck.
- Compliance requires it. If PII must be stripped, redacted, or tokenized before entering the warehouse, you need to transform before loading. This is common in healthcare (HIPAA) and finance (PCI-DSS). You cannot load raw credit card numbers into a shared analytics warehouse and then mask them later — the raw data should never be there.
- Source volume is massive but you only need a subset. If the source system generates 10 TB/day of event data but you only need 500 GB of filtered events for analytics, transforming (filtering) during extraction saves significant warehouse storage and compute costs.
- The warehouse has limited compute. If you are working with an on-premises data warehouse (legacy Teradata, Oracle, on-prem Hadoop) with finite compute resources, doing heavy transformations outside the warehouse preserves its capacity for serving queries.
- Non-SQL transformations. If transformations require complex logic that is difficult to express in SQL — ML feature engineering, image processing, graph algorithms — you need a general-purpose compute layer (Spark, Python) that runs before loading.
Follow-up: “You chose ELT. Six months later, the data engineering team complains that analysts are writing inefficient dbt models that cost $50,000/month in warehouse compute. What do you do?”
This is the shadow cost of ELT — when transformations happen inside the warehouse, the warehouse bill reflects transformation costs, not just query costs. The fix is not “go back to ETL.” It is governance.- Cost attribution. Tag every dbt model with a team owner and track compute cost per model. Most warehouses (Snowflake, BigQuery) support query tagging and cost attribution. When a team sees that their
dim_users_enrichedmodel costs $8,000/month, they have the data to prioritize optimization. - Review expensive models. The top 10 most expensive models probably account for 70% of the cost. Review their query plans. Common culprits: full table scans on large tables (missing partition pruning), unnecessary
SELECT *, Cartesian products from poorly written joins, and models that do a full refresh daily when incremental processing would suffice. - Incremental models. The biggest cost lever in dbt. Instead of rebuilding a table from scratch every day, an incremental model only processes rows that changed since the last run. For a table with 2 years of historical data that grows by 0.1% daily, this can reduce processing volume by 99%.
- Warehouse sizing governance. In Snowflake, analysts might be running transformations on an XL warehouse that costs 64 credits/hour when a Medium (8 credits/hour) would finish only 20% slower. Set default warehouse sizes per workload type and require approval for larger sizes.
- Materialization strategy. Not every dbt model needs to be a table. Views are free (computed at query time). Ephemeral models are compiled into the downstream query and never materialized. Only materialize as tables when the model is queried frequently or takes significant time to compute. Many analysts materialize everything as tables “just to be safe,” which is expensive.
Going Deeper: “Explain incremental models in dbt. What is the failure mode that catches most teams off guard?”
An incremental model in dbt works like this: on the first run, it materializes the full result as a table. On subsequent runs, it only processes rows where a specified condition is true (typicallyWHERE updated_at > (SELECT MAX(updated_at) FROM {{ this }})). The new rows are merged into the existing table.The failure mode that catches teams: late-arriving data and non-monotonic timestamps. The incremental model assumes that once it has processed rows up to timestamp T, it never needs to look at rows before T again. If a source system backfills historical data, corrects past records, or has events that arrive with timestamps in the past (common in mobile apps that queue events offline), the incremental model will miss them.The fix is a combination of: (1) using a lookback window (process rows from the last 3 days, not just since the last max timestamp) to catch late arrivals at the cost of some reprocessing, (2) running a periodic full refresh (weekly or monthly) to reconcile any drift, and (3) monitoring the delta between incremental and full-refresh results to detect when they diverge.The second gotcha: schema evolution. If a new column is added to the source, the incremental model will not add it to the existing table automatically (it only inserts/updates rows, not DDL). You need to run a full refresh or use dbt’s on_schema_change: 'append_new_columns' configuration. Teams that forget this end up with stale table schemas that are missing recently added columns.Q11: A candidate in a system design interview gives you an estimate of '10 TB' for a service but cannot explain how they got there. Another gives you '50 TB' with clear assumptions and reasoning. Which is a stronger answer, and why?
Q11: A candidate in a system design interview gives you an estimate of '10 TB' for a service but cannot explain how they got there. Another gives you '50 TB' with clear assumptions and reasoning. Which is a stronger answer, and why?
- Estimation questions test reasoning, not arithmetic. The interviewer already knows the approximate answer. They are watching the candidate’s process: Do they state assumptions explicitly? Do they decompose the problem into estimable components? Do they sanity-check the result? The 10 TB answer with no reasoning is indistinguishable from a random guess. The 50 TB answer with clear assumptions is a window into how the candidate thinks about systems at scale.
- Assumptions are auditable. When a candidate says “I assumed 500 bytes per record, 100 million records per day, 365 days, with 3x replication and 30% index overhead,” the interviewer can challenge specific assumptions (“Why 500 bytes? What if records include a JSON payload that averages 2 KB?”). This starts a productive technical conversation. With “10 TB,” there is nothing to discuss.
- Self-correction is possible. If the 50 TB candidate made an error in one assumption, the interviewer can nudge them (“What if users upload images? How does that change your estimate?”) and the candidate can adjust cleanly because their reasoning is transparent. The 10 TB candidate has no framework to adjust from.
Follow-up: “As an interviewer, what red flags do you watch for in back-of-envelope estimation?”
The biggest red flags, in order of severity:- No stated assumptions. The candidate writes numbers on the whiteboard without explaining where they came from. This makes the estimate unchallengeable and unverifiable.
- False precision. “We need exactly 47.3 TB.” Real systems have variable workloads, unpredictable growth, and requirements that change. A precise point estimate signals the candidate does not understand uncertainty. I want to hear “40 to 100 TB depending on media retention policy.”
- Forgetting multiplicative factors. Estimating raw data size without accounting for replication (3x), indexes (30-50%), serialization overhead (JSON is 2-5x larger than the raw data), and headroom (plan for 3x current traffic). A 100 GB estimate that does not account for these factors is really 500+ GB.
- No sanity check. The candidate arrives at a number and moves on without asking “does this make sense?” The strongest candidates always compare their result against a known reference point or a per-user reasonableness check.
- Confusing units. MB vs GB vs TB is a factor of 1,000 at each step. Mbps vs MBps is a factor of 8. Seconds vs milliseconds is a factor of 1,000. These confusions silently shift estimates by orders of magnitude.
Follow-up: “How do you teach junior engineers to get better at back-of-envelope estimation?”
Three practices that accelerate learning:- Build a personal reference table. Memorize the Jeff Dean latency numbers, the throughput numbers for common systems (Redis: 100K ops/s, PostgreSQL: 5-20K queries/s, Kafka: 1M messages/s), and the standard storage sizes (1M rows at 1 KB = 1 GB). These are the building blocks of every estimate. Run flashcards if you need to.
- Practice with real systems. Take a system you use every day — Slack, Spotify, Uber — and estimate its storage, throughput, and bandwidth requirements. Then compare against any publicly available data (engineering blogs, SEC filings, conference talks). The gap between your estimate and reality teaches you which assumptions to calibrate.
- Always estimate before benchmarking. Before running a load test, estimate what you think the result will be. Before checking the cloud bill, estimate what you think it should be. The gap between prediction and reality is where learning happens. An estimate that turns out to be 5x off teaches you more than one that was right — because you have to find and correct the wrong assumption.
Q12: Design the data quality strategy for a fintech company processing $2 billion in payments annually. What is different about data quality when money is involved?
Q12: Design the data quality strategy for a fintech company processing $2 billion in payments annually. What is different about data quality when money is involved?
- Exactness, not approximation. In analytics, “roughly 500,000 orders yesterday” is fine. In finance, “499,997 orders totaling $4,237,891.23” is the only acceptable answer. Financial data quality must be exact to the cent, exact to the row, and exact to the timestamp.
- Auditability. Every transformation must be traceable. If a regulator asks “how did you arrive at this revenue figure?”, you must be able to trace it from the final report back through every transformation to the original transaction records. This means full data lineage, immutable transformation logs, and versioned transformation logic.
- Reconciliation. Financial systems require reconciliation — comparing numbers from two independent sources to verify they agree. The payment processor says you processed 4,237,891.23. Your bank statement says $4,237,891.23. If any of these disagree, it is an incident. Build automated reconciliation into the pipeline.
- Idempotency is non-negotiable. In most data pipelines, a duplicate row is a data quality issue. In financial pipelines, a duplicate row is a double charge. Every pipeline stage must be idempotent: running it twice must produce the same result as running it once. Use unique transaction IDs as deduplication keys at every stage.
- Layer 1 — Source validation. Every payment event is validated against a schema contract before entering the pipeline. Amount must be positive. Currency must be a valid ISO 4217 code. Transaction ID must be unique. Timestamp must be within the last 24 hours (rejects stale replays).
- Layer 2 — Transformation assertions. After every dbt transformation, assert that: the sum of credits equals the sum of debits (double-entry accounting identity), no transaction appears more than once, all foreign keys resolve, and the total amount processed today is within 3 standard deviations of the trailing 30-day average (catches anomalies).
- Layer 3 — Cross-source reconciliation. Daily automated reconciliation between: the pipeline’s output, the payment processor’s settlement report, and the general ledger. Any discrepancy triggers an alert to the finance engineering team. The tolerance is zero — in financial reconciliation, there is no “close enough.”
- Layer 4 — Immutable audit trail. Every pipeline run writes its configuration, input row counts, output row counts, transformation logic version (git SHA), and any anomalies to an append-only audit log. This log is immutable (write-once storage, like S3 with Object Lock) and retained for 7 years per regulatory requirements.
Follow-up: “What happens when your automated reconciliation finds a 2 million settlement?”
A 3-cent discrepancy on a $2M settlement is almost certainly a rounding issue — but you cannot assume that. In financial systems, you investigate every discrepancy, no matter how small.- Investigate immediately. Trace both numbers to their source. The $0.03 difference likely comes from a currency conversion rounding difference (the payment processor rounds to 2 decimal places at the transaction level, while the pipeline sums first and rounds at the aggregate level) or a floating-point precision issue (storing monetary values as floats instead of decimals — a cardinal sin in financial software that still happens).
- Classify and document. If it is a known rounding behavior, document it as a “known reconciliation variance” with the exact cause, the affected calculation, and the maximum possible deviation. Finance and audit teams need this documentation.
- Fix the root cause. If it is a floating-point issue, migrate to decimal types (DECIMAL/NUMERIC in SQL, not FLOAT/DOUBLE). If it is a rounding order issue, standardize rounding rules across all systems (round per transaction, not per aggregate).
- Set a tolerance policy. After investigation, you might establish: discrepancies under 0.10 or from unknown causes are escalated to the finance engineering team. Discrepancies over $100 are escalated to the CFO’s office. The thresholds vary by company, but having them defined prevents both under-reaction (ignoring a real problem) and over-reaction (waking someone up at 3 AM for a rounding penny).
Going Deeper: “Why should monetary values never be stored as floating-point numbers?”
IEEE 754 floating-point numbers cannot exactly represent most decimal fractions. The number 0.1 in binary floating-point is actually 0.1000000000000000055511151231257827021181583404541015625. When you sum thousands of transactions stored as floats, these tiny representation errors accumulate and produce results that are off by cents, sometimes dollars.The classic example:0.1 + 0.2 = 0.30000000000000004 in nearly every programming language using IEEE 754 floats. In a financial system summing 10 million transactions, accumulated floating-point errors can produce discrepancies of hundreds of dollars — enough to fail an audit.The fix: use exact decimal types everywhere in the financial data path. DECIMAL(19,4) in SQL. BigDecimal in Java. Decimal in Python. Integer cents (store $10.50 as 1050) in systems without native decimal types. The performance cost of decimal arithmetic is negligible compared to the cost of a single reconciliation failure. Stripe, Square, and every serious payment processor stores amounts as integers in the smallest currency unit (cents, pence, yen) precisely to avoid this class of error.Advanced Interview Scenarios
These questions are designed to expose gaps that textbook answers cannot fill. They target architecture decisions where the obvious answer is wrong, debugging war stories that require production intuition, and cross-cutting scenarios that span capacity planning, Git workflows, and data pipelines simultaneously. If you can answer these convincingly, you are operating at staff-level.Q13: Your capacity estimate says you need to handle 200,000 writes/second for an event ingestion service. A junior engineer proposes DynamoDB because 'it scales infinitely.' What is wrong with this reasoning, and what would you actually recommend?
Q13: Your capacity estimate says you need to handle 200,000 writes/second for an event ingestion service. A junior engineer proposes DynamoDB because 'it scales infinitely.' What is wrong with this reasoning, and what would you actually recommend?
- Trap 1 — Partition key heat. DynamoDB does not give you 200K writes/second uniformly. It gives you roughly 1,000 writes/second per partition. If your partition key is poorly chosen (e.g.,
dateas the partition key, so all today’s events land on the same partition), you will hit throttling at 1,000 writes/second while paying for 200,000. I have seen a team at an adtech company provision 50,000 WCU on a DynamoDB table and still getProvisionedThroughputExceededExceptionbecause 80% of traffic hit 3 hot partitions. The fix was switching the partition key fromcampaign_idtocampaign_id#shard_numberwith write-sharding across 10 shards per campaign — a pattern DynamoDB’s own docs recommend but most developers do not implement until they are already on fire. - Trap 2 — Cost at sustained scale. At 200,000 WCU provisioned, DynamoDB costs roughly 1.25 per million writes, 200K writes/second sustained is ~3,000/month. The cost difference is two orders of magnitude. DynamoDB’s “serverless scaling” premium is acceptable at 5,000 writes/second. At 200,000, you are paying the serverless tax on infrastructure that should be provisioned.
- Trap 3 — Conflating ingestion with storage. At 200K events/second, the first question is not “where do I store this?” but “how do I buffer this?” The correct architecture is: ingest into Kafka (or Kinesis) as a durable buffer, then consume from Kafka into your storage layer (DynamoDB, S3, ClickHouse, whatever) at a pace the storage layer can handle. Kafka decouples ingestion throughput from storage throughput. The junior engineer’s mistake is trying to use the database as both the ingestion buffer and the storage layer.
Follow-up: “At what write volume does DynamoDB stop being the right default?”
The break-even is roughly 10,000-20,000 sustained writes/second. Below that, DynamoDB’s operational simplicity (no servers to manage, automatic scaling, built-in replication) justifies the per-operation premium. Above that, you are paying a managed-service tax on volume that is predictable enough to provision yourself. The analogy I use: DynamoDB is like a taxi — perfect for short, unpredictable trips. Kafka is like owning a car — higher upfront cost but dramatically cheaper per mile at high volume. At 200K writes/second, you are taking a taxi from New York to Los Angeles.Follow-up: “The team pushes back and says Kafka adds operational complexity they do not want. How do you respond?”
Fair concern — self-managed Kafka is genuinely complex (ZooKeeper coordination, partition rebalancing, broker failures). But the answer is not “avoid Kafka.” The answer is “use managed Kafka.” AWS MSK, Confluent Cloud, and Redpanda Cloud all provide Kafka-compatible managed services. MSK costs 450/month for a system that handles your entire write volume with headroom. That is operationally simpler than DynamoDB at this scale because you are not debugging partition throttling, managing adaptive capacity, or deciphering a $180K monthly bill.Follow-up: “How do you size a Kafka cluster for 200K messages/second?”
The back-of-envelope: each Kafka broker on modern hardware (NVMe SSD, 10 Gbps network) handles roughly 200-500 MB/second of write throughput. At 200K messages/second with 1 KB average message size, that is 200 MB/second of write throughput. A single broker handles that, but you need a minimum of 3 for replication (replication factor 3 means each message is written to 3 brokers). So you need 3 brokers minimum, with the bottleneck being network bandwidth for replication, not disk. I would provision 5-6 brokers to give 40-50% headroom for traffic spikes, rebalancing overhead, and the inevitable growth to 400K messages/second six months later.Q14: It is Friday at 4 PM. Production is down because a hotfix was merged to the wrong branch. Your team uses GitHub Flow and the last 3 deployments each contain work from different engineers. Walk me through exactly what you do in the next 60 minutes.
Q14: It is Friday at 4 PM. Production is down because a hotfix was merged to the wrong branch. Your team uses GitHub Flow and the last 3 deployments each contain work from different engineers. Walk me through exactly what you do in the next 60 minutes.
- Minutes 0-5 — Assess blast radius. What is “down”? Complete outage (500s on all endpoints) or partial degradation (one feature broken, rest works)? Check the monitoring dashboard (Datadog, Grafana) and the error rate spike. If it is a partial failure, determine which users are affected. This decides whether we need an immediate rollback or can afford 20 minutes to do a surgical fix.
- Minutes 5-10 — Revert in production, not in Git. The fastest way to restore service is not
git revert— it is to redeploy the last known good artifact. If you have immutable deployments (Docker images tagged with Git SHAs), revert the deployment to the previous image:kubectl rollout undo deployment/apior update the ECS task definition to the previous image tag. This restores service in 2-3 minutes without touching Git history. The Git cleanup happens after service is restored. - Minutes 10-15 — Communicate. Post in the incident channel: what broke, what we did, current status, ETA for full resolution. Tag the on-call manager. If customer-facing, notify the support team so they can respond to inbound tickets with accurate information. Under-communicating during an incident is worse than over-communicating.
- Minutes 15-30 — Understand what happened. Now that production is stable on the rollback, examine the bad merge.
git log --oneline --graphon main to see the commit history. Identify which commit introduced the break. Was it the hotfix itself that was buggy? Or was the hotfix correct but merged into a branch that included two other engineers’ unfinished work? If the last 3 deployments are interleaved, usegit log --authorandgit diffbetween the last good SHA and the bad SHA to isolate the breaking change. - Minutes 30-45 — Surgical fix. If the hotfix was correct but collided with other changes, cherry-pick only the hotfix onto a clean branch from the last known good SHA:
git checkout -b hotfix/actual-fix <last-good-sha>thengit cherry-pick <hotfix-commit>. Run CI against this branch. If CI passes, deploy this branch. - Minutes 45-60 — Stabilize and schedule the postmortem. Confirm production is healthy on the new deployment. Verify metrics are back to baseline. Schedule a blameless postmortem for Monday. Document the timeline in the incident channel.
kubectl rollout undo. The permanent fix was: disabling force-push on main via branch protection rules, adding a pre-deploy integration test suite that validated the top 5 critical API endpoints, and lowering the error-rate alert threshold from 5% to 1%. The postmortem also revealed that the engineer force-pushed because they were trying to “clean up” a messy rebase — which led to a team-wide session on git reflog as a safety net and a norm against rebasing shared branches.Follow-up: “One of the 3 interleaved deployments contains a database migration that has already run. You cannot simply roll back the code because the old code expects the old schema. What do you do?”
This is the scenario that separates teams with mature deployment practices from those without. The database migration has moved forward — rolling back the code without rolling back the schema creates a mismatch.- If the migration is backward-compatible (additive — new column, new table): Roll back the code safely. The old code ignores the new column. The new column sits there harmlessly until the fixed code deploys. This is why expand-and-contract migrations are the gold standard: always add before you remove.
- If the migration is NOT backward-compatible (renamed column, changed type, dropped column): You cannot roll back the code. You must fix forward. Create a new migration that makes the schema compatible with the old code (e.g., re-add the dropped column as nullable), deploy that migration, then deploy the old code. Alternatively, if the breaking migration can be reversed without data loss, run a reverse migration script — but test it first, because reverse migrations in production databases are where careers end.
- The postmortem action item: Enforce the rule that every migration must be backward-compatible with the previous code version. This means always using a two-phase approach: Phase 1 deploys the new schema (additive only), Phase 2 deploys the code that uses it, Phase 3 (later) removes the old schema. This is called expand-and-contract, and it makes rollbacks safe by construction.
Follow-up: “How do you prevent this class of incident from happening again?”
Three layers of defense:- Process layer: Establish a documented hotfix procedure. The runbook says: create a branch from the current production SHA (not from main HEAD), apply only the fix, get one expedited review, merge, deploy, then backport to main. Under pressure, engineers default to muscle memory. If the runbook exists and has been practiced, they follow it. If it does not exist, they improvise, and improvisation under stress is how you get force-pushes to main.
- Tooling layer: Branch protection on main: require CI pass, require at least one approval, disable force-push. Deployment pipeline: require a smoke test suite to pass against the staged artifact before promoting to production. Git hooks: pre-push hook that warns if you are pushing directly to main.
- Cultural layer: Blameless postmortems after every incident. The goal is not “who did this?” but “what system allowed this to happen?” The engineer who force-pushed is not the root cause — the missing branch protection rule is. Fix the system, not the person.
Q15: Your capacity estimate says you need Elasticsearch for a new search feature. Your principal engineer says: 'PostgreSQL full-text search is good enough. You are over-engineering.' Who is right?
Q15: Your capacity estimate says you need Elasticsearch for a new search feature. Your principal engineer says: 'PostgreSQL full-text search is good enough. You are over-engineering.' Who is right?
- Number 1 — Index size. PostgreSQL full-text search with GIN indexes works well up to about 10-50 million documents (depending on document size and query complexity). Beyond that, index maintenance becomes expensive and query latency degrades. If our feature indexes 5 million product listings with an average of 200 words each, PostgreSQL is genuinely sufficient. If we are indexing 500 million documents with complex faceted search, PostgreSQL will struggle.
- Number 2 — Query complexity. If the search is “find products matching these keywords, sorted by relevance,” PostgreSQL’s
ts_vectorandts_rankhandle that well. If the search requires fuzzy matching, autocomplete with typo tolerance, faceted filtering (filter by price range AND category AND rating AND availability simultaneously), and custom scoring algorithms, Elasticsearch’s inverted index architecture is purpose-built for that workload. PostgreSQL can do some of this, but the query complexity and maintenance cost grow fast. - Number 3 — Operational budget. Elasticsearch requires its own cluster, its own monitoring, its own capacity planning, and its own expertise. For a team of 8 engineers with no Elasticsearch experience, introducing a new distributed system for a feature that PostgreSQL can handle is genuinely over-engineering. The principal engineer might be making an implicit argument: “We do not have the operational capacity to run another distributed system.”
Follow-up: “At what scale does PostgreSQL full-text search actually break down? Give me specific numbers.”
From production experience and benchmarks:- 10 million documents, simple keyword search: PostgreSQL handles this at sub-50ms p99 latency with a properly maintained GIN index. No issues.
- 50 million documents, keyword search with ranking: Latency starts creeping to 100-300ms depending on query complexity. GIN index rebuild after bulk updates takes minutes, during which write performance degrades. Manageable but you are noticing it.
- 100+ million documents, faceted search with autocomplete: PostgreSQL is out of its comfort zone. GIN index sizes grow to tens of gigabytes, VACUUM on the search table takes hours, and combining full-text search with faceted filtering (range queries, multi-value filters) requires multiple index scans that do not compose well. This is where Elasticsearch’s architecture — inverted indexes per field, segment-based storage, distributed search across shards — genuinely outperforms.
- The breakpoint is not just document count — it is the product of document count, query complexity, and concurrent query volume. 100M documents with 1 query/second is different from 100M documents with 1,000 queries/second.
Follow-up: “If you do migrate to Elasticsearch, how do you keep it in sync with PostgreSQL as the source of truth?”
This is the classic dual-write problem, and it has exactly one correct answer: do not dual-write.- The wrong approach: Write to PostgreSQL and Elasticsearch in the same application transaction. This couples your write path to Elasticsearch availability, introduces distributed transaction complexity, and creates inconsistency windows when one write succeeds and the other fails.
- The correct approach: Change Data Capture (CDC). Use Debezium to capture PostgreSQL WAL changes and stream them to Kafka, then consume from Kafka into Elasticsearch. This decouples the write path entirely. PostgreSQL is the source of truth. Elasticsearch is an eventually consistent read replica. If Elasticsearch falls behind or goes down, PostgreSQL continues serving writes, and Elasticsearch catches up from the Kafka log when it recovers.
- The acceptable shortcut for small scale: A background job that polls PostgreSQL for recently changed rows (using
updated_at > last_sync_timestamp) and upserts them into Elasticsearch every 30-60 seconds. Simpler than CDC, acceptable latency for most search use cases, but breaks down at high write volume because polling is O(changed rows per interval) and Debezium/CDC is O(1) per change.
Q16: Your Airflow DAG has 47 tasks with complex dependencies. It ran fine for a year. This month, it has failed 11 times. The failures are all different tasks. What is going on and how do you fix it?
Q16: Your Airflow DAG has 47 tasks with complex dependencies. It ran fine for a year. This month, it has failed 11 times. The failures are all different tasks. What is going on and how do you fix it?
- Hypothesis 1 — Resource exhaustion. A 47-task DAG with complex dependencies means many tasks can run in parallel. If the Airflow cluster’s worker pool has not grown but the data volume has, tasks are competing for CPU, memory, and database connections. The symptom: random tasks fail with OOM kills, worker timeouts, or “task received SIGTERM” errors. Check: compare this month’s data volume against last month’s. If it grew 2-3x (common in Q4 for retail, tax season for fintech), the DAG is the same but the work per task has increased. Fix: increase worker resources (vertical scaling) or reduce parallelism (
max_active_tasks_per_dag,poolslots in Airflow) so fewer tasks compete for resources. - Hypothesis 2 — Upstream instability. The DAG depends on external sources — APIs, databases, S3 buckets. If an upstream source has become flakier (higher latency, intermittent timeouts, rate limiting), different tasks that depend on different sources will fail at different times. Check: correlate task failure times with upstream system health dashboards. Look for timeout errors in the task logs. Fix: add retry logic with exponential backoff to tasks that call external systems (
retries=3, retry_delay=timedelta(minutes=5)in Airflow). Add circuit breakers for sources that are consistently degraded. - Hypothesis 3 — Airflow metadata database bloat. Airflow stores task instance records, XCom data, and DAG run history in a PostgreSQL or MySQL metadata database. After a year of running a 47-task DAG daily, that is 47 x 365 = ~17,000 task instance records, plus XCom entries, logs, and rendered templates. If the metadata database has never been cleaned, it may be slowing down the scheduler. The scheduler cannot keep up with task scheduling, tasks time out waiting to be scheduled, and they fail with “task was externally killed” or “task is in a None state.” Check: query the Airflow metadata database for table sizes. If
task_instancehas millions of rows andxcomis gigabytes, this is your problem. Fix: runairflow db cleanto purge old records (keep the last 30 days), add a recurring maintenance job, and increase the metadata database instance size. - Hypothesis 4 — DAG design has reached its complexity limit. 47 tasks with complex dependencies means the dependency graph is deep and wide. If a single task in the middle fails and the retry takes 20 minutes, all downstream tasks are delayed. If the DAG has a hard timeout (dagrun_timeout), those delayed downstream tasks get killed. The cascade looks like “11 different tasks failed” but really it was “1 task was slow and 10 downstream tasks timed out.” Check: look at the DAG’s Gantt chart in the Airflow UI. If you see a single long bar followed by a cascade of short red bars, you have found it. Fix: add SLA monitoring per task, separate the bottleneck task into a shorter critical path and a longer non-critical path, and consider splitting the 47-task monolith into multiple smaller DAGs with sensor-based dependencies between them.
airflow db clean --clean-before-timestamp 2024-01-01 dropped the metadata database to 3 GB, scheduler loop time dropped to 1.5 seconds, and the “random” failures stopped overnight. Total cost of not having a metadata cleanup cron: 6 weeks of engineering time chasing ghosts.Follow-up: “You decide the DAG needs to be split. How do you decompose a 47-task DAG without breaking the dependency chain?”
The principle is: find the natural seams — points in the dependency graph where the coupling between task groups is minimal (ideally a single data handoff).- Step 1 — Map the dependency graph visually. Use the Airflow UI’s graph view or export the DAG as a DOT file and render it. Identify clusters of tightly connected tasks and the bridges between clusters. Usually a 47-task DAG has 3-5 natural clusters.
- Step 2 — Split at data boundaries. If tasks 1-15 extract and load raw data, tasks 16-30 transform that data, and tasks 31-47 produce final tables and reports, split into three DAGs:
extract_load,transform,reporting. The contract between DAGs is the data itself:extract_loadproduces raw tables in the warehouse,transformruns only after those tables are fresh,reportingruns after transformation is complete. - Step 3 — Use sensors or dataset-aware scheduling. Airflow 2.4+ supports dataset-aware scheduling: DAG B can be configured to run automatically when DAG A updates a specific dataset. This replaces hard-coded cross-DAG dependencies with a data contract: “run when the
raw_orderstable has been updated.” This is more resilient than time-based scheduling (“run DAG B at 4 AM and hope DAG A finished by then”).
Follow-up: “What is the trade-off between one large DAG and many small DAGs?”
- One large DAG: The scheduler sees all dependencies and can optimize execution order globally. Easier to reason about the full pipeline in one place. But: a single failure can cascade to many tasks, the DAG becomes hard to understand visually, and different teams cannot own different parts independently.
- Many small DAGs: Each DAG is owned by a team, has its own SLA, and fails independently. But: cross-DAG dependencies must be managed explicitly (sensors, datasets, or external triggers), and the “big picture” of the full pipeline is harder to see. You need a separate lineage tool (like DataHub or OpenLineage) to understand the end-to-end flow.
- The practical answer: Split when the DAG has more than 20-25 tasks or when multiple teams need to own different parts. The overhead of cross-DAG coordination is worth it when the alternative is a 60-task monolith that one engineer understands and everyone else is afraid to touch.
Q17: Your team adopted trunk-based development 6 months ago. It has been great for velocity, but you now have 143 feature flags in your codebase, half of which nobody is sure are still needed. What is the risk, and how do you clean this up?
Q17: Your team adopted trunk-based development 6 months ago. It has been great for velocity, but you now have 143 feature flags in your codebase, half of which nobody is sure are still needed. What is the risk, and how do you clean this up?
- Combinatorial testing explosion. Each boolean flag doubles the number of possible code paths. With 143 flags, you have 2^143 theoretical code paths — more than the number of atoms in the observable universe. In practice, most flags are independent, but even 10 interacting flags create 1,024 combinations that nobody has tested. I have seen a production incident where two flags that were individually safe interacted to create a race condition — Flag A disabled a cache warming step, Flag B enabled a new query path that assumed the cache was warm. Neither flag was buggy alone. Together, they caused a 40-second p99 latency spike for 12% of users.
- Code comprehension collapse. Engineers reading the code encounter
if (flags.NEW_CHECKOUT_FLOW)everywhere. They do not know if this flag is active for all users, 5% of users, or has been 100% for 6 months and nobody removed the conditional. The code is harder to read, harder to debug, and harder to modify because every change requires understanding which flag states are actually reachable. - Stale flags become landmines. A flag that was 100% on for 3 months is effectively dead code in the
elsebranch. But nobody has deleted theelsebranch because “what if we need to roll back?” After 3 months, the surrounding code has changed enough that theelsebranch would not work anyway. It is dead code that provides a false sense of safety.
- Step 1 — Inventory and classify. Query your feature flag system (LaunchDarkly, Unleash, environment variables) for every flag. Classify each as: (a) actively being rolled out (keep), (b) 100% on for all users for 30+ days (candidate for removal), (c) 100% off / 0% rollout (dead — remove immediately), (d) unknown owner (escalate). Most flag systems provide this data through their API. At 143 flags, this takes a day of scripting, not weeks of manual archaeology.
- Step 2 — Set expiration dates from day one going forward. Every new flag gets a TTL — a date by which it must be either fully rolled out and the flag removed, or rolled back and the feature abandoned. In LaunchDarkly, you can set flag expiration dates that create Jira tickets automatically. I use a default of 30 days for simple flags and 90 days for major features. A flag without an expiration date is a flag that lives forever.
- Step 3 — Treat flag removal as first-class engineering work. Removing a flag is not just deleting an
ifstatement. It requires: removing the flag definition, removing the conditional code paths, removing any tests that branch on the flag, and deploying the change. For 70 stale flags, estimate 1-2 hours each, so 70-140 hours of engineering time. This is a multi-sprint effort that must be staffed and tracked, not a “we will get to it” backlog item. - Step 4 — Lint for flag hygiene. Add a CI check that fails if a flag older than its TTL has not been removed. This is the enforcement mechanism. Without it, step 2 is aspirational. With it, stale flags become tech debt that blocks the build, just like failing tests.
if (flags.TEMP_ADMIN_ACCESS) and the flag name implied it was controlled. The security team estimated the endpoint was exposed for 6 months beyond its intended lifetime. The fix was trivial (delete the flag, remove the endpoint). The discovery was accidental. They now mandate flag expiration dates and a monthly flag audit specifically for security-sensitive flags.Follow-up: “How do you handle the case where removing a flag requires changing 47 files?”
This is common for cross-cutting flags (likeNEW_AUTH_FLOW or ENABLE_V2_API). The approach:- Use your IDE’s find-all-references. Search for the flag name across the codebase. Group the changes by module/service. If the flag is referenced in 47 files, it probably touches 3-4 logical areas. Make one PR per area, not one 47-file PR that nobody will review.
- Remove the dead branch first, deploy, then remove the flag. If the flag is 100% on, first remove the
elsebranch in a PR (this is safe — the else branch was unreachable). Deploy. Verify. Then in a second PR, remove theifconditional and the flag definition, collapsing the code to just thethenbranch. Two small, safe PRs instead of one large, scary PR. - Use automated codemods. For JavaScript/TypeScript, jscodeshift can programmatically remove dead code branches based on a flag name. For other languages, custom AST-based scripts or even simple regex-based transformations work when the flag pattern is consistent.
Follow-up: “Your PM says the team should be building features, not cleaning up flags. How do you justify the investment?”
With data, not opinions.- Incident cost: “We had 2 incidents in the last quarter caused by flag interactions. Each cost 4 hours of engineering time plus 30 minutes of user-facing degradation. At our incident rate, stale flags will cause ~8 incidents/year.”
- Velocity cost: “Engineers report spending 15-20 minutes per PR understanding which flags are active in the code they are modifying. With 143 flags and 25 PRs/week, that is 6-8 hours/week of engineering time spent on flag comprehension.”
- Security risk: “Three flags control access to sensitive endpoints. Without a cleanup process, we cannot guarantee these are in the expected state.”
- Then frame the investment: “Cleaning up 70 stale flags is ~100 hours of work. That is one engineer for 2.5 weeks, or spread across the team, 2 hours/week for 3 months. The return is fewer incidents, faster feature development, and reduced security surface area.”
Q18: You estimated that your new service needs 500 GB of storage and 2,000 requests/second. The service launches and immediately hits problems — but storage and throughput are both within your estimates. What went wrong?
Q18: You estimated that your new service needs 500 GB of storage and 2,000 requests/second. The service launches and immediately hits problems — but storage and throughput are both within your estimates. What went wrong?
- Latency distribution, not average throughput. The estimate said 2,000 req/s and the system handles 2,000 req/s on average. But if 5% of requests take 10 seconds (slow queries, external API calls, lock contention), those requests consume worker threads/connections disproportionately. At 2,000 req/s with 5% slow requests, that is 100 slow requests in flight at any given moment. If each holds a database connection for 10 seconds, you need 1,000 connections just for the slow path — and most connection pools default to 20-50. The estimate captured throughput but not the tail latency distribution.
- Write amplification patterns. The estimate said 500 GB of storage. The database has 500 GB of logical data. But PostgreSQL’s MVCC creates dead tuples on every UPDATE. If the workload is update-heavy (e.g., updating a user’s
last_seen_aton every request), the table bloats to 2x-3x logical size between VACUUMs. The disk fills up at 1.5 TB even though the “real” data is 500 GB. The estimate captured logical data size but not the physical storage behavior of the chosen database engine. - Connection storms, not sustained throughput. 2,000 req/s sustained is fine. But if the service is behind a load balancer and 50 application servers each maintain a persistent connection pool of 20 connections, that is 1,000 connections. If a deployment rolls through and all servers reconnect simultaneously, you get a connection storm of 1,000 new connections in 5 seconds. PostgreSQL’s
max_connectionsdefault is 100. The service crashes not because of throughput but because of connection concurrency during transient events. - Hot key / hot partition. The estimate assumed uniform distribution. 2,000 req/s spread evenly across 1 million keys is trivial. But if 40% of requests hit the same 100 keys (a “celebrity” problem, a popular product, a viral post), those keys become a hot partition. DynamoDB throttles. Redis hits CPU limits on a single shard. PostgreSQL row-level locks cause contention. The throughput estimate was correct globally but wrong locally.
INSERT ... ON CONFLICT (user_id) DO UPDATE SET last_active = NOW(). Every active user updated the same row repeatedly. PostgreSQL’s MVCC created dead tuples faster than autovacuum could clean them. The users table — 200 GB of logical data — ballooned to 580 GB of physical data in 4 hours as dead tuples accumulated. The disk filled to 95%, PostgreSQL stopped accepting writes, and the service went down. The fix was two-fold: (1) tune autovacuum aggressively for that table (autovacuum_vacuum_scale_factor = 0.01 instead of the default 0.2), and (2) batch the last_active updates through Redis, flushing to PostgreSQL every 60 seconds instead of on every request. This reduced the PostgreSQL write rate from 3,000/s to 50/s while the logical data volume was identical.Follow-up: “How do you load test in a way that catches these issues before launch?”
The key is realistic traffic patterns, not synthetic benchmarks.- Replay production traffic. If you have an existing system, capture real request logs (with PII redacted) and replay them against the new service. Real traffic has hot keys, burst patterns, and tail latency that synthetic tools like
wrkdo not generate by default. - Simulate transient events. During the load test, trigger a rolling deployment. Kill a database replica. Simulate a network partition. These transient events are when real systems fail, not during steady-state throughput.
- Run the test for hours, not minutes. Many issues (MVCC bloat, connection pool exhaustion, memory leaks, GC pauses) only manifest after sustained load. A 5-minute load test proves your system handles a sprint. A 4-hour test proves it handles a marathon. The PostgreSQL dead tuple issue from the war story would not have appeared in a 10-minute benchmark.
- Measure p99 and p99.9, not just p50 and average. Average latency is meaningless for capacity planning. If your p50 is 5ms but your p99 is 2 seconds, 1% of your users are having a terrible experience, and those 2-second requests are consuming resources that could serve 400 fast requests. The tail is where problems hide.
Follow-up: “You mentioned PostgreSQL MVCC bloat. How do you monitor for this in production before it causes an outage?”
Three metrics to watch:- Dead tuple count per table. Query
pg_stat_user_tablesand trackn_dead_tupover time. If dead tuples for a table are growing faster than autovacuum can clean them, you are headed for bloat. Alert whenn_dead_tup > 10 * n_live_tup(more dead than live rows — autovacuum is losing the race). - Table bloat ratio. Compare the logical table size (
pg_relation_size) against the estimated minimum size (row count x average row size). If the table is 3x larger than it should be, you have bloat. Thepgstattupleextension provides exact bloat measurements. - Autovacuum runtime and frequency. Query
pg_stat_user_tables.last_autovacuumand check if autovacuum is running and completing. If autovacuum is constantly running but never catching up (checkpg_stat_activityfor long-running autovacuum workers), you need to increase autovacuum workers or tune the cost settings.
Q19: Your data pipeline produces a daily report showing $12.4M in revenue. Finance says the real number from the payment processor is $12.1M. The pipeline is not 'wrong' — it is $300K off. Where do you look?
Q19: Your data pipeline produces a daily report showing $12.4M in revenue. Finance says the real number from the payment processor is $12.1M. The pipeline is not 'wrong' — it is $300K off. Where do you look?
- Step 1 — Determine the direction. The pipeline says 12.1M. The pipeline is $300K higher. This means the pipeline is either counting transactions that should not be counted, or counting them at the wrong amount. The direction narrows the hypothesis space immediately.
- Step 2 — Check for double-counting. The most common cause of “pipeline is too high.” Look for duplicate transaction IDs in the pipeline output. This can happen if: (a) the extraction ran twice and the load was not idempotent, (b) the source emits both a pending and a completed event for the same transaction and the pipeline counts both, (c) retries in the extraction layer produced duplicate records. Query:
SELECT transaction_id, COUNT(*) FROM revenue_table WHERE date = 'yesterday' GROUP BY transaction_id HAVING COUNT(*) > 1. If you find duplicates, quantify their total value. If it is near $300K, you have your answer. - Step 3 — Check for refund/chargeback handling. The payment processor’s 12.4M is gross. If yesterday had $300K in refunds and the pipeline does not subtract them, the discrepancy is explained. Check: does the pipeline join against the refunds table? Is the refund data on the same schedule as the transaction data, or does it arrive with a 24-48 hour delay?
- Step 4 — Check for currency conversion. If the business processes international payments, the pipeline might convert foreign currency transactions to USD at a different exchange rate than the payment processor. The pipeline might use the rate at the time of the transaction, while the processor uses the rate at settlement time. A 1-2% difference in exchange rates on international transactions could produce a $300K discrepancy for a business with significant international revenue.
- Step 5 — Check the time window. This is the one that catches experienced engineers off guard. The pipeline counts transactions with a
created_attimestamp in yesterday’s UTC window (00:00:00 to 23:59:59 UTC). The payment processor counts transactions settled yesterday in their processing timezone (Pacific Time). The 8-hour offset means transactions from late in the UTC day fall into different days in the two systems. If transaction volume is roughly uniform, an 8-hour window mismatch at ~$516K/hour would produce a discrepancy of exactly this magnitude. - Step 6 — Reconcile at the transaction level. Export the set of transaction IDs from the pipeline’s output and the set from the payment processor’s report. Compute the symmetric difference. The transactions present in one but not the other, plus the transactions present in both but with different amounts, explain the entire discrepancy. This is slow but definitive — and for a $300K discrepancy, it is worth the effort.
created_at for the transaction timestamp, while the processor used captured_at. Transactions authorized at 11:50 PM and captured at 12:05 AM fell on different days in the two systems, creating a rolling discrepancy of ~0.8%. Combined: ~2.3%, exactly matching the observed pattern. The fix was straightforward: use captured_at as the revenue timestamp and exclude non-captured authorizations. The 4 months of incorrect daily reports had to be retroactively corrected, which required a painful conversation with the CFO about which board reports had used the inflated numbers.Follow-up: “Finance asks: ‘Can we guarantee this will never happen again?’ What is your honest answer?”
No — and I would say exactly that, with an explanation of what we can guarantee.We can guarantee that this specific class of discrepancy will not happen again because we have automated reconciliation that compares our pipeline output against the processor’s settlement report every morning and alerts if the discrepancy exceeds $100. We can guarantee that our time windows and transaction statuses now match the processor’s definitions.What we cannot guarantee is that a new class of discrepancy will not emerge. The payment processor could change their settlement timing. A new payment method (crypto, buy-now-pay-later) could have different capture semantics. A new currency could introduce rounding behavior. The honest promise is not “it will never happen again” — it is “we will detect it within 24 hours instead of 4 months.”Follow-up: “How do you build automated reconciliation for a $2B/year payment pipeline?”
Three reconciliation levels, running daily:- Level 1 — Aggregate reconciliation. Compare daily totals: pipeline total vs processor total. If they differ by more than $100, alert. This catches gross errors (double-counting, missing data) within hours.
- Level 2 — Cohort reconciliation. Compare totals by payment method, currency, and region. The aggregates might match but hide offsetting errors (over-counting credit cards by 50K). Cohort-level reconciliation catches these.
- Level 3 — Transaction-level reconciliation. Weekly (not daily — too expensive at $2B/year volume). Match every transaction ID between the pipeline and the processor. Flag unmatched transactions and amount discrepancies. This is the most thorough check but the most computationally expensive, so running it weekly is a reasonable trade-off.
Q20: Your company wants to migrate from a polyrepo setup (23 repositories) to a monorepo. You have been asked to lead the migration. The CTO wants it done in one quarter. Walk me through your plan — and tell me where the CTO's timeline is wrong.
Q20: Your company wants to migrate from a polyrepo setup (23 repositories) to a monorepo. You have been asked to lead the migration. The CTO wants it done in one quarter. Walk me through your plan — and tell me where the CTO's timeline is wrong.
- Git history migration alone is a multi-week project. You do not just
cp -r repo1/ monorepo/services/repo1/. You need to rewrite each repo’s Git history so that file paths reflect their new location in the monorepo. Otherwise,git logandgit blame— the tools engineers use daily — will show paths that no longer exist. The tool for this isgit filter-repo, and it takes 2-4 hours per repo to run, validate, and fix edge cases (submodules, large files, symlinks). For 23 repos, that is 2-3 weeks of hands-on work just for history migration. - CI/CD pipeline rewrite is the largest work item. Each of the 23 repos has its own CI configuration (GitHub Actions workflows, Jenkinsfiles, CircleCI configs). In a monorepo, you need affected-target analysis — CI must detect which services changed and only build/test those. This requires adopting a monorepo build tool (Nx, Turborepo, Bazel) and rewriting every service’s build and test configuration to work within it. For 23 services, this is 4-8 weeks of work depending on how heterogeneous the build configurations are.
- Dependency management migration is treacherous. In polyrepo, each service pins its own dependency versions. Service A uses Express 4.17, Service B uses Express 4.18. In a monorepo, you need to decide: do you allow different versions per service (complex tooling), or align all services to the same version (requires upgrading 22 services to match the newest one)? Dependency alignment is where hidden breakage lives — a minor version bump in a shared library can break services that were happily pinned to the old version. Budget 2-4 weeks for dependency alignment and the resulting test failures.
- Developer workflow retraining is the hidden cost. Engineers have muscle memory around their polyrepo workflows — their IDE workspace settings, their shell aliases, their review habits. In a monorepo,
git clonetakes longer, IDE indexing is slower, and PR review scope changes because a single PR might touch multiple services. Without training and documentation, engineers will fight the monorepo rather than adopt it. Budget 2-3 weeks for documentation, brown-bag sessions, and pair programming during the transition.
- Quarter 1 — Phase 1 (weeks 1-6): Foundation. Set up the monorepo structure. Migrate the 3 most tightly coupled services first (the ones that change together most often — check cross-repo PRs to identify them). Establish the build tool configuration, CI pipeline with affected-target analysis, and developer documentation. This phase is the proof of concept.
- Quarter 1 — Phase 2 (weeks 7-13): Bulk migration. Migrate the next 10 services, in batches of 3-4 per week. Each batch follows the playbook established in Phase 1. Keep the old polyrepo CI running in parallel as a safety net — engineers can still open PRs in the old repo if the monorepo migration for their service is not ready.
- Quarter 2 — Phase 3 (weeks 14-20): Tail migration and cleanup. Migrate the remaining 10 services. These are often the most independent services with the most unique configurations — they take longer per service. Decommission old repos (archive, do not delete). Remove parallel CI configurations.
- Quarter 2 — Phase 4 (weeks 21-26): Optimization and adoption. Tune CI performance (remote caching, parallelization). Set up CODEOWNERS for review routing. Conduct team retrospectives to identify remaining pain points. This phase is where the monorepo goes from “working” to “productive.”
Follow-up: “During the migration, how do you handle the transition period where some services are in the monorepo and some are still in separate repos?”
This is the most operationally painful part. For the 4-8 weeks where services are split across both, you need:- Keep both CI systems running. Services in the monorepo use the monorepo CI. Services still in polyrepo use their existing CI. Do not turn off old CI until the service is migrated and validated.
- Cross-repo dependencies during transition. If Service A (in monorepo) depends on Service B (still in polyrepo), Service A must continue pulling Service B as an external dependency (npm package, Docker image, API call). Only convert to direct monorepo imports after Service B is migrated. This means the monorepo’s build tool must handle a mix of internal and external dependencies during transition.
- Communication. A Slack channel or daily standup dedicated to migration status. Engineers need to know: “Which services are in the monorepo today? Where do I open my PR? Which CI system do I check?” Without clear communication, engineers waste time in the wrong repo.
Follow-up: “After the migration is complete, what metrics tell you it was worth it?”
- Cross-service change frequency. Before: changes spanning multiple services required multiple PRs in multiple repos, reviewed by multiple teams, merged in a coordinated sequence. After: one PR, one review, one merge. Measure the number of multi-service changes per week and the time-to-merge for those changes. If multi-service changes went from 3 days to 3 hours, the migration is paying for itself.
- CI time per PR. This should be the same or better than polyrepo (thanks to affected-target analysis). If it is worse, the build tool configuration needs more work.
- Developer satisfaction survey. Quarterly survey asking: “How easy is it to make cross-service changes? How easy is it to discover code? How confident are you in the CI pipeline?” If satisfaction dropped, the migration has ergonomic issues to fix. If it improved, you are winning.
- Dependency update lag. Before: shared library updates required PRs to 23 repos, often taking weeks to fully propagate. After: one PR updates the library everywhere. Measure the time from “library update committed” to “all consumers using the new version.” This should drop from weeks to hours.
Q21: You are designing a data pipeline for an ML feature store that needs to serve features with sub-10ms latency at 50,000 requests/second, updated hourly from batch sources. Most candidates propose a simple architecture. What do they miss?
Q21: You are designing a data pipeline for an ML feature store that needs to serve features with sub-10ms latency at 50,000 requests/second, updated hourly from batch sources. Most candidates propose a simple architecture. What do they miss?
-
Problem 1 — The hourly update is not atomic. The batch job computes features for 10 million entities and writes them to Redis. That write takes 15-20 minutes (10M entities x 1 KB average x network overhead). During those 15-20 minutes, some entities have new features and some have stale features. An ML model requesting features for a batch of 50 entities might get a mix of old and new features — a consistency window that introduces subtle prediction errors. At 50,000 req/s, thousands of predictions during the write window are made on inconsistent feature sets.
The fix: Double-buffering. Maintain two feature sets in Redis:
features:currentandfeatures:next. The batch job writes entirely tofeatures:next. When the write is complete, atomically swap the pointer so all reads switch tofeatures:next(which becomesfeatures:current). Redis supports this with key renaming or with read-through references. Tecton, Feast, and other production feature stores use this pattern. - Problem 2 — Redis memory is expensive at 10M entities. 10 million entities x 1 KB of features x 2 (double-buffered) = 20 GB. That fits in a single Redis instance. But feature stores grow — 50 million entities with 5 KB of features each (dozens of feature columns) x 2 = 500 GB. Redis at 500 GB requires a cluster with 16+ shards, costing 25,000/month. And that is just one ML model’s features — a mature ML platform might have 20 models each with their own feature sets. The fix: Tiered storage. Hot features (recent, high-QPS entities like active users) stay in Redis. Warm features (less frequently requested entities) live in DynamoDB or Cassandra with sub-20ms latency. The serving layer checks Redis first and falls through to the warm tier on cache miss. This cuts Redis costs by 60-80% while keeping p99 latency under 10ms for the vast majority of requests.
-
Problem 3 — Feature freshness is not the same as feature correctness. The batch job runs hourly, but what if the batch job fails? The features in Redis are now 2 hours old instead of 1. The ML model continues serving predictions on stale features with no indication that anything is wrong. Unlike an API that returns a 500 error when the backend is down, a feature store with stale data returns a 200 with subtly wrong answers.
The fix: Every feature write includes a
computed_attimestamp. The serving layer checks the freshness of the features before returning them. If features are older than the SLA (say, 90 minutes for an hourly pipeline), the serving layer can: (a) return the stale features with astale: trueflag so the model can degrade gracefully, (b) fall back to default/baseline features, or (c) return an error so the caller can handle the degradation. The choice depends on the ML model’s sensitivity to stale features — a recommendation model can tolerate 2-hour staleness, a fraud model probably cannot. - Problem 4 — Schema evolution between batch and serving. The ML team adds a new feature column to the batch pipeline. The batch job starts writing it to Redis. But the serving layer’s deserialization code has not been updated — it ignores the new column, or worse, it breaks on the unexpected field. The batch and serving layers are two separate deployments with coupled schemas but decoupled release cycles. The fix: Use a schema registry (Protobuf or Avro with schema evolution rules) for the feature payload. The batch writer serializes features using a registered schema. The serving layer deserializes using the same schema, with forward/backward compatibility guarantees. Adding a new feature column is a non-breaking schema change. Removing one requires a deprecation process.
Follow-up: “How do you handle a feature that needs both real-time updates (streaming) and batch backfill?”
This is the Lambda architecture applied to feature stores, and it is the hardest operational problem in ML infrastructure.- The dual-write approach: The streaming pipeline writes fresh features from real-time events (e.g., “user clicked 3 times in the last 5 minutes”). The batch pipeline writes historical features (e.g., “user’s average session length over the last 30 days”). The serving layer merges both at read time, preferring the streaming value when it is fresher than the batch value.
- The correctness trap: The streaming pipeline and batch pipeline must compute the same feature definition using different implementations (one in Flink/Spark Streaming, one in Spark/dbt). If the implementations diverge — different rounding, different time-window boundaries, different null handling — the feature value jumps when the batch pipeline overwrites the streaming value. This creates training-serving skew: the model was trained on batch-computed features but is served a mix of batch and streaming features.
- The industry solution: Feature stores like Tecton address this by allowing a single feature definition that is compiled to both batch and streaming implementations, guaranteeing consistency. If you are building in-house, the pragmatic approach is: compute all features in batch (the source of truth), and use streaming only for the small set of features where hourly staleness is truly unacceptable. The fewer features that need streaming, the smaller the consistency surface area you need to manage.
Follow-up: “What is training-serving skew and why does it matter more than most engineers realize?”
Training-serving skew is the difference between the features the model saw during training and the features it sees during inference. Even small skew can degrade model performance silently — the model’s accuracy drops by 2-3%, nobody notices because there is no error, and the business loses revenue for weeks before someone investigates.Common causes: (1) the training pipeline computes features on historical data using batch logic, but the serving pipeline uses streaming logic that produces slightly different values (different window boundaries, different aggregation order), (2) the training pipeline uses exact point-in-time features, but the serving pipeline uses the “latest” features which may include future data leakage, (3) a feature transformation was updated in the serving path but not backfilled in the training data.The fix is feature monitoring: compare the distribution of features at serving time against the distribution at training time. If they diverge beyond a threshold, alert. This is the feature store equivalent of data quality monitoring — and it is just as neglected.Q22: A senior engineer on your team says: 'Back-of-envelope estimates are a waste of time. Just build a prototype and load test it.' Argue both sides, then give your actual opinion.
Q22: A senior engineer on your team says: 'Back-of-envelope estimates are a waste of time. Just build a prototype and load test it.' Argue both sides, then give your actual opinion.
- Estimates are based on assumptions. Prototypes are based on measurements. An estimate that says “PostgreSQL can handle 10,000 queries/second” is a generalization. A prototype that runs a load test against your actual schema, your actual query patterns, and your actual data distribution gives you a number specific to your system. I have seen estimates that were 5x optimistic because they assumed simple key lookups but the actual queries involved 3-table joins with text search.
- Estimates give false confidence. Once a number is on a whiteboard, people treat it as fact. “We estimated 2 TB, so we provisioned 2 TB.” The estimate becomes the plan, and when reality diverges, the team is surprised. A prototype forces you to confront reality early.
- For well-understood problems, estimates are overhead. If you are building a CRUD API with PostgreSQL and Redis, the capacity characteristics are well-known. Spending 2 hours on a back-of-envelope estimate produces the same answer as “spin up an instance, run k6 against it for 30 minutes, and look at the numbers.” The prototype is faster and more accurate.
- You cannot prototype what does not exist yet. In a system design interview or a greenfield architecture decision, there is no system to load test. The estimate is how you decide what to build — should it be PostgreSQL or DynamoDB? Should it be a monolith or microservices? Should it be a single region or multi-region? These are architecture decisions that must be made before a prototype exists, and they are informed by estimates.
- Prototypes test the current scale, not the future scale. A load test today shows the system handles 5,000 req/s. But you are planning for 50,000 req/s in 18 months. You cannot load test a scale you have not reached yet — but you can estimate it. The estimate tells you whether the current architecture has a ceiling below your growth trajectory.
- Estimates are fast; prototypes are expensive. A back-of-envelope calculation takes 15 minutes on a whiteboard. A meaningful load test requires: provisioning infrastructure, writing load scripts, generating realistic test data, running the test, analyzing results, and cleaning up. That is 2-5 days of work. If the estimate clearly shows “this is 100x beyond what a single database can handle,” the prototype was wasted effort — you knew the answer before you started.
Follow-up: “When do estimates actively mislead — when is the senior engineer’s skepticism justified?”
Estimates mislead in three scenarios:- When the workload has pathological access patterns. Estimates assume average-case behavior. If 80% of traffic hits 1% of data (hot keys, celebrity accounts, viral content), average-case estimates are wildly optimistic. Only a load test with realistic hot-key distribution reveals the actual behavior.
- When the system involves complex interactions. A single-component estimate (“Redis handles 100K ops/s”) is reliable. A multi-component estimate (“the API server calls the cache, which calls the database, which triggers an async worker”) compounds errors. Each component estimate has a margin of error; multiplied together across 5 components, the total error can be an order of magnitude. Prototyping the integrated system catches interaction effects that component-level estimates miss.
- When you are using unfamiliar technology. If the team has 5 years of PostgreSQL experience, their PostgreSQL estimates are well-calibrated. If they are using DynamoDB for the first time, their estimates will miss DynamoDB-specific gotchas (partition throttling, GSI costs, eventual consistency delays). For unfamiliar technology, prototype early to calibrate your estimates.
Follow-up: “How do you handle disagreements between an estimate and a load test result?”
The load test wins on facts but the estimate wins on reasoning. If my estimate said 10,000 req/s and the load test shows 3,000 req/s, the load test is correct — but I need to understand why the estimate was off. Was my assumption about query complexity wrong? Was I right about CPU but wrong about I/O? Did I miss a locking bottleneck?The gap between estimate and measurement is the most valuable learning moment. It tells you which of your mental models is wrong. Fix the mental model, not just the number — because the next estimate needs to be better.Advanced Data Pipeline Topics
Change Data Capture (CDC) — Advanced Patterns
Section 43.5 introduced CDC fundamentals. This section covers the advanced operational patterns that separate production CDC deployments from proof-of-concept demos. CDC and Schema Evolution When the source database schema changes — a column is added, renamed, or its type changes — the CDC stream must handle it without breaking downstream consumers. This is where most CDC deployments encounter their first serious incident.- Additive changes (new nullable column). Debezium emits the new column in change events after the DDL executes. Downstream consumers using schema-on-read (e.g., writing to a data lake in Parquet format) handle this automatically. Consumers with strict schemas (Avro with a Schema Registry) require a compatible schema evolution — adding a nullable field is a backward-compatible change in Avro.
- Breaking changes (column rename, type change, column removal). These produce events that do not match the consumer’s expected schema. The consumer crashes or silently drops data. The fix: use a Schema Registry with compatibility enforcement. Confluent Schema Registry supports backward, forward, and full compatibility modes. A producer cannot register a schema that would break existing consumers.
- The operational pattern: Treat CDC schema changes like API versioning. The source team notifies downstream consumers before making breaking schema changes. The data contract (covered in Section 43.4) governs this handoff. Without this discipline, CDC becomes a fragile coupling between teams rather than the decoupling mechanism it was meant to be.
Late Data — Advanced Strategies
Section 43.6 introduced late-arriving data. Here is the deeper treatment that senior and staff-level candidates need. Watermarks in Stream Processing A watermark is a stream processor’s declaration: “I believe all events with timestamps up to time T have arrived.” Events arriving after the watermark has advanced past their timestamp are considered “late.” The watermark is the stream processor’s answer to the question: “When can I safely close a time window and emit results?”- Perfect watermarks wait for every event to arrive before advancing. This guarantees correctness but introduces unbounded latency — one delayed event holds up the entire pipeline.
- Heuristic watermarks advance based on observed event patterns (e.g., “I have not seen a new event with timestamp before T for the last 30 seconds, so I will advance the watermark to T”). This produces timely results but may miss late events.
- The trade-off is always latency vs completeness. A real-time fraud detection system uses aggressive watermarks (close windows after 10 seconds, accept that 0.1% of events will be late and handle them as corrections). A financial reconciliation system uses conservative watermarks (wait 24 hours before closing a window, because every transaction must be accounted for).
Backfills — Cost and Safety Guardrails
Section 43.6 covered backfill risks. Here are the engineering guardrails that prevent backfills from becoming incidents. The Backfill Safety Checklist:| Step | What to Check | Why |
|---|---|---|
| Cost estimate | Compute the backfill’s warehouse cost before running. Single-day cost * number of days = total. | A 1-year backfill on a model that costs 50 credits/day = 4/credit. |
| Idempotency verification | Confirm the pipeline uses INSERT OVERWRITE or DELETE + INSERT, not bare INSERT. Run the pipeline twice on one day and verify identical results. | A non-idempotent backfill doubles every row it touches. |
| Downstream notification | Alert all consumers of the affected tables before starting. Include date range, expected completion time, and what numbers will change. | Prevents “why did yesterday’s numbers change?” incidents. |
| Chunked execution | Run month-by-month or week-by-week, not all at once. Monitor cost and correctness after the first chunk. | Catches issues early and prevents runaway cost. |
| Concurrency guard | Pause the daily incremental pipeline or use partition-level locking to prevent simultaneous writes to the same partitions. | Prevents race conditions between the backfill and the regular pipeline. |
| Rollback plan | Know how to revert if the backfill produces wrong results. Keep a snapshot of the pre-backfill state or use table versioning (Snowflake Time Travel, Delta Lake versioning). | Allows recovery without re-running the original pipeline. |
Freshness vs Cost — Detailed Decision Framework
Section 43.7 introduced the freshness-cost spectrum. This section provides the decision framework for negotiating SLAs with stakeholders. The Freshness Negotiation Script: When a stakeholder says “we need real-time data,” do not accept that at face value. Walk through this script:- “What decision changes if this data is 1 hour fresher?” — This identifies whether freshness has actual business value or is a perceived need.
- “What is the cost of stale data in dollars per hour of delay?” — For fraud detection, the answer might be “$10,000/hour.” For a weekly board report, the answer is zero.
- “Can we differentiate freshness by data product?” — The fraud pipeline needs real-time. The revenue dashboard needs daily. The ML training pipeline needs weekly. Apply the most demanding SLA only where justified.
- “Are you willing to pay 5-10x more for real-time vs daily?” — This is not confrontational — it is budget transparency. Many stakeholders retract their real-time requirement when they see the cost.
| Use Case | Recommended Freshness | Architecture | Justification |
|---|---|---|---|
| Fraud/risk alerting | Real-time (< 1 min) | Kafka + Flink | Dollar cost of missed fraud exceeds infrastructure cost |
| Live operational dashboards | Near-real-time (1-15 min) | Kafka + micro-batch | Operators need current data but do not need sub-second |
| Executive dashboards | Daily | Airflow + dbt | Executives review once/day; hourly adds no decision value |
| ML model training | Daily to weekly | Scheduled Spark/dbt | Models retrain on schedule, not on every new row |
| Financial reporting | Daily with reconciliation | Batch + validation gates | Accuracy matters more than speed; corrections are expensive |
| Ad-hoc analytics | Hourly to daily | Incremental dbt models | Analysts query on-demand; hourly freshness covers most needs |
Data Lineage — Operational Depth
Section 43.8 introduced lineage concepts. Here is the operational depth that interviewers expect at senior and staff levels. Column-Level Lineage Table-level lineage (“this dashboard reads from table X, which was built from tables Y and Z”) is the minimum. Column-level lineage (“therevenue column in the dashboard comes from orders.amount minus refunds.amount, joined on order_id”) is what you need for debugging and compliance.
- dbt provides column-level lineage through its documentation features and the
dbt-osmosisordbt-column-lineagecommunity packages. - OpenLineage captures column-level lineage from Spark, Airflow, and other tools by inspecting query plans at runtime.
- Why it matters: When GDPR requires you to answer “where does
users.emailappear in all downstream tables and models,” table-level lineage tells you which tables to check. Column-level lineage tells you exactly which columns contain derived PII — without manually inspecting every transformation.
- Domain teams own their operational data. The payments team owns the
transactionsschema. They define the contract, manage the schema, and guarantee the freshness SLA. - The data platform team owns the infrastructure. Kafka, Airflow, the warehouse, CDC connectors. They do not own the data inside the infrastructure — they own the pipes.
- Analytics engineering owns the transformation layer. dbt models, metric definitions, data quality tests. They translate raw operational data into business-consumable datasets.
- Consumers own their interpretations. Dashboard authors and ML engineers use the approved metrics. If they need a new metric, they request it from analytics engineering — they do not write their own SQL against raw tables.
Pipeline Failure and Recovery — Extended Patterns
Section 43.9 introduced failure taxonomy. Here are the advanced recovery patterns for each pipeline type. Batch Pipeline Recovery (Airflow/dbt)| Failure Scenario | Immediate Action | Long-term Fix |
|---|---|---|
| dbt model fails mid-run | Airflow marks task as failed. Rerun from the failed task (not the entire DAG). | Add data quality tests that catch the root cause earlier. |
| Airflow scheduler dies | Workers finish in-flight tasks. No new tasks are scheduled. | Deploy Airflow in HA mode with multiple schedulers (Airflow 2.0+). Monitor scheduler heartbeat with a separate watchdog. |
| Warehouse runs out of compute credits | Queries queue indefinitely or fail with timeout. | Set resource monitors (Snowflake) or query timeout limits. Use warehouse auto-suspend to prevent idle cost. |
| Source data arrives late | Pipeline runs on schedule but processes incomplete data. | Add a sensor task that waits for source data (check row count or max timestamp) before proceeding. Set a timeout on the sensor to avoid indefinite waits. |
| Failure Scenario | Immediate Action | Long-term Fix |
|---|---|---|
| Consumer lag spike | Events back up in Kafka. If lag exceeds retention, events are lost. | Scale consumers (add partitions + consumer instances). Alert on lag threshold, not just lag existence. |
| Checkpoint failure | Flink cannot save state. If it crashes, it restarts from the last successful checkpoint, reprocessing events since then. | Monitor checkpoint duration and size. If checkpoints take longer than the checkpoint interval, the pipeline is behind and will never catch up. |
| Kafka broker failure | Partitions become unavailable until ISR (in-sync replicas) elect a new leader. | Ensure min.insync.replicas >= 2 and replication factor >= 3. Monitor under-replicated partitions. |
| Poison message (malformed event) | Consumer crashes in a loop trying to deserialize. | Implement a dead-letter topic. Route undeserializable events there and continue processing. Alert on dead-letter volume. |
| Failure Scenario | Immediate Action | Long-term Fix |
|---|---|---|
| Metadata database bloat | Airflow UI becomes slow. Scheduler performance degrades. | Run airflow db clean on a schedule. Set [core] max_active_tasks_per_dag and [scheduler] dag_dir_list_interval to reduce metadata growth. |
| Worker pool exhaustion | Tasks queue but never execute. | Monitor Celery/Kubernetes worker queue depth. Auto-scale workers based on queue size. Set [core] parallelism and [core] max_active_tasks_per_dag to prevent one DAG from consuming all workers. |
| DAG import error | The DAG disappears from the UI. Scheduled runs stop. | Add a CI step that validates DAG syntax (python -c "import dag_file") before merging. Monitor active DAG count — a sudden drop indicates an import error. |
Interview Question: Your CDC pipeline using Debezium has been silently dropping delete events. Downstream consumers still show deleted records. Walk me through your investigation and fix.
Interview Question: Your CDC pipeline using Debezium has been silently dropping delete events. Downstream consumers still show deleted records. Walk me through your investigation and fix.
op: 'd' (delete operations). If delete events are present in Kafka but missing downstream, the problem is in the consumer.Layer 2 — Is the source database actually performing hard deletes? If the application uses soft deletes (UPDATE ... SET deleted_at = NOW() instead of DELETE), Debezium emits an update event, not a delete event. The consumer must check for the deleted_at field in update events and treat them as deletions. This is the most common cause of “missing deletes” in CDC pipelines.Layer 3 — Is the Debezium tombstone event being handled? After a delete event, Debezium emits a “tombstone” event (null value for the same key) to support Kafka log compaction. Some consumer implementations skip null-value events, inadvertently discarding the signal that a record was deleted.Layer 4 — Is the connector configured for delete handling? Debezium’s delete.handling.mode configuration determines how deletes are emitted. The default (drop) actually drops the value portion of the delete event in some connector versions. Set it to rewrite to include the full before-image of the deleted row in the event, or none to emit the standard Debezium delete envelope.The fix: Depends on the root cause, but the structural fix is to add a data quality assertion in the consumer that compares the count of active records in the source (via a periodic SELECT COUNT(*)) against the warehouse. If they diverge beyond a threshold, alert. This catches not just missing deletes but any category of missed change events.