Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The Engineering Mindset — How to Think, Not Just What to Know
1. First Principles Thinking
First principles thinking means decomposing a problem down to its fundamental truths — the things that are undeniably true — and then building your reasoning upward from there instead of reasoning by analogy or convention. A critical distinction: First principles thinking is not about being contrarian. It is not about rejecting every convention or reinventing every wheel. It is about understanding the WHY behind every technical choice so you can make better choices in new situations. The engineer who understands why we use connection pooling — not just that we use it — can reason about resource management in any context, even ones they have never encountered. The engineer who only knows the “what” is stuck the moment the context changes.What It Means in Practice
What It Means in Practice
- What is the actual problem we are solving?
- What are the fundamental constraints?
- What are all the possible ways to satisfy those constraints?
- Which way best fits our specific context?
- What is the real problem? Our service A needs to communicate with service B, but they run at different speeds and we cannot afford to lose requests.
- What are the fundamental needs? Decoupling (A should not crash if B is down), buffering (absorb traffic spikes), async processing (A should not wait for B).
- What solutions exist? In-memory queue, database-backed queue, Redis streams, RabbitMQ, Kafka, cloud-managed queues (SQS, Pub/Sub).
- What fits our context? We process 500 messages/second with a 3-person team. Kafka’s operational overhead is not justified. SQS or RabbitMQ fits better.
The Anti-Pattern: Cargo Culting
The Anti-Pattern: Cargo Culting
- “We need microservices because Amazon uses them” — Amazon has 10,000+ engineers. You have 12.
- “We need Kubernetes” — for a single service with predictable traffic, a managed PaaS may be simpler.
- “We need a NoSQL database because it scales” — your relational database handles your load perfectly fine and gives you ACID guarantees you actually need.
Exercise: The Five Whys
Exercise: The Five Whys
Third Why
Fourth Why
Interview Questions — First Principles
Interview Questions — First Principles
- What specific problems is the monolith causing? (Deploy speed? Team coupling? Scaling bottlenecks?)
- Are there simpler solutions? (Modular monolith? Extracting only the bottleneck service?)
- What is the cost of microservices? (Network complexity, distributed transactions, operational overhead)
- What is our team size and operational maturity?
- Can we do an incremental extraction (strangler fig pattern) instead of a full rewrite?
- What evidence would change your mind? If deploy frequency is the bottleneck and you can show that a modular monolith with independent deploy pipelines achieves the same result, the microservices argument collapses.
- What artifact do you produce before writing code? A one-page RFC comparing modular monolith vs strangler fig vs full rewrite, with estimated cost, timeline, and rollback plan for each. The act of writing that document kills most premature microservices proposals.
- What would you do first in production? Extract exactly one service — the one causing the most pain — and run it for 90 days. Measure deploy frequency, incident rate, and operational overhead. That single extraction teaches you more about your org’s readiness than any whiteboard discussion.
- What problem does GraphQL solve? (Over-fetching, under-fetching, multiple round trips)
- Do we actually have those problems? (If we have 3 endpoints consumed by 1 frontend, probably not)
- What does GraphQL cost? (Learning curve, caching complexity, N+1 query risks, schema management)
- Is there a middle ground? (BFF pattern, optimized REST endpoints, sparse fieldsets)
- What is the rollback plan if GraphQL adoption goes badly? If you have external consumers on your REST API, you are running two API surfaces forever. That is a one-way door disguised as a two-way door.
- How do you measure success? Track frontend developer velocity (time from design to working API call), payload sizes, and number of API round trips per page load. If none of those improve after 60 days, the migration is not paying for itself.
- Failure mode: GraphQL’s flexibility lets clients craft arbitrarily expensive queries. Without query complexity analysis and depth limiting, a single malicious or careless query can take down the backend. You need a query cost budget before going to production.
- Rollout: Run the GraphQL gateway alongside REST for at least 60 days. Internal consumers migrate first; external consumers stay on REST until the gateway is proven.
- Rollback: Since REST still exists, rollback means reverting clients to the REST endpoints. The cost is wasted frontend migration effort, not a data problem.
- Measurement: Frontend time-from-design-to-working-API-call, aggregate payload sizes, API round trips per page load, and p99 query execution time on the GraphQL server.
- Cost: GraphQL gateway adds a hop, compute for query parsing, and potentially a schema registry. At high traffic, the gateway itself becomes a scaling concern.
- Security/governance: GraphQL’s introspection feature exposes your entire schema by default. Disable introspection in production and implement field-level authorization — a detail most teams miss until their first security audit.
- Martin Fowler — “MonolithFirst” (martinfowler.com/bliki/MonolithFirst.html) — the canonical argument for starting with a monolith and extracting services only when pain justifies it.
- Shopify Engineering — “Deconstructing the Monolith” — how Shopify built module boundaries inside a Rails monolith instead of breaking it apart.
- Dan McKinley — “Choose Boring Technology” (boringtechnology.club) — the essay every senior engineer should read before proposing a new tool.
AI-Assisted Engineering Lens -- First Principles
AI-Assisted Engineering Lens -- First Principles
Work-Sample Pattern -- First Principles
Work-Sample Pattern -- First Principles
2. Systems Thinking
Systems thinking means understanding that everything is connected. Changing one component in a system creates ripple effects across other components, often in ways you did not predict.Everything Is Connected
Everything Is Connected
- Direct effect: that endpoint is faster.
- Second-order effect: the endpoint now handles more traffic, which increases connection pool usage.
- Third-order effect: other endpoints sharing the connection pool start timing out.
- Fourth-order effect: users retry those endpoints, creating a thundering herd problem.
Feedback Loops
Feedback Loops
- Server slows down → requests queue up → more load → server slows down more → cascading failure
- Retry storms: a failed request triggers a retry, which adds load, which causes more failures, which triggers more retries
- Alert fatigue: too many alerts → engineers ignore alerts → real incidents get missed → more alerts
- Auto-scaling: load increases → more instances spin up → load per instance decreases
- Circuit breakers: failures increase → circuit opens → failing service gets relief → recovers → circuit closes
- Rate limiting: traffic spikes → excess requests get rejected → backend stays healthy
Emergent Behavior in Distributed Systems
Emergent Behavior in Distributed Systems
- The Thundering Herd: Caches expire at the same time. Every server hits the database simultaneously. No single server decided to overload the database — the behavior emerged from the interaction.
- The Metastable Failure: The system is stable under normal load, but a brief spike pushes it into a degraded state it cannot recover from, even after the spike ends. The degraded state sustains itself through positive feedback loops.
- Split-Brain: Two nodes each believe they are the leader. Neither is “wrong” given their local view — the emergent behavior (data corruption) arises from the network partition between them.
The "Blast Radius" Mental Model
The "Blast Radius" Mental Model
| Blast Radius | Example | Approach |
|---|---|---|
| Small | A CSS color change | Ship it, fix forward if wrong |
| Medium | A new API endpoint | Feature flag, canary deploy |
| Large | Database migration | Blue-green deploy, extensive testing, rollback plan |
| Critical | Auth system change | Multi-stage rollout, shadow testing, manual approval gates |
Second-Order Effects
Second-Order Effects
Second-Order: Stale Data
Second-Order: Memory Pressure
Second-Order: Cache Invalidation Complexity
Interview Questions — Systems Thinking
Interview Questions — Systems Thinking
- What would you do first in production? Check the error rate dashboard and the request count metric side by side. If request count dropped proportionally to CPU, traffic is being rejected or redirected before reaching your service — check the load balancer and DNS.
- What artifact would you create after confirming the cause? A Grafana dashboard panel that correlates CPU utilization with request count, error rate, and response code distribution. The next time CPU drops unexpectedly, the on-call engineer sees the correlation immediately instead of guessing.
- What evidence would change your analysis? If the dependencies are called in parallel with independent failure modes and any 3 of 5 succeeding is sufficient (quorum), the math changes dramatically. Architecture decisions alter the reliability equation.
- What is the security and governance angle? If one of those 5 dependencies is an auth service, its failure mode is not “request fails gracefully” — it is “request succeeds without authorization.” Reliability and security failure modes are different beasts.
- What would you do first in production? Pull the DORA metrics dashboard for the last 6 months. Plot deploy frequency, lead time, change failure rate, and mean time to recovery on the same timeline. The correlation pattern tells you which metric degraded first — that is the root cause, not the symptoms.
- What artifact would you create? A “Deploy Health” Grafana dashboard that shows deploys per week, average changeset size, rollback frequency, and time-from-merge-to-production. Set an alert if deploy frequency drops below 4 per week for 2 consecutive weeks. The goal is catching the boiling frog before the water boils.
- What evidence would change your mind about the feedback loop theory? If deploy size remained constant while frequency dropped, the cause is not fear — it is something else. Check whether the CI pipeline got slower, whether a new approval gate was added, or whether the team lost headcount. Same symptom, completely different root cause.
- What is the security implication? Infrequent deploys mean security patches sit in the queue longer. If your deploy cadence is 2 per week and a critical CVE drops on Wednesday, the patch might not reach production until next Monday. Deploy frequency is a security metric that most teams do not track.
- First 30 minutes: identify which metrics I can trust. If CPU and memory on the database host are instrumented correctly, start there. If connection pool utilization is reliable, that is my second signal. Work with what you have, not what you wish you had.
- Next 30 minutes: run
pg_stat_activity(or equivalent) to see active queries, waiting queries, and lock contention right now. This is real-time ground truth that does not depend on dashboard configuration. - Next 30 minutes: characterize the workload. Is this read-heavy (read replica helps), write-heavy (vertical scale or schema optimization), or connection-heavy (connection pooler like PgBouncer)? The answer determines the solution.
- Final 90 minutes: propose the solution with the lowest blast radius and fastest rollback. Vertical scaling is a one-click operation with 5 minutes of downtime. A read replica takes hours to provision but zero downtime. A caching layer is a code change with the highest risk of introducing bugs. Under time pressure with incomplete data, I optimize for reversibility.
- What evidence would change your solution ranking? If the slow query log shows 80% of load from 3 queries that scan full tables, none of the three options is right — the answer is indexing. Always check whether the problem is the engine or the fuel before replacing the car.
- What artifact comes out of this decision? An ADR titled “Database scaling decision under incomplete observability — [date].” It documents: what data we had, what data we lacked, what we decided and why, and a trigger condition to revisit. It also includes an action item: “Fix misconfigured dashboards within 2 weeks so this decision can be validated with real data.”
- How would you use AI tooling here? Paste the
pg_stat_activityoutput and slow query log into an LLM with the prompt: “Given this PostgreSQL workload profile, which scaling strategy — read replica, caching layer, or vertical scaling — addresses the bottleneck most directly? Show your reasoning.” The LLM can pattern-match against thousands of similar workload profiles faster than you can reason through it manually. But verify its recommendation against your specific constraints — the LLM does not know your deployment topology or change management process.
- Nicole Forsgren et al. — “Accelerate: The Science of Lean Software and DevOps” — the source for the DORA metrics framework.
- Netflix Tech Blog (netflixtechblog.com) — “The Netflix Simian Army” — original writeup on Chaos Monkey and regional failure simulation.
- Martin Fowler — “CircuitBreaker” (martinfowler.com/bliki/CircuitBreaker.html) — the pattern that breaks multiplicative failure chains.
3. Trade-Off Thinking
The hallmark of a senior engineer is understanding that there are no “best” solutions — only trade-offs. Every decision optimizes for some things at the expense of others.The "It Depends" Framework
The "It Depends" Framework
- Scale: 100 users vs 100 million users demand different architectures.
- Team size and expertise: A 3-person team cannot operate 20 microservices. A 200-person org cannot share a single monolith.
- Timeline: A startup racing to product-market fit needs different trade-offs than a bank migrating a core system.
- Requirements clarity: If requirements will change significantly, optimize for flexibility. If they are well-understood, optimize for performance.
- Regulatory constraints: GDPR, HIPAA, SOX — these are non-negotiable and override other preferences.
- Budget: A managed database at $500/month might be better than a self-hosted one requiring 20 hours/month of DBA time.
Reversibility: The Amazon Decision Framework
Reversibility: The Amazon Decision Framework
- Choosing a logging library
- API response format (if you version your API)
- UI layout changes
- Feature flag experiments
- Database schema for a core entity with billions of rows
- Public API contract (once external clients depend on it)
- Choice of programming language for a core system
- Data deletion policies
YAGNI Applied to Architecture
YAGNI Applied to Architecture
- Building a plugin system for an internal tool used by 5 people
- Adding Kafka when your throughput is 10 events/second
- Implementing CQRS when you have a single database with straightforward read/write patterns
- Creating an abstraction layer “in case we switch databases” when you have never switched databases
- More code to maintain
- More indirection to debug
- More complexity for new team members to learn
- Abstractions built without real use cases often have the wrong API
When TO Over-Engineer
When TO Over-Engineer
Interview Questions — Trade-Off Thinking
Interview Questions — Trade-Off Thinking
- What are the access patterns? (Relational joins? Key-value lookups? Document retrieval?)
- What are the consistency requirements? (Financial transactions need ACID. Social media feeds tolerate eventual consistency.)
- What is the schema stability? (Rapidly evolving schema favors document stores. Stable, relational data favors SQL.)
- What scale are we targeting? (At moderate scale, PostgreSQL handles almost everything. At extreme write throughput, you might need DynamoDB or Cassandra.)
- What does the team know? (Operational expertise matters — a team skilled in PostgreSQL will run it better than a team learning MongoDB.)
- What evidence would make you switch databases mid-project? If write latency at p99 exceeds your SLA for 3 consecutive weeks despite optimization, and profiling shows the bottleneck is the storage engine itself (not queries), that is evidence — not just a feeling — that you need a different database.
- What is the cost dimension nobody talks about? DBA time. A self-hosted Cassandra cluster that saves $2K/month over Aurora but requires 30 hours/month of operational attention is not saving money — it is spending more through a different budget line.
- What artifact documents the shortcuts? An ADR (Architecture Decision Record) titled “Temporary: [Feature] shipped with [shortcut] — revisit by [date].” It lists: what was skipped, what breaks if we do not fix it, and the trigger condition (traffic threshold, user count, or date) that forces the follow-up.
- What is the rollback plan? Feature flag with a kill switch. If the 2-week version causes production issues, one engineer can disable it in 30 seconds without a deploy.
- What would you measure to know the shortcut is becoming a problem? Set a Datadog alert on the specific dimension that was compromised. Skipped pagination? Alert on response payload size exceeding 1MB. Skipped rate limiting? Alert on requests per user per minute exceeding 100.
- Security vulnerability first — even if it affects zero users today, an unpatched vulnerability is a ticking bomb. If the bug bounty reporter goes public, the blast radius becomes infinite. One engineer starts here immediately.
- P1 bug second — 5% of users affected means real revenue impact right now. But “P1” needs validation: is it truly 5%, or is that a noisy metric? The second engineer verifies the impact while the first works on the security fix.
- Partner integration last — a 48-hour deadline is a business commitment, but it is negotiable. A security breach and a P1 outage are not. I call the partner and explain a 24-hour delay, which buys the second engineer time to address the P1 and then pivot. The meta-principle: never let a deadline override a security issue or an active outage. Deadlines can be renegotiated. Breaches and customer trust cannot.
- Martin Fowler — “TechnicalDebtQuadrant” (martinfowler.com/bliki/TechnicalDebtQuadrant.html) — the prudent vs reckless / deliberate vs inadvertent framework.
- Amazon Leadership Principles (amazon.jobs/en/principles) — where the “one-way vs two-way doors” language comes from; Jeff Bezos’s shareholder letters go deeper.
- GitHub Engineering Blog — posts on Vitess, MySQL sharding, and why GitHub did not migrate off MySQL.
4. The Inversion Technique — Think Backward to Move Forward
Most engineers approach problems by asking: “How do I make this work?” Inversion flips the question: “How could this fail?” — and then you systematically prevent each failure mode. This is not pessimism. It is the single most reliable way to build robust systems, and it is Charlie Munger’s favorite mental tool. Munger, Warren Buffett’s longtime partner at Berkshire Hathaway, borrowed the technique from the mathematician Carl Jacobi, who famously advised: “Invert, always invert.” Munger applied it to investing, business, and life decisions. For engineers, it is devastatingly effective — because software systems have far more ways to fail than to succeed, and the failure modes are often more enumerable than the success conditions.How Inversion Works in Practice
How Inversion Works in Practice
- A charge goes through but we do not record it (money lost, customer charged twice on retry)
- We record it but the charge did not actually go through (revenue leakage)
- The same payment processes twice (double-charge)
- The system is down during peak checkout (lost revenue, lost trust)
- A partial failure leaves the order in an inconsistent state (charged but no order, or order but no charge)
- An attacker replays a payment request (fraud)
| Failure Mode | Prevention |
|---|---|
| Charge without record | Write to database before calling payment provider, use idempotency keys |
| Record without charge | Reconciliation job that compares internal records with provider |
| Double-charge | Idempotency keys on every payment API call |
| System downtime at peak | Queue-based processing, graceful degradation, retry with backoff |
| Inconsistent state | Saga pattern or two-phase approach with compensation logic |
| Replay attack | Unique request IDs with server-side deduplication, expiring tokens |
Inversion Beyond System Design
Inversion Beyond System Design
- Instead of “Does this code work?” ask “How could this code break?”
- Instead of “Is this test good?” ask “What bugs would this test NOT catch?”
- Instead of “Is this API well-designed?” ask “How could a consumer misuse this API?”
- Instead of “How do we deliver on time?” ask “What would cause us to miss the deadline?”
- Common answers: unclear requirements, key person unavailable, dependency on another team, underestimated migration complexity.
- Now mitigate each risk before starting.
- Instead of “How do I get promoted?” ask “What behaviors would guarantee I do NOT get promoted?”
- Common answers: only doing assigned work, never writing design docs, avoiding cross-team visibility, not mentoring others.
- Stop doing those things.
- Before a major deploy, run a pre-mortem: assume the deploy has already failed catastrophically. Now work backward — what went wrong? This surfaces risks that optimism hides. Amazon, Google, and many other companies use pre-mortems as a standard practice before high-stakes launches.
Interview Questions — Inversion Technique
Interview Questions — Inversion Technique
- Large uploads fail midway — use chunked uploads with resumability so users do not restart from zero.
- Storage fills up — set per-user quotas, implement lifecycle policies, monitor disk usage with alerts.
- Malicious files uploaded — scan uploads asynchronously with antivirus, validate file types, sandbox processing.
- Two users upload the same filename — use unique storage keys (UUIDs), not user-provided filenames.
- Upload succeeds but metadata write fails — write metadata first as ‘pending’, update to ‘complete’ after storage confirms. With those failure modes covered, the core architecture is: chunked upload API, object storage (S3) for files, database for metadata, async processing pipeline for validation.”
- What is the rollout plan? 1% canary for 1 hour, 10% for 4 hours, 50% for 24 hours, 100%. At each stage, check: error rate, p99 latency, storage usage growth rate, and antivirus scan completion rate.
- What artifact comes out of this design? A runbook page: “File Upload Service — Operational Guide.” Sections: architecture diagram, failure modes and mitigations, alerting thresholds, escalation contacts, and the exact
curlcommand to check health. - What is the cost dimension? Storage costs are the hidden killer for file upload services. At 5GB per file, 1,000 uploads per day means 5TB per day. At S3 standard pricing, that is roughly 17,250/month. Without lifecycle policies that move cold files to Glacier or delete them after retention periods, your storage bill grows linearly forever. The inversion question “what if storage fills up?” should really be “what is our storage cost at 10x current usage and do we have a lifecycle policy?”
- What is the security consideration beyond antivirus? File uploads are a classic attack vector for server-side request forgery (SSRF) and path traversal. If the upload path includes user-provided filenames without sanitization, an attacker can overwrite system files. If the upload triggers server-side processing (thumbnail generation, document preview), a malicious file can exploit image library vulnerabilities. Always process uploads in a sandboxed environment isolated from your main application.
- How would you use AI-assisted tooling in this design? Ask an LLM: “Given these file upload failure modes, generate the Terraform configuration for an S3 bucket with lifecycle policies, event notifications for antivirus scanning, and CloudWatch alarms for storage growth rate.” The infrastructure-as-code generation is where AI saves the most time — the YAML/HCL boilerplate is tedious for humans but well-suited for LLMs. Review the generated IAM policies carefully, though — AI frequently generates overly permissive policies (
s3:*instead ofs3:PutObject).
- What document does the pre-mortem produce? A launch checklist with go/no-go criteria. Example: “Go if: error rate < 0.5%, p99 latency < 500ms, no P1 bugs open. No-go if: on-call engineer is unfamiliar with the feature, rollback was not tested in staging, or support FAQ is not published.”
- What is the rollback plan, and have you tested it? A feature flag kill switch that has been tested in staging by actually toggling it and verifying the feature disappears cleanly without data corruption. Untested rollback plans are not plans — they are hopes.
- What would you measure for 72 hours after full rollout? Error rates, latency, conversion rate (if applicable), support ticket volume, and — critically — the absence of expected behavior. If the feature is supposed to increase engagement and engagement is flat, that is a signal the feature may not be working even though nothing is technically broken.
- Gary Klein — “Performing a Project Premortem” (HBR, 2007) — the original framing of pre-mortems.
- Netflix Tech Blog (netflixtechblog.com) — “Chaos Engineering Principles” — how inversion scales to operational practice.
- Shopify Engineering (shopify.engineering) — posts on BFCM readiness and game-day exercises.
5. Thinking in Layers of Abstraction
The ability to fluidly move between layers of abstraction — zooming out to see the architecture, zooming in to see the implementation, and knowing which layer matters for the question at hand — is one of the most reliable markers of engineering seniority. Junior engineers get stuck at one layer. Senior engineers shift between them effortlessly, like adjusting the zoom on a map.What Layers of Abstraction Mean in Software
What Layers of Abstraction Mean in Software
- Transistors and logic gates — electrical signals, binary math
- CPU instructions — registers, memory addresses, opcodes
- Operating system — processes, threads, virtual memory, file systems
- Runtime / VM — garbage collection, JIT compilation, event loop
- Language and standard library — syntax, data structures, I/O abstractions
- Framework — routing, middleware, ORM, templating
- Application code — your business logic, domain models
- API surface — the contract your service exposes to consumers
- System architecture — how services interact, data flows, infrastructure topology
- Product / business — what the user experiences, what the business needs
Zooming In and Zooming Out
Zooming In and Zooming Out
- During system design — you need the 30,000-foot view
- When a bug seems to involve multiple services
- When discussing trade-offs with product or leadership
- When evaluating whether a project is worth doing at all
- During performance optimization — the bottleneck lives in specifics
- When debugging a production issue — you need the exact failure path
- When reviewing security-sensitive code — the devil is in the details
- When the abstraction is leaking — something at a lower layer is violating the assumptions of the layer above
Leaky Abstractions and Why Layers Break
Leaky Abstractions and Why Layers Break
- TCP abstracts away packet loss — but when packets are lost, your “reliable” connection stalls and latency spikes. The abstraction leaks.
- An ORM abstracts away SQL — but when you write a complex query through the ORM, it generates horrifically inefficient SQL. The abstraction leaks.
- A managed Kubernetes service abstracts away infrastructure — but when a node runs out of memory, your pods get OOM-killed and your “self-healing” system enters a crash loop. The abstraction leaks.
- Garbage collection abstracts away memory management — but when GC pauses cause latency spikes in your real-time system, the abstraction leaks.
Abstraction Layers in Communication
Abstraction Layers in Communication
getOrderDetails resolver. Each order
fetches its line items individually instead of batching. I am going to add a DataLoader to batch
and deduplicate the queries.”Talking to your engineering manager (mid-level):
“The order details page has a performance issue caused by inefficient database access patterns. I
have identified the root cause and the fix is straightforward — about half a day of work. No user
impact yet, but it will become a problem as order sizes grow.”Talking to a VP or product leader (zoom out):
“The order page is fast today but will slow down as we onboard larger customers. I am fixing it
proactively — half a day of work, no feature impact.”Same problem, three different layers of abstraction. The ability to shift between them is what
makes an engineer effective beyond just writing code.Interview Questions — Thinking in Layers
Interview Questions — Thinking in Layers
- Browser parses the URL, checks its local cache (application layer)
- DNS resolution — browser cache, OS cache, recursive resolver, authoritative nameserver (network layer)
- TCP handshake — SYN, SYN-ACK, ACK. If HTTPS, TLS handshake on top (transport layer)
- HTTP request sent to the server (application protocol layer)
- Server-side: load balancer routes to an application server, which processes the request (infrastructure layer)
- Application logic executes — reads from database, applies business rules, renders a response (application layer)
- HTTP response sent back, browser parses HTML, fetches CSS/JS/images (rendering layer)
- Browser constructs the DOM, applies styles, executes JavaScript, paints the screen (browser engine layer)
- What is the failure mode at each layer? DNS failure returns NXDOMAIN and the user sees “site not found.” TCP failure means the connection hangs for the OS timeout (usually 30-75 seconds) — the worst user experience because there is no feedback. TLS failure shows a browser security warning. HTTP 5xx shows an error page. Application bugs return wrong data with a 200 status — the most dangerous failure because it is silent.
- What is the security layer you did not mention? TLS certificate validation. If the certificate is expired, self-signed, or does not match the domain, the browser blocks the connection. This is the layer that protects against man-in-the-middle attacks. A surprising number of production outages are caused by expired TLS certificates — an artifact that should be monitored.
- What would you measure end-to-end? Real User Monitoring (RUM) that captures DNS lookup time, TCP connection time, TLS handshake time, time-to-first-byte, and time-to-interactive. Most teams only measure time-to-first-byte from the server’s perspective, missing everything that happens before the request reaches their infrastructure.
- Network layer: Are other services on the same network also slow? Check ping times, packet loss.
- Infrastructure layer: Is the host resource-constrained? Check CPU, memory, disk I/O.
- Runtime layer: Is garbage collection pausing? Are threads exhausted?
- Application layer: Did a recent deploy change anything? Are specific endpoints slow or all of them?
- Data layer: Is the database slow? Check slow query logs, connection pool saturation.
- Dependency layer: Is an external API timing out, causing our requests to back up? I would use distributed tracing to see exactly where time is being spent in a request, which immediately tells me which layer to investigate.”
- What would you do first in production? Open the distributed tracing dashboard, find a single slow request, and look at the span breakdown. If 90% of the time is in one span, that is the layer. This takes 60 seconds and eliminates guessing. If you do not have distributed tracing, that is the real problem to solve after the incident.
- What artifact should exist before this happens? A “Service Latency Investigation” runbook with a decision tree: “If all endpoints are slow, check infrastructure. If one endpoint is slow, check its specific dependencies. If the database is the bottleneck, check these 3 things. If an external dependency is the bottleneck, check these 3 things.” The runbook saves 15 minutes per investigation for every engineer on the team.
- How would you use AI-assisted tooling? Export the last hour of distributed traces for slow requests (p99 > 2s) and feed them to an LLM: “Analyze these trace spans and identify the common bottleneck layer and specific operation causing latency.” LLMs are excellent at finding patterns across 50 trace files that a human would take an hour to manually correlate. But always verify the identified span against the actual code path.
ss, dmesg, iostat), they start investigating instead of guessing.Q: Distributed tracing sounds great, but we have basic logs. How do I convince leadership to invest in it?
A: I would quantify the cost of not having it. Take three recent production incidents and estimate how much engineer time was spent correlating logs across services to locate the slow hop. Even at a conservative 3 hours per incident, a team with 10 incidents a quarter loses 120 engineer-hours to manual correlation per year. Against that, Honeycomb, Datadog APM, or open-source Jaeger + Tempo typically pay for themselves inside a quarter. The pitch is not “tracing is modern” — it is “we are spending $X/year in hidden engineer time that tracing would save.”Q: When should you deliberately leak an abstraction?
A: When hiding the layer below costs more than exposing it. For example, databases deliberately expose transaction isolation levels because the cost of the wrong isolation level (lost updates, phantom reads) is far worse than the cost of making the developer think about it. Similarly, HTTP status codes leak server state intentionally because “something went wrong” is useless — you need to know if it is a 4xx (your fault) or 5xx (our fault). Good API design chooses which parts of the lower layer must remain visible.- Joel Spolsky — “The Law of Leaky Abstractions” (joelonsoftware.com) — the essay that named the pattern.
- Cloudflare Blog — “Details of the Cloudflare outage on July 2, 2019” — multi-layer failure postmortem.
- OpenTelemetry documentation (opentelemetry.io) — the standard vocabulary and SDK for distributed tracing across services.
6. Debugging Mindset
Debugging is not a mystical art. It is the scientific method applied to software. The best debuggers are methodical, not lucky.The Scientific Method for Debugging
The Scientific Method for Debugging
Observe
Hypothesize
Test
"What Changed?" — The First Question
"What Changed?" — The First Question
- A deploy went out
- A config was updated
- Traffic patterns shifted
- A dependency released a new version
- A certificate expired
- A cloud provider had an incident
- Recent deployments (git log, deploy dashboard)
- Configuration changes (feature flags, environment variables)
- Infrastructure changes (scaling events, cloud provider status)
- Dependency updates (package lock file changes)
- External factors (traffic spike, time-based event like daylight saving time or month-end batch job)
Bisection Strategy
Bisection Strategy
- Disable half the middleware. Problem persists? It is in the other half. Problem gone? It is in the disabled half.
- Route traffic to half the servers. If one set has errors and the other does not, the problem is environmental (host-specific).
- Comment out half the configuration. Narrow down which config block is causing the issue.
Reading Error Messages (Seriously)
Reading Error Messages (Seriously)
- Read the entire error message. Not just the first line. Stack traces, context fields, and “caused by” chains contain the actual answer.
- Read it literally. “Connection refused on port 5432” means nothing is listening on that port. Not “the database is slow” — it is not running or not reachable.
- Check the line number. Most error messages tell you exactly where the problem is.
- Decode the error code. HTTP 429 is not “server error” — it is rate limiting. HTTP 503 is not “it’s broken” — the server is explicitly telling you it is overloaded.
Rubber Duck Debugging
Rubber Duck Debugging
- Forces sequential reasoning. Your brain can hold contradictory beliefs simultaneously. Speaking forces you to linearize your thoughts, exposing contradictions.
- Activates different cognitive pathways. Reading code silently uses visual processing. Explaining it aloud engages verbal and auditory processing, sometimes revealing what the visual path missed.
- Exposes assumptions. When you say “this variable is always positive,” you sometimes immediately realize — wait, is it? What if the input is negative?
Interview Questions — Debugging Mindset
Interview Questions — Debugging Mindset
- Define “slow” — which pages/endpoints? For all users or some? Since when?
- Check metrics — p50, p95, p99 latency. Is it a general degradation or tail latency?
- Ask “what changed?” — recent deploys, config changes, traffic patterns.
- Check infrastructure — CPU, memory, disk I/O, network. Is any resource saturated?
- Trace a slow request end-to-end — where is time being spent? Database? External API? Application code?
- Form a hypothesis based on the data and test it.
- OS, language version, dependency versions (check lock files)
- Environment variables, config files
- Timing-dependent code (flaky tests often involve race conditions or time zones)
- File system differences (case sensitivity, temp directory paths)
- Network access (CI may not reach external services)
- State leakage from other tests (test execution order may differ)
- What artifact prevents this class of bug from recurring? A CI environment parity checklist in the repo’s
CONTRIBUTING.md. It documents: required environment variables, expected OS behavior (case sensitivity, line endings), network access assumptions, and test isolation requirements. When a new environment-specific flake is found, it gets added to the checklist. - How would you use AI-assisted tooling here? Paste the CI failure log into an LLM with the prompt: “Compare this CI environment failure against these local test results. What environmental differences could explain the discrepancy?” LLMs excel at pattern-matching across log outputs that humans skim too quickly. But verify every hypothesis the LLM suggests — it will confidently propose plausible-sounding causes that are wrong 30% of the time.
git bisect automates it: you mark a known-good and known-bad commit, then test the midpoint, halving the search space each round.
Use it naturally: “The regression appeared somewhere in the last 200 commits, so I kicked off git bisect with a smoke test script; it found the offending commit in 8 steps.”
Warning: Bisection only works if you have a reliable reproduction. If the bug is flaky, each test step gives noisy results and bisect points at the wrong commit.- Brendan Gregg — “Systems Performance” and the USE Method (brendangregg.com/usemethod.html) — the canonical framework for investigating system-level performance issues.
- GitHub Engineering — “October 21 post-incident analysis” (github.blog) — a model postmortem walking through the full debugging arc of a 24-hour incident.
- Julia Evans — “Debugging Manifesto” (jvns.ca) — a practitioner’s guide to the mindset and tooling of effective debugging.
7. Growth Mindset for Engineers
Technical skill is necessary but not sufficient. The engineers who progress fastest are the ones who deliberately invest in how they learn, not just what they learn.T-Shaped Skills
T-Shaped Skills
- You are the go-to person for this area on your team.
- You understand not just how to use the tools, but how they work internally.
- You can debug problems in this area that others cannot.
- Examples: distributed systems, frontend performance, database internals, ML infrastructure.
- You can read and understand code in languages you do not primarily write.
- You can have intelligent conversations about areas outside your specialty.
- You can identify when a problem falls outside your expertise and know who to ask.
- Examples: basic understanding of networking, security fundamentals, business domain knowledge, product thinking.
- Go deep by working on the hardest problems in your area, reading source code, and writing technical deep-dives.
- Go broad by rotating across teams, reading architecture docs, attending cross-team design reviews, and working on side projects in unfamiliar areas.
Learning from Production Incidents
Learning from Production Incidents
Timeline
Root Cause Analysis
What Went Well
What Could Be Improved
Reading Other People's Code
Reading Other People's Code
- Open source libraries you use daily. Read the Express.js source to understand middleware. Read React’s reconciliation algorithm. You will become dramatically better at using these tools.
- Code from senior engineers on your team. Notice their patterns, naming conventions, how they structure error handling, and how they write tests.
- Code in languages you do not know. Expands your mental model of what is possible. A Python developer reading Go learns to think about concurrency differently.
- Rejected pull requests and design docs. Understanding why something was NOT done teaches you as much as understanding why something was done.
Open Source as Professional Development
Open Source as Professional Development
- Code review from world-class engineers. Maintainers of popular projects give detailed, high-quality feedback.
- Reading unfamiliar codebases. Forces you to develop code navigation and comprehension skills.
- Writing for a broad audience. Your code must be understandable to strangers, which improves your clarity.
- Public portfolio. Contributions are visible proof of your skills.
- Fix documentation or typos (low barrier, high value to maintainers).
- Tackle issues labeled “good first issue.”
- Add tests for uncovered code paths.
- Graduate to bug fixes and small features.
"I Don't Know Yet" vs "I Don't Know"
"I Don't Know Yet" vs "I Don't Know"
- When asked something you do not know, say: “I haven’t worked with that directly, but here is how I would approach learning it…” Then describe your learning process.
- When facing an unfamiliar problem, say: “I don’t have experience with this specific situation, but based on what I know about [related area], I would start by…”
Interview Questions — Growth Mindset
Interview Questions — Growth Mindset
- Describe a real decision and the reasoning behind it at the time.
- Explain what happened and how you discovered you were wrong.
- Focus on what you learned — both the specific technical lesson and the meta-lesson about your decision-making process.
- Show that you updated your mental model, not just fixed the immediate problem.
- What evidence would change your learning priorities? If I kept encountering the same class of production incident (e.g., Kubernetes networking issues), that is a signal my learning should shift toward container orchestration internals — even if it is not “exciting.” Production pain is the most honest curriculum advisor.
- How has AI changed your learning process? I use LLMs to compress the “awareness” phase of learning. Instead of spending 2 hours reading documentation to understand what a technology does, I spend 20 minutes in a conversational session: “Explain Raft consensus assuming I understand Paxos but have never implemented a consensus protocol.” But I never trust the LLM’s explanation as complete — I always follow up by reading the actual paper or source code for the areas that matter. AI accelerates the map; you still have to walk the territory.
- Why was the human able to make that error? Was there no validation, no guard rail, no confirmation step?
- Why did the system not detect the error before it reached production? Was there no staging environment, no canary deploy, no automated test for this scenario?
- Why did the team not recover faster? Was there no runbook, no rollback mechanism, no alert that caught the impact early?
- What artifact should every postmortem produce? At minimum: (1) a specific, assigned, time-bound action item that addresses the systemic gap, (2) an update to the relevant runbook, and (3) a new or improved alert that would have caught this incident earlier. If the postmortem does not produce all three, it was a catharsis exercise, not a learning exercise.
- What is the failure mode of postmortem culture itself? Action item fatigue. If postmortems produce 8 action items and only 2 get completed, the team learns that postmortems are performative. Track postmortem action item completion rate as a meta-metric. If it drops below 70%, reduce the number of action items per postmortem rather than letting them pile up unfinished.
- What would you do first after this postmortem? Schedule a 30-minute session where the on-call engineer walks through the incident timeline and identifies the single moment where better tooling, documentation, or automation would have cut the time-to-resolution the most. That single improvement is more valuable than a list of 10 action items.
- Carol Dweck — “Mindset: The New Psychology of Success” — the original research on fixed vs growth mindsets and why the distinction matters for expertise.
- John Allspaw — “Blameless PostMortems and a Just Culture” (Etsy Code as Craft blog) — the essay that shaped how modern tech companies run postmortems.
- Will Larson — “Staff Engineer: Leadership Beyond the Management Track” (staffeng.com) — a practical guide to the learning curve from senior to staff.
8. AI-Assisted Engineering — Using LLMs as a Thinking Partner
The engineering mindset in 2024+ includes knowing when and how to use AI tools — LLMs, code assistants, and AI-powered debugging — as amplifiers for your thinking, not replacements for it. The engineers who use AI most effectively are the ones who already think clearly; AI makes their clear thinking faster.Where AI Accelerates Engineering Thinking
Where AI Accelerates Engineering Thinking
Where AI Fails -- And the Mindset Traps
Where AI Fails -- And the Mindset Traps
The AI-Augmented Engineering Workflow
The AI-Augmented Engineering Workflow
- Prompt with constraints. Do not ask “write a retry function.” Ask “write a retry function
with exponential backoff, jitter, a maximum of 5 attempts, and circuit breaker integration
that matches our existing
CircuitBreakerclass insrc/resilience/.” Constraints eliminate the most common failure modes. - Read every line. AI-generated code that you do not read is worse than code you did not write yourself — at least your own code reflects your mental model. Unread AI code is someone else’s assumptions embedded in your system.
- Test it yourself. Do not trust “this code should work.” Run it. Write a test. The 5 minutes you spend verifying saves the 2 hours you would spend debugging a subtle AI-introduced bug in production.
- Use AI for the artifact layer. Where AI truly shines in the engineering mindset: generating first-draft ADRs, runbooks, postmortem templates, and dashboard configurations. These artifacts (Section 9) are high-value but often skipped because of the writing effort. AI removes the effort barrier.
Interview Questions -- AI-Assisted Engineering
Interview Questions -- AI-Assisted Engineering
- What evidence would make you trust AI more for critical code paths? If AI tools developed verifiable formal proofs alongside their generated code — not just “this compiles” but “this satisfies these correctness properties” — I would trust them for more sensitive work. Until then, AI is a first-draft tool, not a final-draft tool.
- How do you handle an engineer on your team who uses AI to write code they do not understand? The same way I handle any code review where the author cannot explain their code: I reject it. “Walk me through what this does” is the test. If the answer is “the AI generated it and the tests pass,” that is a red flag. Tests pass for wrong code all the time.
- What evidence would change your position? If AI systems could reliably explain why their generated code works — not just produce it, but trace the reasoning through system constraints, failure modes, and production considerations — I would trust them more for critical paths. Current LLMs generate plausible code; they do not reason about the production context that code will run in.
- What is the concrete engineering moment where this matters most? Schema migrations. An LLM can generate a migration script in 30 seconds. But it does not know that your
orderstable has 80 million rows, that adding a column with a default value will lock the table for 15 minutes in PostgreSQL versions before 11, or that your SLA requires zero downtime. The engineer who blindly runs the AI-generated migration takes down production. The engineer who understands the underlying storage engine knows to useALTER TABLE ... ADD COLUMN ... DEFAULT NULLfollowed by a backfill — and the LLM never mentions this unless specifically prompted. - What is the artifact-thinking angle? AI is most transformative for the artifacts that engineers should produce but do not because of writing friction: ADRs, runbooks, postmortem drafts, onboarding documentation. An LLM that generates a first-draft runbook from your service’s code and configuration saves 2 hours and produces a document that would otherwise never exist. The engineering mindset shift is: use AI to eliminate the excuses for missing artifacts.
- Check for hardcoded secrets, API keys, or credentials that the LLM might have hallucinated from training data
- Look for common vulnerability patterns: SQL injection via string concatenation, XSS through unescaped user input, insecure deserialization, overly permissive CORS headers
- Verify that authentication and authorization checks are present on every endpoint — LLMs frequently generate functional code that skips authz because the prompt did not mention it
- Run the code through SAST (static analysis security testing) tools before committing
- What is the rollout consideration? AI-generated security-sensitive code should go through the same review process as any security-critical change: a second engineer reviews it, ideally someone with security expertise. The speed advantage of AI generation is in the first draft, not in skipping review.
- What would you measure? Track the ratio of security findings in AI-generated code versus human-written code in your SAST reports. If AI-generated code has a higher vulnerability density, tighten the review process. If it is comparable or lower, you can calibrate trust accordingly.
setTimeoutMs() method on the HTTP client that does not exist; I caught it because the test failed, but a junior might have added the import and shipped it.”
Warning: Do not use “hallucination” as a hand-wave for all LLM errors. Reserve it for invented facts. Logic errors, context misses, and prompt misunderstandings are different failure modes with different mitigations.- Simon Willison’s blog (simonwillison.net) — ongoing, practitioner-grade writing on LLM capabilities, prompt injection, and engineering applications.
- OWASP — “Top 10 for Large Language Model Applications” (owasp.org) — the authoritative security checklist for LLM-integrated systems.
- Anthropic and OpenAI official prompt engineering guides — the source documents, not the wrapper articles.
9. Decision-Making Under Uncertainty
Real engineering happens under uncertainty. Requirements are incomplete. Timelines are tight. You will never have enough information to be 100% confident. The best engineers make good decisions anyway.Perfect Is the Enemy of Good
Perfect Is the Enemy of Good
- You learn from real users, not hypothetical ones.
- Requirements change. Your “perfect” solution may solve the wrong problem.
- Speed compounds. Faster iterations mean faster learning, which means a better product sooner.
- The core functionality works correctly.
- Edge cases are handled gracefully (even if not optimally).
- The code is clean enough to iterate on.
- You have observability to detect problems quickly.
Time-Boxing Exploration
Time-Boxing Exploration
- Prevents analysis paralysis. Without a deadline, investigation expands to fill all available time.
- Forces prioritization. You focus on the highest-signal questions first.
- Makes “good enough” information acceptable. You stop seeking certainty and start seeking sufficiency.
- Creates a forcing function for group decisions. Everyone knows the decision point is coming.
- Choosing between technologies: 2 hours of research, then decide.
- Investigating a production issue: 30 minutes of focused debugging, then escalate if unresolved.
- Designing a new feature: 1-day spike to prototype the riskiest part, then review as a team.
Decision Journals
Decision Journals
- The decision and its context.
- The options you considered.
- The trade-offs you weighed.
- What you expected to happen.
- The confidence level (low/medium/high).
- Defeats hindsight bias. Six months later, you can review what you actually thought at the time, not what you think you thought.
- Accelerates learning. Comparing predictions to outcomes reveals systematic biases in your decision-making.
- Improves team knowledge transfer. New team members can understand why the system is the way it is by reading the decision log.
When to Ask for Help vs Push Through
When to Ask for Help vs Push Through
| Situation | Action |
|---|---|
| Stuck for < 30 minutes | Keep pushing. Try different approaches. |
| Stuck for 30-60 minutes | Rubber duck it. Write out the problem. Search more broadly. |
| Stuck for > 60 minutes with no new leads | Ask for help. |
| Blocked by something outside your access | Ask immediately (permissions, credentials, domain knowledge). |
| Making a one-way-door decision | Seek input proactively, even if you are not stuck. |
- State what you are trying to do.
- State what you have already tried.
- State your current best hypothesis.
- Ask a specific question, not “this doesn’t work.”
The "Two-Pizza Team" Decision Authority
The "Two-Pizza Team" Decision Authority
Interview Questions — Decision-Making
Interview Questions — Decision-Making
- Define the evaluation criteria based on our specific requirements (not generic benchmarks).
- Time-box a spike: build a small prototype with each option, focused on the riskiest aspect.
- Consult with team members who have experience with either technology.
- Document the decision (ADR) including context, options considered, and trade-offs.
- Choose the option that is the best fit for our constraints, with a preference for the more reversible choice if the options are close.
- First, ensure both sides have clearly articulated their reasoning (not just preferences).
- Identify the specific criteria where they disagree and try to get data on those points.
- If the decision is reversible, choose one and set a review date (“Let’s try A for 2 sprints and evaluate”).
- If irreversible, invest more time: prototype both, or bring in an outside perspective.
- Avoid design-by-committee (merging both solutions into a Frankenstein). Pick one coherent approach.
- The worst outcome is not picking the “wrong” solution — it is not deciding at all.
- What artifact resolves this structurally? An RFC with a formal “Decision” section where one approach is selected and the “Consequences” section explicitly states what the rejected approach would have provided. This prevents relitigating the decision 3 months later when someone forgets why option B was rejected.
- What evidence would change your mind after committing? Define specific, measurable criteria before implementation begins. “If approach A’s p99 latency exceeds 200ms after 30 days in production, we revisit.” Without pre-committed criteria, teams either never revisit (sunk cost bias) or revisit every time someone is frustrated (decision instability).
- What is the rollback plan for the halfway deploy? It depends on what kind of deploy. For a stateless service with blue-green deployment, roll back by shifting traffic to the old fleet — 30 seconds, zero risk. For a database migration that is halfway through, rolling back may be impossible or more dangerous than rolling forward. Know your deploy type before deciding.
- What artifact prevents this chaos next time? An incident response playbook that separates roles: Incident Commander (triage and communication), Investigator (root cause), and Mitigator (stopping the bleeding). When the same person does all three, nothing gets done well.
- What is the cost dimension people miss in incidents? Engineering hours. A 90-minute incident with 4 engineers responding costs 6 engineer-hours in direct response time, plus another 8-10 hours in postmortem, follow-up action items, and context-switching recovery. That is 2 full engineering days for one incident. When leadership asks “why is the roadmap slipping,” this is often the hidden answer.
- Daniel Kahneman — “Thinking, Fast and Slow” — the foundational work on how cognitive biases distort decision-making, and why structured frameworks matter under uncertainty.
- Annie Duke — “Thinking in Bets” — decision-making as probabilistic reasoning rather than outcome optimization, written by a former pro poker player.
- Jeff Bezos 2016 Shareholder Letter (aboutamazon.com) — the origin of the one-way/two-way door framing and the “disagree and commit” principle.
9. Artifact Thinking — Engineering Outputs Beyond Code
The best engineers do not just write code. They produce artifacts — documents, dashboards, runbooks, RFCs, postmortems, and decision records — that amplify their impact beyond the immediate task. Code solves today’s problem. Artifacts prevent tomorrow’s.The Artifact Hierarchy
The Artifact Hierarchy
| Artifact | When to Create | Who Benefits | Lifespan |
|---|---|---|---|
| Runbook | Before going on-call for a new service | On-call engineers at 3 AM | Years (if maintained) |
| ADR (Architecture Decision Record) | Before or immediately after any one-way-door decision | Future engineers asking “why?” | Permanent |
| RFC (Request for Comments) | Before building anything that affects multiple teams | Cross-team alignment, future hires | Months to years |
| Postmortem | After every significant incident | The entire org (pattern learning) | Permanent |
| Dashboard | When a service goes to production | Daily operations, incident response | Evolves continuously |
| Operational Runbook | When a new failure mode is discovered | On-call, SREs, new team members | Updated per incident |
| Decision Journal | During any ambiguous technical choice | Your future self | Personal, indefinite |
ADRs: Making Decisions Discoverable
ADRs: Making Decisions Discoverable
- Title: A short descriptive name — “Use PostgreSQL over Kafka for event storage”
- Status: Proposed, Accepted, Superseded, Deprecated
- Context: What situation prompted this decision? What constraints exist?
- Decision: What did we choose and why?
- Consequences: What trade-offs did we accept? What becomes easier? What becomes harder?
- Trigger conditions for revisiting: Under what circumstances should this decision be re-evaluated?
Dashboards as Thinking Tools
Dashboards as Thinking Tools
- Latency — how long requests take (p50, p95, p99)
- Traffic — how many requests per second
- Errors — what percentage of requests fail
- Saturation — how full are your resources (CPU, memory, connection pool, disk)
- Checkout completion rate, not just API success rate
- User login success rate, not just auth service uptime
- Search result relevance (click-through rate on first result), not just search latency
Postmortems: Learning That Compounds
Postmortems: Learning That Compounds
- Action items say “improve testing” without specifying what test, who writes it, and by when
- The same root cause appears in postmortems 6 months apart
- Postmortems are only written for P1 incidents, missing the near-misses that teach just as much
- Nobody reads old postmortems — they are write-only documents
- Action items are specific, assigned, and time-bound: “Add integration test for partial refund edge case — Alice — by March 15”
- New engineers are assigned to read the last 10 postmortems during onboarding
- Near-misses get “mini postmortems” — a 15-minute write-up, not a 2-hour meeting
- A quarterly review checks which postmortem action items actually got completed
Interview Questions -- Artifact Thinking
Interview Questions -- Artifact Thinking
- A dashboard with the four golden signals plus at least one business-outcome metric
- A runbook covering: what the service does, its dependencies, what healthy looks like, known failure modes, and mitigation steps
- Alerting rules tied to actionable thresholds (not just “CPU > 80%” but “connection pool utilization > 90% for > 2 minutes, which historically precedes an outage”)
- An ADR documenting the key design decisions, especially the ones where we chose simplicity over completeness, with trigger conditions for when to revisit
- What evidence would tell you the dashboard is being used? Check Grafana access logs or add a simple counter. If nobody has viewed the dashboard in 30 days, it is not serving its purpose — either the panels are wrong or the team does not know it exists.
- What is the governance angle? For services handling PII or financial data, the artifact list includes a data flow diagram showing where sensitive data enters, transits, and rests. Compliance teams need this during audits, and building it at service creation time is 10x cheaper than reconstructing it during an audit crunch.
- The change affects more than one service or team — RFC required
- The change involves a database schema migration on a table with more than 1M rows — RFC required
- The change introduces a new dependency or removes an existing one — RFC required
- The change is a new endpoint on an internal service with one consumer — no RFC, just a design discussion in the PR
- Michael Nygard — “Documenting Architecture Decisions” (cognitect.com/blog/2011/11/15/documenting-architecture-decisions) — the original ADR essay.
- Google SRE Book — Chapter 6 “Monitoring Distributed Systems” (sre.google/sre-book) — the source for the four golden signals and operational observability patterns.
- Will Larson — “An Engineering-Leaders’ Guide to Runbooks” (lethain.com) — practical guidance on runbook structure and lifecycle.
10. Mental Models Every Engineer Should Know
Mental models are thinking tools. They are not always literally true, but they help you make better decisions faster by giving you a framework to reason through complex situations. Think of mental models like tools in a toolbox. A hammer is great for nails, but if that is all you have, everything looks like a nail. The engineer who only knows “scale it horizontally” will horizontally scale their way into a distributed systems nightmare when the real problem was an unindexed database query. The more models you carry, the more likely you are to reach for the right one — and the more clearly you can see when someone else is using the wrong tool for the job.Pareto Principle (80/20 Rule)
Pareto Principle (80/20 Rule)
- 80% of bugs come from 20% of the code. Focus code reviews and testing on the most complex, most-changed files.
- 80% of performance gains come from 20% of optimizations. Profile first. Optimize the hot path. Do not micro-optimize cold code.
- 80% of user value comes from 20% of features. Build the critical features well. The rest can be good enough.
- 80% of outages come from 20% of failure modes. Identify and harden against the most common failures first.
Occam's Razor
Occam's Razor
- The production outage is more likely a bad config deploy than a kernel bug.
- The API failure is more likely a network issue than a race condition.
- The “impossible” bug is more likely a wrong assumption in your mental model than a compiler error.
- A typo in a variable name
- An off-by-one error
- A null value where you assumed non-null
- A stale cache
- A missing environment variable
Hanlon's Razor
Hanlon's Razor
- A colleague’s “bad” code is more likely written under time pressure than incompetence.
- A broken API from a partner team is more likely an oversight than sabotage.
- A manager’s “unreasonable” deadline is more likely based on business context you do not have than disrespect for engineering.
- In code reviews, assume the author had reasons. Ask “What was the thinking behind this?” before criticizing.
- In incident response, focus on fixing the problem, not finding someone to blame.
- In cross-team interactions, assume positive intent. “Can you help me understand this decision?” works better than “Why did you break this?”
Conway's Law
Conway's Law
- If your frontend and backend teams are separate, you will end up with a clear API boundary between them (which may be good).
- If your payment team and notification team do not talk, the payment system and notification system will not integrate well (which is bad).
- If you want microservices, you need small, autonomous teams. If you have one big team, you will build a monolith regardless of your stated architecture.
Goodhart's Law
Goodhart's Law
- Code coverage as a target: Teams write meaningless tests that execute code without asserting anything, just to hit the coverage number. Coverage goes up. Quality does not.
- Lines of code as productivity: Engineers write verbose code, avoid refactoring that reduces lines, and split simple changes into multiple commits. Output goes up. Value does not.
- Story points as velocity: Teams inflate estimates. Velocity increases every sprint. Actual throughput stays the same.
- Mean Time To Resolve (MTTR) as a target: Engineers close incidents prematurely or reclassify them to lower severity. MTTR improves. Reliability does not.
Hyrum's Law
Hyrum's Law
- You cannot change “internal” behavior safely at scale. Every observable behavior is a potential contract.
- Versioning is essential. Once behavior exists, the only safe way to change it is to create a new version.
- Be deliberate about what you expose. The less observable surface area your system has, the more freedom you have to change internals.
- Shadow testing is critical for migrations. Run the old and new systems in parallel and compare outputs — you will discover dependencies you did not know existed.
Chesterton's Fence
Chesterton's Fence
- That “weird” code block with no comments — before you delete it, figure out what it does. It might handle a race condition that only occurs under high concurrency. It might be a workaround for a third-party library bug. It might compensate for a quirk in how a specific browser renders a component. If you cannot figure out why it exists, that is a reason for caution, not confidence.
- That “unnecessary” configuration flag — before you simplify it away, check if a specific customer or deployment environment depends on it. Hyrum’s Law (above) guarantees that someone does.
- That “overcomplicated” deployment process — before you streamline it, ask the person who built it what failure it was designed to prevent. You may discover that the “extra” step exists because the simpler version caused a production outage two years ago.
- That “redundant” database index — before you drop it for write performance, check if it supports a critical monthly reporting query that you have never run yourself.
Dunning-Kruger Effect
Dunning-Kruger Effect
- The dangerous zone is peak confidence with limited experience. A developer who has built one CRUD app and declares “microservices are easy” is at the peak of the Dunning-Kruger curve. They have not yet encountered distributed transactions, network partitions, service discovery failures, or the operational overhead of running 50 services with a 10-person team.
- The productive zone is calibrated confidence. The senior engineer who says “I have built three distributed systems and I am still nervous about this one” is not being modest — they have an accurate model of what can go wrong.
- Estimating work: Junior engineers consistently underestimate tasks because they do not know what they do not know. They estimate the happy path. Experienced engineers estimate the happy path plus error handling, plus testing, plus edge cases, plus deployment, plus documentation — and they are still often short.
- In design discussions, the loudest voice is often the most confident — and the most confident person may be the least qualified to judge complexity. Seek out the quiet engineer who says “I am not sure, but here is what worries me.” Their worry is often worth more than the confident person’s certainty.
- When interviewing, beware of candidates who have strong opinions about technologies they have barely used. Probe depth: “Tell me about a time that technology surprised you.” If they cannot name a surprise, they are likely at the peak of the curve.
Survivorship Bias
Survivorship Bias
- “Netflix uses microservices, so microservices work.” Survivorship bias. You are looking at the companies that succeeded with microservices. You are not seeing the hundreds of startups that adopted microservices prematurely, drowned in operational complexity, and failed before anyone wrote a blog post about them. The survivors get conference talks. The failures get silence.
- “This architecture has been running for 3 years without issues.” Maybe it is well-designed. Or maybe your traffic has never actually stressed it. The absence of failure is not proof of resilience — it might be proof that the failure conditions have not occurred yet.
- “Our hiring process works — look at our great engineers.” You are only seeing the people you hired. What about the great engineers you rejected? You have no feedback loop on false negatives.
- “We never need feature flags — we have never had a bad deploy.” You might have been lucky, not skilled. The distinction only becomes clear when your luck runs out.
Applying Mental Models in Interviews
Applying Mental Models in Interviews
- Pareto Principle: 80% of notifications are email. Design that path first and make it excellent. SMS, push, and webhook can be handled in less detail.
- Conway’s Law: If the notification system will be maintained by the same team as the main app, a library is fine. If it will be owned by a separate team, it should be a separate service with a clear API contract.
- Hyrum’s Law: If we expose delivery timestamps, someone will depend on their precision. Decide upfront what guarantees we actually want to make.
- Goodhart’s Law: If we measure “notifications sent,” teams will send more notifications (spam). Measure “notifications acted on” instead.
- Occam’s Razor: Start with the simplest architecture that works (a queue and a worker). Add complexity only when you hit specific limitations.
- Chesterton’s Fence: If there is an existing notification system, understand why it was built the way it was before redesigning. That “weird” batching logic might exist because sending notifications one-at-a-time overwhelmed the email provider’s rate limit.
- Survivorship Bias: “Slack’s notification system uses X” — but you are only hearing about the architecture that survived. Ask what alternatives they tried and abandoned.
- Dunning-Kruger Effect: If the team says “notifications are easy, we’ll have it done in two weeks,” probe for experience. Have they handled delivery guarantees, retry logic, user preference management, and unsubscribe compliance before?
- What would you do first in production? Send a test notification to yourself through every channel before opening the system to real users. The number of “notification systems” that pass all automated tests but send malformed emails or silent push notifications is embarrassingly high. Manual smoke tests for notification delivery are not optional.
- What is the rollout plan? Start with a single notification type (e.g., “order confirmation”) for 1% of users. Monitor delivery rate, bounce rate, and user complaint rate for 48 hours. Then expand to 10%, then 100% for that type. Only then add additional notification types. This isolates “is the infrastructure working?” from “is the notification content correct?”
- What is the rollback plan? A feature flag that disables the notification service entirely and falls back to whatever existed before (even if “before” was “no notifications”). If the new system sends 1,000 duplicate notifications at 3 AM, you need a kill switch that works in under 60 seconds.
- What evidence would change your mental model rankings? If analytics show that 60% of notification engagement is via push, not email, the Pareto analysis shifts. Design push notification delivery first, not email. Most teams default to email because it is the easiest channel to implement, not because it is the highest-value channel. Data should override implementation convenience.
- What is the cost consideration? Third-party notification services (SendGrid, Twilio, Firebase) charge per message. At 10 million notifications per month across email, SMS, and push, costs range from 50,000/month depending on the channel mix. SMS is 10-100x more expensive per message than email. The cost model should influence the architecture: if SMS costs $0.0075 per message, deduplication is not just a UX feature — it is a cost control mechanism.
Putting It All Together
Daily Practice Exercises — 15 Minutes to a Stronger Engineering Mind
The engineering mindset is a muscle. These five exercises, done daily, will rewire how you think about software. Each takes about three minutes. Do them during your morning coffee, on your commute, or as a warm-up before your first code review. In six weeks, you will notice a measurable difference in how you approach problems.Exercise 1: Question One Assumption (3 minutes)
Exercise 1: Question One Assumption (3 minutes)
- “We need this microservice to be separate.” — Do we? What problem does the separation solve? Is the operational cost worth it at our current scale?
- “This endpoint needs to be real-time.” — Does it? Would a 5-second delay actually matter to users?
- “We cannot change this table schema.” — Why not? What would a migration actually cost?
- “Users need to see this data immediately.” — Do they? What if we showed stale data with a refresh button?
Exercise 2: Explain Something to a Rubber Duck (3 minutes)
Exercise 2: Explain Something to a Rubber Duck (3 minutes)
- No hand-waving. If you say “it basically does…” — stop. What does it actually do?
- No jargon shortcuts. If you say “it uses pub/sub,” expand: “There is a publisher that sends messages to a topic. Subscribers listen on that topic and process messages independently. The publisher does not know or care who the subscribers are.”
- If you get stuck, that is the exercise working. The place where you cannot explain clearly is the place where your understanding has a gap.
Exercise 3: Read One Postmortem (3 minutes)
Exercise 3: Read One Postmortem (3 minutes)
- Root cause: Was it a code bug, a configuration error, a process failure, or a systemic design flaw?
- Detection: How was the problem discovered? Monitoring? User reports? Sheer luck?
- Blast radius: How many users were affected? How long was the impact?
- The Five Whys: Does the postmortem go deep enough? If it stops at “an engineer deployed a bad config,” push further in your head: why was that possible?
- Action items: Are they systemic fixes or band-aids?
github.com/danluu/post-mortems collect links to public postmortems.Exercise 4: Draw One System Diagram (3 minutes)
Exercise 4: Draw One System Diagram (3 minutes)
- Include the data stores. Where does state live?
- Include the failure points. What happens when each arrow breaks?
- Include the scale numbers. How many requests per second flow through each arrow?
- If you do not know a number, write a question mark. Those question marks are the gaps in your understanding.
Exercise 5: Ask "What Could Go Wrong?" (3 minutes)
Exercise 5: Ask "What Could Go Wrong?" (3 minutes)
- “What if this gets 10x the expected traffic?”
- “What if this external API starts returning errors?”
- “What if a user does this in an order we did not expect?”
- “What if this runs concurrently with itself?”
- “What if the clock skews between these two services?”
- “What if the deploy rolls out to half the fleet and then fails?”
Where This Mindset Applies — Cross-Chapter Connections
Every chapter in this guide is an application of the engineering mindset. The mental models, trade-off thinking, systems reasoning, and debugging discipline you have learned here are not separate skills — they are the foundation that makes every other topic click. Here is how this chapter connects to every other chapter in the guide, with the specific mindset tool that matters most.Technical Foundations
Technical Foundations
Systems and Infrastructure
Systems and Infrastructure
Security, Quality, and Operations
Security, Quality, and Operations
Professional Growth and Leadership
Professional Growth and Leadership
Cross-Cutting Interview Questions — Combining Multiple Mindset Tools
Cross-Cut: You are building a payments feature and realize the requirements are ambiguous. Three stakeholders want different things. Ship date is in 3 weeks.
Cross-Cut: You are building a payments feature and realize the requirements are ambiguous. Three stakeholders want different things. Ship date is in 3 weeks.
- What artifact do you produce before writing code? The requirements resolution doc becomes an ADR-lite: “Payments Feature — Requirements Decisions.” It captures which interpretation was chosen, who approved it, and what the rejected alternatives were. When someone asks “why does it work this way?” in 6 months, the answer exists.
- What evidence would change your approach to the conflicts? Revenue data. If Stakeholder A’s interpretation serves 90% of payment volume and Stakeholder B’s serves 10%, that is a clear signal for prioritization that opinion-based discussions will not resolve. Always ask: “Do we have data that makes this decision for us?”
- What is the security consideration that ambiguity hides? In payments, ambiguous requirements about retry behavior can create charge-without-record or record-without-charge scenarios. Before shipping, I would verify: does every payment state transition end in a consistent state? Ambiguity in payment flows is not just a product risk — it is a financial and compliance risk.
- What would you measure after shipping? Payment success rate, reconciliation match rate (do charges in our system match charges in the provider’s system), and support ticket volume for payment-related issues. Set alerts on all three with thresholds defined before launch.
- Amazon “Working Backwards” process (aboutamazon.com) — the PR-FAQ document as a requirements-alignment artifact, written before engineering begins.
- Stripe engineering blog (stripe.com/blog/engineering) — several posts on internal documentation norms and the role of writing in decision-making.
- Patrick McKenzie — “Writing is Thinking” series on bitsaboutmoney.com and kalzumeus.com — practical guidance on why written-first processes outperform meeting-first ones.
Cross-Cut: Your team just shipped a feature that nobody is using. Product wants to iterate, engineering wants to kill it, leadership wants a postmortem. What do you do?
Cross-Cut: Your team just shipped a feature that nobody is using. Product wants to iterate, engineering wants to kill it, leadership wants a postmortem. What do you do?
- I run a lightweight postmortem — not because something failed, but because learning from a feature that did not resonate is as valuable as learning from an outage. The postmortem asks: what was our assumption about user need, what did we validate before building, and what does the usage data tell us about the assumption?
- I give Product 2 weeks and a specific hypothesis to test. Not ‘iterate until it works’ — that is an open-ended investment. ‘We believe that if we move the entry point from the settings page to the main dashboard, adoption will increase to X% within 2 weeks.’ If it does not, we kill it.
- I do NOT delete the code immediately. I feature-flag it off, document the decision in an ADR, and schedule a code cleanup for next quarter. Rushing to delete code that took weeks to build creates resentment and discourages future experimentation.”
- What evidence would convince you to keep iterating beyond the 2-week test? Qualitative signal. If 5% of users who discover the feature use it daily and give enthusiastic feedback, that is a distribution problem, not a product problem. Low adoption with high engagement among adopters is the signature of a feature that needs better marketing, not deletion.
- What is the cost angle nobody discusses? The ongoing maintenance cost. Even feature-flagged-off code occupies mental space during refactors, shows up in dependency audits, and creates confusion for new engineers. If the feature is dead, schedule a specific date for code removal — do not let it haunt the codebase indefinitely.
- What artifact prevents repeating this? A “Feature Experiment Template” that every new feature fills out before engineering begins: target audience, success metric, measurement plan, and kill criteria. If the team cannot fill this out, the feature is not ready for engineering investment.
- Marty Cagan — “Inspired: How to Create Tech Products Customers Love” — the canonical framework for product discovery and validation gates that would prevent this scenario.
- Eric Ries — “The Lean Startup” — the validated-learning loop and how to structure hypotheses so that failure is informative rather than demoralizing.
- Shreyas Doshi’s writing on product sense (shreyas.substack.com) — practitioner-grade guidance on distinguishing discovery-stage from execution-stage feature work.
Real-World Stories: The Engineering Mindset in Action
These are not hypothetical scenarios. They are real stories from real companies where the engineering mindset — or the lack of it — made all the difference.SpaceX: Elon Musk and First Principles Rocket Science
SpaceX: Elon Musk and First Principles Rocket Science
Toyota's Five Whys: How a Production System Changed Manufacturing
Toyota's Five Whys: How a Production System Changed Manufacturing
- Why did the machine stop? — Because the fuse blew due to an overload.
- Why was there an overload? — Because the bearing was not sufficiently lubricated.
- Why was it not lubricated? — Because the lubrication pump was not working properly.
- Why was the pump not working? — Because the pump shaft was worn out.
- Why was the shaft worn out? — Because there was no strainer attached, and metal scrap got in.
Amazon's Working Backwards: Starting with the Press Release
Amazon's Working Backwards: Starting with the Press Release
Mars Pathfinder: Debugging a Priority Inversion from 100 Million Miles Away
Mars Pathfinder: Debugging a Priority Inversion from 100 Million Miles Away
Additional Interview Questions: Deeper Practice
Questioning Assumptions
Questioning Assumptions
- Set the scene — what was the assumption, and why was it widely accepted?
- Explain what made you question it — was it data, intuition, or experience from a different context?
- Describe how you raised the concern — did you bring data? Run an experiment? Write a proposal?
- Share the outcome — even if the team did not change course, show that the process of questioning led to a better-informed decision.
- Reflect on what you learned about challenging assumptions effectively.
- Telling a story where you were right and everyone else was wrong (comes across as arrogant).
- Not explaining how you raised the concern (the process matters as much as the outcome).
- Choosing a trivial example that does not demonstrate real risk in challenging the status quo.
Investigation vs Shipping
Investigation vs Shipping
- Explain the factors you weigh — severity, recurrence likelihood, blast radius, time pressure, and the cost of the workaround becoming permanent.
- Describe your decision heuristic — not a rigid rule, but a mental model for how you balance these.
- Give a concrete example of each choice — one where you investigated, one where you shipped a workaround, and why each was correct in context.
- Emphasize documentation — if you ship a workaround, how do you ensure the real fix does not get forgotten?
- Dogmatically always choosing one path (“I always investigate fully” or “I always ship fast”).
- Not mentioning documentation of the workaround and follow-up plan.
- Ignoring the team and business context (deadlines, on-call burden, customer impact).
Learning System
Learning System
- Describe your criteria for depth vs breadth — what signals tell you something deserves deep investment vs surface-level awareness?
- Explain your actual learning process — not just “I read docs.” Do you build prototypes? Teach others? Read source code? Write about it?
- Give a concrete recent example of something you learned deeply and something you deliberately chose to only skim.
- Show awareness of opportunity cost — time spent learning one thing is time not spent on another.
- Listing technologies learned without explaining the system for deciding what to learn.
- Implying you learn everything deeply (not believable and signals poor prioritization).
- Not mentioning how you identify what is important to learn (reactive vs proactive).
Curated Resources: Go Deeper
Mental Models & First Principles Thinking
Mental Models & First Principles Thinking
Systems Thinking & Complex Systems
Systems Thinking & Complex Systems
Debugging, Learning & Engineering Craft
Debugging, Learning & Engineering Craft
Interview Deep-Dive Questions
Q1: How do you decide when a system is 'good enough' to ship? (Intermediate)
Q1: How do you decide when a system is 'good enough' to ship? (Intermediate)
The Question
You have been building a new service for three weeks. It works correctly for the main use cases, but you know there are edge cases you have not handled, the error handling is basic, and you have not written integration tests yet. Your product manager is asking when it can go to production. How do you think about this?Strong Answer
“The way I think about ‘good enough’ is through a risk matrix, not a feature checklist. I ask three questions:- What is the blast radius of the known gaps? If the unhandled edge cases affect 0.1% of requests and fail gracefully with a clear error rather than silently corrupting data, that is shippable. If they could cause data loss or inconsistent state, it is not.
- Is the failure mode observable? If I have logging and alerting in place so I will know within minutes when an edge case hits, I am comfortable shipping earlier. If I am flying blind and will only learn about problems from angry user reports three days later, I need to invest in observability before shipping.
- Is the technical debt conscious and documented? I would create specific tickets for the missing integration tests and unhandled edge cases, with clear descriptions of the risk each one carries. Then I would have an honest conversation with the PM: ‘We can ship Wednesday with these known gaps. Here is what each gap means in terms of user impact. I recommend we allocate the following sprint to close the critical ones before we scale up traffic.’
- Most candidates: ‘I would have a conversation with the PM about trade-offs and ship an MVP.’ (Vague, no framework for deciding what to cut.)
- Great candidates: ‘I would classify each gap by blast radius — silent data corruption is a ship-blocker, graceful error returns are shippable, missing integration tests are an acceptable risk if I have monitoring. Then I give the PM a concrete menu: ship Wednesday with gaps A, B, C, or ship next Monday with only gap C remaining. Let them pick.’
Follow-up: How do you handle the case where management pressures you to skip the observability investment too?
“This is where I draw a harder line, because observability is not a feature you can add later when things are calm — you need it most during the chaos of a production incident, which is exactly when you cannot add it. I would frame it concretely for management: ‘Without basic alerting, if this service fails at 2 AM, we will not know until users complain on Twitter at 9 AM. That is a seven-hour outage versus a fifteen-minute outage. The alerting takes one day to set up. The reputational cost of a seven-hour silent outage on a new service is significantly higher than one day of delay.’Most managers respond well when you quantify the risk in terms they care about — user impact, brand reputation, and on-call burden. If they still push back, I would document the decision and the risk in writing. Not as a CYA move, but because when the incident happens — and it will — having that documented context means the postmortem focuses on improving the process rather than assigning blame.”Follow-up: How do you calibrate differently for a greenfield service versus adding a feature to an existing production system?
“Very differently, and this is a distinction a lot of engineers miss. A greenfield service with zero traffic has much more room for iteration — you can ship a rough version, observe how real traffic behaves, and harden as you go. The blast radius is near zero because nobody depends on it yet. Your biggest risk is over-engineering for problems that never materialize.An existing production system is the opposite. Users depend on its current behavior (Hyrum’s Law). Other services may have integrated with it. The blast radius of a regression is large. Here I invest more in backward compatibility, integration tests, and canary deployments before going to 100% of traffic. I also need to understand what the system currently does under the hood, even the parts that seem unnecessary — Chesterton’s Fence applies strongly.The practical rule I use: for greenfield, I optimize for learning speed. For existing systems, I optimize for safety. These require different engineering practices, different testing strategies, and different conversations with stakeholders.”Going Deeper: How does the concept of ‘reversibility’ change your threshold for shipping?
“This connects directly to the Amazon one-way-door/two-way-door framework. If what I am shipping is easily reversible — say, a new UI component behind a feature flag — my threshold for ‘good enough’ is much lower. I can ship at 70% confidence because the cost of being wrong is a five-minute rollback.But if the change involves a database migration, a new public API contract, or a data format that other services will consume — those are one-way doors. My threshold jumps to 90%+ confidence. I want more thorough testing, more peer review, and a concrete rollback plan.The mistake I see teams make is applying the same ‘good enough’ bar to everything. They either agonize over two-way-door decisions (bikeshedding on a logging format) or rush one-way-door decisions (shipping a database schema without enough thought). The engineering leverage is in correctly categorizing which kind of door you are walking through, then adjusting your rigor accordingly.”Unexpected Tangent: How do you handle the situation where your ‘good enough’ shipped service becomes the permanent solution because nobody ever goes back to fix it?
“This is the most common outcome, and pretending otherwise is dishonest. About 70% of ‘temporary’ solutions I have shipped are still running years later. So I have changed how I think about ‘good enough’ — I now assume my temporary solution is the permanent solution and ask: ‘Can I live with this for two years?’ If the answer is no, I invest more before shipping.The specific practice: when I document technical debt as tickets, I classify them as ‘bomb’ or ‘rust.’ Bomb debt will explode at a specific threshold — ‘this approach fails above 10,000 concurrent users, and we are at 6,000.’ Bomb debt gets a monitoring alert tied to the threshold. Rust debt degrades slowly — ‘the codebase becomes slightly harder to maintain each month.’ Rust debt gets a quarterly review. The bombs get fixed because the alert fires. The rust usually does not get fixed, and that is often fine — as long as you are honest about which category each item falls into.The organizational trick: I no longer create tickets that say ‘clean up X later.’ Those tickets have a 5% completion rate. Instead, I add a TODO comment in the code with a date and my name, and I add a calendar reminder for myself in 6 weeks. When the reminder fires, I spend 30 minutes evaluating whether the cleanup is still needed. Half the time, the answer is no — the temporary solution turned out to be adequate. That saves the team from working on cleanup tickets that exist because someone felt guilty, not because the work is needed.”Q2: Walk me through a time you diagnosed a problem that crossed multiple layers of the stack. (Senior)
Q2: Walk me through a time you diagnosed a problem that crossed multiple layers of the stack. (Senior)
The Question
Tell me about a production issue where the root cause was not in the obvious layer. How did you find it?Strong Answer
“The most memorable one was a latency degradation that looked like a database problem but was actually a garbage collection issue caused by a logging change.Our alerting fired on p99 latency for our order service — it went from 200ms to 1.5 seconds. The natural assumption was the database, since that is where most of our time is usually spent. I checked the slow query logs — nothing unusual. Database CPU and connections were normal. The query execution plans had not changed.So I moved up a layer to the application. I pulled a flame graph from our profiling tool and saw that 40% of the time was spent in GC pauses — way above our normal 5%. Something was allocating a massive amount of short-lived objects.I correlated the timing with our deploy log and found that a teammate had added structured logging to the hot path the previous day. The logging library was constructing a new JSON object for every log line, including serializing the entire request context — headers, body, metadata — into a string for each of the 50,000 requests per second hitting that endpoint. That was millions of string allocations per second that the GC had to clean up.The fix was two-fold: we moved the detailed logging behind a sampling flag (log 1% of requests at full detail, 100% at summary level), and we switched to a zero-allocation logging path for the hot loop. Latency dropped back to normal within minutes of the deploy.The lesson I took from this: the symptom was in the response time (application layer), the initial suspicion was the database (data layer), but the root cause was memory allocation patterns (runtime layer) triggered by a code change (application layer) that affected the garbage collector (runtime layer) which manifested as latency (application layer). The problem crossed three layers. If I had stayed fixated on the database, I would have wasted hours.War Story: This kind of layer-crossing bug is more common than people realize. At a healthcare SaaS company, we had a service that processed insurance claims. Every Tuesday at 2 PM, latency would spike to 8 seconds. For three weeks, the on-call team blamed the database — Tuesdays were when the largest insurance partner sent batch submissions. But the database was fine; query times were flat. The real cause: a cron job ran at 1:55 PM every Tuesday that rebuilt a Lucene search index. It pinned one CPU core at 100%, and our JVM’s G1 garbage collector was configured with only 2 GC threads on a 4-core machine. With one core pegged, GC throughput halved, and the concurrent claims processing — which generated a lot of short-lived objects — started experiencing 800ms GC pauses. The fix was a two-line JVM flag change (-XX:ParallelGCThreads=3 -XX:ConcGCThreads=2), but finding it required crossing from the application layer, through the infrastructure layer, down to the JVM runtime layer, and connecting it to a completely unrelated cron job on the same host.Contrarian Take: Most debugging advice says ‘start with the most likely cause.’ I disagree for cross-layer bugs. Start with the most provable cause. If you can definitively rule out the database in 90 seconds with a slow query log check, do that first even if you think the database is unlikely. Elimination-based debugging beats probability-based debugging when the problem crosses layers, because your probability estimates are wrong — they are biased toward the layer you know best.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would check the logs and look at the database.’ (Single-layer thinking, no systematic approach.)
- Great candidates: ‘I would bracket the problem. First, I confirm where time is being spent using distributed tracing or manual timing instrumentation — is it pre-database, database, or post-database? That tells me which layer to investigate. Then within that layer, I reach for the layer-appropriate tool:
EXPLAIN ANALYZEfor the database, flame graphs for the application,perf toporvmstatfor the OS.’
Follow-up: How did you decide to look at the flame graph instead of continuing to investigate the database?
“Two things triggered the pivot. First, the absence of evidence — the database metrics were clean. Not just ‘within normal range,’ but genuinely unchanged from before the latency spike. When the obvious suspect has a solid alibi, you move on.Second, I had a heuristic I have developed over the years: if the database looks fine but latency is high, check whether the time is being spent before or after the database call. I added timing logs around the database call itself and saw that the query took 20ms — same as always. But the request took 1.5 seconds. So 1.48 seconds was being spent somewhere else. That told me the problem was in the application or runtime layer, which is when I reached for the profiler.The meta-skill here is knowing what tool to use at each layer. Database slow? Check slow query logs and execution plans. Application slow? Check profiler and flame graphs. Network slow? Check traceroute and packet captures. Infrastructure slow? Check CPU, memory, disk I/O. Each layer has its own diagnostic toolset, and experienced engineers build a mental mapping of symptom to tool.”Follow-up: How would you prevent this category of problem from recurring?
“I would attack it at multiple levels. At the code review level, I would advocate for a team guideline: any logging change to a hot path (endpoints handling more than 1,000 RPS) requires a brief performance check — run the profiler before and after in staging. It is a two-minute habit that catches these issues before production.At the system level, I would add a GC pressure metric to our standard dashboard and alert on it. We were already monitoring p99 latency, which caught the symptom, but if we had been monitoring GC pause time directly, we would have had the root cause within seconds instead of the thirty minutes it took me to trace it.At the architectural level, the deeper question is: why was our logging path coupled to our hot path at all? In a high-throughput system, logging should be asynchronous and buffered. The log statement should drop a lightweight event into a ring buffer, and a background thread should handle serialization and I/O. That way, even if someone adds verbose logging, the hot path pays the cost of a pointer write, not a full JSON serialization.This is a pattern I have seen repeat: what seems like a one-off incident is usually a symptom of a missing guardrail. The individual fix (sampling the logs) solves today’s problem. The systemic fix (async logging, GC alerting, performance review guidelines) prevents the next five instances.”Going Deeper: How do you think about observability costs versus observability value?
“This is a real tension in production systems. Every metric you collect, every log you write, every trace you propagate costs CPU, memory, network bandwidth, and storage. I have seen systems where the observability infrastructure itself became the performance bottleneck — the logging was slower than the business logic.The way I frame it is: observability is an investment with diminishing returns. The first 20% of observability effort — basic request latency, error rates, saturation metrics, and structured logs on error paths — gives you 80% of the debugging value. The Pareto Principle applied directly.The next tier — distributed tracing, detailed flame graphs, custom business metrics — is valuable but more expensive. I invest in these for the critical path (the parts of the system where an outage costs real money) and keep lighter instrumentation elsewhere.The anti-pattern I watch for is ‘observe everything at maximum detail.’ A team I worked with was logging the full request and response body for every API call across 40 services. Their log storage bill was 2,000 and the logs became actually useful because the signal-to-noise ratio improved dramatically.The principle: observability should be proportional to the blast radius of the thing you are observing. Your payment service deserves more instrumentation than your ‘about us’ page.”Unexpected Tangent: When have you seen observability itself cause an outage?
“Twice, and both times were humbling. The first: a Prometheus scrape was configured with a 5-second interval for 400 application instances, each exposing 2,000 metrics. That is 160,000 metric scrapes every 5 seconds. Prometheus’s memory usage hit 28GB, it OOM-killed, and we lost all metrics during a separate unrelated incident — the one time we needed them most. We switched to a 30-second scrape interval and reduced per-instance metrics from 2,000 to 300 by removing histograms for non-critical endpoints. Prometheus dropped to 4GB.The second: a distributed tracing system (Jaeger) was configured to sample 100% of requests. Trace data was being sent synchronously from the application to the Jaeger collector. When the Jaeger collector went down for maintenance, every application request blocked for the trace timeout (500ms) before falling back. P99 latency across the entire platform went from 150ms to 650ms because of a monitoring system maintenance window. We moved to async trace emission with a local buffer and a 10% sampling rate — better data at 1/10th the cost, with zero coupling between the application’s latency and the tracing system’s health.The meta-lesson: observability systems are systems too. They have their own scaling limits, failure modes, and blast radii. The irony of an observability system causing the outage it was supposed to help you debug is painful enough that you only let it happen once.”Q3: Explain a technical concept to me at three different levels. (Foundational)
Q3: Explain a technical concept to me at three different levels. (Foundational)
The Question
Pick any concept from your work — caching, load balancing, database indexing, whatever you are most comfortable with. Explain it to me three times: once as if I am a junior engineer on your team, once as if I am your engineering manager, and once as if I am the CEO.Strong Answer
“I will use database indexing.To a junior engineer: ‘Imagine the database is a 10,000-page book with no table of contents and no page numbers. When you run a query like SELECT * WHERE email = ‘alice@example.com’, the database has to read every single page to find Alice. That is called a full table scan and it is slow. An index is like adding a sorted table of contents at the back of the book — it says ‘alice@example.com is on page 4,721.’ Now the database looks up the index, jumps straight to the right page, and returns in milliseconds instead of seconds. The trade-off is that every time you add or update a row, the index has to be updated too, so writes get a little slower. You want indexes on columns you search by frequently, but not on everything.’To an engineering manager: ‘Our order lookup endpoint is hitting 2-second response times at current load and it will get worse as we grow. The root cause is that we are missing an index on the customer_id column in our orders table, so every lookup scans 8 million rows. Adding the index is a one-line migration. Write performance will decrease by roughly 5% based on our benchmarks, which is well within our SLA. I can have this in production today. No code changes, no downtime — PostgreSQL supports concurrent index creation.’To the CEO: ‘Our customers are experiencing slow order lookups and it is getting worse as we grow. I have identified the fix — it is a small infrastructure change, no feature impact, and I can deploy it today. After the fix, lookups will be instant regardless of how many orders we have. No cost increase.’The key difference is what each audience needs: the junior needs to understand the mechanism, the manager needs to understand the impact and effort, and the CEO needs to understand the business outcome. Same problem, same fix, three completely different conversations.War Story: I learned this lesson the hard way at a logistics company. We had a critical Elasticsearch cluster running dangerously low on disk. I presented the problem to the VP of Engineering with a 15-minute explanation of shard allocation, replica balancing, and JVM heap sizing. His eyes glazed over by minute 3. He interrupted with: ‘Do I need to approve a purchase order?’ I said ‘Yes, we need 6 more nodes at 8,000/month by switching from Akamai to CloudFront. The migration takes two weeks with no user downtime.’ Approved on the spot. That engineer got promoted that cycle. I started paying attention to how she communicated.Contrarian Take: The ability to explain things at the right abstraction level is more important for career advancement than raw technical skill. I have watched brilliant engineers stall at senior because they cannot talk to non-technical stakeholders, while good-but-not-great engineers make staff because every VP they talk to thinks ‘that person really gets it.’ This is not politics — it is communication as a force multiplier. An engineer who can get a $500K infrastructure investment approved in a 5-minute conversation is worth more to the organization than one who can optimize an algorithm by 15%.What Most Candidates Say vs What Great Candidates Say:- Most candidates: Give a textbook explanation, then say ‘I would simplify it for the CEO.’ (They cannot actually do the simplification in real time.)
- Great candidates: Deliver all three explanations fluently, with different vocabulary, different levels of detail, and different calls to action for each audience. The CEO explanation contains zero technical terms and ends with a business outcome.
Follow-up: What if the CEO then asks ‘Why did this happen in the first place?’
“Now I need to zoom out without throwing anyone under the bus. I would say: ‘As our order volume grew, a database configuration that was fine for 100,000 orders became a bottleneck at 8 million. This is normal for growing systems — the architecture that works at one scale needs adjustments at the next. We are putting monitoring in place so we catch these scaling thresholds before they become user-visible.’What I am doing is reframing a ‘failure’ as a natural consequence of growth (which it is), while also showing that we are building systems to prevent it from recurring. The CEO does not need to know about missing indexes or full table scans. They need to know: this is handled, and we are getting ahead of similar issues.The trap to avoid is over-explaining the technical details to a non-technical audience. It makes them feel like you are deflecting or making excuses. Keep it at the business-impact layer.”Follow-up: When you are explaining something technical, how do you know you have picked the wrong level of abstraction?
“The clearest signal is the other person’s face. If they look confused, you are too deep. If they look bored or start checking their phone, you are too shallow — they already know this part and you are wasting their time.But since reading faces is unreliable (especially in remote meetings), I have developed a habit: I start at a deliberately high level and then say, ‘Would it be helpful if I went deeper into how this works?’ This gives the other person control over the zoom level. It also signals that I can go deeper, which builds trust.The other technique I use is analogies. If my analogy lands — the person nods, or says ‘Oh, so it is like…’ — I know I am at the right level. If the analogy creates more confusion than it resolves, I have picked the wrong abstraction layer and I adjust.The biggest mistake I see engineers make is defaulting to the level they are most comfortable at. Backend engineers explain everything in terms of database queries. Frontend engineers explain everything in terms of component rendering. The skill is meeting the audience where they are, not where you are.”Unexpected Tangent: How do you explain a decision that you believe is wrong but was made by someone above you?
“This happens more often than anyone admits, and it is one of the hardest communication challenges in engineering. You own the implementation of a decision you disagree with, and a junior engineer asks you why the system works this way.My approach: I explain the decision’s rationale honestly, including the constraints that led to it, without undermining the decision-maker. ‘We chose X because of constraint Y and priority Z. I personally would have leaned toward A instead, and here is why — but the trade-off was reasonable given the timeline.’ This is honest without being toxic.What I never do: pretend I agree when I do not (‘this is the best approach’), or throw someone under the bus (‘the VP made us do this’). Both destroy trust — the first because your team can sense dishonesty, the second because it signals political immaturity.The hardest version of this is explaining a decision to a customer or external stakeholder when you disagree with it. There, I focus entirely on the outcome for the stakeholder: ‘This approach means you get feature X by March instead of June. The trade-off is that Y will not be included in the initial release.’ I keep my personal engineering opinion out of external conversations entirely. Disagreements are internal. Externally, we ship a unified message.A staff engineer I admired once told me: ‘You earn the right to disagree in private by supporting the decision in public.’ That distinction — private disagreement, public alignment — is one of the markers of engineering leadership.”Q4: A design decision you made six months ago is now causing problems. What do you do? (Senior)
Q4: A design decision you made six months ago is now causing problems. What do you do? (Senior)
The Question
You led the design of a system six months ago. The team followed your recommendations. Now the system is hitting scaling issues that trace back to a core design choice you made. How do you handle this?Strong Answer
“First, I own it. Not in a performative way, but factually: ‘I designed this and the choice I made about X is not holding up at our current scale. Here is what I recommend we do about it.’ The fastest way to lose credibility as a senior engineer is to deflect blame or pretend the problem is someone else’s.Then I separate two things: the immediate mitigation and the long-term fix. If the system is actively degrading, the first priority is stabilizing it — maybe that means adding a cache, increasing resources, or putting a rate limit in front of the bottleneck. These are band-aids, and I am explicit about that.For the long-term fix, I go back to the original design. I pull up the design doc or ADR if we wrote one. Critically, I look at what has changed since the original decision. In my experience, the original design was usually correct for the information available at the time. The question is not ‘why did I make a bad decision?’ but ‘what changed — scale, requirements, usage patterns — that invalidated the assumptions?’For example, I once designed a notification service using a fan-out-on-write pattern — every time a user posted, we precomputed notifications for all their followers and stored them. This worked beautifully at 10,000 users. At 500,000 users, we had celebrities with 100,000 followers and a single post was generating 100,000 writes that overwhelmed our database. The design was not wrong — the user distribution changed. The fix was a hybrid approach: fan-out-on-write for normal users, fan-out-on-read for users with more than 5,000 followers. Classic Twitter-style solution, but we did not need it until we had Twitter-style usage patterns.I wrote up the analysis, proposed the migration plan, and estimated the effort. The key was presenting it as ‘here is what we learned and here is the path forward’ rather than ‘I messed up.’ Because honestly, if we had built the hybrid approach from day one, we would have wasted months of engineering time on complexity we did not need yet — YAGNI was correct at the time.War Story: At an e-commerce company processing 200K orders per day, I designed the order search system using Elasticsearch with a synchronous indexing strategy — every order write went to both PostgreSQL and Elasticsearch in the same request path. At 50K orders/day this was fine. At 200K, Elasticsearch bulk indexing started causing 2-second write latencies that backed up the checkout queue. The worst part: I had specifically chosen synchronous indexing over async because ‘users need to see their order immediately after placing it.’ But when I dug into the actual user behavior data, only 12% of users ever searched for an order within 60 seconds of placing it — and those users could see the order on the confirmation page (which read from PostgreSQL, not Elasticsearch). The search index could have been 30 seconds stale and nobody would have noticed. My ‘requirement’ was an assumption I never validated. We switched to async indexing via a PostgreSQL WAL consumer feeding Elasticsearch, write latency dropped to 15ms, and not a single customer complained about search staleness. The lesson: validate your constraints before designing around them, because fake constraints create real complexity.Contrarian Take: Owning your design mistakes publicly and loudly actually increases your credibility, not decreases it. I have seen engineers try to quietly fix their design failures, hoping nobody notices. The team always notices. The engineer who says ‘I designed this, it broke, here is why, and here is the fix’ gets trusted with bigger designs. The engineer who silently patches things gets a reputation for shipping unreliable systems. Counterintuitive, but ownership is a credibility multiplier.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would take responsibility and fix it.’ (Correct direction, but no framework for how to analyze what went wrong or prevent recurrence.)
- Great candidates: ‘I would pull up the original design doc, identify which specific assumptions changed, propose both an immediate stabilization and a migration path, and add a scaling-assumptions section to the doc showing at what thresholds the new design will also need revisiting. I treat every design as having an expiration date — the question is whether you label it.’
Follow-up: How do you prevent this from becoming a pattern — designing things that need redesign in six months?
“You cannot fully prevent it, and I would be skeptical of anyone who claims they can. If you are growing fast, some designs will need revisiting. The goal is not to predict the future perfectly — it is to make the redesign as cheap as possible when it happens.The practices I use: First, I keep a ‘scaling assumptions’ section in my design docs. I write things like: ‘This design assumes fewer than 100,000 users. If we exceed that, the fan-out pattern will need to change to a hybrid approach.’ That way, when the assumption breaks, the team already has a breadcrumb trail pointing to the fix.Second, I design for modularity at the boundaries where I expect scale to matter. I would not abstract everything — that is YAGNI territory — but the specific component I think might need to change gets a clean interface so it can be swapped without rewriting the whole system.Third, I invest in metrics that track my scaling assumptions. If my assumption is ‘this table will stay under 10 million rows,’ I add an alert at 7 million. That gives me lead time to redesign before the system falls over, instead of reacting to an incident.The meta-lesson is: the quality of a design is not whether it lasts forever. It is whether it lasts long enough for the context it was designed for, and whether it is easy to evolve when the context changes.”Follow-up: How do you balance ‘own the mistake’ with ‘defend the original reasoning’?
“This is a genuinely tricky interpersonal dynamic. If you only own the mistake, you undermine confidence in your future designs — people think ‘well, they got it wrong before.’ If you only defend the reasoning, you seem incapable of admitting error.The frame I use is: ‘The decision was sound given what we knew. Here is what changed, and here is what I would do differently with today’s information.’ This is honest, demonstrates learning, and maintains credibility.What I specifically avoid is the word ‘but’ after owning the mistake. ‘I got this wrong, BUT the requirements changed’ sounds like deflection. Instead: ‘I got the scaling characteristics wrong. We did not anticipate the celebrity-user pattern. Knowing what I know now, I would have built in a threshold-based switching mechanism from the start. Here is the plan to add it now.’The engineers who handle this best are the ones who create a culture where revisiting designs is normal and healthy, not a sign of failure. At one team I was on, we had a standing ‘design retrospective’ every quarter where we specifically looked at decisions from three to six months ago and asked: what held up, what did not, and what would we do differently? It destigmatized the conversation entirely.”Going Deeper: When is it correct to NOT redesign, even though the original design is showing strain?
“This is where engineering judgment matters most. Sometimes the correct answer is to add duct tape and move on. I would not redesign if:The system is being replaced entirely within the next two quarters — redesigning a system with a known expiration date is wasted effort. Add the minimum mitigation to keep it alive.The scaling issue has a known ceiling — if we are hitting limits at 500,000 users but our market caps at 600,000 and we are at 480,000, the band-aid might be genuinely sufficient.The redesign would block higher-priority work — everything has an opportunity cost. If the team needs to ship a revenue-critical feature, a workaround for the scaling issue might be the right trade-off, as long as it is conscious and documented.The redesign introduces more risk than the current problem — sometimes the devil you know is safer than the devil you do not. If the current system is degraded but stable, and the redesign requires a risky migration, the status quo plus monitoring might be the better bet.The hardest part of engineering seniority is knowing when not to do the technically elegant thing because the business context does not justify it.”Unexpected Tangent: Have you ever seen a design decision fail in a way that was actually better than if it had succeeded?
“Yes, and this is a surprisingly common phenomenon. At an e-commerce company, I designed a recommendation engine that was supposed to show personalized product suggestions on the homepage. The ML model was underperforming — it kept recommending products from the same category the user had just bought (you bought running shoes, here are more running shoes). We planned to spend 6 weeks improving the model.While the model was broken, we showed a fallback: trending products across all categories. The fallback outperformed the personalized model in conversion rate by 22%. It turned out our users were browsing-oriented, not search-oriented — they wanted to discover new categories, not go deeper into ones they already knew. If the personalized model had worked as designed, we would have never discovered this. The failure taught us something about our users that success would have hidden.I now build intentional fallback paths into every system and instrument them. The fallback is not just a safety net — it is a free A/B test. If the fallback outperforms the primary, you have learned something valuable about your assumptions. This connects to the broader point about design failures: the goal is not to never be wrong. The goal is to structure your systems so that being wrong teaches you something useful.”Q5: How do you approach debugging a problem you have never seen before in a system you did not build? (Intermediate)
Q5: How do you approach debugging a problem you have never seen before in a system you did not build? (Intermediate)
The Question
You join a new team and in your second week, you get paged for a production incident in a service you have barely looked at. Walk me through your approach.Strong Answer
“First — and this sounds obvious but many people skip it — I read the alert. Not just the title, but the full context: what metric triggered it, what threshold was breached, when it started, and what the impact is. Half the time, the alert itself contains enough information to form an initial hypothesis.Then I apply the ‘what changed?’ heuristic. Even though I am new to the system, I can check the deploy log, the config change history, and any feature flag changes in the last few hours. If something was deployed 20 minutes before the alert fired, that is my leading hypothesis.If nothing obviously changed, I work outside-in. I start with the user-visible symptoms (errors, slowness, specific endpoints affected) and trace inward. I check the service’s health dashboard — if there is one — looking for CPU, memory, error rates, and latency percentiles. I look for patterns: is it affecting all users or a subset? All endpoints or specific ones? Constant or intermittent?Here is the critical part for someone new to the system: I do not try to understand the entire architecture before debugging. I follow the specific failing request path. I find the error in the logs, look at what that code does, look at what it depends on, and trace the failure to its source. I treat it like a murder mystery — follow the evidence, not the map.And I ask for help early. Being new to a team during an incident is actually a superpower in one way — I have no preconceptions. But I also do not know the history of the system, the known quirks, the ‘oh yeah, that service does that when the weather changes.’ A senior person on the team can say, ‘Oh, that error happens when the third-party payment API has a partial outage — check their status page.’ That saves me thirty minutes of tracing.The worst thing I could do is stay silent, flailing around in an unfamiliar codebase for an hour while the incident burns, because I am too proud to admit I need context.War Story: My second day at a streaming media company, the content delivery pipeline went down at 6 PM — peak viewing hours, 3 million active sessions. I had never even seen the architecture diagram. The on-call engineer was on vacation (the pager rotation had not been updated). I opened the incident channel, said ‘I am new and do not know this system — who can give me 2 minutes of context?’ A backend engineer sent me a three-sentence Slack message: ‘The CDN origin is a service called media-gateway. It reads from a Redis cluster for hot content and falls back to S3. Check if Redis is responding.’ I ranredis-cli ping against the cluster — connection refused. The Redis primary had run out of memory and crashed. I called out the finding, another engineer scaled up the Redis instance, and we were back in 11 minutes. Total time I spent ‘understanding the system’: 2 minutes of reading Slack. The rest was basic debugging. Afterward, I mapped out the full content delivery architecture in a Miro board and shared it — it became the team’s official architecture doc. The incident made me the most informed new hire about that pipeline because I learned it under fire.Contrarian Take: Being new to a system during an incident is an advantage, not a disadvantage, in one specific way: you have no assumptions to be wrong about. Veteran engineers often debug slowly because they ‘know’ how the system works — but their mental model is outdated or wrong in the specific way that is causing the incident. The new person asks basic questions (‘is this thing actually running?’) that veterans skip because they assume the answer. I have seen new hires find root causes faster than 10-year veterans precisely because they verified what everyone else assumed.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would read the documentation and try to understand the architecture first.’ (This is studying, not incident response. Documentation is often outdated or nonexistent.)
- Great candidates: ‘I would follow the failing request, not the architecture. Find the error in the logs, trace it to the responsible code, and diagnose from there. I treat the system like a crime scene — follow the evidence, not the map. And I would ask for help within 10 minutes, not 60.’
Follow-up: How do you distinguish between ‘I need to understand more context’ and ‘I am going down a rabbit hole’?
“I time-box myself. When I start investigating a thread, I set a mental timer for 15 minutes. If after 15 minutes I have not made progress — meaning I have not either narrowed down the problem or eliminated a hypothesis — I step back and reassess.The key question at that reassessment point is: ‘Am I stuck because I lack information or because I lack context?’ If I lack information — I need a log that does not exist, or access to a dashboard I do not have — that is a specific blocker I can ask for help with. If I lack context — I do not understand why this service talks to that service, or what this config flag does — that is when I pull in a teammate for five minutes of explanation.The rabbit hole trap is when you think you are making progress because you are learning things, but none of those things are leading you toward the root cause. You are doing archaeology, not debugging. The discipline is asking: ‘Does this piece of information help me form or test a hypothesis about the current incident?’ If not, it is interesting but it is not debugging.”Follow-up: After the incident is resolved, what do you do differently as the new person versus what a veteran team member would do?
“The incident becomes my crash course in the system. I write up detailed notes about the incident path — not just for the postmortem, but for myself. What services were involved, how they connect, what the failure mode was, and where I got stuck.Then I do something a veteran probably would not: I draw the architecture diagram from what I learned during the incident. My understanding is incomplete, and that is the point — the gaps in my diagram tell me where I need to learn more. I share the diagram with the team and say ‘Here is my understanding after this incident — what am I missing?’ This serves double duty: it fills my knowledge gaps and it often reveals documentation gaps that the team has been blind to because they have the context in their heads.I also look at the incident through fresh eyes and ask questions the veterans might not think to ask because they have normalized certain risks. ‘Is it expected that a single service failure can take down the entire checkout flow? Should there be a fallback?’ Sometimes the team says ‘Yes, we know, it is on the roadmap.’ Sometimes they say ‘Huh, we never thought about it that way.’ Both responses are valuable.”Unexpected Tangent: What is the worst debugging mistake you have seen someone make during a production incident?
“The worst was not a technical mistake — it was a communication mistake. At a previous company, an engineer diagnosed a database connection pool exhaustion during a P1 incident. The correct fix was to increase the pool size and restart the service. Instead of doing that, he decided to investigate why the pool was exhausted first, because he wanted to find the root cause before applying the fix. Noble instinct, wrong time. He spent 40 minutes tracing connection leaks while the site was down for 200K users. The correct sequence during an active incident is: mitigate first, investigate second. Restart the service, increase the pool, stop the bleeding — and then figure out why it happened with the system stable and the pressure off.The second worst: during a different incident, an engineer made a config change to fix one problem and accidentally introduced a worse one. The original issue was high latency. Their fix was to increase the timeout from 5 seconds to 30 seconds. This did ‘fix’ the latency alerts (requests no longer timed out at 5s), but it meant every slow request now held a connection for 30 seconds instead of failing fast. The connection pool filled up and the entire service went down. Mitigations applied during incidents need the same ‘what is the blast radius?’ analysis as any other change — but under time pressure, engineers skip that step.The pattern behind both mistakes: incident adrenaline distorts judgment. It pushes you toward either analysis paralysis (investigating instead of mitigating) or panic-driven action (changing things without thinking about consequences). The discipline is a simple mental checklist: mitigate, communicate, investigate — in that order, every time.”Q6: Tell me about a time you had to make a one-way-door decision with incomplete information. (Staff-Level)
Q6: Tell me about a time you had to make a one-way-door decision with incomplete information. (Staff-Level)
The Question
Describe a significant technical decision where you could not get more information, could not easily reverse the choice, and had to commit anyway. How did you decide, and how did it turn out?Strong Answer
“We were building a new event-sourcing system for our fintech platform. The core decision was the storage layer: do we use PostgreSQL with a well-designed append-only schema, or do we adopt Apache Kafka as the event store? This was a true one-way door — once we started writing financial events to a storage system, migrating to a different one would mean moving millions of immutable records with audit trail requirements and zero tolerance for data loss.I did not have enough information to be certain. Our projected scale was 50 million events per month initially, growing to maybe 500 million within two years. Kafka handles that volume trivially. But PostgreSQL with proper partitioning could also handle it. The performance difference was not the deciding factor.I mapped the decision across several dimensions:Operational expertise: Our team had deep PostgreSQL experience. Nobody had run Kafka in production. Learning curve matters when you are responsible for a financial system.Failure modes we understand: We knew how PostgreSQL fails. We knew how to back it up, restore it, replicate it, and debug it. Kafka’s failure modes were unknown to us — and in a financial system, an unknown failure mode is an existential risk.Ecosystem fit: Everything else in our stack spoke SQL. Reporting, auditing, analytics — all of it. Kafka would have required building a separate query layer.Reversibility analysis: Technically both choices were one-way doors. But PostgreSQL gave us more escape hatches — we could add a Kafka-based stream later by tailing the PostgreSQL WAL (write-ahead log). Going the other direction — from Kafka to PostgreSQL — would be much harder.I chose PostgreSQL. The team pushed back — Kafka ‘felt’ like the right tool for event sourcing, and several blog posts recommended it. I acknowledged their concern and said: ‘I agree that Kafka is the industry-standard choice for event sourcing at scale. But our specific context — team expertise, compliance requirements, and ecosystem integration — makes PostgreSQL the lower-risk bet right now. If we hit PostgreSQL’s limits, we have a clear migration path. If we hit Kafka operational problems with a team that does not know Kafka, we have a much harder situation.’Two years later, we were at 300 million events per month on PostgreSQL with partitioned tables. It was working well. We never needed to migrate. The most valuable outcome was not the technology choice itself — it was the process of explicitly mapping the decision across dimensions and documenting why we chose what we chose. When new engineers joined and asked ‘why not Kafka?’, we could point them to the ADR instead of relitigating it.War Story: The pressure to choose Kafka was intense. One engineer shared a blog post from a well-known fintech showing how they processed 2 billion events per month on Kafka. What he did not mention — and I looked it up — was that company had a 15-person platform team dedicated to Kafka operations, including 3 engineers who were Kafka committers. We had 8 engineers total. I printed out the blog post and highlighted every mention of operational overhead they glossed over: ZooKeeper management (this was pre-KRaft), partition rebalancing during deploys, consumer group lag monitoring, offset management for exactly-once semantics, schema registry maintenance. Then I asked the team: ‘Who among us is going to do all this? And who is going to build product features while they do?’ That ended the Kafka discussion. Two years later, the engineer who had pushed hardest for Kafka told me it was the right call — he had since joined a company that used Kafka and spent his first three months just getting the operational runbook stable.Contrarian Take: For one-way-door decisions, the team’s emotional state matters more than most engineers admit. If half the team feels unheard or overridden, they will subconsciously under-invest in making the chosen approach succeed. A technically inferior decision that the team is enthusiastic about will often outperform a technically superior decision that half the team resents. This does not mean you let popularity drive architecture — but it means the social process around the decision (making people feel heard, giving them ownership of adjacent decisions, setting revisit milestones) is not soft-skills fluff. It is execution risk management.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would research both options thoroughly and pick the better one.’ (No framework for how to decide, no acknowledgment of team dynamics or irreversibility.)
- Great candidates: ‘I would map the decision across five dimensions: operational expertise, failure modes we understand, ecosystem fit, reversibility, and team morale. I would write an ADR with a trigger-conditions-for-revisiting section. And I would give meaningful ownership to the engineers whose preference was not selected, because a one-way door that the team half-heartedly walks through is worse than a two-way door.’
Follow-up: How do you know when you have done ‘enough’ analysis before committing to a one-way door?
“I use a diminishing-returns test. I track whether each additional hour of analysis is changing my confidence or my decision. Early on, each hour of research shifted my thinking significantly — I learned about Kafka’s operational complexity, about PostgreSQL’s partitioning capabilities, about the team’s actual skill gaps. But after about three days, each new piece of information was confirming what I already believed rather than challenging it. That is the signal that more analysis has diminishing returns.I also watch for a specific failure mode: analysis that is really procrastination dressed up as diligence. If I find myself researching increasingly obscure edge cases that would only matter at 100x our current scale, I am procrastinating. The question is not ‘what is the perfect choice?’ but ‘do I have enough information to make a defensible choice?’Defensible means: if this decision goes wrong, I can explain the reasoning and the information available at the time, and a reasonable senior engineer would agree the process was sound. You do not need to be right — you need to have a rigorous process that maximizes your probability of being right.”Follow-up: What would you have done differently if the decision turned out to be wrong — say PostgreSQL hit a wall at 200 million events?
“Having planned for this possibility is actually what made me comfortable committing in the first place. The ADR included a ‘trigger conditions for revisiting’ section. I wrote: ‘If we observe query latency on the events table exceeding 500ms at p99, or if the partition management overhead exceeds 10 hours of engineering time per month, we should revisit the storage choice.’If we hit the wall, my first response would not be to immediately jump to Kafka. I would ask: what specifically failed? Was it write throughput? Read query performance? Storage costs? The answer determines the next step. If it was write throughput, PostgreSQL has options — better partitioning, connection pooling optimization, or even a simple dedicated write replica. If it was fundamentally an architectural mismatch — PostgreSQL’s row-based storage was wrong for append-only event streams at our scale — then yes, we migrate.The migration plan was already sketched in the original design: tail the PostgreSQL WAL to populate Kafka, run both systems in parallel for a validation period, then cut over reads to Kafka while keeping PostgreSQL as the backup. We estimated two months of work if we ever needed it.The key is that we had thought about the failure path before committing to the decision. You should never walk through a one-way door without at least a rough sketch of the emergency exit.”Going Deeper: How do you handle the social dynamics when your one-way-door decision faces team resistance?
“This is where the technical and the interpersonal merge, and I think many senior engineers underinvest in this aspect. When I made the PostgreSQL call, two engineers on the team were genuinely disappointed. They wanted to work with Kafka — partly for the technical merits, partly because it would be a growth opportunity for them.I did three things. First, I acknowledged their position was technically valid. Kafka is a great fit for event sourcing in many contexts. I was not saying they were wrong — I was saying our specific context tilted toward PostgreSQL.Second, I gave them ownership of something meaningful. The engineer who was most excited about Kafka led the design of our event schema and the WAL-tailing prototype (the escape hatch). That let them engage deeply with event streaming concepts even though we were not using Kafka as the primary store.Third, I committed to revisiting the decision at a specific point — the 100 million events per month milestone — with data. This was not a vague ‘we will revisit later.’ It was a calendar event with defined metrics to evaluate. This turned the decision from a permanent loss for the Kafka advocates into a time-bounded experiment.The broader principle: one-way-door decisions create winners and losers. The technical decision might be correct, but the team dynamics around it need active management. The engineers whose preference was not selected need to feel heard, respected, and still invested in the outcome.”Unexpected Tangent: What is a one-way-door decision that most engineers treat as a two-way door?
“Logging and event schemas. Most teams treat log formats and event payloads as trivial decisions — ‘we can always change them later.’ But once you have 6 months of logs in Elasticsearch or events in a data warehouse, those schemas become load-bearing infrastructure. Dashboards, alerts, and reports are built on top of them. Analytics queries reference specific field names. Compliance systems parse specific log formats for audit trails.At a company I worked at, someone renamed a log field fromuser_id to userId during a
‘consistency cleanup.’ Thirty-seven Grafana dashboards broke. The fraud detection pipeline stopped
scoring transactions. A quarterly compliance report that parsed access logs failed silently and
delivered an empty report to auditors. All because a log field name — something nobody thought of
as a one-way door — had become a dependency for a dozen downstream systems over 18 months.Other underrated one-way doors: error message text (users and scripts parse them — Hyrum’s Law),
feature flag naming conventions (once feature flags are referenced in A/B test analysis, renaming
them invalidates historical data), and the order of steps in a deployment pipeline (teams build
muscle memory and mental models around the existing sequence). The common thread is that anything
observable for long enough becomes a contract, whether you intended it to or not.”Q7: You are asked to improve the reliability of a system that 'works fine' but has no tests, no monitoring, and no documentation. Where do you start? (Senior)
Q7: You are asked to improve the reliability of a system that 'works fine' but has no tests, no monitoring, and no documentation. Where do you start? (Senior)
The Question
You inherit a production service that handles real traffic and has not had an outage — but it has zero tests, minimal logging, no metrics dashboard, and the original author left the company. Leadership says “improve its reliability.” What is your approach?Strong Answer
“Before I write a single test or add a single metric, I need to understand what this system actually does. This sounds obvious, but I have seen engineers jump straight to adding test coverage for code they do not understand, which produces tests that verify the current behavior without knowing if the current behavior is correct.My first week is a discovery phase:Read the code, trace the critical paths. I follow an incoming request from entry point to response. I identify what data it reads, what data it writes, what external services it calls, and what happens when each of those calls fails. I draw the architecture diagram that does not exist.Look at the data. What queries hit this service? What is the traffic pattern? Are there spikes? What is the error rate according to the load balancer logs, even if the application does not log errors? Production traffic is the most honest documentation you will ever read.Talk to the users of the system. Who depends on this service? What do they care about? If the order service team tells me ‘we retry three times because that endpoint sometimes returns 500s on Mondays,’ that is a reliability problem that no monitoring will show me yet.After discovery, I prioritize ruthlessly. This is where Pareto matters: I am not going to add 100% test coverage to a codebase I barely understand. Instead:First: observability. I add basic metrics — request rate, error rate, latency percentiles, and saturation (connection pool usage, memory, CPU). I add structured logging on error paths. This gives me eyes on the system. I literally cannot improve what I cannot measure.Second: the kill switch. I make sure I can stop this system safely if it starts misbehaving. Can I drain traffic? Is there a feature flag to disable it? Can I roll back a deploy quickly? If not, that is the first thing I build.Third: tests for the scariest paths. I do not try to cover everything. I identify the three or four code paths where a failure would be most damaging — data corruption, financial miscalculation, security vulnerability — and I write integration tests for those. Not unit tests that mock everything, but tests that exercise the real behavior with a real database.Fourth: documentation. I write the architecture doc and the runbook based on what I learned. Not comprehensive documentation — the minimum an on-call engineer would need to handle an incident at 3 AM. ‘What does this service do? What are its dependencies? What does a healthy state look like? What are the known failure modes and how do you mitigate them?’This is a six-to-eight-week investment for a moderately complex service. After that, I have a system that is observable, stoppable, tested on the critical paths, and documented enough to be on-called. Then I can start improving it incrementally.War Story: I inherited a Python service at an ad-tech company that computed real-time bidding prices. Zero tests, no metrics, logging wasprint() statements to stdout that nobody read. The original author had left 8 months earlier. Everyone was terrified to touch it. My first discovery: the service had a memory leak — RSS grew by about 50MB per hour. It had been silently restarted by a cron job every 4 hours since the original author set it up. Nobody knew this. When I asked about the cron job, three different people said ‘oh, that is just how we deploy it.’ The cron job was the reliability strategy. My second discovery: the service was doing floating-point arithmetic for currency calculations, which meant bid prices were sometimes off by fractions of a cent — in aggregate, about 23K/month currency issue alone paid for my salary multiple times over. But the key insight was: the system was not ‘working fine.’ It was failing in ways that were invisible because there was no instrumentation to make the failures visible.Contrarian Take: When inheriting a system with no tests, do NOT start by writing unit tests. Unit tests for code you do not understand just lock in the current behavior — including the bugs. Start with integration tests that validate the business outcomes (does the service return the correct bid price for this input?), and add monitoring that tells you about real-world behavior. Unit tests come later, after you actually understand what the code is supposed to do versus what it actually does. I have seen teams write 200 unit tests for a legacy service and still miss a critical business logic bug because the tests verified the implementation, not the intent.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would add tests and monitoring.’ (Correct direction, but no prioritization framework. Adding tests to a system you do not understand produces false confidence.)
- Great candidates: ‘I would spend the first week in pure discovery — reading code, tracing request paths, talking to consumers, and looking at real production traffic patterns. I do not write a single test until I understand what correct behavior looks like. Then I prioritize: observability first (because I cannot improve what I cannot measure), kill switch second (because I need to be able to stop it safely), critical-path integration tests third, and documentation fourth.’
Follow-up: How do you convince leadership that spending 6-8 weeks on a service that ‘works fine’ is worth it?
“I frame it in terms of risk, not engineering ideals. ‘This service works fine’ actually means ‘this service has not failed yet.’ Those are very different statements. I would say:‘This service processes X transactions per day and has no tests, no monitoring, and no documentation. If it fails — and we will not know it is failing because we have no alerting — we will be debugging a system nobody understands, under production pressure, with no runbook. Our mean time to detect will be measured in hours (until a user complains), and our mean time to resolve will be measured in days (because nobody knows how it works).Here is my proposal: in six weeks, I can make this system observable, tested on its critical paths, and documented enough for on-call. The investment is N engineering-weeks. The alternative is waiting for an incident that will cost us far more than that in engineering time, customer trust, and potentially lost revenue.’Most leaders respond to this framing because it quantifies the downside of inaction. The abstract appeal to ‘engineering quality’ rarely works. The concrete appeal to ‘here is the cost when this breaks, and here is a cheaper alternative’ usually does.”Follow-up: You discover during your exploration that the system has a significant bug that nobody knows about because there is no monitoring. Do you fix it immediately or stick to your plan?
“It depends on the severity. If the bug is actively causing data corruption or financial miscalculation, I stop everything, fix the bug, and alert stakeholders. Some things cannot wait for a phased plan.If the bug is causing intermittent incorrect behavior that is not dangerous — say, a race condition that returns stale data 0.1% of the time — I document it, add it to my priority list, and keep following my plan. The reason is that fixing a bug in a system with no tests and no monitoring is risky. I might introduce a regression I cannot detect. I would rather have the observability and test coverage in place first so that my fix is safe and verifiable.This is a judgment call that I would discuss with the team lead. I would say: ‘I found this bug. Here is the impact. Here is my recommendation: let me add monitoring and a targeted test for this specific behavior first, then apply the fix with confidence that we will catch any regression. The total timeline is an extra week compared to fixing it blindly today, but the risk of introducing a new problem drops significantly.’ Most experienced engineers agree with that approach.”Unexpected Tangent: How do you handle the political dimension — the original author is gone, and the current team sees the system as ‘not their problem’?
“This is the hidden challenge nobody talks about in system reliability discussions. Orphaned services create organizational dead zones. Nobody wants to own them. Engineers view being assigned to improve an orphaned service as a punishment, not a growth opportunity.I reframe it. When I took on the bidding service, I positioned it to my manager as: ‘This service handles $4M in monthly ad spend with zero observability. Whoever makes it reliable owns a critical piece of revenue infrastructure and gains deep production operations experience. I want that to be me.’ That framing turns an obligation into a visible, high-impact project.The harder political challenge is when the orphaned service was built by a team that still exists but has ‘moved on’ to shinier projects. They do not want to maintain it but they also do not want someone else to change it (territorial instinct). I handle this by explicitly asking for their blessing: ‘I am going to add monitoring and tests to this service. I will send you the PRs for review since you have the most context. You do not need to maintain it — I am just asking for your expertise during the stabilization period.’ This gives them credit without giving them work, which is usually the magic formula.The worst outcome is when nobody owns the orphan and leadership does not assign it. These services become ticking time bombs. If you spot one in your organization, volunteering to stabilize it is one of the highest-leverage career moves you can make — you become the hero when (not if) it eventually breaks, and you have already done the groundwork.”Q8: How do you think about second-order effects when designing systems? (Staff-Level)
Q8: How do you think about second-order effects when designing systems? (Staff-Level)
The Question
Give me a concrete example of a time where a design decision had unexpected second-order effects. How did you discover them, and what did it teach you about anticipating them in the future?Strong Answer
“We added aggressive client-side caching to our mobile app to improve offline performance. The first-order effect was exactly what we wanted — the app felt snappy, offline mode worked, and user satisfaction scores went up.The second-order effects hit us over the next month:Customer support costs increased. Users would update their profile on the web, then open the mobile app and see stale data. They thought it was broken. Support tickets for ‘my changes are not showing up’ tripled. This was predictable in hindsight, but we had not talked to the support team during design.Our A/B testing became unreliable. We were running experiments on the backend, but the mobile app was serving cached responses. Some users were seeing experiment variant A from yesterday mixed with variant B’s backend behavior today. Our data science team spent two weeks debugging what looked like a bizarre statistical anomaly before realizing it was stale caches.Push notification engagement dropped. We sent notifications saying ‘New message from X’ but when users opened the app, the cached view did not show the new message. They had to pull-to- refresh. The notification team saw engagement metrics drop 15% and could not figure out why.What I learned about anticipating second-order effects:First, I now ask ‘who else consumes or depends on the data we are changing?’ before any caching decision. Caching is never just about the service that implements it — it affects every downstream consumer’s view of reality.Second, I learned to run a ‘second-order effects brainstorm’ for significant design changes. I gather representatives from adjacent teams — support, data science, the mobile team, the notifications team — and ask them: ‘We are planning to cache X aggressively. What could go wrong from your perspective?’ This takes one hour and consistently surfaces effects I would not have thought of.Third, I now design cache invalidation before I design caching. The question is not ‘should we cache this?’ but ‘how will we invalidate this, and what is the worst-case staleness users will experience?’ If I cannot answer the invalidation question, I do not cache.War Story: The A/B testing contamination from the mobile caching incident cost the data science team three weeks. They had been running a pricing experiment — showing different prices to different user cohorts to measure price elasticity. But the mobile cache was serving yesterday’s prices to users who had been reassigned to a different cohort today. The experiment data showed impossible results: users in the ‘high price’ cohort were converting at a higher rate than the ‘low price’ cohort. The data science team published a report saying ‘counterintuitively, higher prices increase conversion.’ The product team almost raised prices across the board based on this analysis. A junior data scientist caught the anomaly by checking raw request logs and discovered the cached price mismatch. If she had not, we would have raised prices based on contaminated data and likely lost $200K+ in quarterly revenue. That is a second-order effect of a caching decision nearly causing a pricing disaster through a data science pipeline — three teams, three systems, one root cause that nobody anticipated at design time.Contrarian Take: Second-order effects are not a bug in your design process — they are a feature of working in complex systems. You will never predict all of them, and trying to predict all of them before shipping is a form of analysis paralysis that kills velocity. The real skill is not preventing second-order effects — it is building detection mechanisms so you discover them in hours rather than weeks. Invest in anomaly detection, cross-team communication channels, and fast rollback capabilities. A team that ships fast and detects second-order effects within 24 hours outperforms a team that tries to predict everything and ships 3x slower.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would think about what could go wrong before implementing the change.’ (Correct instinct, but no method. How do you systematically identify second-order effects?)
- Great candidates: ‘I use a three-step framework: trace the data flow (where does this data appear across the system?), check the feedback loops (does this change alter any auto-scaling, caching, or retry behavior?), and run a cross-team impact review (a 30-minute meeting with representatives from adjacent teams asking: what could go wrong from your perspective?). The cross-team review catches 80% of the effects I would miss on my own.’
Follow-up: Is it possible to systematically predict second-order effects, or is it always reactive?
“You cannot predict all of them, but you can catch the majority with a structured approach. I use a simple framework I call ‘trace the data flow’:Step one: identify all the places the affected data appears. If I am caching user profile data, where does profile data show up? The mobile app, the web app, the admin dashboard, email templates, third-party integrations, analytics pipelines. Each of those is a potential site for a second-order effect.Step two: for each consumer, ask ‘what is the acceptable staleness?’ The mobile app showing a profile photo that is 5 minutes old is fine. A payment processor using stale billing info is not. This immediately tells you where the cache needs real-time invalidation versus where eventual consistency is acceptable.Step three: ask ‘what feedback loops does this create?’ If caching reduces database load, will our auto-scaler reduce database capacity? If so, what happens when the cache fails and all requests hit a database that has been right-sized for cached traffic?This process takes about an hour for a significant caching change. It does not catch everything, but it consistently catches the effects that would otherwise become incidents.”Going Deeper: How do you think about second-order effects at the organizational level, not just the technical level?
“This is where systems thinking connects to leadership. Technical decisions have organizational second-order effects that engineers often ignore.Example: we decided to extract our authentication logic into a shared library so all services could use it consistently. Technically sound. But the organizational second-order effects were significant:The team that owned the library became a bottleneck. Every service team needed their changes prioritized, and the auth team could not keep up. This created resentment and workarounds — some teams forked the library to unblock themselves, which defeated the consistency goal.On-call responsibility became ambiguous. When an auth failure happened, was it the library’s fault or the consuming service’s fault? Both teams would point at each other during incidents.Hiring became harder for the auth team because their roadmap was mostly ‘maintain the library and handle requests from other teams’ — not the most exciting pitch for candidates.These are predictable using Conway’s Law in reverse: if you create a shared component, you need a team structure that supports it. That means either staffing the team adequately, creating a self-service model where consumers can contribute changes, or accepting that the library will evolve slowly.The lesson: when I design systems now, I always ask ‘what team structure does this architecture imply, and is that structure feasible?’ If the architecture requires a team that does not exist or a collaboration pattern that conflicts with how the organization actually operates, the architecture will fail no matter how technically elegant it is.”Unexpected Tangent: Can second-order effects ever be deliberately engineered as a positive force?
“Absolutely, and the best platform teams do this intentionally. At Stripe, the decision to make every API response include arequest_id field was a first-order feature for debugging. But it
had a powerful second-order effect: customers started including request IDs in their support
tickets, which cut average support resolution time from 45 minutes to 12 minutes. That was not
the original goal — but Stripe kept it because the second-order effect was more valuable than
the first-order one.I have used this pattern deliberately. At one company, I added a latency_ms field to every API
response header. The first-order purpose was debugging. The second-order effect: frontend
developers started noticing when their requests were slow and proactively optimizing their API
usage patterns — batching requests, caching responses, reducing unnecessary calls. Backend
performance improved 15% without the backend team doing anything, because the frontend team could
see the cost of their decisions in real time. That is a positive feedback loop engineered through
transparency.Another example: making deployment logs public to the entire engineering org (not just the
deploying team). First-order effect: better deployment visibility. Second-order effect: teams
started coordinating deploys informally because they could see when another team was deploying to
the same infrastructure. Deployment collision incidents dropped 60%.The principle: information transparency creates positive second-order effects. When people can see
the consequences of their actions (latency, cost, error rates, deployment conflicts), they
self-correct. You do not need to build enforcement mechanisms — you need to build visibility.”Q9: What is the most important skill for debugging, and how do you develop it? (Foundational)
Q9: What is the most important skill for debugging, and how do you develop it? (Foundational)
The Question
If you had to pick one single skill that makes someone a great debugger, what would it be?Strong Answer
“The ability to observe precisely — meaning to separate what you see from what you assume.Most debugging failures I have witnessed, including my own, come from a single root cause: the engineer assumed something was true instead of verifying it. ‘The config is correct’ — did you read it, or did you assume it? ‘The deploy went out’ — did you check the deploy dashboard, or are you assuming it succeeded? ‘This code path is not being hit’ — did you add a log line to confirm, or are you reading the code and believing your reading?The best debuggers I have worked with have an almost annoying habit of verifying everything. They will say: ‘I think the problem is in the cache layer. Let me prove that the request is actually reaching the cache.’ And they add a log, or check a metric, or run a query — before moving on. Weaker debuggers say: ‘The problem must be in the cache layer’ and start redesigning the cache without confirming the premise.To develop this skill, I practice a simple discipline: for every hypothesis I form while debugging, I write it down and then write down how I would prove or disprove it. Not how I would reason about it — how I would empirically verify it. This forces me to find evidence rather than building chains of logic on top of unverified assumptions.A concrete example: I was debugging a timeout issue. My chain of reasoning was: the request hits our API gateway, which forwards to the backend, which queries the database. The database must be slow. I was about to start investigating the database when I paused and asked: ‘Have I verified that the request even reaches the backend?’ I checked the backend access logs. The request was not there. The timeout was happening at the API gateway level — a misconfigured route was sending the request to a non-existent service. If I had followed my unverified assumption into the database, I could have spent an hour looking in the wrong place.War Story: At a travel booking company, we had a P1 incident where hotel search results were returning duplicate listings — the same hotel appearing 3-4 times with slightly different prices. The first engineer to investigate assumed it was a deduplication bug in the search service and spent 90 minutes reading search code. The second engineer assumed it was a data ingestion issue and spent an hour checking the ETL pipeline. I asked a simpler question: ‘Are we actually seeing duplicates, or are these genuinely different listings?’ I pulled 5 duplicate hotel IDs and queried the raw database. They had differentlisting_id values, different supplier_id values, and slightly different metadata. They were not duplicates in our system — they were the same physical hotel listed by 3 different suppliers, which was correct behavior our system had always had. What changed was the frontend: a CSS update had removed the supplier badge that previously differentiated them visually. Users suddenly noticed duplicates that had always been there. The ‘bug’ was a CSS class rename. Two senior engineers spent 2.5 combined hours investigating the wrong systems because they assumed the report was accurate without verifying the premise. I found it in 8 minutes because I verified the basic claim before investigating the mechanism.Contrarian Take: The best debugging tool is not a debugger, a profiler, or a tracing system — it is a text editor and 5 minutes of writing. Before I touch any tool, I write three sentences: what I expect to be true, what I actually observe, and the specific discrepancy between them. This forces me to articulate the gap precisely, which cuts my debugging time by half. Engineers who jump straight into tools often debug symptoms of symptoms, three levels removed from the actual discrepancy. The writing step grounds you in what is actually wrong.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘The most important skill is understanding the system.’ (True but circular — understanding the system is the goal of debugging, not the skill that enables it.)
- Great candidates: ‘Precise observation — the ability to separate what I see from what I assume. I treat every fact as unverified until I have evidence. Did the deploy go out? Let me check the deploy dashboard, not assume it did. Is the database slow? Let me check the slow query log, not infer from latency numbers. Every debugging dead end I have ever gone down started with an unverified assumption that felt like a fact.’
Follow-up: How do you apply this skill when you are debugging under time pressure during an incident?
“This is where the skill matters most and is hardest to apply. Adrenaline during an incident pushes you toward action — ‘just try something!’ The discipline is to slow down for 30 seconds before each action and ask: ‘What am I about to change, and what evidence do I have that this will help?’I use a technique I call ‘narrate to the channel.’ During an incident, I post my observations and hypotheses in the incident Slack channel in real time: ‘Observing: error rate spiked at 14:32. Hypothesis: related to deploy at 14:28. Testing: checking if rolling back the deploy resolves the error rate.’ This forces me to be explicit about what I know versus what I assume, and it gives the rest of the team visibility into my reasoning — someone might immediately say ‘that deploy only touched the admin service, but the errors are on the customer-facing service.’The temptation during incidents is to skip observation and jump to action. The best incident responders I have worked with resist that temptation. They take two minutes to look at the data, form a hypothesis, and then act. Those two minutes often save thirty minutes of misdirected effort.”Follow-up: What is the difference between a developer who is ‘good at debugging’ and one who is just ‘experienced’?
“Experience gives you pattern matching — you have seen similar bugs before, so you can quickly recognize them. That is valuable, but it is fragile. It breaks down the moment you encounter a bug outside your experience.True debugging skill is domain-independent. It is the scientific method applied to code: observe, hypothesize, test, conclude. A great debugger can diagnose problems in systems they have never seen before because their approach does not depend on recognizing patterns — it depends on systematically eliminating possibilities.I have worked with engineers who had 15 years of experience but were poor debuggers. Their approach was: ‘I have seen this before, it is probably X.’ If X was right, they looked brilliant. If X was wrong, they had no systematic fallback. They would try Y, then Z, then get frustrated. That is not debugging — that is guessing from experience.I have also worked with engineers with two years of experience who were excellent debuggers. Their approach was methodical: gather symptoms, form a specific hypothesis, design a test for that hypothesis, run the test, and adjust based on results. They were slower on familiar problems but much faster on novel ones.The ideal is both: experience for pattern matching on common problems, plus a rigorous method for the problems that do not match any pattern. If I had to develop one in a junior engineer, I would invest in the method. Patterns accumulate naturally over time.”Unexpected Tangent: How do you debug problems that are not reproducible?
“Non-reproducible bugs — the ones that happen in production but never in your local environment — are where debugging skill truly separates itself from debugging knowledge. You cannot step through the code. You cannot add breakpoints. You are working from forensic evidence.My approach to Heisenbugs (bugs that disappear when you observe them) has three pillars:First, increase the resolution of your observation without changing the system’s behavior. This means adding passive instrumentation: metrics that are always collected (not debug logging you turn on during investigation), distributed trace data that captures the full request lifecycle, and structured events that log state transitions. If the bug happens again, you need the evidence to already be there — you cannot ask the bug to wait while you add logging.Second, reason about what conditions are different between production and local. The usual suspects for non-reproducible bugs: concurrency (your local machine has 4 cores, production has 32), timing (network latency between services that are co-located locally), data volume (your test database has 1,000 rows, production has 80 million), and configuration drift (a feature flag that is on in production but off in staging). I make a checklist of these differences and systematically evaluate which ones could explain the bug.Third, if I truly cannot reproduce it, I build a hypothesis about the trigger condition and add an alert that fires when that condition occurs. For example, if I suspect a race condition between two services, I add a metric that tracks when both services are writing to the same resource within a 100ms window. When the alert fires, I have the context to capture a full trace of the next occurrence.The most memorable non-reproducible bug I tracked down was a payment double-charge that happened about once per 50,000 transactions. It only occurred when a user hit the ‘pay’ button, got a network timeout, and retried — but only when the retry hit a different application server than the original request, and only when the first server’s request succeeded after the timeout but before the idempotency key expired from Redis. The time window was about 800ms. I found it by adding a custom metric that tracked ‘payment requests where the idempotency key was written by a different server than the one processing the request.’ It took three weeks of passive monitoring to catch it, but the evidence was conclusive.”Q10: How do you evaluate whether a team should build something custom or adopt an existing solution? (Senior)
Q10: How do you evaluate whether a team should build something custom or adopt an existing solution? (Senior)
The Question
Your team needs a workflow orchestration system. There are open-source options (Temporal, Airflow), managed services (AWS Step Functions), and the option to build something custom. How do you approach this decision?Strong Answer
“I start with a strong bias toward not building custom — and then I let the specific requirements either confirm or override that bias.The reason for the bias is simple arithmetic. A custom workflow engine is probably 3-6 months of engineering effort to build, and then it requires ongoing maintenance, bug fixes, documentation, and on-call support forever. Open-source and managed options amortize that cost across thousands of users. The question is whether your specific needs are unusual enough to justify the custom investment.I evaluate along five dimensions:Requirements fit. I list the ten most important capabilities we need and score each option on how well it covers them natively. If an existing solution covers 80%+ of our needs, the remaining 20% is almost certainly cheaper to work around than to build an entire system for.Operational cost. This is where teams consistently undercount. Building a workflow engine is the easy part. Operating it — handling upgrades, monitoring, debugging, scaling, and training new team members — is the hard part. A managed service like Step Functions eliminates most of this. Self-hosted open source (Temporal) reduces it but does not eliminate it. Custom means you own all of it forever.Team expertise. If the team has deep experience with one of the options, that is a significant factor. A tool the team knows how to operate and debug is worth more than a theoretically superior tool nobody understands.Vendor lock-in risk. Step Functions ties you to AWS. Temporal is portable but still a dependency. Custom gives you full control. How much does this matter? If you are already deeply on AWS and have no plans to leave, the lock-in cost is near zero.Evolution speed. How fast are your requirements changing? If you know the workflow patterns well and they are stable, an existing solution is great. If you are in an exploratory phase and the patterns are shifting every month, a custom solution gives you more flexibility — but only if you actually have the team capacity to iterate on it.For most teams in most situations, I would recommend starting with a managed service (Step Functions or equivalent) for its operational simplicity, evaluating Temporal if the workflow complexity genuinely exceeds what the managed service supports, and only building custom if there is a specific, articulated requirement that no existing solution can meet.The red flag I watch for: ‘Let us build our own because existing solutions do not do X’ — where X is a nice-to-have, not a hard requirement. That is rationalized NIH (Not Invented Here) syndrome.War Story: At a logistics company, the platform team spent 5 months building a custom job scheduler because Airflow ‘did not fit our needs.’ The specific objection was that Airflow’s UI was ‘not intuitive enough for our data analysts.’ Five months and 12,000 lines of Go later, the custom scheduler had no UI at all — analysts had to submit jobs via curl commands. It also lacked retry logic, dependency management, and alerting — all things Airflow provides out of the box. The team spent another 4 months adding those features. By the end, they had spent 9 engineering-months building a worse version of Airflow plus a React dashboard. If they had spent 2 weeks customizing Airflow’s UI with a plugin or built a thin wrapper around its API, they would have had a better solution in a fraction of the time. The project lead later admitted the real motivation was that the team wanted to build something in Go, not that Airflow was genuinely insufficient. I now ask a pointed question in build-vs-buy discussions: ‘If the existing tool were written in our preferred language, would we still want to build custom?’Contrarian Take: The total cost of adopting a third-party tool is almost always 3-5x higher than the team estimates, because they count the integration cost but forget the ongoing costs: staying current with upgrades, debugging interactions between the tool and your system, training new hires on the tool, and dealing with the tool’s bugs and limitations. That said, the total cost of building custom is usually 5-10x higher than the team estimates. So ‘buy’ is still usually cheaper — but less cheaper than people think. The honest comparison is not ‘free open-source tool vs. months of custom development.’ It is ‘200K+ in custom development and permanent ownership.’What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would evaluate the options based on features and pick the best fit.’ (No framework for evaluation, no mention of operational cost, team expertise, or long-term maintenance.)
- Great candidates: ‘I evaluate across five dimensions: requirements coverage (does it do 80%+ of what we need?), operational cost (who runs it, patches it, debugs it at 3 AM?), team expertise (do we know how to operate this, or are we learning a new system under production pressure?), lock-in risk (what does it cost to leave?), and evolution speed (how fast are our requirements changing?). For most teams, I have a strong bias toward managed services for non-differentiating infrastructure.’
Follow-up: What is the difference between healthy skepticism of third-party tools and NIH syndrome?
“Healthy skepticism is specific: ‘This tool does not support exactly-once delivery semantics, which we need for financial transactions.’ NIH syndrome is vague: ‘I do not trust other people’s code’ or ‘We could build something simpler.’The test I use: can the person articulate a specific, non-trivial requirement that the existing tool cannot meet, AND can they explain why that requirement is essential (not just preferred)? If yes, building custom might be justified. If the objections are general (‘it is too complex,’ ‘it has too many features we will not use,’ ‘I want to understand every line of code’), that is usually NIH.I have also seen the inverse anti-pattern: ‘tool worship,’ where a team adopts a powerful third-party solution for a simple problem and then wrestles with its complexity. Kubernetes for a single service. Kafka for 10 messages per second. Elasticsearch for a dataset that fits in PostgreSQL with room to spare. The right tool is the simplest one that meets your actual requirements, whether that is custom-built, open-source, or managed.”Going Deeper: How does the build-vs-buy calculus change when the component in question is your company’s core competency?
“This is where the calculus flips completely. For supporting infrastructure — workflow orchestration, logging, CI/CD, monitoring — I default to buy or adopt. These are solved problems and your engineering time is better spent on what differentiates your product.But for your core differentiator — the thing that makes your product valuable — building custom is usually the right call. If you are a fintech company, your payment orchestration engine might look similar to a generic workflow engine, but the specific requirements around idempotency, reconciliation, regulatory compliance, and real-time fraud detection make it your competitive moat. Building that on top of a generic tool means you are constantly fighting the tool’s assumptions.Amazon could have used an off-the-shelf recommendation engine. Netflix could have adopted a standard CDN. Google could have used existing search infrastructure. They built custom because those systems ARE their product.The principle: use commodity solutions for commodity problems. Build custom for your differentiation. The hard part is being honest about what is actually your differentiator versus what you just want to build because it sounds interesting. In my experience, engineering teams tend to overestimate how much of their stack is differentiated. Usually it is 10-20% of the codebase. Everything else should be as boring and standard as possible.”Unexpected Tangent: How do you evaluate a third-party tool’s actual reliability versus its claimed reliability?
“Marketing pages and GitHub stars tell you nothing about operational reliability. Here is my actual evaluation checklist, learned from adopting tools that looked great and then burned us:Check the issue tracker, not the README. Search for ‘data loss,’ ‘corruption,’ ‘production,’ and ‘outage.’ Look at how maintainers respond. A project where critical bugs get triaged in hours is worth 10x a project where they sit for months. Temporal’s GitHub issues, for example, show rapid response from the core team on production-affecting bugs — that is a strong signal.Find companies at your scale that run it in production. Not 10x your scale (their problems are different) and not conference-talk testimonials (survivorship bias). Find companies at roughly your stage, roughly your traffic volume, and talk to their engineers directly. LinkedIn messages with ‘we are evaluating X, would love 15 minutes of your experience’ have a surprisingly high response rate.Deploy it in a non-critical path first and measure operational overhead for 30 days. How often does it need attention? What is the on-call burden? At a previous company, we evaluated a message queue by running it for a month handling internal analytics events (low stakes). It crashed twice due to a known JVM GC issue at our heap size. That 30-day trial saved us from deploying it on our payment processing pipeline.Read the upgrade path. How do major version upgrades work? Do they require downtime? Schema migration? Data reindexing? A tool that is easy to adopt but nightmarish to upgrade becomes a trap. I specifically check if the last 3 major version upgrades had a clean migration path and how many GitHub issues mention ‘upgrade’ alongside ‘broken’ or ‘data loss.’Check the bus factor. If the project has 1-2 active maintainers and they work at a startup that might not exist next year, that is a risk you need to price in. If the project is backed by a foundation (CNCF, Apache) or a well-funded company (Temporal Inc., Confluent), the continuity risk is lower.”Q11: How do you handle disagreements with a more senior engineer about a technical approach? (Intermediate)
Q11: How do you handle disagreements with a more senior engineer about a technical approach? (Intermediate)
The Question
You are a senior engineer. A staff engineer on your team proposes an approach you believe is wrong. They have more experience and more organizational credibility than you. How do you handle the disagreement?Strong Answer
“I start from the assumption that they know something I do not. That is not deference — it is Bayesian reasoning. They have more experience, they have probably seen more failure modes, and they have context about the system and the organization that I might lack. So my first move is not to argue — it is to understand.I would say: ‘Can you walk me through the reasoning behind this approach? I want to make sure I understand the constraints you are designing for.’ This accomplishes two things: I might learn something that changes my mind (it happens more often than you would think), and if their reasoning has a gap, I can point to the specific gap rather than making a vague objection.If after understanding their reasoning I still believe the approach is wrong, I escalate my concern — but with data, not opinions. ‘I ran some numbers and this approach would require 3x the storage at our projected scale’ is much more persuasive than ‘I think this will not scale.’ If I can build a quick prototype that demonstrates the problem, even better. Evidence beats authority.I also pay attention to the type of disagreement. Is this a fundamental architectural concern where being wrong is costly? Or is it a stylistic preference where both approaches work? For the first type, I push harder and am willing to escalate. For the second, I defer and save my credibility for the fights that matter.There was a situation where a staff engineer wanted to use eventual consistency for a feature that I believed required strong consistency — it involved inventory counts where overselling would cost real money. I gathered three specific scenarios where eventual consistency would lead to overselling, estimated the financial impact based on our order volume, and presented them. The staff engineer agreed the scenarios were valid and we redesigned that component with strong consistency. No ego involved on either side — the data spoke.The worst outcome in a technical disagreement is not picking the wrong approach. It is picking no approach because the team is stuck in analysis paralysis, or picking a compromise that is worse than either original proposal.War Story: Early in my career, a principal engineer proposed replacing our PostgreSQL read replicas with a Redis cache layer for our product catalog. I believed this was wrong — the catalog had complex relational queries (category trees, attribute filtering, cross-sells) that Redis could not express without denormalizing everything. But I made the mistake of arguing in a group Slack channel with vague objections: ‘I think this adds too much complexity.’ The principal dismissed it. I escalated by sending a long email to the engineering director. That backfired — it looked political, not technical. What I should have done: taken 4 hours to prototype both approaches, benchmarked them against our actual query patterns, and shared the results. When I eventually did this (after the principal’s approach was already approved), the benchmark showed Redis was 8x faster for simple key lookups but required 14 separate Redis calls to reconstruct a single product detail page that PostgreSQL served in one query. Total latency was actually worse. The principal looked at the data and reversed his own decision. The lesson: data is the ultimate organizational equalizer. A junior engineer with a benchmark beats a principal engineer with an opinion.Contrarian Take: Most advice says ‘disagree and commit’ — meaning once a decision is made, support it fully. This is mostly correct, but there is a dangerous edge case: if you genuinely believe the decision will cause irreversible harm (data loss, security breach, regulatory violation), ‘disagree and commit’ is the wrong framework. In those cases, you have a professional obligation to escalate, even if it is uncomfortable. The phrase should be ‘disagree and commit for reversible decisions; disagree and escalate for irreversible ones.’ Knowing the difference is a senior-level judgment call.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I would respectfully share my opinion and defer to seniority.’ (Too passive. Also suggests seniority equals correctness, which is cargo-cult thinking.)
- Great candidates: ‘I would start by genuinely trying to understand their reasoning — they may have context I lack. If I still disagree after understanding their position, I would produce data: a prototype, a benchmark, a cost analysis, or concrete failure scenarios with estimated financial impact. Evidence beats authority. And I would classify the disagreement: if it is a reversible decision, I would express my concern, document it, and commit fully. If it is irreversible and high-stakes, I would push harder and involve additional perspectives.’
Follow-up: What if you present data and the senior engineer still disagrees?
“Then I need to consider a few possibilities. Maybe my data is incomplete — perhaps there is a constraint I am not seeing. Maybe the senior engineer is wrong but the organizational dynamics make it difficult for them to reverse their position publicly. Or maybe this is genuinely a judgment call where reasonable people can disagree.If it is a reversible decision, I let it go. I document my concern in the design doc or code review for the record, and I let the implementation prove or disprove us. If I was right, the data will show it and we can adjust. If I was wrong, I learn something.If it is an irreversible decision with significant risk, I escalate — respectfully and through the right channels. I would go to the engineering manager or the tech lead and say: ‘I want to raise a concern about this design decision. Staff engineer X and I disagree on approach Y. Here is the data I have gathered. I want to make sure this gets a broader review before we commit.’ This is not going over someone’s head — it is responsible engineering.What I never do is comply outwardly while sabotaging implicitly. If I disagree with the decision but the team commits to it, I commit fully. I implement it as well as I can. Passive-aggressive compliance — building the thing poorly to prove a point — is career poison and harms the team.”Follow-up: How do you create a team culture where junior engineers feel safe disagreeing with you?
“I actively invite disagreement and reward it when it happens. Concretely:In design reviews, I ask specific people for their concerns. Not ‘does anyone have thoughts?’ but ‘Alice, you worked on the similar system at your last company — does anything about this approach worry you?’ Directed questions give people permission to speak up.When someone finds a flaw in my proposal, I thank them publicly. ‘Good catch — I had not considered that race condition. Let me rethink this section.’ The first time a junior engineer sees a senior engineer openly accept criticism and change their mind, it permanently changes the team’s dynamic.I also share stories about times I was wrong. Not in a performative way, but naturally: ‘This reminds me of a service I designed two years ago that hit scaling problems because I underestimated the write volume. Let me make sure we are not making the same mistake here.’ This signals that being wrong is normal and recoverable, not career-threatening.The thing I am most careful about is never punishing disagreement, even indirectly. If a junior engineer pushes back on my idea and turns out to be wrong, I make sure they feel just as valued as when they push back and are right. The goal is to reward the act of critical thinking, not just the accuracy of the conclusion.”Unexpected Tangent: How do you handle the situation where you disagree with the senior engineer and they turn out to be right?
“This is the situation nobody prepares for, and how you handle it defines your reputation more than any technical win. The natural human instinct is to quietly move on and hope nobody remembers. The senior move is to explicitly acknowledge it.I once pushed hard against a staff engineer’s proposal to use event sourcing for our order management system. I argued it was over-engineering — CRUD was fine for our scale. Six months later, we needed to add a complete audit trail for regulatory compliance, replay events to reconstruct order states for a billing dispute, and support real-time event streaming to a new analytics platform. Event sourcing would have given us all three for free. CRUD meant we spent 8 weeks retrofitting audit logging, building a manual state reconstruction tool, and adding change data capture to stream events out of PostgreSQL.I went to the staff engineer and said: ‘You were right about event sourcing. I underestimated the value of the event log as a first-class data structure. Here is what I would evaluate differently next time.’ She appreciated the acknowledgment, but more importantly, the team saw a senior engineer openly update their mental model based on evidence. That created a culture where changing your mind was seen as strength, not weakness.The practical habit: I keep a ‘decisions I was wrong about’ list. Not for self-flagellation — for calibration. After reviewing 20+ wrong decisions, I noticed a pattern: I consistently underweighted future flexibility and overweighted current simplicity. That pattern recognition has improved my decision-making more than any book on architecture.”Q12: Describe a situation where doing nothing was the correct engineering decision. (Staff-Level)
Q12: Describe a situation where doing nothing was the correct engineering decision. (Staff-Level)
The Question
Tell me about a time where the right technical decision was to explicitly choose inaction — to not build something, not fix something, or not migrate something — even though there was pressure to act.Strong Answer
“We had a legacy billing service written in PHP that handled all invoice generation. It was ugly — the code was poorly structured, had no tests, and everyone complained about it. When our platform team proposed a company-wide migration to Go microservices, there was strong momentum to include billing in the rewrite. The team estimated 3-4 months to rebuild billing in Go.I argued that we should do nothing. Here is why:The system worked. It processed 50,000 invoices per month with a 99.97% success rate. The 0.03% failures were all edge cases with specific international tax calculations that were manually resolved. Not elegant, but functional and understood.The risk of rewriting was high. Billing is a domain with enormous hidden complexity. Tax rules, proration logic, currency conversion, refund handling, credit notes, compliance requirements for different jurisdictions. This complexity was encoded in the existing codebase through years of bug fixes and edge case handling. Rewriting it meant re-discovering every edge case the hard way — through production incidents that could affect actual revenue.The maintenance cost was low. Yes, the code was ugly. But how often did we actually change it? I checked the git log: 4 commits in the last 6 months. Two were dependency updates, one was a tax rate change, and one was a bug fix. We were spending maybe 2 hours per month on this service. Rebuilding it would cost 3-4 months upfront and would likely require more maintenance initially as we hit new edge cases.The opportunity cost was real. Those 3-4 engineering months could go toward building our new analytics pipeline, which had direct revenue impact.I presented this analysis to the team: ‘The billing service is ugly but stable, cheap to maintain, and handles a domain with enormous hidden complexity. Rewriting it risks months of rework plus production billing errors that directly affect our customers and our revenue. I recommend we leave it alone and invest our time in higher-impact work.’The pushback was emotional: ‘But it is PHP!’ ‘But it does not match our new architecture!’ These were real concerns but they were aesthetic, not functional. I asked the critical question: ‘What specific problem will rewriting billing solve that justifies 3-4 months of work and the risk of billing errors?’ Nobody could name one that was not cosmetic.We did not rewrite it. Two years later, it was still running with that same 99.97% success rate. Eventually, when we genuinely needed to change billing — to add multi-currency support — we did a targeted modification of the existing service rather than a full rewrite. It took two weeks instead of four months.War Story: The billing example is mild compared to the most painful ‘do nothing’ decision I ever had to defend. At a B2B SaaS company, our event tracking pipeline was built on a custom Kafka consumer written in Scala by an engineer who had since left. The Scala code was dense, the build system was SBT (which nobody else on the team knew), and it took 45 minutes to compile. Every engineer who touched it swore afterward. The new platform lead wanted to rewrite it in Go — estimated 6 weeks. I dug into the actual operational data: the pipeline processed 800 million events per month with zero data loss over 14 months. It had been touched exactly twice in that period — both times to update a dependency version. The SBT build was slow but ran in CI, not locally. I calculated the real maintenance cost: about 3 hours per quarter. The Go rewrite would cost 6 weeks of a senior engineer’s time (roughly 60K over 6 months, maintenance cost of the existing system ~$1,200 over the same period. We kept the Scala pipeline. Three years later, it was still running. When we finally replaced it, it was because we moved to a managed service (AWS Kinesis Data Firehose), not because we rewrote it in Go.Contrarian Take: The most undervalued skill in software engineering is the ability to say ‘no, we should not build that.’ Engineering culture celebrates building. Promotions reward building. Conference talks are about things people built. But the engineer who prevents 6 weeks of unnecessary work has created exactly as much value as the engineer who builds something valuable in 6 weeks — the opportunity cost math is symmetric. The problem is that preventing unnecessary work is invisible. Nobody gives a conference talk titled ‘How I Saved My Company 6 Weeks By Not Rewriting a Service.’ But those invisible decisions are where the highest-leverage engineering judgment lives.What Most Candidates Say vs What Great Candidates Say:- Most candidates: ‘I believe in continuous improvement, so I would always try to improve the system.’ (Sounds proactive but is actually undisciplined — it does not distinguish between improvement that delivers value and improvement that is wasted effort.)
- Great candidates: ‘I evaluate three things before acting: the actual operational cost of the status quo (not the perceived cost), the total cost of the proposed change (including migration risk, re-learning edge cases, and opportunity cost), and whether the change delivers business value or just engineering satisfaction. Some of the best engineering decisions I have made were choosing not to act when there was pressure to do something.’
Follow-up: How do you distinguish between ‘healthy inaction’ and ‘dangerous complacency’?
“The key test is whether the decision to do nothing is active or passive. Active inaction means you have evaluated the situation, considered the alternatives, quantified the costs and risks, and deliberately chosen that doing nothing is the best use of resources. You can articulate why. Passive inaction means you are avoiding the problem because it is uncomfortable, or you are hoping it goes away.Concretely, healthy inaction looks like: ‘We evaluated rewriting the billing service. Here is the risk analysis. Here is the opportunity cost. The recommendation is to not rewrite at this time. We will revisit when X condition changes.’Dangerous complacency looks like: nobody has looked at the billing service in a year, nobody knows what it does, and when someone raises a concern, the response is ‘it’s fine, do not touch it.’ That is not a decision — that is ignorance.I also set review triggers. When I recommended against rewriting billing, I said: ‘We should revisit this decision if: the failure rate exceeds 0.1%, if we need to make more than 2 feature changes per quarter, or if we lose the last person who understands the codebase.’ Those triggers turn inaction from a permanent state into a monitored position.”Follow-up: How do you manage the team’s frustration when they want to build something and you are saying no?
“I acknowledge the frustration honestly. Engineers want to build things. Working on a greenfield Go microservice is more exciting than maintaining legacy PHP. That is a legitimate emotional reality, and dismissing it with ‘be professional’ does not help.What I try to do is channel that energy productively. The engineers who wanted to rewrite billing were really expressing a desire to work with modern tools and learn new skills. So I found other opportunities for that — they led the Go migration for a different service that was lower-risk and higher-value to rewrite.I also try to make the ‘do nothing’ decision feel empowering rather than defeatist. Saying ‘We chose not to rewrite billing, and here is the analysis that shows why that saves us 4 months of effort’ is a story about smart prioritization. The team made a deliberate, data-driven decision to protect their time for higher-impact work. That is a stronger narrative than ‘we just could not get to it.’The engineer who can argue persuasively for not building something is often more valuable than the one who can build anything. Knowing what not to do is a higher-order skill than knowing how to do things.”Going Deeper: What signals tell you it is time to reverse the ‘do nothing’ decision?
“There are both quantitative and qualitative signals:Quantitative: The metrics I set as trigger conditions — failure rate increasing, maintenance hours climbing, the number of workarounds growing. When the cost of inaction starts exceeding the cost of action, it is time to act.Qualitative: When the team starts routing around the system. If engineers are building workarounds, duplicating functionality in other services, or avoiding the system entirely because it is too risky to modify — the system is no longer ‘working fine.’ It is a tumor that the organization is growing around.Another signal: when the knowledge of how the system works concentrates in fewer and fewer people. A system that two people understand is a risk. A system that one person understands is a crisis waiting to happen. A system that nobody understands is already a crisis — you just do not know it yet.The most dangerous signal is the one you do not get: the gradual normalization of small failures. When the team starts saying ‘yeah, that service does that sometimes, just retry’ — that normalization is the precursor to a major incident. Small failures that are accepted rather than investigated are the canary in the coal mine.The discipline is setting those review triggers at decision time, not after the crisis. If you decide to do nothing and you do not define what would change your mind, you have not made a decision — you have just procrastinated.”Unexpected Tangent: How do you deal with the resume-driven development impulse — engineers wanting to rewrite systems because they want to learn new technology?
“I treat it as a legitimate human need, not something to suppress. Engineers who never get to work on interesting problems leave for companies that let them. Pretending this motivation does not exist is how you lose your best people.My approach is to create sanctioned outlets for technology exploration that do not risk production systems. At one company, I instituted ‘Tech Exploration Fridays’ — 4 hours every other Friday where engineers could prototype anything using any technology. The rule: it must be potentially useful to the team, but it does not need to ship. One engineer’s Friday prototype of a Go-based metrics aggregator actually replaced a fragile Python script in production. Another engineer’s Rust experiment taught us that Rust was not the right fit for our team and saved us from a misguided rewrite proposal.The key insight: the desire to work with new technology is not the problem. The problem is when it masquerades as a technical rationale. ‘We should rewrite this in Go because Go is better for our use case’ is often really ‘I want to learn Go.’ Both are valid — but they need different evaluation frameworks. A legitimate technical migration needs a cost-benefit analysis, risk assessment, and timeline. A learning opportunity needs a sandbox, a time box, and zero production risk.When someone proposes a rewrite and I suspect resume-driven development, I ask a diagnostic question: ‘If we had to do this rewrite in the same language as the current implementation, would you still advocate for it?’ If the answer is ‘yes, because the architecture needs to change,’ the motivation is technical. If the answer is ‘well, no, the architecture is fine, but since we are rewriting anyway…’ the motivation is the language, and we should find a different way to satisfy that need.”Advanced Interview Scenarios
S1: Your VP asks you to present a migration plan. The 'obvious' answer will get you killed. (Staff-Level)
S1: Your VP asks you to present a migration plan. The 'obvious' answer will get you killed. (Staff-Level)
The Scenario
Your VP of Engineering asks you to present a plan for migrating your company’s primary database from a self-hosted PostgreSQL cluster to Amazon Aurora. The VP heard at a conference that Aurora “solves scaling problems” and wants a 90-day migration timeline. The database is 4TB with 200+ tables, serves 15 microservices, and handles 12,000 queries per second at peak. What do you actually do?What weak candidates say: “Here is my 90-day migration plan: phase 1 is schema migration, phase 2 is data migration using DMS, phase 3 is cutover with a maintenance window. We should use AWS Database Migration Service and follow their documentation.”What strong candidates say:“Before I present a migration plan, I need to present a migration decision. The VP has jumped from a problem (scaling) to a specific solution (Aurora) without establishing whether the problem actually exists or whether this solution fits. My first conversation with the VP is not about timelines — it is about problem definition.I would come prepared with data:Is there actually a scaling problem? I would pull the last 6 months of database metrics — CPU utilization, connection count, query latency percentiles, replication lag, storage growth rate. If our PostgreSQL cluster is running at 40% capacity with room to grow, there is no scaling problem. The conference talk sold a solution we do not need.If there IS a scaling problem, is Aurora the right fix? Maybe the real issue is that 3 of our 200 tables account for 80% of query load, and adding read replicas or targeted caching would solve the problem at 10% of the cost and risk of a full migration. Maybe the problem is poorly optimized queries, not database capacity — I have seen teams migrate to a faster database only to hit the same bottlenecks 6 months later because nobody fixed the N+1 queries.If Aurora genuinely is the right move, 90 days is almost certainly fantasy. Here is why: with 15 microservices consuming this database, we are not migrating a database — we are migrating an ecosystem. Each service has connection strings, ORM configurations, PostgreSQL-specific SQL (CTEs, window functions, PL/pgSQL stored procedures), and implicit dependencies on PostgreSQL behavior (Hyrum’s Law guarantees this). I would need to audit every service for Aurora compatibility. In my experience, teams discover 30-50% more migration work than they initially estimate because of these hidden dependencies.My actual presentation to the VP would be:‘I investigated the scaling concern. Here is what I found: [data]. Based on this, I see three options ranked by cost and risk: (1) Optimize the current PostgreSQL setup — estimated 3 weeks, lowest risk, addresses 80% of the scaling headroom. (2) Add read replicas and targeted caching — estimated 6 weeks, moderate risk, addresses 95% of headroom. (3) Full Aurora migration — estimated 5-7 months realistically, highest risk, addresses all scaling concerns and reduces operational burden long-term. I recommend starting with option 1, which buys us 12+ months of runway, while we plan option 3 properly if the business growth trajectory demands it.’This reframes the conversation from ‘execute the VP’s solution’ to ‘solve the VP’s actual problem.’ The engineers who get promoted to staff and principal are the ones who do this reframing — they do not take orders, they take problems.War Story: At a Series C e-commerce company, the CTO wanted to migrate from MongoDB to PostgreSQL because of consistency issues. I did the audit and found the real issue: 14 of the 16 services were doing unacknowledged writes (w:0) for performance reasons — a setting
someone had copy-pasted from a blog post 3 years ago. The writes were fast but MongoDB was
not confirming them, so under load, some writes silently dropped. Switching to w:majority
writes and adding a few indexes on the hottest collections solved the consistency problems in
2 weeks. The PostgreSQL migration would have taken 6 months and introduced new categories of
bugs in services that relied on MongoDB’s document model. The CTO was initially frustrated that
I pushed back on his plan, but when I showed the data, he became the biggest advocate for the
fix-in-place approach. The lesson: always diagnose before prescribing.”Follow-up: The VP insists on Aurora because the board wants to see “cloud modernization” on the roadmap. Now what?
“Now we are in political territory, not technical territory, and pretending otherwise is naive. If the board has cloud modernization as a strategic priority, there may be legitimate business reasons — fundraising optics, acquisition readiness, compliance certifications — that override the purely technical calculus.In that case, I would say: ‘I understand the strategic context. Let me propose a migration plan that achieves the cloud modernization goal while managing technical risk. Instead of migrating the primary database in 90 days, let us start with 2-3 lower-risk services — the ones with simpler schemas and lower traffic. We migrate those to Aurora in 90 days, prove the pattern, build the tooling, and establish the team’s expertise. Then we migrate the core database in quarter 2 with a team that actually knows what they are doing.’This gives the VP a board-friendly narrative (‘we have begun Aurora migration and completed phase 1 on schedule’) while protecting the business from a rushed migration of the most critical system. The senior engineering skill here is finding the overlap between what leadership needs politically and what engineering needs technically.”Follow-up: How do you estimate the 5-7 months for the full migration? Break down the work.
“I would break it into phases and estimate each independently:Phase 0 — Audit (3-4 weeks): Catalog every table, every stored procedure, every service’s database access patterns. Run pgAudit or query logging to capture actual query patterns — not what the code says it does, but what it actually does in production. Identify PostgreSQL-specific features we rely on. Test Aurora compatibility with our most complex queries.Phase 1 — Tooling and infrastructure (2-3 weeks): Set up Aurora cluster, configure networking, establish replication from PostgreSQL to Aurora using DMS for continuous sync. Build the monitoring to compare query performance between the two.Phase 2 — Shadow traffic (4-6 weeks): Route a copy of read traffic to Aurora while PostgreSQL remains primary. Compare results and latency. This is where you discover the surprises — queries that behave differently, features that do not exist in Aurora’s PostgreSQL compatibility layer, performance regressions on specific query patterns.Phase 3 — Service-by-service cutover (6-8 weeks): Migrate services one at a time, starting with the lowest risk. Each service gets a canary period where it reads from Aurora but writes to both. Monitor for a week before cutting over fully.Phase 4 — Core service cutover and decommission (3-4 weeks): The last, riskiest services. Dual-write period, extensive testing, planned maintenance window for the final switchover.Total: 18-25 weeks, which is 4.5-6 months. I add a month of buffer for the surprises we cannot predict, getting to 5-7 months. And I would bet the upper end.”S2: Production is down, but the dashboards all say 'green.' (Senior)
S2: Production is down, but the dashboards all say 'green.' (Senior)
The Scenario
It is 2:47 AM. You get a PagerDuty alert from the customer support team’s escalation channel — not from your monitoring. Support has received 400+ tickets in 90 minutes saying “checkout is broken.” You pull up Grafana: all service dashboards are green. Error rates normal. Latency normal. CPU and memory normal. But checkout is clearly broken for real users. What do you do?What weak candidates say: “I would check the error logs and look for exceptions. If the dashboards are green, maybe it is a client-side issue — I would check the CDN and JavaScript error tracking.”What strong candidates say:“Green dashboards during a real outage is one of the scariest scenarios because it means our observability has a blind spot. The system is broken in a way we did not anticipate when we built the monitoring. My first job is not to fix the problem — it is to establish ground truth about what is actually happening.Step 1 — Reproduce it myself (2 minutes). I open the checkout flow in an incognito browser, on mobile, on a different network. If I can reproduce it, I can trace exactly what is happening. If I cannot, the problem is environment-specific, and that narrows things dramatically.Step 2 — Check what the dashboards are NOT measuring (5 minutes). Green dashboards mean our measured metrics are healthy. But what are we not measuring? Common blind spots:- Business-logic correctness. The API returns 200 OK, but the response payload is wrong — an empty cart, a zero-dollar total, a missing shipping option. The error rate metric counts HTTP status codes, not business logic validity. I would check a sample of actual API responses for the checkout endpoint.
- Client-side errors. The backend is fine, but a JavaScript error in the checkout form prevents the ‘Place Order’ button from working. This would show zero backend errors and zero backend latency impact while making checkout completely unusable. I would check our client-side error tracking (Sentry, LogRocket) if we have it — and if we do not, that is a major observability gap we just discovered the hard way.
- Third-party payment processor. Our service sends the payment request successfully (our
metrics see a fast, successful outbound call), but the payment processor is rejecting or
timing out silently. We might be logging the request as successful based on the HTTP 200 we
get back, without checking that the response body contains a success status. I have seen this
exact pattern: Stripe returns 200 with
{ status: 'failed', reason: 'card_declined' }and the service logs it as a successful request. - A/B test or feature flag gone wrong. A percentage of users are in an experiment that broke checkout. Our aggregate metrics show low error rates because 80% of users are fine. But the 20% in the broken variant are all experiencing failures. I would check which experiments are active on the checkout flow.
- DNS or CDN partial outage. Some geographic regions cannot resolve our domain or are getting served stale or broken cached assets. Backend metrics are fine because the requests never reach the backend.
Follow-up: How do you build monitoring that catches problems dashboards miss?
“The meta-principle is: measure outcomes, not just components. Component metrics (CPU, memory, error rate) tell you whether the machine is healthy. Outcome metrics tell you whether users are actually achieving their goals.For a checkout flow, the key outcome metric is checkout completion rate — the percentage of users who start checkout and successfully place an order. If that drops by 10%, something is broken — even if every individual component shows green. This is a business metric, not a technical metric, and that is exactly why it works as a circuit-breaker for the blind spots in technical monitoring.I would also add synthetic transactions — an automated bot that completes a full checkout flow every 60 seconds in production. Not a health check endpoint that returns 200 — an actual end-to-end flow that exercises the real code path. When that bot fails, you know something real is broken, regardless of what the dashboards say.The third layer is anomaly detection on support ticket volume. If support tickets spike 3x above the rolling average for that time of day, that is an automatic page to engineering. Humans notice problems that machines miss.”Follow-up: After the incident, how do you build the case that the company needs to invest in better observability?
“I use the incident itself as the business case. In the postmortem, I quantify the cost:‘This incident lasted 90 minutes before detection. During that window, approximately 1,200 users attempted checkout and failed. At our average order value of 102,000 in lost revenue — some of which we will never recover because users went to a competitor. Our detection time was 90 minutes because we rely on support ticket volume, which is a lagging indicator.Proposed investment: 2 engineering-weeks to add synthetic monitoring, client-side error tracking, and checkout completion rate dashboards. Expected outcome: detection time drops from 90 minutes to under 5 minutes for this category of issue. At one incident per quarter (conservative), that is $300K+ in prevented revenue loss per year against a one-time cost of 2 weeks of engineering.’Numbers talk. ‘We need better observability’ gets deprioritized forever. ‘$300K per year in prevented revenue loss’ gets funded.”S3: The team wants to adopt the hot new framework. Everybody is excited. You think it is a mistake. (Intermediate)
S3: The team wants to adopt the hot new framework. Everybody is excited. You think it is a mistake. (Intermediate)
The Scenario
Your frontend team of 6 engineers is excited about migrating the main product from React to a newer framework (pick your era: Next.js, Remix, Svelte, Solid, whatever is hot right now). Three engineers have built side projects with it. There are blog posts from Shopify and Vercel praising it. The team lead is enthusiastic. You are the one senior engineer who thinks this is a bad idea. How do you approach this?What weak candidates say: “I would present my concerns about migration risk and suggest we keep using React. If the team disagrees, I would go along with it.”What strong candidates say:“I have been in exactly this situation, and the first thing I check is whether I am wrong. The most dangerous version of this scenario is the one where I am the dinosaur resisting legitimate progress. So before I argue against the migration, I need to honestly evaluate why I am skeptical.Is my skepticism based on evidence or on comfort? If my argument boils down to ‘I know React well and I do not want to learn something new,’ that is not a technical objection — it is inertia. I need to separate my emotional attachment to the current stack from my technical assessment of the migration.Assuming my skepticism survives that self-check, here is my framework:Apply Survivorship Bias analysis. The blog posts and conference talks about this framework are from teams where it worked. We are not hearing from the teams where it did not — where the migration stalled at 60%, where the framework’s ecosystem was too immature for production edge cases, where the team ended up maintaining two frameworks for 18 months because they could not finish the migration. I would research whether anyone has publicly written about failed migrations to this framework. Those stories are more instructive than the success stories.Apply Chesterton’s Fence. Our React codebase was not born ugly. It was built over 3 years by real engineers solving real problems. Before we tear it down, can we articulate what specific problems the new framework solves that React cannot? Not ‘it is faster’ in a benchmark — but does our specific application have the specific performance problems that this framework addresses? If our React app loads in 1.8 seconds and the new framework promises 1.2 seconds, is 600ms worth 6 months of migration work? For a stock trading app, maybe. For an internal admin dashboard, absolutely not.Quantify the migration cost honestly. I have never seen a major frontend migration finish on time. Our React app has 150 components, 40 custom hooks, 3 third-party integrations that assume React’s lifecycle model, and a test suite with 800 tests. None of that migrates automatically. Realistically, a migration like this is 6-9 months with the whole team, or 12-18 months with a partial team — during which we are shipping features at half speed because everyone is context- switching between two frameworks.What I would actually propose:‘I am hearing that the team is excited about [framework]. Let me propose we make this decision with data instead of enthusiasm. Pick the most isolated, least critical part of our app — maybe the settings page — and rebuild it with the new framework. Time-box this to 2 weeks with 2 engineers. At the end, we evaluate: How did it feel to develop? Did we hit any ecosystem gaps (missing libraries, incompatible tooling)? Is the performance difference meaningful for our use case? Was the migration process for that component representative of the rest of the app? Then we make the full migration decision based on that evidence.’This approach respects the team’s excitement, channels it productively, and generates real data to replace the blog posts and side project impressions they are currently relying on.War Story: At a B2B SaaS company, the frontend team was dying to migrate from Angular to React in 2019. I was skeptical but I did not block it — I proposed the pilot approach. Two engineers spent 2 weeks migrating the user settings module. They discovered that our authentication library had deep Angular-specific bindings with no React equivalent. Our form validation system was built on Angular’s reactive forms — porting it to React meant either rewriting 60+ forms or adopting a React form library with different semantics that would require retraining the QA team. The i18n setup assumed Angular’s DI system. The 2-week pilot revealed 3 months of hidden infrastructure work that nobody had accounted for. The team decided to do a targeted migration instead: new features would be built in React (using Module Federation for micro-frontends), and existing Angular code would only be migrated when it needed significant changes anyway. Two years later, the app was 60% React, 40% Angular, and both parts worked fine. The gradual approach avoided the 6-month feature freeze that a full migration would have required.”Follow-up: What if the 2-week pilot goes well and confirms the team’s enthusiasm? Do you change your mind?
“If the pilot genuinely goes well — meaning the migration was smooth, performance improved meaningfully for our use case, and no ecosystem gaps surfaced — then yes, I update my position. The whole point of the pilot was to replace opinion with evidence. If the evidence supports the migration, my job is to help plan it well, not to keep objecting.But I would pressure-test ‘went well’ before accepting it at face value. The settings page is the simplest, most isolated part of the app. The hard parts are: the payment flow with its complex state management, the real-time dashboard with WebSocket integrations, the form-heavy workflows with intricate validation. I would ask: ‘Based on what we learned in the pilot, what is your confidence that these harder modules will migrate smoothly?’ If the answer is ‘pretty confident’ with no specific reasoning, I am still worried. If the answer is ‘we identified that the WebSocket integration will need a compatibility layer, and here is our plan for that,’ I am much more comfortable.”Follow-up: How do you handle the social dynamics when you are the lone dissenter and it feels like the team resents your caution?
“This is a real cost of being the skeptic, and I think about it honestly. Being the person who always says ‘wait, but what about…’ erodes social capital. Engineers stop inviting you to brainstorming sessions because they expect you to shoot everything down.I manage this by being specific about what I support, not just what I oppose. Instead of ‘I think this migration is a bad idea,’ I say ‘I think an unplanned migration is risky. Here is what I propose instead.’ I am not saying no — I am saying ‘yes, differently.’ That is an important distinction.I also pick my battles. I probably voice concerns about 30% of the decisions where I have doubts and let the other 70% go. The ones I raise are the ones where I believe the downside is large and the team has not considered it. The ones I let go are cases where I think the team might be slightly wrong but the cost of being wrong is low.And when I am overruled and the decision turns out well, I say so publicly. ‘I was skeptical about adopting [framework] and the team proved me wrong. The migration went smoother than I expected.’ This builds credibility for the times when my skepticism is validated.”S4: You discover that your team's 'successful' system is only working by accident. (Senior)
S4: You discover that your team's 'successful' system is only working by accident. (Senior)
The Scenario
You are investigating a performance issue and discover that your team’s order processing pipeline has been silently dropping approximately 0.5% of orders for the last 4 months. The system reports these as successful because a retry mechanism accidentally masks the failures — the retry writes the order to a secondary data store that nothing reads from. Revenue impact is roughly $180K over 4 months. Nobody noticed because all the dashboards show a 99.5% success rate, which is “within SLA.” What do you do?What weak candidates say: “I would fix the bug that drops orders, add monitoring for the secondary data store, and write a postmortem. Then I would work on recovering the lost orders from the secondary store.”What strong candidates say:“This is a Goodhart’s Law scenario playing out in real time. We set ‘99.5% success rate’ as our target, and the system achieved the metric while failing at the actual goal. The orders are not succeeding — they are being misclassified as successful. Our monitoring optimized for the metric, not the outcome. This is not just a bug — it is a systemic observability failure.Immediate actions (first 4 hours):First, I need to stop the bleeding. Every hour this continues, more orders are lost. But I also cannot just hack a fix into production at 2 AM without understanding the full picture. My immediate move is to add an emergency alert: if any order is written to the secondary data store, page the on-call engineer immediately. This turns the silent failure into a visible one while I work on the actual fix.Second, I assess the recovery situation. Those orders in the secondary data store — can we process them? Are they complete records? Is the data still valid (inventory still available, prices not changed)? I need to understand the recovery path before I commit to anything, because telling leadership ‘we can recover the lost orders’ and then discovering we cannot is worse than saying ‘we are assessing recovery options.’Third, I escalate to leadership. This is a $180K revenue impact that has been hidden for 4 months. Engineering leadership, product, and likely finance need to know. I am not going to try to fix this quietly and hope nobody notices. The escalation includes: what happened, the scope of impact, what we are doing right now, and when we will have a full recovery and prevention plan.Root cause analysis (next 2 days):The technical bug is probably straightforward to fix — the retry is writing to the wrong store. But the interesting root cause is: why did nobody notice for 4 months?- The SLA was measured wrong. ‘99.5% success rate’ counted HTTP 200 responses, not actual order completion. The retry mechanism returned 200 after writing to the secondary store, so the metric said ‘success.’ This is a measurement problem, not a code problem.
- Nobody validated end-to-end. If we had a reconciliation job that compared orders placed vs orders fulfilled vs revenue booked, the discrepancy would have surfaced in week 1.
- The secondary data store was a ghost. A data store that nothing reads from should not exist. It is either dead code that should be removed, or it serves a purpose nobody remembers (Chesterton’s Fence). In this case, it became an accidental black hole for lost orders.
- Replace HTTP-status-based success metrics with business-outcome metrics. An order is successful when payment is confirmed AND the order appears in the fulfillment queue AND the customer receives a confirmation. Anything less is not success — it is an in-progress state.
- Add daily reconciliation between order intake, payment processing, and fulfillment. Any discrepancy over 0.01% triggers an alert.
- Audit every retry mechanism in our pipeline. Retries that silently swallow failures are time bombs. Every retry should either succeed on the primary path or raise a visible alarm.
Follow-up: How do you present this to leadership without it sounding like the engineering team was negligent?
“I frame it as a systemic gap, not a human failure. The messaging is: ‘Our monitoring measured component health, not business outcome. The system was functioning correctly from a technical standpoint — services responded, retries fired, no errors were logged. What we lacked was end-to-end business validation that would have caught the discrepancy between orders accepted and orders fulfilled.’I would also come prepared with the fix plan and timeline. Leaders can absorb bad news much more easily when it is paired with a concrete plan. ‘We have identified the issue, recovery of X% of affected orders is underway, and we are implementing three systemic changes that will prevent this category of problem going forward. Here is the timeline.’What I specifically avoid: technical jargon that obscures accountability (‘a race condition in the retry handler caused misrouted writes to a secondary persistence layer’). Leadership translates that as ‘I do not understand what went wrong and I am hiding behind complexity.’ Plain language: ‘Orders were accidentally saved to the wrong place, so they were marked as complete but never actually fulfilled. We are fixing the orders and changing how we measure success so this cannot happen again.’”Follow-up: After this incident, how do you audit other systems for similar hidden failures?
“I would run what I call a ‘metric honesty audit’ across our critical pipelines. For each system:- What does the SLA metric actually measure? Trace it back to the code. Does ‘99.9% success rate’ mean 99.9% of HTTP requests returned 200, or 99.9% of business transactions completed end-to-end? These are very different numbers.
- Is there a reconciliation between the start and end of each pipeline? If money enters the system and products leave the system, do those numbers match?
- Are there any ‘dead’ data stores or queues that might be silently accumulating lost records? Run a query on every table and queue that is not part of the primary data flow. If any of them have recent writes, investigate why.
- For every retry mechanism: what happens to the data if the retry ultimately fails? Is there a dead-letter queue? Is the dead-letter queue monitored?
S5: You inherit a team that ships fast but breaks things constantly. (Staff-Level)
S5: You inherit a team that ships fast but breaks things constantly. (Staff-Level)
The Scenario
You join as a new engineering manager for a team of 8 engineers. The team ships features fast — leadership loves their velocity. But they have had 11 production incidents in the last quarter, 3 of which were P1 (customer-facing outages lasting more than 30 minutes). The previous manager told the team that “moving fast and breaking things is fine.” How do you change the culture without killing the velocity that leadership values?What weak candidates say: “I would implement code review requirements, add a CI/CD pipeline with tests, and create an on-call rotation. We need to slow down and prioritize quality.”What strong candidates say:“The first thing I would not do is announce that ‘things are going to change.’ That is a recipe for the team to see me as the new manager who does not trust them and wants to slow them down. I need to earn trust before I can change behavior.Week 1-2: Listen and measure. I would read every postmortem from those 11 incidents (if postmortems exist — I suspect they do not). I would categorize each incident: Was it a code bug? A deploy issue? A missing test? A config error? A dependency failure? I am looking for patterns, not individual blame. In my experience, 70% of incidents in teams like this come from 2-3 root categories.I would also have 1:1s with every engineer and ask: ‘What do you think causes most of our incidents? What would you change if you could?’ Engineers on teams like this usually know exactly what is broken — they have just never been asked or empowered to fix it. I am betting at least half the team is frustrated by the incidents even if they do not show it.Week 3-4: Find the 80/20 fix. Based on the pattern analysis, I would identify the single change that would prevent the most incidents with the least disruption to velocity. Common examples:- If most incidents come from deploys: add a canary deployment step with automatic rollback. This costs engineers zero extra time per deploy but catches 60-70% of bad deploys before they go wide.
- If most incidents come from missing error handling: add a 30-minute ‘incident prevention’ code review checklist — not full code review, just a focused check for error handling on the critical path. Most teams can add this without materially slowing velocity.
- If most incidents come from config changes: put configs in version control with a review process. Config changes are the number one cause of outages at most companies, and they are usually completely unreviewed.
Follow-up: One of the senior engineers pushes back, saying the new practices are “red tape” and threatens to leave. What do you do?
“I take the conversation seriously and have it privately. This engineer might be the strongest individual contributor on the team — and losing a top IC over process changes is a real cost.First, I listen to their specific objection. ‘Red tape’ usually means one of three things: (1) they do not see the problem — incidents do not affect them personally; (2) they agree there is a problem but think my solution is wrong; or (3) they fundamentally value individual autonomy over team process and will resist any guardrail.For (1), I share the data. ‘You personally have spent 23 hours on incident response this quarter. That is 3 days you could have spent building things.’For (2), I ask them for an alternative. ‘You have more context on this codebase than I do. If you agree we need to reduce incidents but think my approach is wrong, what would you propose?’ This channels their energy constructively and often produces better solutions.For (3), this is a harder conversation. Some engineers genuinely thrive in chaos and resist structure. If they cannot accept any guardrails, this may not be the right team for them anymore — but I would exhaust other options first, including giving them autonomy on experimental or greenfield projects where the blast radius of moving fast is smaller.”Follow-up: How do you measure whether your cultural changes are working, without falling into Goodhart’s Law traps?
“I would track a basket of metrics rather than a single one, specifically because of Goodhart’s Law:- Incident count per quarter — but paired with incident severity distribution. If incidents drop but the remaining ones are all P1s, we have not actually improved.
- Mean time to detect (MTTD) — how quickly we find incidents. This measures observability.
- Mean time to resolve (MTTR) — how quickly we fix incidents. But I pair this with mean time between incidents (MTBI) to make sure we are not just resolving faster by closing prematurely.
- Feature velocity — story points shipped, PRs merged, or whatever proxy the team already uses. This ensures we are not trading velocity for reliability.
- Team sentiment — quarterly anonymous survey asking ‘how confident are you in the stability of our systems?’ and ‘how much time do you spend on unplanned work vs planned work?’
S6: Two teams built the same thing independently and nobody noticed for 6 months. (Senior)
S6: Two teams built the same thing independently and nobody noticed for 6 months. (Senior)
The Scenario
You are doing an architecture review and discover that Team Alpha (payments) and Team Beta (subscriptions) have each independently built their own retry-with-backoff library, their own dead-letter queue implementation, and their own idempotency-key generation system. Both work. Both are slightly different. Neither team knows about the other’s implementation. How do you think about this?What weak candidates say: “This is clearly a waste. We should consolidate into a single shared library and have both teams adopt it. I would create a platform team to own shared infrastructure.”What strong candidates say:“My first reaction — ‘this is wasteful, we need to consolidate’ — is the obvious answer, and it is probably wrong. Before I propose any action, I need to apply Conway’s Law and think about why this happened.Why it happened matters more than what happened. Two teams building the same thing independently is not an engineering failure — it is an organizational signal. It means:- The teams do not have a communication channel for shared infrastructure. There is no architecture review, no RFC process, and no shared Slack channel where someone would say ‘hey, we just built a retry library.’
- The teams were incentivized to ship fast, not to collaborate. If the KPI is ‘features shipped per quarter,’ of course teams will build what they need locally instead of spending 3 weeks negotiating a shared solution with another team.
- A platform team does not exist, and maybe it should not. Creating shared infrastructure requires someone to own it, maintain it, and support other teams using it. That is a full-time job. At a 30-person engineering org, a dedicated platform team may not be justified.
- Migration cost. Both teams have to rewrite their code to use the shared version. Both have tests, both have production behavior they depend on. The shared version needs to support both teams’ requirements, which are subtly different (Hyrum’s Law guarantees their code depends on implementation details of their current version).
- Ongoing coordination cost. A shared library means every change requires coordination. Team Alpha needs a new feature? They have to convince Team Beta it is worth the complexity. Team Beta finds a bug? They have to make sure the fix does not break Team Alpha’s usage. This coordination tax is real and ongoing.
- Bottleneck risk. Who owns the shared library? If it is Team Alpha, Team Beta’s requests get deprioritized. If it is a new platform team, you just increased your headcount for a library. If it is ‘everyone,’ nobody actually maintains it and it rots.
Follow-up: How do you decide when duplication crosses the line from ‘acceptable’ to ‘problematic’?
“I look at three signals:Divergent correctness. If both implementations should behave the same way but they do not, and the difference causes bugs or inconsistent user experience, that is problematic. Two different retry libraries are fine. Two different ‘calculate sales tax’ implementations that give different answers for the same input are not fine.Maintenance multiplication. If a security vulnerability is found in one implementation, do we need to remember to patch the other? If a new requirement comes in (like adding circuit breaker support to retries), do we have to implement it twice? If the answer is ‘yes and we keep forgetting,’ consolidation saves future time.Cognitive load for new engineers. When a new engineer joins and asks ‘how do I implement retries in my service?’, are they confused by two options with no guidance? If the duplication creates decision paralysis for people who should be focused on building features, that is a signal.”Follow-up: What organizational changes would prevent this from happening in the first place?
“Lightweight architecture visibility — not a heavyweight review board. Specifically:A shared channel (Slack, Teams) where engineers post when they are about to build something that might be reusable. Not a formal proposal — a one-line message: ‘Hey, I am about to build a retry library for the payments service. Has anyone built one already?’ This takes 30 seconds and prevents months of duplicated work.A quarterly ‘architecture show and tell’ where each team demos their recent infrastructure work. Not a review — a show-and-tell. The payments team demos their retry library, the subscriptions team says ‘wait, we just built one too,’ and they can decide to align or not. The decision is theirs — but at least they know.An internal package registry or catalog. Even if teams do not share code, a catalog that says ‘Team Alpha built a retry library (link)’ and ‘Team Beta built an idempotency system (link)’ gives new teams a starting point.The anti-pattern is a formal architecture review board that must approve all new infrastructure. That creates a bottleneck, slows teams down, and generates resentment. The goal is visibility, not control.”S7: Your metrics say performance improved. Your users say it got worse. Who is right? (Intermediate)
S7: Your metrics say performance improved. Your users say it got worse. Who is right? (Intermediate)
The Scenario
Your team deployed an optimization last week. The p50 latency dropped from 120ms to 80ms. The p95 dropped from 400ms to 250ms. The team is celebrating. But social media and support channels are flooded with complaints: “the app is slower than ever.” Your metrics say it is faster. Your users say it is slower. Who do you believe?What weak candidates say: “The users must be experiencing something we are not measuring. I would check client-side performance and network latency.”What strong candidates say:“I believe the users. Always. Metrics can be wrong or incomplete. Users experience reality.But I also believe the metrics are measuring what they claim to measure. So the question is not ‘who is right’ but ‘what is the gap between what we are measuring and what users are experiencing?’ That gap is where the bug lives.Here are the most likely explanations, in order of probability:1. The p99 or p99.9 got worse while p50 and p95 improved. This is the most classic trap in performance optimization. An optimization that makes most requests faster can make the slowest requests much slower. For example, if we added an in-memory cache, 95% of requests hit the cache and are blazing fast. But the 5% that miss the cache now pay the cache-lookup overhead PLUS the original database query, making them slower than before. Our aggregate metrics look great because most requests improved. But the users who hit the slow tail experience degradation — and those are often the most engaged users (complex pages, large accounts, rare but real use cases). I would immediately check p99 and p99.9 latency, which are often not on the default dashboard.2. We changed what we are measuring. The optimization might have changed the measurement point. Did we add a CDN cache that serves responses before they hit the backend? Backend latency drops — we are measuring fewer requests at the backend because the CDN is absorbing them. But the CDN might be serving stale content, or the CDN’s own latency is higher for cache misses, and we are not measuring that. The metric is accurate for what it measures, but it no longer measures the user’s experience.3. We made the server faster but the client slower. Maybe we moved computation from server to client — returning raw data that the client now has to process. Server-side latency drops (we are doing less work). Client-side rendering time increases (the user’s phone is doing more work). Total perceived time is worse. Our monitoring only measures server-side latency.4. We improved latency but broke something else. The optimization might have introduced a subtle correctness bug. Pages load faster but show the wrong data, requiring users to retry. Or the optimization broke streaming and progressive loading — now users stare at a blank screen for 800ms before everything appears at once, whereas before they saw progressive content loading in 1200ms but felt engaged after 300ms. Time-to-first-byte improved. Time-to-interactive worsened.What I would do immediately:Pull up Real User Monitoring (RUM) data if we have it — actual browser timing from real user sessions, not server-side metrics. Compare Core Web Vitals (LCP, FID, CLS) before and after. If we do not have RUM data, this incident is the reason to add it.Look at the specific complaints. Are users saying ‘the page takes forever to load’ or ‘the page loads but then hangs’ or ‘I have to reload twice’? The symptom tells you the layer.War Story: At a media company, we optimized our article page API from 200ms to 60ms by pre-rendering articles to static HTML and caching aggressively. Dashboards looked incredible. Users were furious. The issue: our pre-rendered HTML included ads at fixed positions. The ad network’s JavaScript took 2-3 seconds to load after the page rendered, causing massive layout shifts — the page would jump around as ads loaded. Before the optimization, the page loaded slowly but ads loaded during the slow render, so the final layout was stable. After optimization, the page appeared instantly but then spasmed for 3 seconds as ads injected themselves. Perceived performance was much worse despite measured performance being 3x better. The fix was adding skeleton placeholders for ads so the layout was stable from first paint. Lesson: never measure latency without also measuring stability.”Follow-up: How do you design performance metrics that actually correlate with user experience?
“You need to measure what the user sees, not what the server does. Specifically:Time to First Contentful Paint (FCP): When does the user see anything? A blank white screen for 2 seconds feels broken even if data is loading.Largest Contentful Paint (LCP): When is the main content visible? This is the moment the user decides ‘the page loaded.’First Input Delay (FID) and Interaction to Next Paint (INP): When the user clicks something, how long until the UI responds? A fast-loading page that is unresponsive to clicks feels slow.Cumulative Layout Shift (CLS): Does the page jump around after loading? Layout instability makes fast pages feel janky and broken.Custom business metrics: For a checkout flow, measure time-to-checkout-complete, not individual API latencies. For a search app, measure time-from-keystroke-to-results-visible.The key principle: server-side latency is a component metric. User experience is a composite of multiple component metrics, and the composite can worsen even when every component improves.”Follow-up: Should you roll back the optimization while you investigate?
“It depends on the severity of user complaints and whether the optimization is easily reversible.If users are saying ‘I literally cannot complete checkout,’ roll back immediately. User-facing functionality trumps performance gains.If users are saying ‘it feels slower’ but functionality is intact, I would keep the optimization deployed and investigate in parallel. Rolling back loses the data we are gathering from the current state and delays diagnosis.But here is the subtle point: even if I keep it deployed, I would tell the team ‘this optimization is not shipped until we understand the user complaints.’ Celebrating a metric win while users are unhappy creates a cultural blind spot. The team needs to internalize that metrics exist to approximate user experience, not replace it.”S8: You are asked to estimate a project but you know the estimate will be wrong. (Intermediate)
S8: You are asked to estimate a project but you know the estimate will be wrong. (Intermediate)
The Scenario
Your product manager asks you to estimate a project: rebuilding the search feature to support fuzzy matching, autocomplete, and faceted filtering. You have never built a search system before. Your team has never integrated Elasticsearch (or any search engine). The PM needs the estimate for a board meeting next week and wants a number in weeks. How do you provide an honest estimate without sandbagging or setting yourself up for failure?What weak candidates say: “I would break it down into tasks, estimate each one, add 20% buffer, and give the number. For a search feature with three components, probably 6-8 weeks.”What strong candidates say:“This is a situation where the honest answer is: my estimate will be wrong, and what matters is how I communicate the uncertainty — not pretending it does not exist.Why the estimate will be wrong: There is a concept in estimation called the ‘cone of uncertainty.’ At project start, estimates are reliably off by 2-4x. They only converge toward accuracy as you learn more about the problem. Since we have never built search and never used Elasticsearch, we are at the widest part of the cone. Any single number I give will be a fiction.What I would actually do:Instead of one number, I would give three numbers and a spike:‘I can give you a range right now, and a more precise estimate in one week after we do a technical spike. Here is the range:- Best case (everything goes smoothly, no surprises): 6 weeks. This assumes Elasticsearch integrates cleanly with our data model, our team ramps up on it quickly, and the product requirements do not change.
- Most likely (normal amount of surprises): 10-12 weeks. This accounts for learning curve on Elasticsearch, data migration complexity, iteration on relevance tuning (which is always harder than expected), and normal requirement clarification.
- Worst case (significant unknowns materialize): 16-18 weeks. This accounts for discovering that our data model is a poor fit for Elasticsearch, needing to redesign the data pipeline, and multiple iterations on search relevance because users do not find what they expect.
Follow-up: The PM says “the board wants a single number, not a range. Give me your best guess.”
“I would give them the most-likely number and explicitly state the assumption: ‘10 weeks, assuming no major surprises in the Elasticsearch integration and a single round of relevance tuning. I want to flag that the confidence level on this is moderate — if the spike next week reveals significant data model issues, I will update the estimate immediately.’The key is putting the assumption on the record. If I say ‘10 weeks’ and the PM tells the board ‘10 weeks,’ that is a commitment. If I say ‘10 weeks assuming X, Y, and Z,’ and any of those assumptions break, I have a justified basis for updating the estimate. This is not covering myself — it is honest communication of what the number depends on.What I would never do: pad the estimate to 18 weeks to guarantee I come in under budget. That is sandbagging, and it destroys trust if the PM finds out. It also means the board might kill the project because 18 weeks seems too expensive, when 10-12 weeks might be perfectly acceptable.”Follow-up: How do you handle the situation when you are 6 weeks in and realize it will take 16 weeks instead of 10?
“The moment I realize the estimate is wrong, I communicate immediately. Not at the next standup. Not at the next sprint review. That day. Delays get worse with time, and surprises get less forgivable the later they are revealed.I would say to the PM: ‘I need to update the estimate. Here is what we have learned: [specific technical findings]. Based on this, the remaining work is X weeks, putting us at 16 weeks total instead of 10. Here is why: [concrete explanation, not vague]. Here are three options: (1) Keep the current scope and timeline extends to 16 weeks. (2) Cut faceted filtering from v1 and we can hit 11 weeks. (3) Ship fuzzy search only next week as a quick win, and autocomplete plus facets follow in a second phase.’Giving options is critical. ‘It will take 6 more weeks’ puts the PM in a corner. ‘Here are three paths with different scope and timeline trade-offs’ gives them agency and makes the conversation collaborative instead of adversarial.The meta-lesson: estimates are not promises. They are hypotheses. The difference between a junior and senior engineer is not estimation accuracy — it is how quickly they detect and communicate deviations, and whether they bring options or just bad news.”S9: You need to delete a feature that an unknown number of users depend on. (Senior)
S9: You need to delete a feature that an unknown number of users depend on. (Senior)
The Scenario
Your team needs to remove a 4-year-old feature to simplify the codebase and unblock a major refactor. The feature was built by someone who left the company. Analytics show it gets roughly 200 daily active users out of your 500,000 user base (0.04%). The PM says “just remove it.” You are not so sure. How do you approach this?What weak candidates say: “200 users out of 500,000 is negligible. I would remove the feature, update the docs, and move on. We can always add it back if users complain.”What strong candidates say:“This is a Chesterton’s Fence problem combined with Hyrum’s Law, and the ‘just remove it’ answer has killed more product trust than most people realize.Why I am not just removing it:First, analytics might be lying. ‘200 daily active users’ — how is ‘active use’ defined? If the analytics track clicking a specific button, the feature might have 200 clickers and 5,000 users who passively benefit from it (like a background sync, an email digest, or an API integration that triggers without a UI click). I need to understand what the feature actually does end-to-end before I can trust the usage numbers.Second, Hyrum’s Law applies powerfully to features. Those 200 users might be the most important users. If this is an enterprise SaaS product and those 200 users are power users at your top 3 accounts representing 30% of revenue, removing this feature could trigger a contract renegotiation. ‘0.04% of users’ and ‘30% of revenue’ can describe the same people.Third, ‘we can always add it back’ is technically true and practically false. Once you remove a feature, the codebase moves on. Other features fill the space. The data model evolves. Adding it back 6 months later is not a rollback — it is a rebuild, and by then you have lost the institutional knowledge of how it worked.What I would actually do:Step 1 — Understand what the feature does and who uses it (1-2 days). Read the code. Read the commit history. Check if there are support tickets, feedback emails, or feature requests related to it. Look at the user profiles of the 200 daily users — are they free tier or paying? Are they concentrated in specific accounts?Step 2 — Soft deprecation (2-4 weeks). Do not remove the feature. Instead, add a banner: ‘We are planning to retire this feature on [date]. If this affects your workflow, please let us know at [feedback link].’ This is minimal engineering effort and provides ground truth about how many users actually care — not how many passively use it, but how many actively depend on it.Step 3 — Analyze the feedback. If nobody responds, you have strong evidence that removal is safe. If 15 users respond and they are all from your largest enterprise customer, you have very different information than the analytics provided.Step 4 — Offer a migration path if needed. If users depend on the feature, can we point them to an alternative? A different tool, a workaround, an API they can use to replicate the functionality? Removing a feature and leaving users stranded is how you generate the kind of social media thread that goes viral: ‘Company X just killed the one feature that made their product worth using.’War Story: At a developer tools company, we wanted to remove a legacy webhook format that analytics said 400 users were using. We added the deprecation banner. Within a week, we got 38 emails — 12 from companies whose entire deployment pipeline depended on the specific JSON structure of our legacy webhook. Two of them were mid-six-figure annual contracts. One had automated tests that parsed our webhook payloads field-by-field — classic Hyrum’s Law. We ended up supporting both the legacy and new webhook format for 18 months with a compatibility shim. The engineering cost of the shim was 2 days. The cost of losing those contracts would have been $800K+ per year. The PM who said ‘just remove it’ was looking at user counts. The business reality lived in revenue per user.”Follow-up: How do you handle the refactor if you cannot remove the feature? Does the refactor just stall?
“No — this is where engineering creativity matters. The question is not ‘remove the feature to enable the refactor’ but ‘how do we refactor around the feature.’ Options:Encapsulate it. If the feature’s code is tangled throughout the codebase (which is probably why it is blocking the refactor), extract it into an isolated module with a clean interface. The refactor can proceed around the module. The module becomes a black box that we maintain until we can genuinely deprecate it.Strangler fig pattern. Build the new system alongside the old feature. Route new users to the new system. Keep the old feature running for existing users. Over time, as users naturally migrate or as the deprecation timeline plays out, the old feature’s usage drops to zero organically.Feature flag isolation. Put the feature behind a flag and give only the dependent users access. The refactored system does not need to know about the feature — it is entirely encapsulated behind the flag. When the last dependent user migrates, flip the flag off.The point is that ‘cannot remove a feature’ does not mean ‘cannot refactor.’ It means the refactor is harder, and we need to be creative about the boundary between old and new.”Follow-up: What is the most common mistake you see teams make when deprecating features?
“Announcing the deprecation date but not enforcing it. I have seen teams announce ‘this feature will be removed on January 1’ and then push the date back three times because one more customer asked for an extension. After the third push, nobody takes the deprecation seriously. Users learn that complaining equals extension, and the feature lives forever.The second most common mistake is deprecating without understanding the migration path. ‘We are removing feature X’ without a clear answer to ‘what should users do instead?’ guarantees a backlash.The third mistake is removing features in a big bang. Instead of quietly removing one thing and seeing the reaction, the team batches 5 deprecations into a single release and gets 5x the pushback. Each individual removal might have been fine. All five at once feels like the product is being gutted.”S10: The 'correct' architecture will take too long and the 'fast' one has a landmine. (Staff-Level)
S10: The 'correct' architecture will take too long and the 'fast' one has a landmine. (Staff-Level)
The Scenario
You are designing a new notification system. The “correct” architecture (event-driven with guaranteed delivery, deduplication, user preference management, multi-channel routing) will take 4 months with 3 engineers. The CEO wants to launch the new product in 6 weeks and notifications are a hard requirement. You can build a “fast” version in 3 weeks that sends emails directly from the application code — but you know from experience this creates a tightly coupled mess that will take 6 months to untangle later. Neither option is acceptable as-is. What do you do?What weak candidates say: “I would build the fast version and plan to refactor later. We need to hit the launch date. Technical debt is the cost of doing business.”What strong candidates say:“Both extremes are wrong, and the real senior engineering skill here is finding the architecture that is neither the 4-month gold-plated version nor the 3-week prototype-that-becomes-permanent.The trap is thinking in two options. There are always more than two options. Let me decompose the problem:What does the CEO actually need in 6 weeks? Not the full notification system. They need notifications that work for the launch — which probably means one channel (email), one trigger (the product’s core action), and basic delivery. User preference management, multi-channel routing, and deduplication can come later.What is the landmine in the fast approach? It is not that the code is quick and dirty — that is fixable. The landmine is the coupling. If every service sends emails directly, then:- Adding SMS later means touching every service.
- Changing the email provider means touching every service.
- Adding user preference checks means touching every service.
- Rate limiting and deduplication have to be implemented in every service.
POST /notify with a payload). It has one handler: email via SendGrid. No preference
management, no deduplication, no multi-channel. The handler is a single function that formats
the email and calls SendGrid. This is deliberately simple — maybe 500 lines of code.Week 3: Integrate the product services. Instead of calling SendGrid directly, they call the
notification service. This is the critical investment — it establishes the boundary that
prevents coupling.Week 4-5: Load test, add basic monitoring, handle edge cases for the launch use case.Week 6: Launch.What this buys us: The notification service exists as a single place to add channels,
preferences, deduplication, and routing later. No service-by-service rewrites needed. The
‘fast’ version and the ‘correct’ version share the same interface. We just add capabilities
to the notification service over time.What I explicitly chose to skip and why:- Message queue between services and notification service. For the launch, REST is fine. At our scale (probably fewer than 1,000 notifications per hour initially), we do not need async processing yet. We add the queue when we need it — and the services do not need to change because they just call the notification service’s API either way.
- Deduplication. We accept that in rare edge cases, a user might get a duplicate email for the first few months. This is annoying but not harmful. We add deduplication when it becomes a real problem.
- User preferences. For launch, everyone gets email. We add preferences when the product team asks for them — which is usually 2-3 months after launch, giving us time.