Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Mercor Interview Preparation Guide
This guide is specifically curated for Software Engineer / Developer positions at Mercor. It covers multi-disciplinary topics including database design, system architecture, caching, real-time systems, and behavioral decision-making. Mercor interviews are known for testing breadth across the stack and engineering judgment under constraints — expect questions that blend system design with pragmatic trade-off analysis.1. Database Design & Architecture
Core Design Principles
When designing database architecture for user flows, consider:- Data Flow Mapping: Map user actions to CRUD operations and plan transaction boundaries. Identify which operations need strong consistency (payments, inventory) vs. eventual consistency (notifications, analytics).
- Schema Design: Normalize data to reduce redundancy and design efficient indexes. But know when to denormalize — read-heavy dashboards with complex joins are a classic case where denormalization saves you from slow aggregation queries at the cost of write complexity.
- Scalability: Choose between read-replicas, sharding, or microservices patterns. The decision depends on your read/write ratio, data locality requirements, and whether your bottleneck is query throughput or storage volume.
Q: How do you handle database scaling?
Q: How do you handle database scaling?
-
Vertical Scaling (Scale Up): Increase CPU/RAM/IOPS on the existing server. This is always the first move because it requires zero application changes. At AWS, moving from an
r6g.largeto anr6g.4xlargeRDS instance can 8x your throughput in under 10 minutes. The ceiling is real though — the largest RDS instance (e.g.,db.r6g.16xlargewith 512GB RAM) costs ~$8K/month and you eventually hit single-node limits on connection count and write IOPS. -
Connection Pooling: Before adding infrastructure, check if your bottleneck is actually connection overhead. Tools like PgBouncer (for PostgreSQL) or ProxySQL (for MySQL) can reduce 10,000 application connections down to 200 actual database connections. This alone has saved teams from premature scaling decisions. A Node.js app with 50 serverless functions each holding 10 connections can exhaust PostgreSQL’s default
max_connections=100instantly. -
Read Replicas: Distribute
SELECTqueries to replicas. This is the highest-ROI scaling move for read-heavy workloads (most web apps are 80-95% reads). The gotcha is replication lag — a user writes data and immediately reads it back from a replica that has not received the write yet. Solution: route writes and “read-your-own-writes” queries to the primary, everything else to replicas. -
Sharding (Horizontal Partitioning): Split data across multiple database instances by a shard key (e.g.,
user_id % N). This is the nuclear option — it gives you near-linear write scaling but introduces enormous complexity: cross-shard joins become impossible (or require scatter-gather), transactions cannot span shards without distributed transaction protocols (2PC), and rebalancing shards when data distribution is skewed is an operational nightmare. Instagram famously sharded PostgreSQL by user ID and built custom tooling to manage it. - Caching Layer (Redis/Memcached): Store hot query results in memory. This does not scale the database itself but reduces load on it. At scale, a well-tuned Redis cache can absorb 90%+ of read traffic. The key trade-off is cache invalidation complexity — see Caching Patterns below.
- CQRS (Command Query Responsibility Segregation): Separate the write model (optimized for transactional integrity) from the read model (optimized for query performance). The write side uses a normalized relational schema; the read side uses denormalized views, Elasticsearch, or materialized views. This is what systems like LinkedIn’s feed and Netflix’s catalog service use at scale.
- “You mentioned replication lag with read replicas. How would you handle a scenario where a user updates their profile and immediately sees stale data?”
- “What shard key would you choose for a multi-tenant SaaS application, and what happens when one tenant has 100x more data than others?”
- “At what point would you consider moving from a relational database to a NoSQL solution instead of continuing to scale your RDBMS?”
Q: Explain ACID properties and when you would intentionally relax them
Q: Explain ACID properties and when you would intentionally relax them
-
Atomicity: “All or nothing” — if any operation in a transaction fails, the entire transaction rolls back. Under the hood, this is implemented via a Write-Ahead Log (WAL) in PostgreSQL or the redo/undo log in MySQL’s InnoDB. The WAL records every change before it is applied, so the database can replay or roll back on crash recovery. The cost: every write has to hit the WAL first (sequential disk I/O), which is why
fsyncsettings matter so much for write throughput. - Consistency: The database enforces all constraints (foreign keys, unique constraints, CHECK constraints) before and after every transaction. What most people miss: “consistency” in ACID is different from “consistency” in CAP theorem. ACID consistency means the data satisfies application-defined invariants. CAP consistency means all nodes see the same data at the same time. Conflating the two is a common interview mistake.
- Isolation: Concurrent transactions do not interfere with each other. But isolation has levels, and the default is usually not the strongest. PostgreSQL defaults to Read Committed, which means you can see other transactions’ committed changes mid-query. Serializable isolation gives you the strongest guarantee (transactions behave as if they ran sequentially) but at significant throughput cost — typically 30-50% lower TPS due to lock contention and serialization failures that require retries. MySQL’s InnoDB defaults to Repeatable Read and uses MVCC (Multi-Version Concurrency Control) to avoid read locks, maintaining “snapshots” of data at the transaction start time.
-
Durability: Once a transaction commits, it survives crashes. This means the WAL must be flushed to disk (
fsync) before the commit is acknowledged. You can relax this withsynchronous_commit=offin PostgreSQL for a 2-5x write throughput improvement — at the risk of losing the last ~600ms of commits on a crash. Some teams use this for analytics writes or logging where losing a few records is acceptable.
ONE, QUORUM, ALL).Red flag answer: Reciting the four properties without mentioning isolation levels, WAL mechanics, or real scenarios where you would relax guarantees. Also, confusing ACID consistency with CAP consistency.Follow-up:- “Walk me through what happens internally when PostgreSQL commits a transaction — from the
COMMITstatement to the data being durable on disk.” - “You have a financial system that needs Serializable isolation for balance transfers but also handles 50K TPS in reads. How do you architect this?”
- “Explain the difference between optimistic and pessimistic concurrency control. When would you choose each?”
Q: When would you choose NoSQL over a relational database?
Q: When would you choose NoSQL over a relational database?
- Choose relational (PostgreSQL, MySQL) when: Your data has well-defined relationships and you need complex joins, transactions across multiple entities, or strong consistency guarantees. Examples: financial ledgers, inventory systems, user account management. If your queries look like “give me all orders for this user with their shipping addresses and payment methods,” relational is natural.
- Choose document stores (MongoDB, DynamoDB) when: Your data is naturally hierarchical or semi-structured, you access it by a single key or narrow partition, and the schema evolves frequently. Example: a product catalog where each product category has completely different attributes. Storing this in SQL means either a massive sparse table or a complex EAV (Entity-Attribute-Value) pattern. In MongoDB, each document can have its own shape.
- Choose wide-column stores (Cassandra, ScyllaDB) when: You need massive write throughput (100K+ writes/sec), time-series data, or multi-datacenter replication with tunable consistency. Example: IoT sensor data, event logging, messaging systems like Discord (which stores billions of messages in Cassandra, though they later migrated to ScyllaDB for performance).
- Choose graph databases (Neo4j, Amazon Neptune) when: Your primary queries traverse relationships — “friends of friends who also liked X” or fraud detection patterns. Running BFS/DFS on a relational database with recursive CTEs is possible but becomes prohibitively slow beyond 3-4 hops.
- “You are designing a system that needs to store user profiles (relational), activity feeds (time-series), and social connections (graph). Would you use one database or multiple? How would you keep them in sync?”
- “What is the CAP theorem, and how does it actually influence your database choice in practice — not just in theory?”
- “DynamoDB charges per read/write capacity unit. How does the data modeling approach differ from SQL when cost is a primary constraint?“
2. System Design: Load Balancing
Load balancing distributes incoming traffic across multiple servers to prevent bottlenecks. But the choice of load balancer, algorithm, and layer has significant implications for latency, observability, and failure recovery.Algorithms
- Round Robin: Sequential distribution. Simple and stateless. Works well when all servers are identical and requests are roughly equal in cost. Falls apart when some requests are 10x more expensive than others (e.g., a report generation endpoint vs. a health check).
- Weighted Round Robin: Routes more traffic to more powerful servers. Useful in heterogeneous fleets or during rolling deployments where new instances are warming up.
- Least Connections: Routes to the server with the fewest active connections. Best for long-lived connections (WebSockets, gRPC streams) or when request processing times vary significantly.
- IP Hash: Ensures a user always hits the same server (sticky sessions). Useful when server-side state exists (in-memory sessions, local caches). The trade-off: if that server goes down, all its “sticky” users lose their sessions. Consistent hashing minimizes this disruption.
Layer 4 vs Layer 7 Load Balancing
This distinction matters enormously and is a common interview topic:- Layer 4 (Transport Layer): Operates on TCP/UDP. Sees IP addresses and ports but not HTTP headers, URLs, or cookies. Extremely fast (near wire-speed) because it does not parse application data. AWS NLB operates at Layer 4. Use when: you need raw throughput, TLS passthrough, or are load balancing non-HTTP protocols (gRPC, database connections, custom TCP).
-
Layer 7 (Application Layer): Operates on HTTP/HTTPS. Can route based on URL path (
/api/*to backend,/static/*to CDN origin), headers (Accept-Languagefor regional routing), cookies (session affinity), or even request body content. AWS ALB, NGINX, and HAProxy operate at Layer 7. Use when: you need content-based routing, A/B testing, canary deployments, or WAF integration.
Technical Implementation (Node.js)
Using the built-incluster module to utilize all CPU cores:
Q: How would you design a load balancing strategy for a service handling 100K requests per second?
Q: How would you design a load balancing strategy for a service handling 100K requests per second?
-
Layer 1 — DNS-based load balancing: Use Route 53 weighted routing or GeoDNS to distribute traffic across multiple regions or availability zones. This gives you coarse-grained distribution and disaster recovery. Latency-based routing sends European users to
eu-west-1and US users tous-east-1. - Layer 2 — Network Load Balancer (L4): In each region, an AWS NLB or equivalent handles TLS termination at wire speed. NLBs can handle millions of requests per second with single-digit millisecond latency. They distribute TCP connections across a fleet of application load balancers or directly to instances.
-
Layer 3 — Application Load Balancer (L7): For content-based routing — send
/api/v2/*to the new service fleet,/api/v1/*to the legacy fleet, and/static/*to the CDN origin. This is where you implement canary deployments (5% of traffic to the new version). -
Health checks: Configure both shallow health checks (TCP connect or HTTP 200 on
/health) for fast detection and deep health checks (that verify database connectivity, cache availability, and downstream dependencies) on a separate endpoint. The shallow check runs every 5 seconds; the deep check runs every 30 seconds. A server failing the deep check gets drained gracefully (stop sending new connections, let existing ones finish) rather than killed instantly. -
Capacity math: 100K RPS across, say, 20 application instances means ~5K RPS per instance. If each request takes ~50ms of server time on average, each instance needs at least
5000 * 0.05 = 250concurrent connections. With Node.js’s event loop, this is comfortable. With thread-per-request models (Spring Boot default), you would need 250 threads per instance — feasible but you would want to benchmark the memory overhead (~1MB stack per thread = 250MB just for thread stacks). - Failure scenario: What happens when 5 of your 20 instances die simultaneously? The remaining 15 now handle ~6,700 RPS each. If they were already at 70% capacity, they are now at 93% — dangerously close to cascading failure. This is why you always provision for N+2 or N+3 redundancy and set up auto-scaling with a target of 60% CPU utilization, not 80%.
- “A canary deployment is sending 5% of traffic to a new version, but the new version has 3x higher latency. How does the load balancer detect and respond to this?”
- “What is the difference between connection draining and deregistration delay, and why does it matter for zero-downtime deployments?”
- “How would you handle a sudden 10x traffic spike — say, 100K RPS jumps to 1M RPS in 30 seconds?“
3. Caching & Streaming
Caching Patterns
Understanding these patterns is not about memorizing definitions — it is about knowing which pattern fits which problem and what breaks when you choose wrong.- Cache-Aside (Lazy Loading): App checks cache first; on a miss, reads from DB, then writes the result to cache. This is the most common pattern. The gotcha: cache stampede — when a popular key expires, hundreds of concurrent requests all miss the cache simultaneously and slam the database. Mitigation: use locking (only one request fetches from DB, others wait) or probabilistic early expiration (refresh the cache slightly before TTL expires).
- Write-Through: App writes to cache and DB simultaneously (or the cache layer handles the DB write). Guarantees cache is always consistent with DB. The cost: every write has the latency of both the cache write and the DB write. Used when read-after-write consistency is critical, like user session data.
- Write-Behind (Write-Back): App writes to cache immediately; the cache asynchronously flushes to DB later (typically in batches). Gives you the fastest write latency but introduces a durability risk — if the cache node crashes before flushing, you lose data. Used for: analytics counters, view counts, rate limiting counters — data where losing a few writes is acceptable. Facebook uses this pattern for “Like” counts.
- Read-Through: Similar to cache-aside, but the cache itself is responsible for loading from the database on a miss (the application only talks to the cache, never directly to the DB). Simplifies application code but requires the cache layer to understand your data source.
- Refresh-Ahead: The cache proactively refreshes entries before they expire, based on predicted access patterns. Reduces cache miss latency for hot keys but wastes resources if predictions are wrong.
Cache Invalidation
Phil Karlton famously said there are only two hard problems in computer science: cache invalidation and naming things. Here is why:- TTL-based expiration: Simplest approach. Set a 60-second TTL and accept that data might be stale for up to 60 seconds. Works for most use cases. The art is choosing the right TTL — too short and you get excessive cache misses; too long and users see stale data.
- Event-driven invalidation: When data changes, publish an event (via Kafka, Redis Pub/Sub, or database triggers) that invalidates the corresponding cache key. More complex but gives you near-real-time consistency. This is what Facebook’s McSqueal does — it listens to MySQL’s binlog to invalidate Memcached keys.
- Version-based invalidation: Instead of invalidating, append a version number to the cache key (
user:123:v7). When data changes, increment the version. Old keys naturally expire via TTL. Simple and avoids race conditions but wastes cache memory.
Real-time Communication
Q: Compare WebSockets, SSE, and Long Polling. When would you choose each?
Q: Compare WebSockets, SSE, and Long Polling. When would you choose each?
-
WebSockets: Full-duplex communication over a single persistent TCP connection. After an HTTP upgrade handshake, both client and server can send messages at any time. Best for: chat applications, multiplayer games, collaborative editing (Google Docs), trading platforms — anything where both sides send data frequently.
- Production gotcha: WebSocket connections are stateful — the connection is tied to a specific server. This means sticky sessions or a pub/sub layer (Redis Pub/Sub, Kafka) is required when you have multiple server instances. If Server A holds User 1’s WebSocket and Server B holds User 2’s WebSocket, and User 1 sends a message to User 2, Server A must publish to a shared channel that Server B subscribes to. Also, many enterprise proxies, firewalls, and older load balancers struggle with WebSocket upgrades. AWS ALB supports them; AWS NLB supports them (L4); some corporate proxies silently drop them.
- Scale concern: Each WebSocket holds a TCP connection open. A single Node.js server can typically handle 50K-100K concurrent WebSocket connections (memory-bound, ~10KB per connection). At 1M concurrent users, you need 10-20 servers just for connection management, plus the pub/sub backplane.
-
SSE (Server-Sent Events): Uni-directional stream from server to client over a standard HTTP connection. The client opens a long-lived HTTP GET request, and the server sends
text/event-streamformatted messages. Best for: live dashboards, stock tickers, notification feeds, build/deploy status updates — anything where the server pushes updates but the client does not need to send data back.- Advantage over WebSockets: Uses standard HTTP, so it works through all proxies, CDNs, and load balancers without special configuration. Automatic reconnection is built into the browser’s
EventSourceAPI (withLast-Event-IDfor resuming). Much simpler infrastructure story. - Limitation: Uni-directional only. If the client needs to send data, it uses regular HTTP requests alongside the SSE connection. Also, HTTP/1.1 browsers limit to ~6 concurrent SSE connections per domain (HTTP/2 raises this significantly).
- Advantage over WebSockets: Uses standard HTTP, so it works through all proxies, CDNs, and load balancers without special configuration. Automatic reconnection is built into the browser’s
-
Long Polling: Client sends an HTTP request; server holds it open until it has data (or a timeout, typically 30-60 seconds). Client immediately re-opens a new request after receiving a response. This is the legacy pattern that predates WebSockets and SSE.
- When it is still used: When you must support very old browsers or environments where WebSockets and SSE are blocked. Firebase Realtime Database falls back to long polling when WebSockets fail. Slack used long polling for years before migrating to WebSockets.
- Why it is inferior: Each poll cycle creates a new HTTP request with full headers (~800 bytes overhead), a new TCP connection (or reuses via keep-alive), and the server must track pending requests. At scale, this means significantly more overhead than persistent connections.
| Feature | WebSockets | SSE | Long Polling |
|---|---|---|---|
| Direction | Bidirectional | Server-to-client | Simulated bidirectional |
| Protocol | WS/WSS | HTTP | HTTP |
| Reconnection | Manual | Automatic | Manual |
| Binary data | Yes | No (text only) | Yes |
| Proxy-friendly | Sometimes | Always | Always |
| Best for | Chat, games, collaboration | Dashboards, notifications | Legacy fallback |
- “You are building a notification system that needs to push updates to 5 million concurrent mobile users. Would you use WebSockets, SSE, or something else entirely? Why?”
- “How does HTTP/2 server push differ from SSE, and why did Chrome remove support for HTTP/2 push?”
- “Your WebSocket connections are dropping every 60 seconds in production. What is the most likely cause and how do you fix it?“
4. Behavioral & Decision-Making (Scenario-Based)
Mercor frequently uses Option A vs Option B scenarios to test engineering trade-offs. The key to these questions is not picking the “right” answer — it is demonstrating a structured decision framework that considers time horizons, reversibility, and second-order consequences.Scenario 1: Critical Deadline
Q: Deadline is tomorrow, but the feature is not ready. Option A: Add more developers. Option B: Ask current team to work overtime.
Q: Deadline is tomorrow, but the feature is not ready. Option A: Add more developers. Option B: Ask current team to work overtime.
- Option A (Add developers) — The Classic Trap: Brooks’ Law states that “adding manpower to a late software project makes it later.” The new developers need context transfer (the existing team stops coding to explain), environment setup, codebase familiarization, and integration of their work. For a 24-hour deadline, the ramp-up time alone exceeds the deadline. This option only makes sense if: (a) the remaining work is highly parallelizable and well-defined (e.g., writing independent API endpoints with clear specs), (b) the new developers already know the codebase, and (c) the time horizon is 2+ weeks, not hours.
- Option B (Overtime) — The Default but Dangerous Choice: For a 24-hour crunch, this is often the only viable option. But experienced engineers know the hidden costs: tired developers write buggier code, code review quality drops, and you accumulate technical debt that will slow you down in the next sprint. Studies show that developer productivity drops by ~25% after 8 hours and by ~50% after 12 hours. The bugs introduced during crunch often take longer to fix than the time “saved.”
- The answer the interviewer actually wants — Option C (Scope Negotiation): The best engineers do not accept the premise. They ask: “What is the minimum viable feature set that delivers value by the deadline?” Cut scope to the critical path. Ship the 80% that matters, defer the 20% that is nice-to-have. Communicate transparently with stakeholders: “We can deliver the core flow by tomorrow. The edge cases and polish will be in a fast-follow by Thursday.” This shows leadership, not just execution.
- Option D (if available) — Feature Flags: Ship what you have behind a feature flag. The code goes to production but is not user-facing. This meets the deployment deadline (if the concern is a release train), gives you more time to finish, and allows incremental rollout. Tools like LaunchDarkly, Unleash, or even a simple database toggle make this trivial.
- “The PM insists that all features are must-haves and scope cannot be cut. How do you handle this conversation?”
- “You chose to crunch and shipped on time, but the code is now full of shortcuts. How do you handle the technical debt in the next sprint?”
- “How would your answer change if the deadline was 2 weeks away instead of 24 hours?”
Scenario 2: Security vs Feature Launch
Q: You found a vulnerability just before launch. Option A: Delay launch and fix security. Option B: Launch now, fix in next sprint.
Q: You found a vulnerability just before launch. Option A: Delay launch and fix security. Option B: Launch now, fix in next sprint.
- Option A (Delay and fix) is correct in almost every case. The calculus is asymmetric: the worst case of delaying a launch is lost revenue and frustrated stakeholders (recoverable). The worst case of shipping a vulnerability is a data breach, regulatory fines (GDPR fines can reach 4% of global annual revenue), customer lawsuits, and reputational damage that takes years to recover from (unrecoverable). Equifax’s 2017 breach cost them $1.4 billion. The feature launch delay would have cost them nothing.
-
The nuance: Not all vulnerabilities are equal. A CVSS 9.8 remote code execution? Stop everything. A CVSS 3.1 information disclosure that requires authenticated access and only leaks non-sensitive metadata? You might ship with a documented risk acceptance and a P1 ticket for the next sprint. The key is having a risk assessment framework:
- What data is exposed? (PII, financial, health data = stop; internal logs = maybe proceed)
- What is the attack vector? (Unauthenticated remote = critical; requires physical access = lower risk)
- Is there a mitigating control? (Can you WAF-block the attack vector while shipping?)
- What is the blast radius? (One user affected vs. all users)
- What great candidates add: “I would document the risk decision. If we decide to ship, I want a written risk acceptance from the security lead and the product owner, a monitoring alert for exploitation attempts, a WAF rule to block known attack patterns for this vulnerability, and a committed fix date within 72 hours.”
- “The CEO says the launch has been promised to investors and cannot be delayed. How do you escalate and what do you do?”
- “The vulnerability is in a third-party dependency, not your code. Does that change your approach?”
- “How would you build a process to prevent this last-minute discovery from happening again?”
Scenario 3: Monolith vs Microservices Migration
Q: Your monolith is becoming hard to scale and deploy. Do you migrate to microservices?
Q: Your monolith is becoming hard to scale and deploy. Do you migrate to microservices?
- Do not start with microservices. The Majestic Monolith (as DHH from Basecamp calls it) is perfectly fine for most companies. Shopify runs one of the largest Ruby on Rails monoliths in the world, handling billions of dollars in transactions. The question is not “should we use microservices?” but “what specific problem are we solving, and is the operational cost worth it?”
- When microservices make sense: (a) Independent teams need to deploy independently (organizational scaling, not technical). (b) Different parts of the system have fundamentally different scaling needs (the image processing service needs GPU instances, the API gateway needs CPU instances). (c) You need polyglot persistence — one service needs PostgreSQL, another needs Elasticsearch. (d) Blast radius isolation — a bug in one service should not take down the entire platform.
- The hidden costs people underestimate: Service discovery, distributed tracing (Jaeger, Zipkin), circuit breakers (Hystrix, resilience4j), API gateways, inter-service authentication, distributed transaction management (Saga pattern), network latency between services (a monolith function call is nanoseconds; a network call is milliseconds), and the sheer operational overhead of managing 50+ deployable units. Most teams that adopt microservices prematurely spend more time fighting infrastructure than building features.
- The pragmatic middle ground: The Modular Monolith — a single deployable unit with strict internal module boundaries (separate packages/namespaces, defined interfaces between modules, no shared database tables across modules). This gives you most of the organizational benefits of microservices without the operational cost. When (and if) a module needs to be extracted, the boundary is already clean. This is what Shopify did with their “components” architecture.
- “You decided to extract one service from the monolith. Walk me through how you would handle data that is currently in shared database tables.”
- “How do you handle a distributed transaction that spans three microservices — say, placing an order that involves inventory, payment, and shipping?”
- “What observability stack would you set up before splitting the monolith? Why is this a prerequisite?“
5. Technical Deep Dives (Sample Questions)
Q: Design a URL Shortener (Bit.ly)
Q: Design a URL Shortener (Bit.ly)
- Generate a unique short code (6-8 characters using base62:
a-z, A-Z, 0-9) that maps to a long URL. - Two approaches for generating short codes:
- Hash-based: MD5/SHA256 the long URL and take the first 7 characters. Problem: collisions. Two different long URLs can produce the same 7-character prefix. Mitigation: check for collision and append a counter or re-hash.
- Counter-based (preferred at scale): Use a global auto-incrementing counter and convert to base62. Counter 1000000 becomes
4c92in base62. No collisions by definition. Problem: sequential IDs are predictable (users can enumerate URLs). Mitigation: use a Snowflake-like ID generator or a pre-generated pool of random IDs.
- Primary store: A key-value lookup (
short_code -> long_url). DynamoDB or Redis are natural fits. If using SQL, a table with a unique index onshort_codeand an index onlong_url(to check if a URL was already shortened). - Schema:
short_code (PK),long_url,created_at,user_id,expiry,click_count.
- The 80/20 rule applies heavily: 10% of shortened URLs receive 90% of traffic. Use Redis as a read-through cache for the hot set.
- At Bitly’s scale (~200M shortens/month, billions of redirects/month), the redirect path must be sub-10ms. This means: cache-first lookup, with the database as fallback.
- 301 vs 302 redirect: 301 (permanent) lets browsers cache the redirect — reduces server load but you lose analytics on repeat visits. 302 (temporary) forces every click through your server — more load but complete analytics. Most URL shorteners use 302 because analytics is the product.
- Every redirect logs:
short_code,timestamp,referrer,user_agent,geo_ip. This is an append-only write-heavy workload — perfect for Kafka into a data warehouse (BigQuery, Redshift) or a time-series store (ClickHouse). - Real-time click counts: increment a Redis counter on each redirect (
INCR clicks:abc123). Periodically flush to the database.
- The redirect service must be highly available — if it is down, every shortened link on the internet breaks. Design for 99.99% uptime (less than 52 minutes of downtime per year). Multi-region deployment with DNS failover.
- “A user shortens the same URL twice. Should they get the same short code or a different one? What are the trade-offs?”
- “How would you implement link expiration? What happens to expired links — 404 or redirect to a landing page?”
- “An attacker is using your service to shorten malicious URLs for phishing. How do you detect and prevent this?”
Q: Design a Social Media Relationship Feed
Q: Design a Social Media Relationship Feed
Q: Implement a Health Check System
Q: Implement a Health Check System
/healthz or /health/live):- Returns 200 if the process is alive and the HTTP server is accepting connections. No dependency checks.
- Used by: Kubernetes liveness probes, load balancer TCP checks.
- Purpose: Detect crashed processes or deadlocked event loops. If this fails, the instance should be killed and restarted.
- Must be fast (under 10ms) and must never fail due to downstream dependencies.
/health/ready or /readyz):- Verifies the application can actually serve traffic: database is reachable, Redis is connected, critical downstream services are responding, disk space is sufficient.
- Used by: Kubernetes readiness probes, load balancer application-level checks.
- Purpose: Prevent routing traffic to an instance that is alive but cannot serve requests (e.g., database connection pool is exhausted).
- Timeouts on dependency checks: If your database health check hangs for 30 seconds, your health check endpoint hangs for 30 seconds, and the load balancer marks you as unhealthy (or worse, times out and retries). Always wrap dependency checks in a
Promise.racewith a 2-3 second timeout. - Do not health-check yourself into a cascade: If your database goes down and all 50 instances fail their deep health check simultaneously, the load balancer removes all of them — now you have zero healthy instances. Consider making some dependencies “soft” (warning, not failure) so instances stay in rotation in degraded mode.
- Kubernetes distinction:
livenessProbefailures trigger a pod restart.readinessProbefailures remove the pod from the Service (no traffic routed, but pod stays alive). Confusing these causes either unnecessary restarts or traffic to broken pods. - Include version information: Return the app version, git SHA, and build timestamp in the health response. This is invaluable during deployments to verify which version is running on which instance.
200 OK with no dependency verification, or one that does not have timeouts on dependency checks. Also, not distinguishing between liveness and readiness.Follow-up:- “Your health check includes a database query, but during a deployment, the database connection pool takes 10 seconds to initialize. How do you handle the startup window?”
- “Should health check endpoints require authentication? What are the security implications of exposing internal status?”
- “How would you implement a health check for a service that depends on an external third-party API with unpredictable latency?”
Q: Design a Rate Limiting System
Q: Design a Rate Limiting System
-
Fixed Window Counter: Count requests in fixed time windows (e.g., 100 requests per minute, window resets at :00, :01, :02…). Simple to implement with Redis
INCR+EXPIRE. The problem: boundary burst. A user sends 100 requests at 12:00:59 and 100 more at 12:01:00 — they have sent 200 requests in 2 seconds while technically respecting the 100/minute limit in each window. - Sliding Window Log: Store the timestamp of every request. To check the limit, count timestamps within the last 60 seconds. Perfectly accurate but memory-intensive — storing timestamps for millions of users at high RPS is expensive.
-
Sliding Window Counter (most common in production): Hybrid of fixed window and sliding window. Uses two adjacent fixed windows and weights the count proportionally. If we are 30 seconds into the current minute, the effective count is
(current_window_count) + (previous_window_count * 0.5). Nearly as accurate as the sliding log with the memory efficiency of fixed windows. This is what Cloudflare uses. - Token Bucket: A bucket holds tokens (max capacity = burst size). Tokens are added at a fixed rate (e.g., 10/second). Each request consumes one token. If the bucket is empty, the request is rejected. This naturally allows short bursts while enforcing an average rate. Most API gateways (Kong, AWS API Gateway) use this.
- Leaky Bucket: Requests enter a queue (the bucket) and are processed at a fixed rate. If the queue is full, new requests are rejected. Unlike token bucket, this enforces a smooth output rate with no bursts. Used when downstream systems cannot handle burst traffic at all.
- What key to rate limit by: IP address (easy to spoof with rotating proxies), API key (best for authenticated APIs), user ID (best for logged-in users), or a combination. For unauthenticated endpoints, use IP + fingerprint heuristics.
- Response headers: Always return
X-RateLimit-Limit,X-RateLimit-Remaining, andX-RateLimit-Resetheaders so clients can implement backoff. Return HTTP 429 (Too Many Requests) with aRetry-Afterheader. - Graceful degradation: Instead of hard-blocking, consider degraded service: rate-limited users get cached results instead of fresh data, or lower-priority requests are queued instead of rejected.
- Multi-tier limits: GitHub’s API has per-user, per-IP, and per-resource rate limits. This prevents one API endpoint from consuming a user’s entire rate limit budget.
- “Your Redis rate limiter adds 2ms to every API request. How would you reduce this overhead for high-throughput services?”
- “A legitimate enterprise customer is hitting the rate limit due to a batch import. How do you handle this without removing the limit?”
- “How would you rate-limit a GraphQL API where a single request can be trivially cheap or extremely expensive depending on the query?”
Q: How would you debug a production performance regression?
Q: How would you debug a production performance regression?
- Check deployment history: was there a recent deploy?
git log --since="2 hours ago"on the deployed branch. In my experience, 80%+ of production regressions are caused by recent code changes. - Check metrics dashboards (Grafana, Datadog): CPU utilization, memory usage, request latency (p50, p95, p99), error rates, database query times, cache hit ratios. Look for step-function changes (correlate with deploy times) vs. gradual degradation (usually a leak or accumulation problem).
- Check infrastructure changes: auto-scaling events, dependency updates, cloud provider incidents (check the AWS status page — it has lied before, so also check Twitter/X and Downdetector).
- Distributed tracing (Jaeger, Datadog APM, AWS X-Ray): Look at trace waterfalls for slow requests. Is the latency in the application code, the database, a downstream service, or the network? A trace that shows 500ms total with 480ms in a database call tells you exactly where to look.
- Database slow query log: Enable
pg_stat_statementsin PostgreSQL or the slow query log in MySQL. Look for queries that suddenly started doing sequential scans instead of index scans (often caused by stale statistics — runANALYZE). - Application profiling: If the bottleneck is in application code, use a flame graph (via
perfon Linux, or--profin Node.js). Flame graphs visually show where CPU time is spent — wide bars at the top indicate hot functions.
- Can you reproduce the issue on a single instance by removing it from the load balancer and sending test traffic?
- Is it affecting all endpoints or specific ones? All users or specific segments (geo, account type)?
- Does it correlate with traffic volume (load-dependent) or is it constant (code-path dependent)?
- If a recent deploy is the cause, roll back first, investigate second. Restoring service is more important than understanding the root cause in the moment. You can always re-deploy a fixed version later.
- If the fix is a code change, deploy it through the normal pipeline with a feature flag so you can instantly disable it if it makes things worse.
- Confirm metrics return to baseline. Watch for 30+ minutes — some issues are intermittent.
- Write a blameless postmortem: timeline, root cause, impact (duration, affected users, revenue impact), action items (how to prevent recurrence).
- “The regression only affects p99 latency — p50 is fine. What does this tell you about the likely cause?”
- “You rolled back the deploy but latency is still elevated. What do you investigate next?”
- “How would you set up alerting to catch performance regressions within 5 minutes of a deployment?”
Q: Explain the CAP Theorem and how it applies to real system design decisions
Q: Explain the CAP Theorem and how it applies to real system design decisions
- CP System (e.g., ZooKeeper, etcd, Google Spanner): During a partition, the minority side of the partition refuses to serve reads/writes to prevent stale data. This means some requests get errors or timeouts. Used for: distributed locks, leader election, configuration management — anywhere serving stale data is worse than serving no data.
- AP System (e.g., Cassandra, DynamoDB default, CouchDB): During a partition, all nodes continue to serve requests. Writes that happen on different sides of the partition may conflict and need to be resolved later (via last-write-wins, vector clocks, or application-level conflict resolution). Used for: shopping carts (Amazon’s Dynamo paper), DNS, social media feeds — anywhere availability is more important than perfect consistency.
- PACELC extension: CAP only describes behavior during partitions. PACELC adds: “when there is no partition, do you optimize for Latency or Consistency?” DynamoDB with eventual consistency reads gives you lower latency (AP/EL). DynamoDB with strongly consistent reads gives you higher latency (CP/EC). Same system, different per-request trade-offs.
-
Per-operation granularity: Real systems do not make one global CAP choice. A banking system might be CP for balance reads/writes but AP for transaction history display. Cassandra lets you choose consistency per query (
ONE,QUORUM,ALL). This is tunable consistency, not a binary choice. - CRDTs (Conflict-free Replicated Data Types): A class of data structures that can be merged without conflicts after a partition heals. Counters, sets, and registers can be designed so that concurrent updates on different nodes can always be merged deterministically. Used by Redis Enterprise, Riak, and collaborative editing systems.
- “You are building a global e-commerce platform. Product catalog, shopping cart, and payment processing each have different CAP requirements. Walk me through your choices for each.”
- “Google Spanner claims to be ‘effectively CA’ by using TrueTime and GPS-synchronized clocks. How does this work and what are the limitations?”
- “A database vendor tells you their product is ‘fully consistent and fully available.’ What questions would you ask to challenge this claim?”
Q: Design an API Authentication and Authorization System
Q: Design an API Authentication and Authorization System
Q: How would you design a job queue / background task processing system?
Q: How would you design a job queue / background task processing system?
- Producer sends jobs to a message broker (Redis, RabbitMQ, SQS, Kafka).
- Consumer/Worker pulls jobs, processes them, and acknowledges completion.
- Dead Letter Queue (DLQ) captures jobs that fail after max retries for manual inspection.
-
Redis (with BullMQ/Sidekiq): Simple, fast, great for web application background jobs (sending emails, generating PDFs, processing uploads). BullMQ adds reliable queue semantics (delayed jobs, retries, priorities, rate limiting) on top of Redis. Limitation: if Redis crashes and persistence is not configured, you lose queued jobs. Use AOF persistence with
appendfsync everysecfor durability. - RabbitMQ: Full-featured message broker with routing, exchanges, and consumer acknowledgments. Supports multiple messaging patterns (work queue, pub/sub, topic routing). Better durability guarantees than Redis. Good for: medium-scale (tens of thousands of messages/sec) with complex routing needs.
- SQS: Managed, serverless, nearly infinite scale. No infrastructure to manage. Standard queues offer at-least-once delivery with best-effort ordering. FIFO queues offer exactly-once processing with strict ordering (but limited to 3,000 messages/sec per queue). Good for: AWS-native architectures where operational simplicity is the priority.
- Kafka: Not a queue but a distributed log. Consumers track their position (offset) in the log. Messages are retained for a configurable period regardless of consumption. Good for: event sourcing, stream processing, cases where multiple consumers need to read the same messages independently, or where you need message replay.
-
Idempotency: Jobs must be safe to process more than once. Network failures, consumer crashes, and at-least-once delivery all mean duplicate processing will happen. Solution: assign each job a unique
idempotency_key, check a deduplication store (Redis SET with TTL) before processing, and write the result atomically with the deduplication record. - Retry with exponential backoff: First retry after 1s, second after 4s, third after 16s, with jitter (random +-20%) to prevent thundering herd. After N retries (typically 3-5), move to the DLQ.
- Poison message handling: A job that crashes the worker every time it runs (e.g., input triggers an OOM kill). Without proper handling, this job gets retried forever, blocking the queue. Solution: track per-job retry counts and move to DLQ after max retries.
- Monitoring: Track queue depth (growing = consumers are not keeping up), processing latency (time from enqueue to completion), failure rate, and DLQ size. Alert when queue depth exceeds a threshold or DLQ is non-empty.
- “A job takes 30 minutes to process. The consumer crashes at minute 25. How do you avoid reprocessing from scratch?”
- “You have jobs with different priorities — some must be processed within 1 second, others within 1 hour. How do you design the queue system?”
- “How would you implement exactly-once processing with an at-least-once delivery queue like SQS Standard?”
Q: What happens when you type a URL into a browser and press Enter?
Q: What happens when you type a URL into a browser and press Enter?
- Browser checks its own DNS cache, then the OS resolver cache (
/etc/hosts, then the configured DNS resolver). - If not cached, a recursive DNS query goes to your configured resolver (e.g., 8.8.8.8 or your ISP’s resolver), which walks the DNS hierarchy: Root Server (
.) tells you where.comis,.comTLD server tells you whereexample.com’s authoritative NS is, and the authoritative NS returns the A/AAAA record. - Modern optimization: DNS prefetching (
<link rel="dns-prefetch" href="//api.example.com">) resolves domains for resources the page will need before they are requested.
- Client initiates a three-way handshake: SYN, SYN-ACK, ACK. This takes one round-trip time (RTT) — ~10ms on the same continent, ~150ms cross-ocean.
- For HTTPS, this is followed by the TLS handshake (adds 1-2 more RTTs). TLS 1.3 reduced this to 1-RTT (or 0-RTT for resumption), which is why upgrading from TLS 1.2 to 1.3 gives a measurable latency improvement.
- TCP congestion control starts with a small congestion window (typically 10 segments = ~14KB). This means the first response can only send ~14KB before waiting for an ACK. This is why keeping your critical CSS/HTML under 14KB improves First Contentful Paint — it fits in the first congestion window.
- Browser sends an HTTP request with headers:
Host,User-Agent,Accept,Accept-Encoding(gzip, br), cookies,Connection: keep-alive. - Server processes the request (routing, middleware, controller, database queries, template rendering) and returns a response with headers:
Content-Type,Content-Length,Cache-Control,Set-Cookie,Content-Encoding. - HTTP/2 multiplexes multiple requests over a single TCP connection (no head-of-line blocking at HTTP layer). HTTP/3 uses QUIC (UDP-based) to eliminate TCP head-of-line blocking entirely.
- Browser parses HTML into the DOM tree, CSS into the CSSOM tree, and combines them into the Render Tree.
- JavaScript execution blocks DOM parsing unless the script is
asyncordefer. This is why scripts at the bottom of<body>or withdeferimprove perceived performance. - The render pipeline: Layout (calculate positions/sizes) then Paint (draw pixels) then Composite (layer management for GPU-accelerated animations).
- Critical Rendering Path optimization: Inline critical CSS, defer non-critical CSS, preload key resources (
<link rel="preload">), usefont-display: swapto prevent invisible text during web font loading.
- “Where in this entire flow would you add caching to improve performance, and what kind of caching at each layer?”
- “The page loads in 3 seconds. Using Chrome DevTools, walk me through how you would identify whether the bottleneck is network, server processing, or client-side rendering.”
- “How does a CDN change this flow? What parts of the journey are eliminated or shortened?“
6. Preparation Checklist
- STAR Method: Practice Situation, Task, Action, Result for behavioral answers. Record yourself and listen for vague language (“we did” instead of “I did”, “improved performance” instead of “reduced p95 latency from 800ms to 120ms”).
- Trade-offs: Always mention pros and cons before picking an option. The phrase “It depends on…” followed by specific factors is what interviewers want to hear — not definitive statements without qualification.
- Coding: Be ready to explain Express middleware chains (the
next()flow), Redis data structures beyond justGET/SET(sorted sets for leaderboards, HyperLogLog for cardinality estimation), and SQL indexing (B-tree vs hash indexes, composite index column ordering, partial indexes). - Infrastructure: Understand Layer 4 vs Layer 7 load balancing, the difference between horizontal and vertical scaling, and why auto-scaling has a lag that matters.
- Numbers to know: 1ms for a Redis roundtrip, 5ms for a simple database query, 50ms for a cross-AZ network call, 150ms for a cross-continent roundtrip. Know the latency hierarchy and use it to justify design decisions.
- System design framework: Requirements (functional + non-functional) then API design then data model then high-level architecture then deep dive on the hardest part then scaling and trade-offs. Practice this flow until it is automatic.
- Ask clarifying questions: Before diving into an answer, ask 2-3 clarifying questions. “What is the expected QPS?”, “Do we need strong consistency or is eventual OK?”, “What is the read/write ratio?” This signals senior-level thinking.