Interview Preparation
Master the most common microservices interview questions asked at top tech companies.What This Chapter Covers:
- Common interview questions with answers
- System design exercises
- Behavioral questions about microservices
- Whiteboard coding challenges
- Tips for success
Common Interview Questions
Architecture & Design
Q1: When would you choose microservices over monolith?
Q1: When would you choose microservices over monolith?
Choose Microservices when:
- Team is large (multiple teams need autonomy)
- Different parts need different scaling
- Need technology diversity
- Complex domain with clear boundaries
- Organization is ready for DevOps culture
- Small team (< 10 developers)
- Simple domain
- Startup/MVP phase
- Unclear boundaries
- Limited DevOps expertise
Q2: How do you handle distributed transactions?
Q2: How do you handle distributed transactions?
Options:Best Practice: Design for eventual consistency, use compensation over rollback.
- Saga Pattern (Preferred)
- Choreography: Events trigger compensation
- Orchestration: Central coordinator manages
- Event Sourcing
- Store events, not state
- Replay for consistency
- Two-Phase Commit (Avoid)
- Blocking, doesn’t scale
- Single point of failure
Q3: How do you ensure data consistency across services?
Q3: How do you ensure data consistency across services?
Strategies:
- Eventual Consistency
- Accept temporary inconsistency
- Design idempotent operations
- Use event-driven updates
- Outbox Pattern
- Write to DB + outbox in same transaction
- Separate process publishes events
- Guarantees at-least-once delivery
- Change Data Capture (CDC)
- Listen to database changes
- Publish events from DB logs
- Example: Debezium
- Avoid distributed transactions
- Design for failure recovery
- Monitor for inconsistencies
Q4: How would you split a monolith into microservices?
Q4: How would you split a monolith into microservices?
Step-by-Step Approach:
- Identify Boundaries
- Use Domain-Driven Design
- Find bounded contexts
- Look for natural seams
- Start with Edge Services
- Authentication
- Notifications
- Low-risk, well-defined
- Strangler Fig Pattern
- Route traffic through facade
- Gradually extract functionality
- No big bang migration
- Database Extraction
- Identify service data
- Create new database
- Sync during transition
- Switch reads, then writes
Q5: Explain the CAP theorem and its implications
Q5: Explain the CAP theorem and its implications
CAP Theorem:
- Consistency: All nodes see same data
- Availability: Every request gets response
- Partition Tolerance: System works despite network failures
- CP (Consistency + Partition): Reject requests until consistent (e.g., banking)
- AP (Availability + Partition): Accept requests, sync later (e.g., shopping cart)
- Networks will fail → must handle partitions
- Usually choose AP with eventual consistency
- Use compensation for errors
- Payment: CP - never double charge
- Inventory display: AP - show slightly stale data
Communication Patterns
Q6: When to use sync vs async communication?
Q6: When to use sync vs async communication?
Synchronous (REST, gRPC):Best Practice: Default to async, use sync only when necessary.
- Need immediate response
- Query operations
- Simple request-reply
- Real-time requirements
- Fire and forget
- Long-running operations
- Decouple services
- Handle spikes/backpressure
Q7: How do you handle API versioning?
Q7: How do you handle API versioning?
Strategies:
- URL Versioning:
/api/v1/orders - Header Versioning:
Accept: application/vnd.api+json; version=1 - Query Parameter:
/orders?version=1
- Support N-1 versions minimum
- Deprecation warnings in responses
- Clear migration documentation
- Use semantic versioning
- Breaking: Remove field, change type, remove endpoint
- Non-Breaking: Add optional field, new endpoint
Q8: How do you implement rate limiting?
Q8: How do you implement rate limiting?
Algorithms:Headers:
- Token Bucket
- Tokens added at fixed rate
- Request consumes token
- Allows bursts
- Sliding Window
- Count requests in time window
- More accurate than fixed window
- Leaky Bucket
- Fixed output rate
- Queue excess requests
X-RateLimit-Limit, X-RateLimit-Remaining, Retry-AfterResilience & Reliability
Q9: What happens when a service goes down?
Q9: What happens when a service goes down?
Defense Layers:Key: Graceful degradation, not complete failure.
- Circuit Breaker: Fail fast, don’t wait
- Retry with Backoff: Handle transient failures
- Fallback: Cached data or default response
- Timeout: Don’t wait forever
- Bulkhead: Isolate failure impact
Q10: How do you debug issues in distributed systems?
Q10: How do you debug issues in distributed systems?
Tools & Techniques:
- Distributed Tracing (Jaeger, Zipkin)
- Trace requests across services
- Identify bottlenecks
- Find error source
- Centralized Logging (ELK, Loki)
- Correlation IDs across logs
- Structured logging (JSON)
- Searchable logs
- Metrics (Prometheus, Grafana)
- RED metrics: Rate, Errors, Duration
- Dashboards for visibility
- Alerting on anomalies
- Check dashboards for anomalies
- Find trace ID from failed request
- Follow trace through services
- Search logs with correlation ID
- Identify root cause
System Design Exercises
Exercise 1: Design an E-Commerce Order System
- Start with requirements clarification
- Estimate scale (10K orders/hour = ~3 orders/second)
- Identify service boundaries using DDD
- Explain saga pattern for order workflow
- Discuss failure scenarios and handling
- Address scaling (horizontal scaling of stateless services)
- Mention observability approach
Exercise 2: Design URL Shortener
Requirements:- 100M URLs/month
- Redirect latency < 100ms
- 5-year data retention
- Read-heavy workload (100:1 read/write)
- Cache heavily (Redis)
- Generate short codes (Base62)
- Distributed ID generation
- Consistent hashing for distribution
Exercise 3: Design Notification Service
Requirements:- Multi-channel (email, SMS, push)
- Template support
- Delivery guarantees
- Rate limiting
- Message queue for reliability
- Channel-specific workers
- Dead letter queue for failures
- Priority queues
- Idempotency for retries
Behavioral Questions
Tell me about a time you dealt with a production incident
Tell me about a time you dealt with a production incident
STAR Format:Situation: “Payment service started timing out during Black Friday peak.”Task: “I was on-call and needed to restore service quickly.”Action:
- Checked dashboards, saw 95th percentile latency spike
- Identified database connection pool exhaustion
- Temporary: Increased connection pool, added more replicas
- Long-term: Implemented connection pooling with PgBouncer
- Service restored in 15 minutes
- Added connection pool monitoring
- Implemented load shedding for future peaks
Describe a challenging microservices migration
Describe a challenging microservices migration
Focus on:
- Why migration was needed
- Planning and preparation
- Strangler fig pattern usage
- Data migration strategy
- Rollback plan
- Lessons learned
Quick Reference Card
Interview Tips
Do's
- Clarify requirements upfront
- Think out loud
- Discuss trade-offs
- Mention failure scenarios
- Draw diagrams
- Reference real experience
- Ask good questions
Don'ts
- Jump to solution immediately
- Over-engineer simple problems
- Ignore scale requirements
- Forget about data consistency
- Skip error handling discussion
- Claim expertise you don’t have
- Dismiss simpler solutions
Summary
Key Interview Themes
- Architecture: Know when to use microservices and how to design boundaries
- Data: Understand eventual consistency, sagas, and CQRS
- Resilience: Circuit breakers, retries, fallbacks are essential
- Communication: Know sync vs async trade-offs
- Observability: Logs, metrics, traces - you need all three
- Experience: Have real examples ready to share
Next Steps
Practice
Work through the capstone project to apply everything you’ve learned.
Capstone Project
Build a complete e-commerce microservices system from scratch.