Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Cloud Service Patterns — AWS in Depth
Knowing that “the cloud” exists is table stakes. Knowing which specific service to reach for, why it behaves the way it does under load, and where the cost traps hide — that is what separates engineers who build on AWS from engineers who get paged by AWS. This chapter goes deep on the services you will actually use in production: Lambda, S3, DynamoDB, SQS, EventBridge, ECS, and the networking and security primitives that hold it all together. Every section is grounded in real production behavior, not marketing pages.Part I — Serverless Patterns (Lambda / Cloud Functions)
1.1 Cold Starts — What Actually Happens
When you invoke a Lambda function, AWS must run your code somewhere. If there is already a warm execution environment sitting idle from a recent invocation, your code runs immediately — this is a “warm start” and adds near-zero overhead. But if no warm environment exists, AWS must provision one from scratch. This is a cold start, and understanding exactly what happens during one is the key to mitigating them. The cold start lifecycle:- Download your code — AWS fetches your deployment package (zip or container image) from S3 or ECR. Larger packages take longer.
- Create the execution environment — AWS provisions a microVM (Firecracker), allocates memory, sets up networking, and mounts the filesystem.
- Initialize the runtime — The language runtime starts (Python interpreter, Node.js V8, JVM, .NET CLR). This is where JVM-based languages pay a massive tax.
- Run your initialization code — Any code outside your handler function executes: import statements, database connection setup, SDK client creation, global variable initialization.
- Execute your handler — Finally, your actual function logic runs.
| Runtime | Typical Cold Start | Worst Case | Why |
|---|---|---|---|
| Python | 100-300 ms | 500 ms+ | Interpreter starts fast; import-heavy packages (pandas, numpy) add time |
| Node.js | 100-300 ms | 500 ms+ | V8 is quick; large node_modules increase download/init |
| Go | 50-100 ms | 200 ms | Compiled binary, no runtime initialization |
| Rust | 50-100 ms | 200 ms | Same as Go — compiled, minimal runtime |
| Java | 3-10 seconds | 15+ seconds | JVM startup, class loading, JIT compilation warmup |
| .NET | 1-3 seconds | 5+ seconds | CLR initialization, assembly loading |
- Provisioned Concurrency — Pre-warms a specified number of execution environments. You pay for them whether they are used or not (like reserved instances for Lambda). Use when: your API has strict latency SLAs and cold starts are unacceptable.
- SnapStart (Java only) — AWS takes a snapshot of the initialized JVM after your init code runs, then restores from snapshot on cold start instead of re-initializing. Reduces Java cold starts from seconds to ~200-500 ms. Released late 2022. Use for any Java Lambda function.
- Keep functions small — Smaller deployment packages download faster. Trim unused dependencies. Use Lambda layers for shared libraries.
- Warm-up pings — A CloudWatch Events rule that invokes your function every 5 minutes with a no-op payload. Cheap and effective, but does not guarantee warm instances under concurrent load (one ping keeps one instance warm).
- Choose lightweight runtimes — Go and Rust have the fastest cold starts. Python and Node.js are a good middle ground. Java and .NET are the slowest without SnapStart or equivalent.
- Initialize outside the handler — Database connections, SDK clients, and configuration should be created in the global scope (outside the handler function). This code runs once per cold start and is reused across warm invocations.
1.2 Concurrency Model
Lambda concurrency is one of the most misunderstood aspects of the service. Here is how it actually works: One invocation = one execution environment. If your function receives 100 simultaneous requests, AWS creates 100 execution environments. Each environment handles exactly one request at a time (no multi-threading within a single environment). This is fundamentally different from a container or VM that handles many concurrent requests. Account-level limits:| Limit | Default | Can Increase |
|---|---|---|
| Concurrent executions (per region) | 1,000 | Yes, via support ticket (common to get 10K-100K) |
| Burst concurrency | 500-3,000 (region-dependent) | No |
| Function-level reserved concurrency | Configurable (0 to account limit) | N/A |
- Unreserved — All functions share the account’s concurrency pool. A traffic spike on one function can starve all others.
- Reserved — You allocate a guaranteed number of concurrent executions to a specific function. That function can never exceed its reservation, and no other function can consume its allocation. This is both a guarantee and a throttle.
429 TooManyRequestsException. For synchronous invocations (API Gateway), this surfaces as a 429 to the client. For asynchronous invocations (S3 events, SNS), Lambda retries automatically (twice by default). For SQS event sources, the message returns to the queue and is retried based on visibility timeout.
1.3 Event Sources — The Lambda Trigger Taxonomy
Lambda’s power comes from its integration with dozens of AWS services. But each event source has different invocation semantics, and misunderstanding them causes subtle bugs.| Event Source | Invocation Type | Retry Behavior | Concurrency Model |
|---|---|---|---|
| API Gateway | Synchronous | No retry (client retries) | One Lambda per HTTP request |
| SQS | Polling (Lambda pulls) | Message returns to queue on failure | Batch size controls concurrency |
| S3 | Asynchronous | 2 retries, then DLQ | One Lambda per event |
| EventBridge | Asynchronous | Retries for 24 hours | One Lambda per event |
| Kinesis | Polling (Lambda pulls) | Retries until data expires | One Lambda per shard |
| DynamoDB Streams | Polling (Lambda pulls) | Retries until data expires | One Lambda per shard |
| SNS | Asynchronous | 3 retries | One Lambda per message |
1.4 Lambda Architecture Patterns
Pattern 1: API Backend API Gateway -> Lambda -> DynamoDB/RDS. The most common pattern. Each HTTP endpoint maps to a Lambda function (or a single function with routing logic). Works well for moderate traffic APIs. Watch out for: cold starts on infrequently-hit endpoints, connection pooling with RDS (use RDS Proxy), and API Gateway’s 29-second timeout limit. Pattern 2: Event Processor S3/SQS/SNS/EventBridge -> Lambda -> DynamoDB/S3. Process events asynchronously. Image uploaded to S3 triggers a Lambda that generates thumbnails. Order placed publishes to SNS, triggering Lambdas for email, inventory, and analytics. This is where serverless shines — zero cost when idle, scales automatically with event volume. Pattern 3: Scheduled Task EventBridge Scheduled Rule -> Lambda. Replaces cron jobs. Clean up expired sessions every hour, generate daily reports, sync data between systems. Cheaper and more reliable than running a dedicated EC2 instance for cron. Pattern 4: Stream Processor Kinesis/DynamoDB Streams -> Lambda. Process real-time data streams. Clickstream analytics, change data capture, real-time aggregations. Lambda processes records in order within each shard. Use enhanced fan-out for high-throughput consumers.1.5 Container Images vs Zip Packages
Lambda supports two deployment formats, and the choice matters more than most teams realize. Zip packages (the default):- Max size: 50 MB zipped, 250 MB unzipped
- Fast deployment (seconds)
- Works with Lambda layers for shared dependencies
- Best for: most Lambda functions, small-to-medium dependency footprints
- Max size: 10 GB
- Uses standard Docker tooling (Dockerfile, docker build, ECR)
- Slower cold starts (larger image to download, though AWS caches aggressively)
- No Lambda layer support (bake everything into the image)
- Best for: ML inference (large model files), functions with massive dependencies, teams with existing Docker workflows, functions that need specific system libraries
1.6 Step Functions — Orchestration Done Right
When you need to coordinate multiple Lambda functions into a workflow, you have three options: synchronous chaining (Lambda calls Lambda — terrible idea), event-driven choreography (SQS/EventBridge between steps — good for simple flows), or orchestration with Step Functions (good for complex flows with branching, retries, and error handling). When to use Step Functions:- Workflows with conditional logic (if payment succeeds, do X; if it fails, do Y)
- Long-running workflows that exceed Lambda’s 15-minute timeout
- Workflows that need human approval steps
- Complex retry and error handling across multiple steps
- When you need a visual representation of workflow state for debugging
- Simple fan-out (one event triggers multiple independent actions)
- When steps are truly independent and do not need coordination
- When you want loose coupling between services owned by different teams
- When the “workflow” is really just a pipeline (A -> B -> C with no branching)
1.7 Cost Model — When Serverless Stops Being Cheap
Lambda pricing has two components: per-invocation (0.0000166667 per GB-second). The combination means Lambda is extraordinarily cheap at low scale and surprisingly expensive at high scale. The break-even calculation:| Metric | Lambda (128 MB, 200ms avg) | ECS Fargate (0.25 vCPU, 0.5 GB) |
|---|---|---|
| 100K requests/month | ~$0.43 | ~$9.00 |
| 1M requests/month | ~$4.30 | ~$9.00 |
| 10M requests/month | ~$43.00 | ~$9.00 |
| 100M requests/month | ~$430.00 | ~$36.00 (4 tasks) |
1.8 Anti-Patterns
The Lambda-lith: Deploying your entire Express/Flask application as a single Lambda function behind API Gateway. You lose all the benefits of serverless (granular scaling, independent deployment, per-function monitoring) and keep all the downsides (cold starts, 15-minute timeout, payload limits). If you want to run a monolith, use containers. Synchronous Lambda chains: Lambda A calls Lambda B, which calls Lambda C, all synchronously. Each function waits for the next, consuming concurrency and accumulating latency. If Lambda C times out, Lambda B times out, and Lambda A times out — a cascade failure. Use Step Functions or async patterns (SQS between steps) instead. Cold-start-sensitive critical paths: Placing Lambda in the hot path of a user-facing request without provisioned concurrency, then wondering why p99 latency is 3 seconds. Either use provisioned concurrency, or move the latency-sensitive operation to an always-on service. Recursive Lambda invocations: A Lambda function that writes to an S3 bucket that triggers the same Lambda function. This creates an infinite loop that racks up a massive bill in minutes. AWS now has recursive loop detection for some event sources, but do not rely on it — design your event routing to prevent cycles. Lambda-to-Lambda synchronous chaining (expanded): This deserves emphasis because it remains one of the most common mistakes in serverless architectures. The problems compound:- Latency accumulation: Each hop adds cold start risk + network latency. A chain of A -> B -> C -> D with 300ms cold starts becomes 1.2 seconds in the worst case — before any business logic runs.
- Concurrency multiplication: If A calls B synchronously, both A and B consume a concurrent execution for the full duration. A chain of 4 functions processing 100 requests consumes 400 concurrent executions. You hit the account limit 4x faster.
- Cascading timeouts: If D’s timeout is 30 seconds, C must have a timeout > 30 seconds, B > C, A > B. Now A has a 2-minute timeout for what should be a 5-second operation.
- Cost doubling: Every function in the chain is billed for the time it spends waiting for the downstream function. You are literally paying twice (or more) for the same wall-clock time.
- The fix: Use Step Functions for orchestration (each step is invoked independently), SQS between steps for loose coupling, or collapse the chain into a single function if the logic is simple enough.
- Why it happens: Teams install entire frameworks (all of
boto3when they only need S3, all ofpandaswhen they only need CSV parsing), bundle test fixtures, include unused native binaries, or skip tree-shaking in Node.js. - The fix: Audit dependencies ruthlessly. Use
pip install --targetwith only the packages you need. In Node.js, use bundlers (esbuild, webpack) to tree-shake dead code. Move shared dependencies to Lambda Layers — layers are downloaded once and cached across function updates, so updating your function code does not re-download shared libraries. - Lambda Layers best practices: Create layers for common SDKs, database drivers, and utility libraries. A layer can be up to 250 MB (unzipped) and a function can use up to 5 layers. Layers are versioned — pin specific versions in production, use
$LATESTonly in dev.
ConcurrentExecutions, Throttles, Duration (p50/p95/p99), IteratorAge (for stream-based triggers), and DeadLetterErrors.
Part II — Storage Patterns (S3 / Blob Storage)
2.1 S3 Consistency Model
Before December 2020, S3 had eventual consistency for overwrite PUTs and DELETEs. You could update an object and immediately read the old version. This was one of the most common sources of subtle bugs in S3-based architectures. Since December 2020, S3 provides strong read-after-write consistency for all operations. After a successful PUT, any subsequent GET returns the new version. After a DELETE, any subsequent GET returns a 404. This applies to all S3 API operations, all storage classes, and all regions. You do not need to pay extra for it. There is no performance penalty.2.2 Storage Classes and Lifecycle Policies
S3 has six storage classes, each with different cost and access characteristics. Choosing the right one — and automating transitions between them — is one of the easiest ways to cut AWS bills.| Storage Class | Use Case | Retrieval Time | Monthly Cost (per GB) | Retrieval Cost |
|---|---|---|---|---|
| S3 Standard | Frequently accessed data | Instant | ~$0.023 | None |
| S3 Intelligent-Tiering | Unknown/changing access patterns | Instant | ~$0.023 + monitoring fee | None |
| S3 Standard-IA | Infrequently accessed, needs instant access | Instant | ~$0.0125 | Per-GB retrieval fee |
| S3 One Zone-IA | Infrequent, reproducible data (thumbnails, transcoded media) | Instant | ~$0.01 | Per-GB retrieval fee |
| S3 Glacier Flexible | Archive, retrieval in minutes to hours | 1-12 hours | ~$0.0036 | Per-GB + per-request |
| S3 Glacier Deep Archive | Long-term archive, compliance | 12-48 hours | ~$0.00099 | Per-GB + per-request |
2.3 Multipart Uploads
For files larger than 100 MB, S3’s multipart upload is essential. Instead of uploading a single large object (which fails if the network drops midway), you split the file into parts (5 MB to 5 GB each), upload each part independently (possibly in parallel), and then tell S3 to assemble them. Why this matters:- Resilience — If one part fails, you retry only that part, not the entire file.
- Parallelism — Upload parts concurrently to saturate your bandwidth.
- Required for files > 5 GB — S3 has a 5 GB single-PUT limit.
- Resumable — Parts already uploaded persist for up to 7 days (configurable). You can resume after a failure.
2.4 Pre-Signed URLs — Secure Temporary Access
Pre-signed URLs let you grant temporary access to private S3 objects without making them public or sharing credentials. You generate a URL server-side that includes a cryptographic signature and expiration time. Anyone with the URL can download (or upload) the object until the URL expires. Common patterns:- Secure file downloads — User requests a file, your API generates a pre-signed GET URL valid for 15 minutes, returns it to the client. Client downloads directly from S3, bypassing your server entirely.
- Direct-to-S3 uploads — Your API generates a pre-signed PUT URL. Client uploads directly to S3. Your server never touches the file data, eliminating a bandwidth and memory bottleneck.
- Temporary media access — Serve images or videos through pre-signed URLs with short expiration. Works with CloudFront signed URLs for even better performance.
2.5 S3 Event Notifications
S3 can emit events when objects are created, deleted, or restored. These events can trigger Lambda functions, SQS queues, SNS topics, or EventBridge rules. This is the foundation of most event-driven data processing architectures on AWS. Common patterns:- Upload image to S3 -> Lambda generates thumbnails -> writes to another S3 bucket
- CSV uploaded to S3 -> SQS message -> worker processes and loads to database
- Log file uploaded -> EventBridge rule -> Step Functions workflow for analysis
2.6 S3 Select and Athena — Querying Data in Place
Instead of downloading an entire CSV or JSON file to filter it, S3 Select lets you run SQL expressions directly against objects in S3. It pushes the filtering to the storage layer, reducing data transfer and processing time.- S3 Select — Simple filtering on a single object. Quick, cheap, no setup.
- Athena — Complex queries across many objects. Joins, aggregations, window functions. Best with columnar formats (Parquet, ORC) that reduce data scanned (and therefore cost).
2.7 S3 Cost Traps
Trap 1: Request costs. S3 storage is cheap, but GET and PUT requests are not free. GET: 0.005 per 1,000 requests. A system that makes 100 million GET requests per month pays $40 in request fees alone — on top of storage and transfer. Trap 2: Data transfer out. S3 to the internet costs 90/month in transfer alone. Use CloudFront ($0.085/GB but with caching, actual transfer from S3 drops dramatically). Trap 3: Storing everything in STANDARD. Old logs, backups, and infrequently accessed data sitting in S3 Standard is waste. Lifecycle policies are free to configure and can cut storage costs by 50-80%. Trap 4: Versioning without lifecycle rules. S3 versioning keeps every version of every object. Without lifecycle rules to expire old versions, storage grows silently. A 1 GB file overwritten daily generates 365 GB of versions per year.Part III — Compute Patterns
3.1 The Compute Decision Matrix
This is the question you will face in every architecture review and most system design interviews: which compute platform for this workload?| Factor | Lambda | ECS Fargate | ECS on EC2 | EKS | EC2 |
|---|---|---|---|---|---|
| Startup time | 100ms-10s (cold start) | 30-60s | 30-60s | 30-60s | Minutes |
| Max execution | 15 minutes | Unlimited | Unlimited | Unlimited | Unlimited |
| Scaling | Automatic, per-request | Auto, task-based | Auto, instance+task | Auto, pod-based | Auto, instance-based |
| Min cost | $0 (scale to zero) | ~$9/month (1 task) | EC2 instance cost | EC2 + EKS $73/mo | EC2 instance |
| Operational overhead | Minimal | Low | Medium | High | High |
| Container support | Yes (images) | Native | Native | Native | Manual |
| Team size | 1-3 OK | 2-5 OK | 5+ | 5+ | Any |
| Best for | Event-driven, APIs, glue | Web apps, APIs, workers | Cost-optimized steady load | Multi-cloud, complex orchestration | Legacy, GPU, special hardware |
3.2 Fargate Spot — The Cost Optimization Cheat Code
Fargate Spot tasks run on spare AWS capacity at up to 70% discount. AWS can terminate them with 30 seconds notice. Use for: batch processing, CI/CD builds, development environments, any workload that can handle interruption. Do not use for: Production API servers, database tasks, anything where a 30-second termination notice causes data loss or user-facing errors. Strategy: Run your baseline on regular Fargate tasks and burst onto Spot. Set up capacity providers to automatically split traffic (e.g., 80% regular, 20% Spot).3.3 Auto-Scaling Patterns
Target tracking (simplest, usually best): “Keep average CPU at 60%.” AWS handles the math — adds tasks when CPU exceeds 60%, removes when it drops below. Works for most workloads. Also supports custom metrics (queue depth, request count). Step scaling (more control): “When CPU > 70% for 3 minutes, add 2 tasks. When CPU > 90% for 1 minute, add 5 tasks.” Gives you fine-grained control over scaling speed. Useful when you know your scaling curve. Predictive scaling: Uses machine learning to forecast traffic based on historical patterns and pre-scales before the load arrives. Ideal for workloads with predictable daily/weekly cycles (e-commerce sites, business applications). Needs at least 24 hours of historical data.3.4 Graviton Processors — ARM on AWS
AWS Graviton processors are ARM-based chips designed by AWS. They offer up to 40% better price-performance compared to equivalent x86 instances. Available on EC2 (M7g, C7g, R7g families), ECS, EKS, Lambda, and RDS. Why switch:- 20-40% cheaper for compute-heavy workloads (no code changes needed for interpreted languages)
- Better energy efficiency (lower carbon footprint)
- Same or better single-threaded performance for most workloads
- Native compiled code (C, C++, Go, Rust) needs ARM cross-compilation
- Some third-party software does not have ARM builds yet
- Docker images must be built for ARM (
--platform linux/arm64) or use multi-arch manifests
Part IV — Messaging and Event Patterns
4.1 The Messaging Decision Tree
This is the question that comes up in every system design interview involving async processing. Here is the honest decision framework: Do you need a simple task queue? -> SQS Standard. Messages are delivered at-least-once, ordering is best-effort. Simple, cheap, scales automatically. This handles 80% of async use cases. Do you need strict ordering? -> SQS FIFO. Messages are delivered exactly-once (within the deduplication window), in order, grouped by message group ID. Limited to 300 messages/second (3,000 with batching). Use when order matters (financial transactions, sequential processing). Do you need fan-out (one event, many consumers)? -> SNS for simple fan-out, EventBridge for filtered routing. SNS pushes to all subscribers. EventBridge lets you filter by event content (only routeOrderPlaced events where order.total > 1000 to the fraud detection service).
Do you need high-throughput stream processing with replay? -> Kinesis Data Streams. Consumers can replay the stream from any point. Supports multiple independent consumers. Sharded for parallelism. Use for: clickstream analytics, real-time aggregations, change data capture.
| Feature | SQS | SNS | EventBridge | Kinesis |
|---|---|---|---|---|
| Model | Queue (point-to-point) | Pub/sub (fan-out) | Event bus (filtered routing) | Stream (ordered log) |
| Ordering | FIFO option | No | No | Per-shard |
| Retention | Up to 14 days | No retention (push only) | Archive + replay | 1-365 days |
| Throughput | Nearly unlimited (Standard) | 30M messages/s | Thousands/s (soft limit) | Per-shard (1 MB/s in, 2 MB/s out) |
| Consumer model | Pull (polling) | Push (HTTP, Lambda, SQS) | Push (100+ targets) | Pull (KCL, Lambda) |
| Replay | No | No | Yes (archive) | Yes (any point in stream) |
| Cost model | Per-request | Per-publish + delivery | Per-event | Per-shard-hour + per-PUT |
| Best for | Task queues, decoupling | Simple fan-out, notifications | Event routing, cross-account events | Real-time analytics, CDC |
4.2 SQS Deep Dive
Visibility timeout: When a consumer reads a message, SQS hides it from other consumers for the visibility timeout period (default 30 seconds). If the consumer processes it and deletes it within that window, done. If the consumer crashes, the timeout expires and the message becomes visible again for another consumer. Set the timeout to slightly longer than your maximum processing time. Dead Letter Queue (DLQ): After a configurable number of failed processing attempts (maxReceiveCount), SQS moves the message to a DLQ. Monitor your DLQ. An empty DLQ means your processing is healthy. A growing DLQ means something is wrong. Set up CloudWatch alarms on DLQ message count.
Batching: SQS supports reading up to 10 messages at a time and sending up to 10 messages at a time. Always batch when possible — it reduces API calls and therefore cost by up to 10x.
Long polling: By default, ReceiveMessage returns immediately even if no messages are available (short polling). Set WaitTimeSeconds to 20 (maximum) for long polling — the call blocks until a message arrives or the timeout expires. This eliminates empty responses and reduces API call costs.
4.3 EventBridge Patterns
EventBridge is the most underused AWS service relative to its power. Think of it as a serverless event router with content-based filtering. Event patterns let you filter events by content:OrderPlaced events with total > $1,000 from US or Canada. Events that do not match are silently ignored — no code needed.
Schema registry automatically discovers and stores event schemas as events flow through EventBridge. You can generate type-safe code bindings from these schemas. This solves one of the biggest pain points of event-driven architecture: knowing what events look like.
Cross-account events let you route events between AWS accounts — critical for multi-account organizational architectures where production, staging, and shared services run in separate accounts.
4.4 Event-Driven Anti-Patterns
The event spaghetti: Every service emits events, every other service consumes them, nobody knows the full event flow. Solve with: an event catalog (EventBridge schema registry), event ownership documentation, and architectural diagrams that show event flows. The event that should have been a command: Events describe what happened (“OrderPlaced”). Commands describe what should happen (“SendConfirmationEmail”). If your event consumers are doing imperative work that the publisher expects to happen, you have a disguised command. Use a queue (SQS) for commands, events (EventBridge/SNS) for notifications. Missing dead letter queues: Events that fail processing silently disappear. Every async integration must have a failure path — DLQ, on-failure destination, or error logging at minimum.Part V — Database Patterns on AWS
5.1 RDS vs Aurora
- Availability requirements: Aurora’s storage layer survives losing two copies without read impact and three copies without write impact. RDS Multi-AZ failover takes 60-120 seconds; Aurora failover takes under 30 seconds.
- Read-heavy workloads: Aurora supports up to 15 read replicas (RDS supports 5), with replication lag typically under 20ms (RDS lag can be seconds).
- Auto-scaling storage: Aurora storage grows automatically in 10 GB increments up to 128 TB. No need to provision storage upfront.
- Large databases: For databases over 1 TB, Aurora’s distributed storage outperforms RDS’s EBS-based storage significantly.
- Small databases (< 100 GB) with moderate traffic
- Development and staging environments
- When you need Oracle or SQL Server (Aurora only supports PostgreSQL and MySQL)
- When your workload is write-heavy and does not benefit from Aurora’s read replica architecture
5.2 Aurora Serverless v2
Aurora Serverless v2 scales database compute automatically based on demand. You define a minimum and maximum capacity (in ACUs — Aurora Capacity Units), and Aurora adjusts in near real-time. You pay only for the capacity you use, measured in ACU-seconds. When it shines: Development environments, applications with variable traffic patterns, new products where you do not know the traffic profile yet. You get the Aurora storage engine’s reliability without paying for a provisioned instance 24/7. When it does not: Highly predictable, steady-state workloads. Provisioned Aurora instances are cheaper when utilization is consistently high. Also, ACU scaling has a brief pause that can cause latency spikes during rapid scale-up.5.3 ElastiCache — Redis vs Memcached on AWS
| Factor | ElastiCache Redis | ElastiCache Memcached |
|---|---|---|
| Data structures | Strings, hashes, lists, sets, sorted sets, streams, geospatial | Strings only |
| Persistence | Yes (snapshots, AOF) | No (pure cache) |
| Replication | Yes (read replicas, Multi-AZ) | No |
| Cluster mode | Yes (horizontal scaling) | Yes (multi-node) |
| Pub/sub | Yes | No |
| Use case | Session store, leaderboards, rate limiting, real-time analytics | Simple caching, session caching |
Part VI — Networking and Security on AWS
6.1 VPC Design Patterns
- Gateway Endpoints (free): S3, DynamoDB. Use these always.
- Interface Endpoints (0.01/GB): Everything else (SQS, ECR, Secrets Manager, CloudWatch). Use when the cost is less than NAT Gateway traffic.
6.2 Security Groups vs NACLs
Both control network traffic, but they work differently:| Feature | Security Groups | Network ACLs |
|---|---|---|
| Level | Instance/ENI level | Subnet level |
| Statefulness | Stateful (return traffic automatically allowed) | Stateless (must explicitly allow both directions) |
| Rules | Allow rules only | Allow AND deny rules |
| Evaluation | All rules evaluated together | Rules evaluated in order (lowest number first) |
| Default | Deny all inbound, allow all outbound | Allow all inbound and outbound |
| Use case | Primary security control | Additional subnet-level defense |
6.3 IAM Least Privilege
- Never use root credentials. Create IAM users or roles for everything.
- Use roles, not long-lived access keys. EC2 instance profiles, ECS task roles, Lambda execution roles — all use temporary credentials that rotate automatically.
- Scope permissions narrowly.
"Resource": "*"is almost never necessary. Specify the exact ARN of the resource. - Use conditions. Restrict by IP range, VPC, time of day, MFA requirement, or source account.
reports/ prefix of one specific bucket — not all of S3.
6.4 Secrets Manager vs Parameter Store
Both store configuration and secrets. The difference is scope and cost.| Feature | Secrets Manager | Systems Manager Parameter Store |
|---|---|---|
| Cost | 0.05/10K API calls | Free (standard), $0.05/10K for advanced |
| Auto-rotation | Built-in for RDS, Redshift, DocumentDB | Manual (you write the Lambda) |
| Cross-account | Native support | Requires additional IAM |
| Encryption | Always encrypted (KMS) | Optional encryption (KMS) |
| Best for | Database passwords, API keys, rotation-required secrets | App configuration, feature flags, non-sensitive parameters |
6.5 WAF and Shield
AWS WAF (Web Application Firewall): Filters HTTP/HTTPS requests to CloudFront, ALB, or API Gateway. You define rules to block common attacks: SQL injection, XSS, IP reputation lists, rate limiting, geo-blocking. Managed rule groups from AWS and partners provide pre-built protection against OWASP Top 10 threats. AWS Shield Standard: Free DDoS protection for all AWS resources. Protects against common network-layer (L3/L4) attacks. Automatically enabled. AWS Shield Advanced: $3,000/month. Provides DDoS cost protection (AWS credits you for scaling costs during an attack), 24/7 access to the AWS DDoS Response Team, advanced detection, and real-time attack visibility. Worth it only for organizations where a DDoS attack has significant financial impact.6.6 Multi-Account Strategy
Most teams start with a single AWS account. This works fine for a side project or an early startup. But as the organization grows — multiple teams, multiple environments, compliance requirements, cost attribution needs — a single account becomes a liability. Multi-account strategy is not an advanced topic you deal with “later.” It is a foundational decision that gets exponentially harder to change the longer you wait.- Blast radius containment. A misconfigured IAM policy, a compromised credential, or a runaway resource in production cannot affect staging or other teams’ workloads. Each account is a hard security boundary.
- Cost isolation. Each account generates its own bill. No tagging strategy needed to separate prod from dev — it is structural. Finance gets clear cost attribution by account.
- Service limit isolation. Lambda concurrency limits, API Gateway throttling, EC2 instance limits — all are per-account. A runaway batch job in the data team’s account cannot starve the API team’s Lambda functions.
- Compliance boundaries. PCI-DSS, HIPAA, SOC2, and GDPR all benefit from isolating regulated workloads into dedicated accounts with stricter controls. Auditors love clean account boundaries.
- Independent deployments. Teams can deploy, experiment, and break things in their own accounts without coordination overhead.
- Organizational Units (OUs): Group accounts hierarchically. Common structure:
- Consolidated billing: All accounts roll up to a single bill. Volume discounts (Savings Plans, Reserved Instances) apply across the entire organization. You buy RIs in one account and they apply to matching usage in any account.
- Service Control Policies (SCPs): Organization-wide guardrails that override IAM permissions. An SCP is a ceiling, not a floor — even if an IAM policy grants permission, an SCP can deny it.
- Region restriction: Deny all actions outside approved regions.
- Prevent root user access: Deny all actions by the root user (except account recovery).
- Require encryption: Deny
s3:PutObjectunlesss3:x-amz-server-side-encryptionis set. Denyrds:CreateDBInstanceunlessStorageEncryptedis true. - Prevent leaving the organization: Deny
organizations:LeaveOrganizationon all accounts. - Deny expensive services: Deny
ec2:RunInstancesfor instance types larger thanxlargein sandbox accounts. Prevent accidental p4d.24xlarge launches that cost $32/hour.
| Strategy | When to Use | Pros | Cons |
|---|---|---|---|
| Account-per-environment | Small-to-medium orgs, single team | Simple, clear prod/staging/dev separation | Does not scale to many teams |
| Account-per-team | Medium orgs, 3-10 teams | Team autonomy, independent limits and billing | Many accounts to manage, cross-team access complexity |
| Account-per-workload | Large orgs, regulated industries | Maximum isolation, granular compliance | Account sprawl, significant management overhead |
| Hybrid (recommended) | Most growing organizations | Balances isolation with manageability | Requires thoughtful OU structure |
- IAM cross-account roles: Account A’s Lambda assumes a role in Account B to read from Account B’s S3 bucket. The trust policy on Account B’s role explicitly allows Account A’s principal.
- Resource-based policies: S3 bucket policies, KMS key policies, and SNS topic policies can grant access to specific external accounts without role assumption.
- AWS RAM (Resource Access Manager): Share VPC subnets, Transit Gateway, and other resources across accounts without duplicating them.
- EventBridge cross-account event buses: Route events from workload accounts to a central observability account for unified monitoring.
- Account vending machine. Automate new account creation with Control Tower Account Factory or a custom pipeline (Terraform + Step Functions). Every new account is provisioned with: baseline SCPs applied via its OU, CloudTrail logging shipped to the centralized log archive, GuardDuty enrolled in the delegated administrator account, VPC with Transit Gateway attachment pre-configured, and a pre-seeded CI/CD role for the owning team. A developer requests an account through a self-service portal; 15 minutes later, the account is ready with all guardrails in place. No tickets, no waiting.
- Drift detection. AWS Config rules and Config Conformance Packs continuously audit every account against your organization’s baseline. Common rules: “all S3 buckets must have versioning enabled,” “no security groups allowing 0.0.0.0/0 inbound on port 22,” “all RDS instances must be encrypted.” When drift is detected, AWS Config sends findings to Security Hub, which triggers SNS notifications or auto-remediation Lambda functions. Without drift detection, your carefully crafted guardrails erode within weeks as engineers work around them.
- Centralized audit and compliance. AWS CloudTrail Organization Trail captures every API call across every account in the organization and delivers logs to the centralized log archive account. The log archive account has an S3 bucket with Object Lock (WORM — Write Once Read Many) so that even a compromised account cannot tamper with its own audit trail. CloudTrail Lake enables SQL-based queries across the organization-wide audit log for investigations.
- Delegated administrator accounts. Instead of running all security tools from the management account (which should have minimal usage), delegate administration to specialized accounts: GuardDuty delegated admin, Security Hub delegated admin, AWS Config delegated admin. This follows the principle of least privilege at the account level — the management account only manages organization structure and SCPs.
- Your CI/CD platform (GitHub Actions, GitLab CI, CircleCI) issues an OIDC token to the running job. This token contains claims about the job: repository name, branch, workflow, actor.
- The job calls AWS STS
AssumeRoleWithWebIdentity, presenting the OIDC token. - AWS validates the token against the OIDC provider’s public keys (configured as an IAM Identity Provider in your account).
- If valid, AWS returns temporary credentials (access key, secret key, session token) scoped to a specific IAM role with a short TTL (typically 1 hour).
- The job uses these temporary credentials to deploy. When the job ends, the credentials expire automatically.
Condition block is critical — it restricts which repositories and branches can assume this role. Without it, any GitHub repository could assume your deployment role. The same OIDC pattern works for Kubernetes pods (EKS Pod Identity / IAM Roles for Service Accounts), GitLab CI, and any OIDC-compliant identity provider.
Part VII — Cost Engineering
7.1 Reserved Instances vs Savings Plans vs Spot
This is the compute pricing decision framework that every team running significant AWS workloads must understand.| Option | Discount | Commitment | Flexibility | Best For |
|---|---|---|---|---|
| On-Demand | 0% (baseline) | None | Complete | Unknown workloads, testing, burst capacity |
| Savings Plans (Compute) | Up to 66% | 1 or 3 years, $/hour commitment | Any instance family, region, OS, tenancy | Teams that know they will spend $X/hour on compute but want flexibility |
| Savings Plans (EC2) | Up to 72% | 1 or 3 years, specific instance family | Same family, any size/OS/tenancy | Workloads committed to a specific instance family |
| Reserved Instances | Up to 72% | 1 or 3 years, specific instance type | Limited (convertible RIs have some flex) | Steady-state databases, well-known workloads |
| Spot Instances | Up to 90% | None (but can be terminated) | Any instance type with spare capacity | Batch processing, CI/CD, fault-tolerant workloads |
- Cover your steady-state baseline (databases, minimum API capacity) with Savings Plans or RIs.
- Handle predictable burst capacity with On-Demand (auto-scaling groups).
- Run fault-tolerant batch work on Spot.
- Use Compute Savings Plans when you are not sure which instance types you will use — they apply across Lambda, Fargate, and EC2.
7.2 Cost Allocation Tags
Tags are key-value pairs you attach to AWS resources. They flow through to Cost Explorer and billing reports, letting you answer: “How much does the recommendations team spend on compute?” or “What does our staging environment cost?” Why they matter before day 1: Retroactive tagging is painful and often incomplete. Resources created without tags generate unattributable costs. Once your bill is $50K/month with 500 untagged resources, you have a cost visibility problem that takes weeks to clean up. Mandatory tag strategy:| Tag Key | Purpose | Example Values |
|---|---|---|
team | Cost attribution to team | platform, payments, ml |
environment | Separate prod from dev costs | production, staging, dev |
service | Map costs to services | api-gateway, order-processor |
cost-center | Finance/budgeting alignment | eng-100, data-200 |
7.3 The Serverless Tax
Managed services trade operational cost for financial cost. At some point, the financial cost exceeds what you would pay to self-manage. Examples where serverless costs more:- NAT Gateway: 4/month (with reduced availability and throughput).
- API Gateway REST: 16/month + $0.008/LCU-hour) is cheaper above roughly 5 million requests/month.
- Managed Kafka (MSK): $200+/month minimum. Self-managed Kafka on EC2 can be cheaper at scale (but operational cost is significant).
- Lambda at scale: As shown in section 1.7, Lambda is 10-40x more expensive than containers at sustained high traffic.
- Your team is small and engineering time is more expensive than AWS bills
- Traffic is variable and unpredictable
- The operational risk of self-managing exceeds the cost savings
- You are optimizing for speed of delivery, not cost efficiency
Part VIII — Cloud-Agnostic Patterns
This entire chapter uses AWS service names because they are the most widely deployed. But if you only think in AWS, you are building a career on one vendor’s terminology. The best cloud engineers think in patterns and then implement in services. This section maps the portable patterns (they work everywhere) versus the cloud-specific patterns (where vendor lock-in is real and sometimes worth it).8.1 Patterns That Transfer Across Clouds
These patterns have direct equivalents on AWS, GCP, and Azure. The APIs differ, but the architecture, trade-offs, and failure modes are the same. If you understand the pattern, switching clouds is a configuration change, not an architecture rewrite. Container orchestration:| Pattern | AWS | GCP | Azure |
|---|---|---|---|
| Managed Kubernetes | EKS | GKE | AKS |
| Managed containers (serverless) | ECS Fargate | Cloud Run | Container Apps |
| Container registry | ECR | Artifact Registry | ACR |
| Pattern | AWS | GCP | Azure |
|---|---|---|---|
| Object storage | S3 | Cloud Storage | Blob Storage |
| Storage classes/tiers | Standard/IA/Glacier | Standard/Nearline/Coldline/Archive | Hot/Cool/Cold/Archive |
| Lifecycle policies | S3 Lifecycle Rules | Object Lifecycle Management | Lifecycle Management Policies |
| Pre-signed URLs | S3 pre-signed URLs | Signed URLs | Shared Access Signatures (SAS) |
| Pattern | AWS | GCP | Azure |
|---|---|---|---|
| Task queue | SQS | Cloud Tasks | Storage Queues / Service Bus |
| Pub/sub fan-out | SNS | Pub/Sub | Service Bus Topics / Event Grid |
| Event routing | EventBridge | Eventarc | Event Grid |
| Streaming | Kinesis | Pub/Sub (streaming mode) | Event Hubs |
| Pattern | AWS | GCP | Azure |
|---|---|---|---|
| Managed PostgreSQL/MySQL | RDS, Aurora | Cloud SQL, AlloyDB | Azure Database for PostgreSQL/MySQL |
| Serverless relational | Aurora Serverless | Cloud SQL (auto-scale) | Azure SQL Serverless |
| Read replicas | RDS/Aurora Read Replicas | Cloud SQL Replicas | Azure Read Replicas |
| Pattern | AWS | GCP | Azure |
|---|---|---|---|
| Secrets management | Secrets Manager | Secret Manager | Key Vault |
| Configuration store | Parameter Store | Runtime Configurator | App Configuration |
| Key management | KMS | Cloud KMS | Key Vault (Keys) |
| Pattern | AWS | GCP | Azure |
|---|---|---|---|
| Identity & access management | IAM | IAM | Entra ID (formerly Azure AD) + RBAC |
| Service accounts / workload identity | IAM Roles | Service Accounts / Workload Identity | Managed Identities |
| Organization policies | SCPs (Organizations) | Organization Policies | Azure Policy |
s3:* on * to unblock themselves, and that policy persists for months. The second most common failure is IAM propagation delay: policy changes can take up to 60 seconds to propagate globally, so a “grant then immediately use” pattern may fail intermittently. The mitigation: test permission changes with iam:SimulatePrincipalPolicy before relying on them, and always include a brief retry window after policy attachment.
Workload identity portability: The underlying concept — “workloads prove their identity through short-lived tokens issued by a trusted identity provider, not through long-lived credentials” — is identical across clouds. AWS IAM Roles for Service Accounts (IRSA) and EKS Pod Identity, GCP Workload Identity, and Azure Managed Identities all implement the same OIDC-based pattern. If your CI/CD and Kubernetes workloads use OIDC federation, switching clouds means reconfiguring the trust relationship, not rewriting your authentication logic.
8.2 Patterns That Are Cloud-Specific
These are services where the cloud vendor has built something with unique architecture or performance characteristics that do not have a direct portable equivalent. Using these services is a conscious lock-in trade — you gain performance, simplicity, or cost benefits that portable alternatives cannot match. DynamoDB (AWS): Single-digit millisecond latency at any scale, single-table design, global tables for multi-region replication. No direct equivalent on GCP or Azure. The closest alternatives (Bigtable, Cosmos DB) have different data models, consistency guarantees, and access patterns. If you design around DynamoDB’s single-table pattern, migrating to another database is an application rewrite, not a configuration change. This is often worth it — DynamoDB’s operational simplicity and performance at scale are hard to match.- Default failure mode: DynamoDB’s primary failure mode is hot partition throttling: if a single partition key receives disproportionate traffic, requests to that partition are throttled even if the table’s aggregate throughput is well within limits. Global tables add replication lag (typically sub-second, but can spike to seconds during regional degradation). The recovery pattern: monitor
ThrottledRequestsper partition, use write sharding for hot keys, and design your access patterns to distribute load evenly across the key space.
8.3 The Abstraction Layer Decision
When teams want cloud portability, they reach for abstraction layers. Here is the honest trade-off: Infrastructure as Code (Terraform / Pulumi):- Portable: Resource definitions can target any cloud. Switching providers means rewriting
.tffiles, not application code. - Not portable: The resources themselves are cloud-specific. Defining an Aurora cluster in Terraform does not make it portable — it just means your infrastructure definition is in a consistent language.
- Verdict: Use Terraform/Pulumi for consistency and automation, not for cloud portability.
- Portable: Kubernetes manifests (Deployments, Services, ConfigMaps) work on any Kubernetes cluster. Move from EKS to GKE by repointing
kubectl. - Not portable: Ingress controllers, storage classes, load balancer annotations, and IAM integration are cloud-specific. You will still have cloud-specific YAML.
- Verdict: Kubernetes gives you compute portability. Data layer, networking, and identity are still cloud-specific. This is why Kubernetes is popular for multi-cloud — it solves the hardest portability problem (running your applications) while accepting that storage and networking are different.
- Libraries like
libcloud,fog, or cloud-agnostic SDK wrappers provide unified APIs across providers. - The honest truth: These abstractions target the lowest common denominator. You lose the best features of each cloud. In practice, teams use them for basic operations (file upload, queue send) and still use native SDKs for advanced features.
- Verdict: Use for simple, commodity operations (put object, send message). Do not use for complex patterns (DynamoDB streams, BigQuery ML, Lambda event mappings).
Interview Questions — Cloud Service Patterns
Design a file processing pipeline using S3, Lambda, and SQS
Design a file processing pipeline using S3, Lambda, and SQS
- Trigger: File uploaded to S3 triggers an event notification.
- Buffering: S3 event goes to SQS (not directly to Lambda). SQS provides buffering if Lambda is at concurrency limits, retry with backoff, and DLQ for failed messages.
- Processing: Lambda polls SQS, processes each file (parse, validate, transform).
- Output: Processed results written to a separate S3 bucket or DynamoDB.
- Error handling: Failed messages go to a DLQ after 3 retries. A separate Lambda monitors the DLQ and alerts or routes to manual review.
- “I would not trigger Lambda directly from S3 because SQS gives me buffering, retry control, and a DLQ. Direct S3-to-Lambda has limited retry semantics.”
- “I would set reserved concurrency on the processing Lambda to prevent it from consuming the entire account’s concurrency pool.”
- “For large files, I would use S3 Select to process the file in place rather than downloading the entire object.”
- “I would add a visibility timeout on SQS that is 6x the Lambda timeout, per AWS recommendations, to prevent a message from becoming visible again while Lambda is still processing it.”
maxReceiveCount, backoff via visibility timeout, and a real DLQ. You also get buffering for free when Lambda is throttled.Q: What happens if the same file is uploaded twice with the same key — how do you prevent double-processing?
A: S3 generates a new event per upload even on overwrite, so you must make processing idempotent. Store a processed-marker (e.g., s3://processed-bucket/{etag}) or use a DynamoDB dedupe table keyed on etag + key. The first Lambda to write the marker wins; others short-circuit.Q: The upstream system starts producing 10x more files overnight. What breaks first?
A: Usually the downstream database, not Lambda or SQS. Lambda auto-scales, SQS auto-scales, but your DynamoDB write capacity or Postgres connection pool becomes the bottleneck. Second failure mode: you hit the account-level Lambda concurrency cap and starve other functions — set reserved concurrency before this happens.- AWS Lambda Operator Guide — failure modes, concurrency, retries.
- SQS Developer Guide: Amazon SQS visibility timeout
- AWS Architecture Blog: Fanout S3 Event Notifications to Multiple Endpoints
Your Lambda function has a 30% cold start rate. How do you fix it?
Your Lambda function has a 30% cold start rate. How do you fix it?
Init Duration shows cold start time. Check invocation frequency. If the function is invoked once every 10 minutes, cold starts are inevitable without intervention.Step 2: Quick wins (free or cheap).- Move initialization code outside the handler (SDK clients, DB connections).
- Reduce deployment package size (remove unused dependencies, use Lambda layers).
- If using Java, enable SnapStart.
- Switch to a lighter runtime if possible (Python/Node.js start 10-50x faster than Java).
- Analyze traffic patterns. If traffic is predictable (business hours), use scheduled provisioned concurrency (scale up at 8 AM, down at 6 PM).
- If traffic is unpredictable, use Application Auto Scaling on provisioned concurrency to scale based on utilization.
- Start with the minimum needed: if peak concurrency is 50, provisioning 30-40 instances covers 80-90% of cold starts.
- If this is a user-facing API with strict latency requirements and high cold start rate, consider whether Lambda is the right compute model. A small Fargate task that is always warm may be cheaper and faster than Lambda with full provisioned concurrency.
Compare the cost of processing 10M events/day: Lambda vs ECS Fargate
Compare the cost of processing 10M events/day: Lambda vs ECS Fargate
- 10M events/day = ~300M events/month
- Assume 128 MB memory, 200ms average duration
- Compute: 300M * 0.2s * (128/1024) GB = 7.5M GB-seconds
- Compute cost: 7.5M * 125/month
- Request cost: 300M * 60/month
- Lambda total: ~$185/month
- 10M events/day = ~116 events/second average
- Assume each event takes 200ms to process
- Need: 116 * 0.2 = ~24 concurrent processing slots
- Run 3 Fargate tasks (0.5 vCPU, 1 GB each) with 10 concurrent processors each
- Cost: 3 tasks * 44/month (vCPU)
- Plus: 3 tasks * 10/month (memory)
- Fargate total: ~$54/month
You notice your AWS bill jumped 40% this month. Walk me through how you investigate.
You notice your AWS bill jumped 40% this month. Walk me through how you investigate.
- Forgotten resources: Someone launched large EC2 instances for testing and forgot to terminate them. Check for instances with low utilization.
- NAT Gateway data processing: A new service was deployed that routes traffic through NAT Gateway. Check NAT Gateway charges in the VPC cost breakdown.
- S3 request costs: A misconfigured client making millions of unnecessary API calls. Check S3 request metrics.
- Data transfer: Cross-region replication, S3-to-internet transfer, or inter-AZ traffic increase. Check the “Data Transfer” line item.
- DynamoDB scaling: On-demand pricing with a traffic spike, or provisioned capacity that was scaled up and never scaled back down.
- Spot interruptions: If Spot capacity became unavailable, workloads fell back to On-Demand pricing.
DataTransfer-Out-Bytes), NAT processing (NatGateway-Bytes), and EBS gp3 throughput overages under the EC2 umbrella. One of those is usually the culprit when compute counts are flat.Q: You find the spike started on the day of a deploy. How do you confirm causation, not just correlation?
A: Compare CloudWatch metrics on the suspect service (invocation count, bytes transferred, DB read/write units) against the cost timeline. Look at feature flags flipped that day. Use the Cost Explorer API to export hourly costs and overlay with deployment timestamps from your CI system.Q: How do you prevent the same thing next month?
A: Three layers: (1) AWS Budgets with 80%/100% alerts per-service and per-tag, (2) Cost Anomaly Detection with Slack/email subscribers, (3) a tag policy that blocks untagged resource creation so every cost has an owner. None of these are optional at scale.When would you choose Kinesis over SQS for event processing?
When would you choose Kinesis over SQS for event processing?
- You need replay. Kinesis retains data for 1-365 days. Multiple consumers can read from any point in the stream. SQS deletes messages after they are consumed.
- You need ordering. Kinesis guarantees order within a shard. SQS Standard does not guarantee order; SQS FIFO guarantees it but at lower throughput.
- You need multiple independent consumers. With Kinesis, consumer A reads the sales analytics and consumer B reads the fraud detection stream — independently, at their own pace, from the same data. SQS delivers each message to one consumer.
- You are doing real-time analytics. Kinesis integrates with Kinesis Data Analytics (Apache Flink), allowing windowed aggregations, pattern detection, and real-time SQL over streams.
- You need a simple task queue. One message, one consumer, delete after processing.
- You need per-message retry and DLQ. SQS has built-in retry counting and dead letter queue support. Kinesis requires you to build this yourself.
- Traffic is spiky and you need buffering. SQS scales automatically with no shard management.
- You do not need replay or ordering. SQS Standard is simpler and cheaper for most async processing.
customer_id + transaction_minute instead of just customer_id) and accept weaker ordering, or (2) split that one customer into their own dedicated stream. Kinesis ordering is per-shard, so a hot shard is a design smell, not an AWS limit.Q: Could you just use SQS FIFO with message group IDs to get ordering cheaper than Kinesis?
A: Yes, for low-throughput use cases. SQS FIFO gives you per-group ordering and dedupe without shard management. The trade-off: no replay, no multi-consumer fanout, and a hard ceiling around 3000 TPS per group with batching. Above that, Kinesis wins.Q: Your Kinesis consumer Lambda is falling behind by 20 minutes. How do you catch up without losing data?
A: Three options: (1) increase parallelization factor on the event source mapping (up to 10 concurrent Lambdas per shard), (2) add shards to increase parallelism, (3) switch to Kinesis Enhanced Fan-Out which gives each consumer its own 2MB/s read throughput instead of sharing. Do not delete and recreate the consumer — you will lose the checkpoint.Design a multi-region active-active architecture for a critical payment service on AWS
Design a multi-region active-active architecture for a critical payment service on AWS
- DynamoDB Global Tables for payment state (transaction records, idempotency keys). Global Tables provide multi-region replication with eventual consistency (typically under 1 second). Use conditional writes and idempotency keys to prevent duplicate payments when both regions are active.
- Aurora Global Database for account data (if relational is required). One primary region handles writes; the secondary has a read replica that can be promoted to writer in under 1 minute.
- Idempotency keys stored in DynamoDB Global Tables (both regions see the same key within ~1 second).
- Route each customer to a “home” region by default (reduces cross-region conflict).
- Implement optimistic locking: write with a condition that the item does not already exist.
attribute_not_exists(idempotency_key)) in both regions. The second write will fail with a ConditionalCheckFailedException, and your code treats that as “already processed, return success.”Q: How do you test this works without waiting for a real region outage?
A: Chaos engineering. Netflix’s Chaos Monkey / Gremlin tooling lets you simulate region failure. Simpler: in a staging environment, block Route 53 health checks to one region and verify traffic fails over within your RTO. Run this quarterly.Q: Aurora Global Database only has one writer region. How is that “active-active”?
A: It is not — that is active-passive with fast promotion. True active-active RDBMS at cross-region scale is very hard (see Spanner, CockroachDB). For most enterprise use cases, active-passive Aurora with a <1 minute failover is the correct answer even if the interviewer asked for “active-active.”How would you design a cost-effective log aggregation pipeline on AWS?
How would you design a cost-effective log aggregation pipeline on AWS?
- “I would use Parquet format for logs in S3 — Athena charges per TB scanned, and Parquet reduces scan volume by 80-90% compared to raw JSON.”
- “I would avoid keeping logs in CloudWatch Logs long-term. CloudWatch Logs Insights is expensive for large volumes. Export to S3 within 24 hours and query with Athena.”
- “I would set up S3 Intelligent-Tiering for logs with unpredictable access patterns — it automatically moves objects between access tiers with no retrieval fees.”
year=/month=/day=/hour= so the query only reads relevant partitions, (3) use partition projection so Athena does not query Glue for partition metadata, (4) select only the columns you need — never SELECT *.Q: Compliance says logs must be retained for 7 years but queryable within 48 hours. Which tier?
A: Glacier Flexible Retrieval (retrieval in 1-5 min to 3-5 hours depending on tier). Deep Archive is cheaper but 12+ hours to restore — that blows the 48-hour SLA if the request comes in on a Friday evening.Q: How do you handle a log spike (e.g., deploy-time error flood) without blowing up Firehose cost?
A: Firehose charges per GB ingested, so you cannot fully escape cost, but: (1) sample debug/trace logs at the source (Fluent Bit filter), (2) enable Firehose compression (GZIP) — ~5x reduction on text logs, (3) set a CloudWatch Logs metric filter that alerts when a single app’s log volume 10x’s, so you catch runaway logging before the month-end bill.A developer asks you to give their Lambda function AdministratorAccess. How do you respond?
A developer asks you to give their Lambda function AdministratorAccess. How do you respond?
- Understand the requirement. What AWS services does the function actually need to access? S3? DynamoDB? SQS? Get the specific operations — read, write, list?
- Create a scoped policy. Grant only the permissions the function needs, on only the resources it needs to access.
- Use IAM Access Analyzer. If the function already exists with broad permissions, enable IAM Access Analyzer to see which permissions it actually uses over a 30-day period. Then generate a least-privilege policy based on actual usage.
- Explain the risk. AdministratorAccess on a Lambda function means a vulnerability in the function code (injection, SSRF, dependency vulnerability) becomes a complete account compromise. The attacker can create IAM users, access any database, exfiltrate any S3 bucket, and launch cryptomining instances — all through the Lambda’s execution role.
s3:* access across buckets. A single SSRF vulnerability allowed the attacker to steal credentials and exfiltrate 100M records. Capital One’s post-incident remediation was exactly this pattern: scope every role to the specific buckets and actions, monitored by Access Analyzer.ReadOnlyAccess (not Admin) for 24 hours via a condition (aws:CurrentTime < tomorrow), and open a ticket to scope it down. Put the ticket on their sprint board, not yours. “Tighten later” never happens unless it is someone’s scheduled work.Q: Lambda is behind an API Gateway that is public. Does that change anything?
A: Yes — it means the blast-radius from a vulnerability (SSRF, injection, dependency CVE) is your entire AWS account. This is the Capital One scenario almost exactly. Public-facing Lambdas need extra rigor: scoped IAM, a permission boundary, and VPC egress controls if they do not need internet access.Q: How do you enforce this org-wide, not just case-by-case?
A: Service Control Policies at the AWS Organization level that deny iam:AttachRolePolicy with arn:aws:iam::aws:policy/AdministratorAccess unless the caller is in a specific ops account. Combine with AWS Config rules that flag any role with *:* in its policy. Automation, not vigilance.Your team is spending $2,000/month on NAT Gateway. How do you reduce it?
Your team is spending $2,000/month on NAT Gateway. How do you reduce it?
- S3 Gateway Endpoint (free) — Eliminates all S3 traffic through NAT Gateway.
- DynamoDB Gateway Endpoint (free) — Same for DynamoDB.
- ECR, CloudWatch Logs, SQS, Secrets Manager Interface Endpoints (~$7/month each) — Eliminates pull-through traffic for container images and logging.
SELECT dstaddr, sum(bytes) FROM flow_logs WHERE action='ACCEPT' AND dst_port IN (443) GROUP BY dstaddr ORDER BY 2 DESC. Cross-reference the top IPs against AWS IP ranges (published at ip-ranges.json) to identify which service family they belong to.Q: You deploy an ECR VPC endpoint, but NAT cost does not drop. Why?
A: Check three things: (1) the VPC endpoint policy actually allows ecr:* for your account (default policy is permissive but someone may have locked it down), (2) DNS resolution is enabled on the endpoint (otherwise clients resolve the public ECR hostname and route via NAT), (3) Docker pulls also hit S3 for image layers — you need the S3 Gateway endpoint too.Q: Is the managed NAT Gateway ever worth keeping over a NAT instance?
A: Yes, in prod. NAT Gateway is fully managed with HA inside the AZ and no maintenance burden. NAT instances (a t4g.nano running iptables) save 80% of cost but you own patching, failover, and scaling bandwidth. Only viable in dev/staging where an outage is acceptable.Real-World Architecture Examples
Example 1: E-Commerce Order Pipeline
Context: An e-commerce platform processing 50,000 orders per day.Example 2: Media Processing Pipeline
Context: A platform where users upload videos up to 5 GB, which need to be transcoded into multiple formats.Example 3: Real-Time Fraud Detection
Context: A fintech processing credit card transactions that need fraud scoring within 100ms.Curated Resources
AWS Documentation and Architecture Center
AWS Documentation and Architecture Center
- AWS Well-Architected Framework — The foundational framework for evaluating any cloud architecture. Read the “Pillars” section at minimum.
- Lambda Operator Guide — The best single resource for understanding Lambda’s behavior in production. Covers cold starts, concurrency, monitoring, and cost optimization.
- S3 User Guide: Performance Design Patterns — Request rate optimization, prefix distribution, and multipart upload best practices.
- Serverless Application Lens — Well-Architected review specifically for serverless architectures.
- AWS Architecture Center — Reference architectures, diagrams, and best practices for common workloads.
re:Invent Talks (Best of)
re:Invent Talks (Best of)
- “Optimizing Lambda Cost and Performance” (SVS401) — Deep dive into Lambda internals, Firecracker, and optimization techniques from the Lambda team.
- “Advanced Serverless Patterns” (SVS340) — Step Functions, event-driven patterns, and saga implementations.
- “S3 Masterclass” (STG301) — Storage classes, lifecycle management, and cost optimization from the S3 team.
- “Networking Best Practices for VPC” (NET301) — VPC design, Transit Gateway, and PrivateLink patterns.
- “Running Kubernetes at Scale” (CON301) — For teams that have chosen EKS, this covers production patterns and pitfalls.
Books, Blogs, and Courses
Books, Blogs, and Courses
- Designing Data-Intensive Applications by Martin Kleppmann — Not AWS-specific, but the distributed systems principles behind every AWS service.
- AWS Certified Solutions Architect Study Guide — Even if you are not pursuing the cert, the SAA-C03 study materials are the most structured way to learn AWS service selection.
- AWS Architecture Blog — Practical architecture patterns from AWS solutions architects.
- Last Week in AWS by Corey Quinn — The best newsletter for AWS cost analysis and commentary. Corey is brutally honest about pricing.
- theburningmonk.com by Yan Cui — The best independent blog on serverless patterns, costs, and pitfalls.
- cloudonaut.io by Michael and Andreas Wittig — Deep technical blog covering VPC, IAM, cost optimization, and CloudFormation.
Interview Deep-Dive Questions
1. Walk me through exactly what happens inside AWS when a Lambda function cold-starts. Where does the time go, and what levers do you have at each stage?
Difficulty: Senior What the interviewer is really testing: Whether you understand the cold start lifecycle at an infrastructure level, not just “it is slow.” Can you decompose latency and apply targeted fixes? Strong answer: A cold start is not one thing — it is a pipeline of five sequential stages, and each has different optimization levers.-
Stage 1 — Code download (~50-500ms). The Lambda service fetches your deployment package from an internal S3 cache. Larger packages take longer. Container images are cached closer to the execution fleet via ECR’s lazy-loading (SOCI — Seekable OCI), so only the layers you need at startup are fetched. Lever: Shrink package size. Use tree-shaking in Node.js, strip unused dependencies in Python. For container images, use a minimal base image (
alpineordistroless) and order Dockerfile layers so frequently-changing code is last. - Stage 2 — Microvm provisioning (~50-100ms). AWS uses Firecracker to spin up a lightweight microVM. This is largely outside your control, but it is fast because Firecracker was literally purpose-built for this — it boots a VM in ~125ms. Lever: Almost none directly. But choosing more memory indirectly allocates proportional CPU, which speeds up subsequent stages.
- Stage 3 — Runtime initialization (~10-5000ms). The language runtime starts. Python interpreter: fast. Node.js V8: fast. JVM with class loading and JIT warmup: 3-10 seconds. .NET CLR assembly loading: 1-3 seconds. Lever: Choose a lighter runtime for cold-start-sensitive paths. For Java, SnapStart snapshots the JVM state after init and restores from the snapshot, cutting cold start from seconds to 200-500ms. For .NET, Native AOT compiles ahead-of-time, eliminating CLR startup.
- Stage 4 — Init code execution (~10-2000ms). Your code outside the handler runs: imports, SDK client creation, DB connection establishment, config loading from Parameter Store or Secrets Manager. Lever: This is where you have the most control. Lazy-initialize anything you do not need on every invocation. Cache configuration in memory. Use the AWS SDK’s built-in credential provider instead of making explicit STS calls. If you call Secrets Manager during init, that is a network round-trip on every cold start.
- Stage 5 — Handler execution. This happens on every invocation, warm or cold. Not part of cold start.
Follow-up: Your team runs Java on Lambda and the CTO refuses to let you switch runtimes. Cold starts are 6 seconds. What do you do?
Answer: This is a real constraint I have seen. Here is the playbook, in order of cost and complexity:- Enable SnapStart immediately. This is the single biggest win for Java Lambda. AWS snapshots the JVM state after your init code runs using CRaC (Coordinated Restore at Checkpoint). On cold start, it restores from snapshot instead of re-initializing the JVM. This typically cuts cold starts from 6s to 400-800ms. It is a configuration flag — zero code changes for most applications.
- Audit your init code. Spring Boot with annotation scanning and component scanning is the biggest offender. If you are using Spring Boot, consider switching to Micronaut, Quarkus, or plain Java (no framework). Quarkus was designed for fast startup and has an explicit Lambda extension. If switching frameworks is off the table, at least disable annotation scanning for packages you do not need.
-
Trim the dependency tree. Run
mvn dependency:treeand look for transitive dependencies you do not use. Every JAR that the classloader touches adds startup time. Shade only the classes you use with the Maven Shade plugin. -
Provisioned concurrency as the final layer. After SnapStart and init optimization, if cold starts are still above SLA, provision enough warm instances to cover baseline traffic. Use Application Auto Scaling to track the
ProvisionedConcurrencyUtilizationmetric and scale provisioned capacity with demand. - Measure the cost trade-off. Provisioned concurrency for 50 instances at 512MB costs roughly $20/day. Compare that to the engineering cost of rewriting in another language. Usually, SnapStart plus provisioned concurrency is dramatically cheaper than a rewrite.
Follow-up: What are the gotchas with SnapStart that most people miss?
Answer: SnapStart has three specific traps:-
Uniqueness assumptions break. If your init code generates a random seed, UUID, or unique identifier, the snapshot captures that value. Every restored instance starts with the same “random” value. This breaks security tokens, encryption IVs, and anything that assumes init-time uniqueness. You must use
CRaC’sbeforeCheckpointandafterRestorehooks to re-initialize these values. The AWS docs call this out, but most developers miss it until they see duplicate “unique” IDs in production. -
Network connections are stale. A database connection opened during init is in the snapshot but the TCP connection is dead by the time the function restores. You must implement connection validation or re-establishment in
afterRestore. Connection pools that do health checks on borrow (like HikariCP’sconnectionTestQuery) handle this naturally, but raw connections do not. -
Deterministic encryption becomes non-deterministic. If your init code seeds a CSPRNG (cryptographically secure pseudorandom number generator), the snapshot freezes the seed state. All restored instances start from the same PRNG state. This is a subtle security vulnerability. AWS provides the
software.amazon.lambda.snapstart.SnapshotRestoreinterface to re-seed after restore.
2. You are designing a system that processes 500,000 file uploads per day to S3. Each file needs validation, transformation, and loading into a database. Walk me through the architecture.
Difficulty: Senior / Staff-Level What the interviewer is really testing: End-to-end event-driven architecture design, back-of-envelope capacity planning, failure handling at scale, and the judgment to pick the right AWS services. Strong answer: First, some quick math to size the problem. 500K files per day is roughly 6 files per second sustained, with likely peaks of 20-30/s during business hours. This is well within Lambda’s comfort zone but high enough that error handling and cost optimization matter. The architecture:-
Upload path: Clients upload via pre-signed PUT URLs generated by an API (Lambda behind API Gateway). Files land in an S3 bucket partitioned by date prefix (
uploads/2026/04/10/). Pre-signed URLs let the client upload directly to S3, so our API never touches the file bytes — this eliminates a huge bandwidth and memory bottleneck. - Event routing: S3 Event Notifications route to SQS (not directly to Lambda). I put SQS in the middle for three reasons: (1) it buffers during Lambda throttling or cold start bursts, (2) it gives me configurable retry with exponential backoff, and (3) it provides a DLQ for poison messages. I would set the visibility timeout to 6x the Lambda timeout per AWS best practices.
- Processing Lambda: Polls SQS, downloads the file from S3, validates it (schema check, file type, size limits, malware scan if required), transforms it (normalize fields, enrich with lookup data), and writes to the database. I would set reserved concurrency at 50-100 to protect the database from connection storms — this is critical.
-
Database writes: If the target is DynamoDB, I would use
BatchWriteItemfor throughput. If it is Aurora PostgreSQL, I would use RDS Proxy to pool connections from Lambda (otherwise each concurrent Lambda opens its own connection and you exhaustmax_connectionswithin minutes). Batch inserts viaCOPYor multi-rowINSERTrather than one-row-at-a-time. -
Error handling: After 3 failed processing attempts, SQS moves the message to a DLQ. A separate Lambda monitors the DLQ depth via CloudWatch Alarm. If the DLQ grows beyond a threshold, it triggers an SNS notification to PagerDuty. I would also write failed records to a separate S3 bucket (
failed-uploads/) with the error details for manual reprocessing. - Idempotency: Files may be delivered more than once (SQS is at-least-once). I would use a DynamoDB table or a database unique constraint on a hash of the file content or the S3 object key to prevent duplicate processing.
Follow-up: The product team now says some files are 5GB+. How does that change your architecture?
Answer: This changes several things:-
Upload path: Files over 5GB require multipart upload. The pre-signed URL approach still works but I need to generate pre-signed URLs for each part (using
CreateMultipartUpload, thenUploadPartpre-signed URLs, thenCompleteMultipartUpload). The client-side SDK handles this — the AWS SDK’sTransferManagerin Java orboto3’supload_filewith multipart threshold in Python. I must also add a lifecycle policy to abort incomplete multipart uploads after 7 days to avoid hidden storage costs. - Processing Lambda timeout: A 5GB file cannot be downloaded, validated, transformed, and loaded within Lambda’s 15-minute timeout. I have two options: (1) use S3 Select to process the file in place if it is CSV/JSON/Parquet — push the filtering to the storage layer, or (2) move processing to an ECS Fargate task that can run for hours. The SQS message would trigger a Step Functions workflow that launches a Fargate task instead of a Lambda.
- Memory constraints: Lambda maxes out at 10 GB memory. A 5GB file loaded entirely into memory leaves little room for processing. For Fargate, I would configure tasks with 8-16 GB memory and stream the file from S3 in chunks rather than loading it all at once.
- The hybrid approach: Keep Lambda for files under 100MB (the vast majority) and route large files to Fargate. The SQS consumer Lambda checks the file size from the S3 event metadata. Small files are processed inline; large files trigger a Step Functions workflow that launches a Fargate task. This optimizes cost (Lambda for the 99% of small files) while handling the edge case (Fargate for the 1% of large files).
Follow-up: How do you prevent Lambda from overwhelming your Aurora PostgreSQL database with connections?
Answer: This is one of the most common production failures in serverless-to-relational architectures. Lambda can scale to hundreds of concurrent executions in seconds, each opening its own database connection. Aurora PostgreSQL’smax_connections defaults to around 1,600 for a db.r5.large, but each connection consumes memory and process resources.
- RDS Proxy is the primary solution. RDS Proxy sits between Lambda and Aurora, maintaining a pool of warm database connections. Hundreds of Lambda instances share a pool of, say, 100 database connections. RDS Proxy handles connection multiplexing, authentication caching, and automatic failover. It adds ~1ms of latency, which is negligible. Cost: roughly $21/month for a small instance.
- Reserved concurrency as a guard rail. Even with RDS Proxy, I would set reserved concurrency on the Lambda function to cap the maximum number of simultaneous executions. If the proxy pool is 100 connections and each Lambda holds one connection for 500ms, I would cap Lambda at 200 concurrent executions — this ensures the proxy is never overwhelmed.
- Connection reuse in Lambda. Initialize the database connection outside the handler (in the module scope). The connection persists across warm invocations of the same execution environment. This means a warm Lambda reuses its connection instead of opening a new one per invocation.
- The pattern I would avoid: Opening and closing a connection per invocation. This creates massive connection churn, wastes time on TCP + TLS + auth handshake on every request, and generates load on the database’s process management.
Going Deeper: What happens if RDS Proxy itself becomes the bottleneck?
RDS Proxy scales based on the underlying database instance size — it provisions enough capacity to handle the database’smax_connections. But the real bottleneck is rarely the proxy; it is the database’s ability to handle concurrent queries. If you are seeing proxy connection timeouts, the actual problem is usually slow queries holding connections open. The fix is query optimization (indexes, query plans, connection timeout settings on the proxy), not more proxy capacity. I would check pg_stat_activity in Aurora to see if connections are stuck in idle in transaction state, which is a common leak pattern where application code opens a transaction and never commits or rolls back.
3. Your company runs 200 Lambda functions across 5 teams. You are tasked with designing the AWS account strategy. What do you recommend?
Difficulty: Staff-Level What the interviewer is really testing: Organizational architecture thinking, AWS Organizations knowledge, understanding of blast radius and governance, and the ability to balance isolation with operational simplicity. Strong answer: The way I think about this is: accounts are a security and operational boundary, not an organizational chart. The goal is to contain blast radius, enable independent team velocity, and give finance clean cost attribution — without drowning in account sprawl. My recommended structure for 5 teams and 200 functions:- Management Account — AWS Organizations root. No workloads here. Only billing, SCPs, and Control Tower configuration.
- Security OU: Log Archive Account (centralized CloudTrail, Config, GuardDuty findings), Security Tooling Account (Security Hub, IAM Access Analyzer).
- Infrastructure OU: Shared Services Account (CI/CD pipelines, ECR repositories, shared Lambda layers, DNS zones in Route 53), Networking Account (Transit Gateway, VPN, if needed).
- Workloads OU: Split into Production OU and Non-Production OU. Within each, one account per team. So 5 production accounts and 5 development/staging accounts. Total: 10 workload accounts.
- Sandbox OU: Individual developer sandbox accounts with aggressive SCPs (no resources larger than medium, auto-cleanup after 7 days via AWS Nuke).
- Root OU: Deny all actions outside approved regions (us-east-1, eu-west-1). Deny root user access. Require S3 encryption. Prevent leaving the organization.
- Sandbox OU: Deny expensive instance types. Deny production service creation (RDS Multi-AZ, Aurora, large DynamoDB tables).
- Production OU: Deny deletion of CloudTrail logs. Deny disabling encryption. Require MFA for destructive actions.
Follow-up: Team A needs to read data from Team B’s DynamoDB table. How do you handle cross-account access without breaking isolation?
Answer: I have three options, and the right one depends on the access pattern:-
Cross-account IAM role assumption (my default choice). Team B creates an IAM role in their account with a trust policy allowing Team A’s Lambda execution role to assume it. Team A’s Lambda calls
sts:AssumeRole, gets temporary credentials, and reads from Team B’s DynamoDB table. The role in Team B’s account has a scoped policy: onlydynamodb:GetItemanddynamodb:Queryon the specific table, nothing else. This is auditable (CloudTrail logs every role assumption), revocable (delete the trust policy), and granular (limit by IP, VPC, or condition keys). - DynamoDB Streams to EventBridge (for async data sharing). If Team A does not need synchronous reads but just needs to react to changes in Team B’s data, use DynamoDB Streams in Team B’s account to publish change events to EventBridge. EventBridge cross-account rules forward relevant events to Team A’s event bus. This fully decouples the teams — Team A does not need any access to Team B’s account.
- Shared data layer (for truly shared data). If the DynamoDB table is shared reference data (product catalog, config data), consider putting it in the Shared Services account with cross-account read access for all workload accounts. This is appropriate when the data is organizational, not team-owned.
Follow-up: How do you prevent SCP misconfigurations from causing a production outage?
Answer: This is a real risk. An overly broad deny SCP applied at the wrong OU level can instantly break production across every account in that OU.- Test SCPs in a staging OU first. Create a mirror of your production OU structure with a test account. Apply the SCP there, run automated integration tests, and verify nothing breaks before promoting to the production OU.
- Use the SCP simulator. AWS IAM Policy Simulator can evaluate SCPs against specific API calls. Before applying a deny policy, simulate the actions your production services make (Lambda invoke, DynamoDB read/write, S3 access, CloudWatch logging) and verify they are not blocked.
- Deploy SCPs through CI/CD. SCPs should be in version control (Terraform or CloudFormation StackSets), deployed through a pipeline with approval gates and automatic rollback. Never apply SCPs manually through the console.
- Always include a break-glass exclusion. Every deny SCP should have a condition excluding a specific “break-glass” IAM role that can bypass the restriction in emergencies. This role should require MFA and be heavily audited, but it prevents you from locking yourself out.
-
Monitor with CloudTrail. Set up CloudWatch alarms on
AccessDeniedevents in CloudTrail. A sudden spike in access denied errors after an SCP change indicates something is broken.
4. Explain the difference between SQS, SNS, EventBridge, and Kinesis. When have you used each one, and when have you made the wrong choice?
Difficulty: Intermediate / Senior What the interviewer is really testing: Whether you have genuine production experience with these services versus textbook knowledge. The “wrong choice” part specifically tests self-awareness and learning from mistakes. Strong answer: The way I think about it is along two axes: how many consumers and do you need replay.- SQS is a task queue. One message goes to one consumer. Once processed, it is deleted. Think of it as a to-do list: each item gets assigned to one worker. I use SQS for work distribution — processing uploaded files, handling order fulfillment, running async jobs. The killer feature is simplicity: no shards to manage, no partitioning strategy, nearly infinite throughput for Standard queues, built-in DLQ, and visibility timeout for safe concurrent processing.
- SNS is a megaphone. One message goes to many subscribers (Lambda, SQS, HTTP, email, SMS). It is fan-out with no retention — if the subscriber is down, the message is lost (unless the subscriber is an SQS queue, which buffers it). I use SNS when a single event should trigger multiple independent reactions: “order placed” triggers email, inventory update, and analytics simultaneously.
-
EventBridge is SNS with brains. It adds content-based filtering (only route high-value orders to fraud detection), schema registry (know what your events look like), cross-account routing, and archive/replay. I use EventBridge as the default for new event-driven integrations. It is more expensive per event than SNS, but the filtering alone eliminates hundreds of lines of consumer-side
ifstatements. - Kinesis is a log. Events are ordered within a shard, retained for 1-365 days, and multiple consumers read independently at their own pace from any point in the stream. I use Kinesis for real-time analytics (clickstream data), change data capture (reacting to database changes), and any use case where I need to replay the event history (reprocess last 24 hours after deploying a bug fix).
bisectBatchOnFunctionError. We migrated to SQS in a weekend and the system became dramatically simpler. The lesson: do not reach for the most powerful tool when the simple one fits.
Follow-up: You need exactly-once processing. Which of these gives it to you?
Answer: None of them give you exactly-once processing out of the box. This is a fundamental distributed systems truth that separates real practitioners from textbook learners.-
SQS FIFO claims “exactly-once delivery” within a 5-minute deduplication window. It deduplicates messages with the same
MessageDeduplicationId. But “exactly-once delivery” is not “exactly-once processing.” If your Lambda reads the message, processes it, writes to the database, and then crashes before deleting the message from SQS, the message becomes visible again and gets processed a second time. You still need idempotency in your consumer. - Kinesis is explicitly at-least-once. Checkpointing happens after processing. If the consumer crashes between processing and checkpointing, the record is reprocessed.
- SNS and EventBridge are at-least-once. Retry logic can deliver the same event multiple times.
Going Deeper: You mentioned Kinesis shard blocking. Walk me through what happens and how to prevent it.
When Lambda processes records from a Kinesis stream, it reads a batch of records from a shard and invokes your function. If the function returns an error, Lambda retries the same batch. It keeps retrying until the records expire from the stream (which could be 24 hours to 365 days depending on your retention setting). During this entire retry loop, no new records on that shard are processed. One poison record blocks all processing on the shard. Prevention:bisectBatchOnFunctionError: true— After a failure, Lambda splits the batch in half and retries each half. This narrows down to the single failing record through binary search rather than blocking on the entire batch.maxRetryAttempts— Set a maximum number of retries (e.g., 3-5). After exhausting retries, the failing record is sent to the on-failure destination.- On-failure destination — Route failed records to an SQS queue or SNS topic for investigation. This is your DLQ equivalent for stream sources.
- Error handling in your function — Catch exceptions per record within the batch. Process what you can, collect failures, and use
batchItemFailuresresponse (a relatively newer Lambda feature) to report which specific records failed. Lambda retries only the failed records, not the entire batch. maxRecordAge— Set a maximum age for records. If a record is older than this threshold, Lambda skips it and sends it to the on-failure destination. This prevents ancient records from blocking the shard indefinitely.
5. You are building a new microservice. A colleague says “just put it on Lambda” and another says “use Fargate.” How do you make this decision?
Difficulty: Intermediate What the interviewer is really testing: Structured decision-making about compute platforms, awareness of trade-offs, and the ability to ask clarifying questions instead of jumping to an answer. Strong answer: My first response would be: “What are the requirements?” Because neither is universally better — they optimize for different things. Here are the questions I would ask:- What is the traffic pattern? If it is spiky (zero at night, peak during the day), Lambda’s scale-to-zero saves money. If it is steady 24/7 traffic, Fargate’s always-on pricing wins.
- What is the latency SLA? If p99 must be under 100ms, Lambda cold starts are a risk unless I pay for provisioned concurrency. Fargate is always warm.
- How long does a single request take? Lambda has a 15-minute hard timeout. If the service does long-running work (video processing, ML training, large file transformations), Lambda is out.
- What are the dependency requirements? If the service needs 2GB of ML model files, Lambda’s container image path works but cold starts will be painful. Fargate handles large images without cold start penalties.
- Does it talk to a relational database? Lambda-to-RDS without RDS Proxy is a foot-gun (connection storms). Fargate maintains persistent connection pools naturally.
- What does the team know? If the team has deep container experience with existing Dockerfiles, CI/CD pipelines, and monitoring, Fargate is lower friction. If the team is small and does not want to manage containers, Lambda is simpler.
Follow-up: At what traffic volume does the cost crossover from Lambda to Fargate typically happen?
Answer: The crossover depends on three variables: memory allocation, execution duration, and traffic steadiness. But for a common configuration (256MB memory, 200ms average duration):- Below 1 million requests/month: Lambda wins overwhelmingly. The free tier alone covers most of this. A Fargate task running 24/7 costs at minimum ~$9/month regardless of traffic.
- 1-10 million requests/month: Comparable cost. Lambda is roughly 9-20/month. The difference is not worth optimizing for.
- 10-100 million requests/month: Fargate wins on raw compute cost. Lambda at 50M requests/month with 256MB/200ms costs ~40/month.
- Above 100 million requests/month: Fargate or ECS on EC2 is dramatically cheaper. At this scale, consider EC2 with Savings Plans — you might save another 50-70%.
Follow-up: What about App Runner? When does that fit?
Answer: App Runner is the “I want Fargate but even simpler” option. You point it at a container image or a source code repository, and AWS handles build, deploy, scaling, TLS, load balancing — everything. It is the closest AWS equivalent to Heroku or Google Cloud Run. When it fits: Web applications and APIs where the team does not want to configure VPCs, load balancers, or auto-scaling policies. Prototypes. Services owned by teams that are not infrastructure-savvy. The developer experience is excellent — push code, get a URL. When it does not fit: When you need VPC integration (private databases, internal services), custom networking, fine-grained scaling controls, or cost optimization. App Runner’s pricing is slightly higher than raw Fargate because you are paying for the managed layer. And it has fewer knobs — you cannot configure health check paths, connection draining timeouts, or scaling cooldown periods with the same granularity as Fargate. My honest take: App Runner is under-discussed. For teams that want container-based deployment without Kubernetes or Fargate complexity, it is a great option. But it occupies a narrow sweet spot — simpler than Fargate, more capable than Lambda, less flexible than either.6. A critical production service depends on DynamoDB. The table is getting hot-partition throttling at 3 AM. Walk me through your diagnosis and fix.
Difficulty: Senior What the interviewer is really testing: Real operational debugging ability with DynamoDB, understanding of partition key design, and the ability to diagnose production issues under pressure. Strong answer: Hot-partition throttling means one or more partition keys are receiving disproportionate traffic. DynamoDB distributes data and throughput across partitions based on the partition key. If one key gets 10x the traffic of others, that partition exhausts its allocated throughput while other partitions sit idle. DynamoDB’s adaptive capacity helps (it redistributes throughput to hot partitions over minutes), but it cannot fully compensate for severely skewed access patterns. Diagnosis steps:- Check CloudWatch Contributor Insights. Enable it on the table — it shows the most frequently accessed and throttled partition keys. This immediately tells you which keys are hot. If the top key accounts for 30% of all reads at 3 AM, you have found the problem.
-
Correlate with application behavior. What runs at 3 AM? Batch jobs, cron tasks, data exports, report generation. A nightly job scanning a specific partition key range (e.g., all orders from yesterday using a partition key of
date=2026-04-09) concentrates all traffic on one partition. - Check if it is on-demand or provisioned. On-demand tables can still throttle if traffic exceeds the table’s previous peak by more than 2x within 30 minutes. Provisioned tables throttle when consumed capacity exceeds allocated capacity per partition.
- Immediate: Increase provisioned capacity or switch to on-demand. This buys time but does not fix the root cause if the partition key design is skewed.
-
Short-term: Add write sharding to the hot key. Append a random suffix (e.g.,
date=2026-04-09#3) to spread writes across multiple physical partitions. The batch reader scatters-gathers across all suffixes. This is the standard DynamoDB hot-partition pattern. -
Medium-term: Redesign the access pattern. If the 3 AM job scans by date, and the partition key is
date, every daily scan hits one partition. Restructure the key to include another dimension (e.g.,PK = customer_id,SK = date) so that the scan is distributed. Use a GSI if query access patterns require the date-first lookup. - Architectural: DAX (DynamoDB Accelerator). If the 3 AM traffic is read-heavy and reads the same data repeatedly, put DAX in front of the table. DAX caches reads at the item and query level, absorbing the burst without hitting the underlying partition.
Follow-up: The hot key is a single global counter that multiple services increment. How do you handle that?
Answer: A global counter on a single DynamoDB item is the classic hot-partition anti-pattern. Every increment hits the same partition key, same physical partition, and that partition maxes out at roughly 1,000 WCUs. The scatter-gather counter pattern: Instead of one item{PK: "global_counter", count: N}, create N shards: {PK: "counter#0", count: X}, {PK: "counter#1", count: Y}, …, {PK: "counter#N", count: Z}. Each writer randomly picks a shard and increments it. To read the total, query all shards and sum. With 10 shards, you spread write throughput 10x.
- Choose shard count based on expected writes per second. Each shard can handle ~1,000 WCUs, so 10 shards handle ~10,000 increments per second.
- The trade-off is read complexity: getting the current total requires reading all shards. For use cases where you only need an approximate count or can tolerate 1-second staleness, cache the total in a separate item and update it periodically.
- If exact real-time counts are needed, use DynamoDB Streams to aggregate changes into a single total asynchronously.
INCR is atomic and handles hundreds of thousands of increments per second on a single key. Periodically persist the value to DynamoDB for durability. This is what I would recommend if the counter is in a hot path (e.g., rate limiting) rather than just analytics.
7. Your S3 data transfer bill is $15,000/month. How do you bring it down?
Difficulty: Senior What the interviewer is really testing: Deep understanding of S3 cost model, CDN and caching strategies, architectural thinking about data movement, and cost engineering as a core engineering skill. Strong answer: 0.09/GB means roughly 167 TB of data leaving S3 per month. That is significant. Here is my systematic approach: Step 1: Understand where the data is going. S3 server access logs and CloudTrail S3 data events tell me which buckets, prefixes, and clients are responsible. I would break it down by:- S3 to internet (most expensive at $0.09/GB)
- S3 to CloudFront (cheaper at $0.00/GB for origin fetches in most regions)
- S3 cross-region (e.g., replication to another region at $0.02/GB)
- S3 to EC2/Lambda in the same region (free — but going through NAT Gateway costs $0.045/GB)
Follow-up: You put CloudFront in front of S3 and cache hit ratio is only 20%. Why, and how do you fix it?
Answer: A 20% cache hit ratio means 80% of requests are cache misses, which means either the content is not cacheable or the cache is not configured well. Common causes:-
Unique URLs per user. If each request includes a unique query string parameter (session token, timestamp, user ID), CloudFront treats each URL as a unique cache key. Fix: Configure CloudFront to ignore irrelevant query strings in the cache key. Whitelist only query parameters that actually change the content (e.g.,
size=thumbnailvssize=full). -
Low TTL or
Cache-Control: no-cacheheaders. If S3 objects have noCache-Controlheader or setmax-age=0, CloudFront does not cache them (or caches very briefly). Fix: Set appropriateCache-Controlheaders on S3 objects. Static assets:max-age=31536000, immutable. Dynamic content: use a shorter TTL but still cache (evenmax-age=60dramatically reduces origin load). - Long-tail content. If you serve millions of unique objects and each is accessed once per day, even perfect caching does not help because the first request is always a miss. Fix: This is a content distribution problem, not a caching problem. Consider pre-warming the cache for popular content or accepting the miss rate and focusing on reducing object sizes instead.
-
Single edge location. If all traffic comes from one region but CloudFront distributes across hundreds of edge locations, each edge sees low traffic and evicts cached content quickly. Fix: Use CloudFront’s
Price Classto restrict to fewer edge locations (higher hit rate per location) or use Origin Shield (an additional caching layer between edge locations and S3 that centralizes cache fills).
8. Your team wants to go multi-region active-active. When would you argue against it?
Difficulty: Staff-Level What the interviewer is really testing: Architectural maturity, the ability to push back on complexity, and understanding of the real costs and risks of multi-region deployments. Strong answer: Multi-region active-active is one of the most over-prescribed architectural patterns. It sounds great in a design meeting, but the operational complexity is staggering. I would argue against it in most cases. Here is my framework: Argue against when:- The business does not actually need zero-downtime failover. Most services can tolerate 30-60 seconds of downtime during an active-passive failover. Ask the business: “What is the cost per minute of downtime?” If the answer is less than the engineering cost of maintaining active-active, it is not worth it. For most SaaS companies, active-passive with Route 53 health checks and automated failover provides 99.95%+ availability.
- The data layer has strong consistency requirements. Active-active means writes happen in multiple regions simultaneously. For eventually consistent data (user preferences, analytics, content caches), this is manageable. For strongly consistent data (financial balances, inventory counts, sequential ordering), you enter split-brain territory. DynamoDB Global Tables give you eventual consistency with ~1 second replication lag. That 1-second window is where duplicate orders, double-charges, and inventory over-sells live.
- The team does not have the operational maturity to run it. Active-active requires: per-region monitoring and alerting, automated failover testing, data conflict resolution strategies, region-aware routing, region-specific deployment pipelines, and engineers who can debug cross-region replication issues at 3 AM. If the team has not mastered single-region operations (observability, incident response, deployment automation), adding a second region multiplies problems rather than solving them.
- The cost does not justify the benefit. Active-active roughly doubles your infrastructure cost (two of everything) plus adds 30-50% operational overhead (cross-region replication, routing, testing). For a service doing 500K on active-active infrastructure is absurd. For a service doing 500K, it is a no-brainer.
- Regulatory requirement (EU data must be served from EU, US from US) — but this is often better solved with region-pinned routing, not true active-active.
- Latency-sensitive global user base where 200ms cross-ocean latency is unacceptable.
- The business genuinely cannot tolerate the 30-60 seconds of active-passive failover (financial trading, real-time gaming, emergency services).
Follow-up: If you do go active-active, how do you handle data conflicts?
Answer: Data conflicts in active-active occur when both regions write to the same record within the replication window. The strategies depend on the data type:- Last-writer-wins (LWW). DynamoDB Global Tables use this by default — the write with the latest timestamp wins. This is fine for data where the most recent value is always correct (user profile updates, session data, preferences). It is dangerous for counters, balances, or any additive data where both writes carry information.
- Application-level conflict resolution. Instead of overwriting, design your data model so writes are additive. Use event sourcing: instead of updating a balance, append a debit/credit event. Both regions can append independently, and the balance is derived by replaying events. Conflicts become “concurrent events” that are both valid.
- Region-pinning with failover. Assign each entity (user, account, order) a “home” region. All writes for that entity go to its home region. The other region serves reads from the replica. On failover, the secondary region promotes to writer. This avoids conflicts entirely at the cost of cross-region write latency for entities whose users are in the “wrong” region.
- CRDTs (Conflict-Free Replicated Data Types). For specific data structures (counters, sets, flags), CRDTs guarantee that concurrent updates from different regions merge deterministically without conflicts. This is theoretically elegant but requires redesigning your data model around CRDT-compatible structures. Practically, it works for counters (G-Counter), boolean flags (LWW-Register), and sets (OR-Set). It does not work for arbitrary relational data.
- My recommendation for most teams: Region-pinning with failover. It sidesteps the conflict problem entirely, is simple to reason about, and gives you active-active read scalability with single-region write simplicity. True active-active writes to the same data from multiple regions should be reserved for teams with deep distributed systems expertise.
9. Explain VPC Endpoints — Gateway vs Interface. When do you use each, and what is the cost impact?
Difficulty: Intermediate / Senior What the interviewer is really testing: Whether you understand AWS networking costs at a practical level and can optimize real infrastructure spend. Strong answer: VPC endpoints provide private connectivity from your VPC to AWS services without routing traffic through the internet (via a NAT Gateway or Internet Gateway). There are two types, and they have very different architectures and cost profiles. Gateway Endpoints:- Available only for S3 and DynamoDB.
- Free. No hourly charge, no data processing charge.
- Implemented as a route table entry. You add the endpoint, associate it with your private subnet route tables, and traffic to S3/DynamoDB is routed directly through the AWS backbone — it never leaves the AWS network and never touches your NAT Gateway.
- You should enable these on every VPC with private subnets. There is zero reason not to.
- Available for 100+ AWS services (SQS, SNS, ECR, CloudWatch Logs, Secrets Manager, KMS, etc.) and third-party services.
- Cost: ~7.30/month per AZ) plus $0.01/GB of data processed.
- Implemented as an ENI (Elastic Network Interface) in your subnet. The endpoint gets a private IP address in your VPC, and traffic to the service resolves to that private IP via a private hosted zone.
- Use when: the data processing savings through avoiding NAT Gateway ($0.045/GB) exceed the endpoint hourly cost. For high-traffic services (ECR image pulls, CloudWatch Logs ingestion), the break-even is usually within the first few GB per month.
Follow-up: You have Interface Endpoints for ECR in two AZs but container image pulls are still going through the NAT Gateway. What went wrong?
Answer: This is a surprisingly common misconfiguration. There are several things to check:-
Private DNS is not enabled. When you create an Interface Endpoint, you can enable “Private DNS.” This creates a private hosted zone that overrides the public DNS name (e.g.,
api.ecr.us-east-1.amazonaws.com) to resolve to the endpoint’s private IP address. If private DNS is not enabled, your ECS tasks still resolve the ECR hostname to its public IP, which routes through the NAT Gateway. -
DNS resolution is not enabled on the VPC. The VPC must have
enableDnsSupportandenableDnsHostnamesset to true for private hosted zones to work. Check VPC settings. -
There are actually two ECR endpoints required. ECR requires both
com.amazonaws.region.ecr.api(for API calls like authentication) andcom.amazonaws.region.ecr.dkr(for Docker image layer downloads). Missing either one causes partial or complete fallback to the public endpoint. You also need an S3 Gateway Endpoint because ECR stores image layers in S3. - Security group on the endpoint is too restrictive. Interface Endpoints have security groups. If the security group does not allow inbound HTTPS (port 443) from your ECS tasks’ security group, the connection fails and the SDK falls back to the public endpoint.
- The endpoint is in the wrong AZs. If your ECS tasks run in AZs where the endpoint does not have an ENI, traffic from those AZs goes through the public path. Create the endpoint in all AZs where your tasks run.
10. You are designing the CI/CD pipeline for a team with 30 Lambda functions. What is your deployment strategy?
Difficulty: Senior What the interviewer is really testing: Practical DevOps thinking for serverless, understanding of deployment safety mechanisms, and the ability to design for both velocity and reliability. Strong answer: 30 Lambda functions is enough that you need structure, but not so many that you need a platform team. My approach: Repository structure: Monorepo with each function in its own directory. Shared code (utilities, models, SDK wrappers) in ashared/ directory that builds into a Lambda Layer. A change to shared code triggers deployment of all functions that use that layer. A change to a single function deploys only that function.
Build pipeline (CodePipeline or GitHub Actions):
- On PR: Lint, unit test,
sam buildto verify the functions compile/package. Run integration tests against a dev-stage AWS account. - On merge to main: Build all changed functions, push artifacts to S3 (zip) or ECR (container images), deploy to staging account.
- Staging validation: Run smoke tests against the staging deployment. If all pass, proceed to production.
- Production deployment: Deploy with traffic shifting using Lambda aliases and
CodeDeployintegration.
- Each function has a
livealias pointing to the current production version. - On deployment, I publish a new version and shift the
livealias usingCodeDeploywith a canary or linear strategy:- Canary10Percent5Minutes: Route 10% of traffic to the new version. If CloudWatch alarms (error rate, latency p99, custom business metrics) trigger within 5 minutes, auto-rollback. If clean, shift to 100%.
- Linear10PercentEvery1Minute: For less risky changes, shift 10% every minute over 10 minutes.
- The rollback is instant — just re-point the alias to the previous version. No re-deployment needed.
- Publish the shared layer as a versioned Layer. Pin production functions to a specific layer version (not
$LATEST). - Layer updates go through the same canary deployment. Deploy the new layer to one function first, validate, then roll out to all functions.
Follow-up: How do you handle database schema migrations in this serverless CI/CD pipeline?
Answer: Schema migrations in serverless are tricky because there is no persistent server to run migration scripts from. Here is what works:- Dedicated migration Lambda. A Lambda function whose sole purpose is running database migrations. It is triggered as a custom resource in CloudFormation/CDK or as a step in the CI/CD pipeline (invoke via AWS CLI after deployment). It connects to the database via RDS Proxy, runs the migration scripts (using a tool like Flyway, Alembic, or Knex), and reports success/failure.
- Backward-compatible migrations only. Since Lambda functions are deployed with canary traffic shifting, the old version and new version run simultaneously during deployment. The database schema must be compatible with both versions. This means: add columns (nullable or with defaults), never rename or drop columns during deployment. Use a two-phase migration: Phase 1 deploys the new schema (additive only) and the new code that can use both old and new schema. Phase 2 (days later, after all traffic has shifted) deploys a cleanup migration that removes the old columns.
- Migration as a pre-deploy step. In the CI/CD pipeline, run the migration Lambda before deploying the new function code. If the migration fails, the deployment stops and the old code continues running against the old schema. If the migration succeeds, deploy the new code that uses the new schema.
- For DynamoDB: There are no schema migrations in the traditional sense (DynamoDB is schemaless). But adding a new GSI takes time and consumes write capacity. Add new GSIs in a separate deployment step, monitor the backfill progress, and only deploy the code that queries the new GSI after the backfill completes.
Going Deeper: How do you test 30 Lambda functions locally before deploying?
SAM CLI local invoke is the primary tool.sam local invoke FunctionName -e event.json runs your function in a Docker container that mimics the Lambda runtime. It is not perfect (no cold start simulation, no IAM role), but it catches most logic errors.
For integration testing, I would use sam local start-api to spin up a local API Gateway + Lambda environment, then run the API test suite against it. For event-driven functions (SQS, S3 triggers), use sam local invoke with sample events generated by sam local generate-event.
The honest caveat: local testing catches code bugs but not deployment bugs, IAM issues, or VPC networking problems. I would maintain a dedicated dev/test AWS account where the CI pipeline deploys on every PR and runs integration tests against real AWS services. The confidence hierarchy is: unit tests (fast, local) -> local Lambda testing (medium, Docker) -> integration tests in dev account (slow, real AWS) -> canary deployment in production (final safety net).
11. What is the most expensive mistake you have seen (or made) on AWS, and what did you learn from it?
Difficulty: Senior (tests production experience and self-awareness) What the interviewer is really testing: Real-world experience, not theoretical knowledge. They want to see humility, systematic thinking about cost, and whether you have actually operated AWS at scale. Strong answer (example narrative): The most expensive mistake I saw was a recursive Lambda invocation that went undetected for 4 hours on a Friday night. A Lambda function processed images uploaded to an S3 bucket. The processed output was written back to the same bucket with a different prefix. But the S3 event notification was configured with no prefix filter — it triggered on allPutObject events, not just the upload prefix. So: upload triggers Lambda, Lambda writes output to same bucket, output triggers Lambda again, which writes another output, infinitely.
In 4 hours, the function executed 23 million times and generated 4 TB of duplicate output objects. The bill was around $12,000 — Lambda invocations, S3 PUT requests, S3 storage, and the biggest chunk was S3 request costs on the millions of PUTs.
What I learned:
- Always use prefix filters on S3 event notifications. Source prefix and destination prefix must be different, and the event notification must filter to only the source prefix.
- Set reserved concurrency on every Lambda function. If this function had reserved concurrency of 10, the loop would have been throttled and the blast radius contained.
- AWS now has recursive loop detection for Lambda-S3-Lambda loops (released 2023). But do not rely on detection — design to prevent recursion.
- Set up AWS Budgets with action thresholds. A budget action can automatically apply an SCP that denies Lambda invocations if the monthly spend exceeds a threshold. We now have a $500 daily anomaly alert on every account.
- Use a separate destination bucket. The simplest architectural fix is never writing output to the same bucket as input. Two buckets, one event notification, zero recursion risk.
Follow-up: How do you set up cost guardrails to prevent runaway bills on a new AWS account?
Answer:- AWS Budgets: Create a monthly budget with alerts at 50%, 80%, and 100% of expected spend. At 100%, trigger a budget action that notifies via SNS and optionally applies an SCP restricting resource creation.
- Cost Anomaly Detection: Enable it on the billing account. It uses ML to detect unusual spending patterns and alerts within hours rather than waiting for the end of the month. Configure alerts for both percentage-based (50% above forecast) and absolute-dollar thresholds ($100 anomaly).
- SCPs for sandbox accounts: Deny creation of expensive resources (large EC2 instances, large RDS instances, Redshift clusters, SageMaker endpoints). Deny actions in regions you do not use.
- Lambda concurrency limits: Set account-level concurrency limits in non-production accounts to a low number (100-200). This prevents any single runaway function from generating millions of invocations.
- S3 lifecycle policies from day one: Abort incomplete multipart uploads after 3 days. Expire old object versions after 30 days. Transition to cheaper storage classes on a schedule.
-
Tag enforcement: Use AWS Organizations Tag Policies to require
teamandenvironmenttags on all resources. Untagged resources generate unattributable costs that nobody optimizes.
12. You need to migrate a monolithic application from EC2 to a cloud-native architecture. How do you approach this?
Difficulty: Staff-Level What the interviewer is really testing: Strategic migration planning, risk management, understanding of the Strangler Fig pattern, and the judgment to prioritize what to migrate first. Strong answer: The first thing I would not do is rewrite everything at once. The big-bang rewrite is the highest-risk approach and fails more often than it succeeds. Instead, I use the Strangler Fig pattern — incrementally replacing pieces of the monolith with cloud-native services while the monolith continues running. Phase 0 — Understand the monolith (2-4 weeks). Before touching anything, I would map the monolith’s components, their dependencies, data stores, and traffic patterns. Identify which components have clear boundaries (a payments module that talks to a specific set of tables) versus which are deeply entangled (a “utils” module imported by everything). Use application performance monitoring (X-Ray, Datadog APM) to trace request flows and identify hot paths. Phase 1 — Lift and shift to containers (2-4 weeks). Containerize the monolith as-is and deploy on ECS Fargate. No code changes, no re-architecture. This gets you off EC2, into a reproducible deployment pipeline, with auto-scaling and health checks. The monolith runs identically — just in a container. This de-risks the migration: if anything breaks later, you can always fall back to “container running the monolith.” Phase 2 — Extract the easiest, highest-value component (4-8 weeks). Pick a component that is: (a) loosely coupled (few dependencies on other monolith code), (b) independently deployable (has its own API or event interface), and (c) a good candidate for serverless (event-driven, bursty traffic, or stateless). Common first extractions: image processing, email/notification sending, report generation, webhook handling. Extract it into a Lambda function or Fargate service. Route traffic to the new service using an API Gateway or load balancer rule. The monolith continues handling everything else. Phase 3 — Repeat for each component (ongoing). Work through the monolith component by component, extracting to the appropriate compute platform. For each extraction: deploy the new service alongside the monolith, shadow traffic or canary to verify correctness, then cut over. Keep the monolith code for that component intact but unreachable as a rollback path for 2-4 weeks. Phase 4 — Decommission the monolith. When the last component is extracted, the monolith container is an empty shell. Turn it off. In practice, this takes 6-18 months for a medium-sized monolith. Key decisions at each phase:- Database: Do not try to split the database first. That is the hardest part. Start with the monolith and new services sharing the same database (via RDS Proxy if needed). Extract service-specific tables into their own databases (DynamoDB, separate RDS instances) later, once the service boundaries are stable.
- Authentication: Extract auth into a shared service early (Cognito, Auth0, or a custom auth service). Every new microservice needs auth, and having a consistent auth layer prevents each service from reimplementing it differently.
-
Shared data: Use an event bus (EventBridge) from the beginning. When the monolith writes an order, it publishes an
OrderCreatedevent. New services subscribe to events rather than querying the monolith’s database. This decouples the extraction pace from the data migration pace.
Follow-up: The CTO wants the migration done in 3 months. How do you push back?
Answer: I would present data, not opinions. “Here is what we can realistically deliver in 3 months, and here is what that timeline risks.” What 3 months can deliver: Phase 1 (containerization) and Phase 2 (one component extraction). This gives us a deployable, auto-scaling container setup, a CI/CD pipeline, and proof that the extraction pattern works. We will have migrated one component end-to-end and validated the approach. What 3 months cannot deliver safely: Full decomposition of a monolith into microservices. Rushing this leads to: services with unclear boundaries that need to be re-merged, data consistency bugs from premature database splitting, and an explosion of inter-service communication complexity that nobody has monitoring for yet. The risk framework: “Each week we spend on the migration is a week we are not shipping features. The Strangler Fig approach lets us do both — we ship features on the monolith while extracting components in parallel. If we try to do everything in 3 months, we do neither well.” My concrete proposal: “Give me 3 months for Phase 1 and Phase 2. After that, we will have a validated pattern, a realistic velocity measurement, and I can give you an evidence-based timeline for the rest. I would rather under-promise and over-deliver than commit to an aggressive timeline that leads to cutting corners on testing and monitoring.”Going Deeper: How do you handle the database during migration — do you split it early or late?
Late. Always late. Premature database splitting is the number one reason monolith-to-microservice migrations fail. Here is why: when you extract a service from the monolith but both still share the same database, you can move incrementally. The new service reads and writes to the same tables. If something goes wrong, you can revert to the monolith handling that component without any data migration. When you split the database, you introduce: data synchronization (change data capture between the old and new databases), distributed transactions (what was a single database transaction is now a cross-service saga), and migration risk (moving data between databases while both systems are live). My approach: share the database until the service boundary is stable (the service has been running independently for 2-4 months with no boundary changes). Then split the data using DynamoDB Streams or PostgreSQL logical replication to keep the old and new databases in sync during the transition period. Once the new service is fully cut over, decommission the sync and drop the tables from the old database. The exception: if the monolith’s database is itself the bottleneck (connection limits, query contention, scaling ceiling), then an early database split for the highest-traffic service is justified — but do it for operational reasons, not for architectural purity.Advanced Interview Scenarios
13. Your Lambda function works fine in dev but randomly times out in production — about 5% of invocations hit the 30-second limit. CloudWatch shows the function uses only 60% of allocated memory. What is going on?
Difficulty: Senior What the interviewer is really testing: Whether you understand Lambda’s CPU allocation model, downstream dependency failures, and production debugging beyond “just increase the timeout.” The obvious wrong answer is “increase the timeout to 60 seconds.” What weak candidates say:- “Increase the timeout to 60 seconds or give it more memory.”
- “It is probably cold starts causing the timeouts.”
- “Add retry logic in the function.”
- Step 1: Enable X-Ray tracing. X-Ray shows the time breakdown per invocation. When I had this exact problem on a payment processing Lambda, X-Ray revealed that 5% of invocations spent 28 seconds waiting on an HTTP call to a third-party fraud detection API. The API had a long-tail latency problem — p50 was 200ms but p99 was 29 seconds. Without X-Ray, CloudWatch only shows total duration, which is useless for multi-step functions.
- Step 2: Check the CPU angle. Lambda allocates CPU proportional to memory. At 128MB, you get roughly 1/10th of a vCPU. If the function does CPU-intensive work (JSON parsing of large payloads, image manipulation, crypto operations), it can be CPU-starved even with memory headroom. I once debugged a Node.js Lambda that parsed a 5MB JSON payload — at 128MB memory, parsing took 8 seconds. At 1024MB memory, it took 400ms. The fix was increasing memory from 128MB to 512MB, which cut p99 latency from 12 seconds to 1.2 seconds. The extra memory was irrelevant — it was the CPU that mattered.
- Step 3: Check DNS resolution and connection establishment. Lambda in a VPC resolves DNS through the VPC’s DNS resolver, which can be slow under load. Also check if the function creates new TCP/TLS connections per invocation instead of reusing them. A TLS handshake to a downstream service adds 100-300ms, and if the downstream has connection limits, the handshake can queue for seconds.
-
Step 4: Check for connection pool exhaustion on downstream services. If the Lambda talks to RDS without RDS Proxy, and 200 concurrent Lambda instances each hold a connection, Aurora’s
max_connectionsmight be exhausted. New connections queue until one frees up, causing sporadic timeouts on the functions waiting for a connection. CloudWatch metricDatabaseConnectionson the RDS side confirms this. -
Step 5: Set explicit socket timeouts. The AWS SDK defaults to no socket timeout in some configurations. If a downstream service silently drops a connection (no RST, no FIN), the Lambda hangs until its own timeout. I always set
connectTimeout: 3000, socketTimeout: 10000on every HTTP client. This converts a 30-second timeout into a 10-second error with a clear message.
SO_KEEPALIVE with a 15-second interval on the HTTP client and add a 5-second read timeout. Timeout rate dropped from 5% to 0.01% overnight.
Follow-up: How would you differentiate between a CPU-bound timeout and a network-bound timeout without X-Ray?
Answer: Instrument the function with manual timestamps around each operation and log them. Evenconsole.log(Date.now()) before and after each external call gives you a poor-man’s trace. If the gap between “before DynamoDB call” and “after DynamoDB call” is 25 seconds, it is network. If the gap between “received payload” and “finished parsing” is 25 seconds, it is CPU.
The power-user move: check the REPORT line in CloudWatch Logs. It shows Billed Duration, Max Memory Used, and Init Duration. If Max Memory Used is 90%+ of allocated memory, the garbage collector may be thrashing. If Max Memory Used is low but duration is high, it is probably a downstream wait. Also, Lambda Power Tuning (an open-source tool from Alex Casalboni) systematically tests your function at different memory settings and graphs cost vs duration — it reveals CPU-bound vs I/O-bound behavior instantly.
Follow-up: You discover it is CPU-bound. Increasing memory from 128MB to 1769MB gives you a full vCPU. But the function only needs 80MB of memory. Are you paying for waste?
Answer: This is a question where the math is non-obvious. At 128MB, if the function runs for 10 seconds (CPU-starved), you pay for 128MB * 10s = 1,280 MB-seconds. At 1769MB, it runs in 800ms, so you pay 1769MB * 0.8s = 1,415 MB-seconds. Roughly the same cost, but your latency dropped from 10 seconds to 800ms. In many cases, increasing memory actually decreases cost because the function finishes so much faster that the total MB-seconds goes down. Lambda Power Tuning visualizes this exact tradeoff — I have seen cases where 3x memory allocation reduced both latency and cost by 40%.14. Your team chose Step Functions to orchestrate an order fulfillment workflow. Six months later, the monthly Step Functions bill is $8,000 and growing. The CTO asks why “serverless” is so expensive. Diagnose and fix.
Difficulty: Senior What the interviewer is really testing: Whether you understand Step Functions pricing models (Standard vs Express), can identify architectural cost traps, and know when to replace orchestration with choreography. The trap: most candidates do not know that Standard Workflows charge per state transition. What weak candidates say:- “Step Functions is just expensive, we should rewrite it.”
- “Move everything to Lambda and SQS.”
- “The CTO should expect high costs for serverless at scale.”
-
Pull the Step Functions CloudWatch metrics.
ExecutionsStartedtells me volume.StateMachinesCountand per-machine execution counts tell me if it is one workflow or many. Most importantly, look at the number of states per execution —StatesExecuted / ExecutionsStartedgives average transitions per workflow. - Check for “chatty” workflows. I have seen teams put every micro-operation as a separate Step Functions state: validate input (1 transition), check inventory (2), reserve inventory (3), process payment (4), handle payment failure choice (5), update order (6), send email (7), update analytics (8), log completion (9). That is 9 transitions per order. But the real killer is retry states and map states. A Map state iterating over 100 line items in an order fires 100+ transitions per order. If each line item has 3 states, a 100-item order costs 300+ transitions.
- Identify the fix — Standard to Express migration for eligible workflows. Express Workflows charge per execution and duration (0.83/month on Express. The same workflow on Standard with 10 transitions costs $250/month. Express is 300x cheaper for high-volume, short-duration workflows.
- The catch with Express Workflows: They have a 5-minute maximum duration, are asynchronous by default (no waiting for the result), and do not record execution history in the console (you must log to CloudWatch). If the order workflow takes longer than 5 minutes (waiting for payment confirmation, external API calls), Express does not work and you need a hybrid approach.
- Hybrid approach: Use Express for the high-volume, fast inner loop (process each line item) and Standard for the outer orchestration (order lifecycle with human approval, long waits). The Express workflow is called as a nested workflow within the Standard outer workflow.
Follow-up: When would you replace Step Functions entirely with event choreography (SQS + EventBridge)?
Answer: When the workflow is a straight pipeline with no branching, no parallel states, no retries beyond what SQS provides, and no need for visual execution history. “Order placed -> process payment -> update inventory -> send email” is a pipeline. Each step publishes an event, the next step subscribes. SQS between each step gives you retry and DLQ for free. The cost is effectively zero beyond the SQS API calls ($0.40 per million). But the moment you need conditional branching (“if payment fails, try backup processor, then notify customer, then schedule retry in 24 hours”), error handling across steps, or parallel execution with aggregation (“transcode to 3 formats and wait for all to finish”), choreography becomes spaghetti. That is when Step Functions earns its keep — the state machine definition is the documentation, and the console visualization is the debugging tool. My heuristic: fewer than 4 steps with no branching = choreography. More than 4 steps or any branching/parallel = Step Functions (Express if under 5 minutes, Standard if over).Follow-up: How do you test Step Functions workflows locally?
Answer: Step Functions Local is the official answer — a Docker container that emulates the Step Functions service. It runs your state machine definition and can invoke local Lambda functions via SAM CLI or mock the service integrations. It is adequate for happy-path testing. The honest answer: Step Functions Local is painful for complex workflows. The mock integrations are limited, and real service integrations (DynamoDB, SQS) do not work locally. What actually works in practice is: (1) unit test each Lambda function independently, (2) integration test the full workflow in a dev AWS account using the real service — deploy via SAM/CDK on every PR, run the workflow with test data, assert on the execution output, and tear down. (3) Use Step Functions’ built-in test state feature (added in 2023) to test individual states against real AWS services without running the full workflow. The feedback loop is slower than local testing, but the confidence is dramatically higher.15. An S3 bucket contains 50TB of customer data. A security audit reveals the bucket was publicly accessible for 72 hours due to a misconfigured bucket policy. The data was not encrypted at rest. Walk me through your incident response AND how you prevent recurrence.
Difficulty: Staff-Level What the interviewer is really testing: Security incident response maturity, S3 security model depth, and whether you think about prevention systemically (guardrails, not just fixes). This is a question where panic and “just make it private” is the weak answer. What weak candidates say:- “Remove the public policy immediately and enable encryption.”
- “Check CloudTrail to see who changed the policy.”
- “Turn on S3 Block Public Access.”
- Enable S3 Block Public Access at the account level, not just the bucket level. This overrides any bucket policy, ACL, or access point policy that grants public access. It is a single API call:
aws s3control put-public-access-block --account-id 123456789 --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true. This immediately locks down every bucket in the account. - Enable default encryption on the bucket with SSE-S3 or SSE-KMS. Existing objects are not retroactively encrypted — you must re-upload or use S3 Batch Operations to copy objects in place with encryption. For 50TB, S3 Batch Operations takes hours but runs server-side.
- Restrict bucket access to only specific IAM roles via an explicit deny bucket policy.
- Pull S3 Server Access Logs for the bucket during the 72-hour window. These logs show every GET, PUT, LIST request including source IP, requester, and user agent. If access logging was not enabled (common oversight), check CloudTrail S3 data events — but these are only available if data event logging was enabled. If neither was enabled, you have a visibility gap.
- Analyze access logs for anomalous IPs, bulk downloads, or programmatic access patterns (many GET requests in rapid succession from non-internal IPs). Tools: Athena query over the access logs in S3, or import into a SIEM (Splunk, Elastic).
- Check AWS CloudTrail for the
PutBucketPolicyorPutBucketAclevent that opened the bucket. This tells you who changed the policy, from which IP, and at what time. Determine if it was intentional (developer testing who forgot to revert) or compromised credentials.
- If the data contains PII (names, emails, financial records), this likely triggers GDPR Article 33 (72-hour notification to the supervisory authority), CCPA breach notification, HIPAA breach notification, or industry-specific requirements. Notify the legal team immediately. The notification clock starts when you discover the breach, not when it occurred.
- Determine the scope: which customers’ data was in the bucket, what data fields were exposed, was any data actually downloaded by unauthorized parties. The access logs from Track 2 determine this.
- SCP: Deny public S3 access at the organization level. Apply an SCP that denies
s3:PutBucketPolicyands3:PutBucketAclwhen the policy grants public access. This makes it structurally impossible for any account in the organization to make a bucket public, regardless of individual IAM permissions. - AWS Config rule:
s3-bucket-public-read-prohibited. This continuously monitors all buckets and alerts (or auto-remediates) if any bucket becomes public. Set it to auto-remediate by invoking a Lambda that re-applies the Block Public Access configuration. - Macie for sensitive data discovery. Enable Amazon Macie on all S3 buckets to automatically discover and classify sensitive data (PII, financial data, credentials). Macie would have flagged this bucket as containing sensitive data that was publicly accessible.
- S3 Block Public Access as the default. Enable Block Public Access at the organization level via SCP so every new account inherits the restriction. The only way to make a bucket public is to first remove the organization-level block — which requires Security OU approval.
- Mandatory encryption via SCP. Deny
s3:PutObjectwhens3:x-amz-server-side-encryptionis not set. This forces encryption on all new objects organization-wide.
Follow-up: The developer says they needed the bucket to be public for a legitimate use case (serving static assets for a marketing site). How do you accommodate that safely?
Answer: You do not make a data bucket public. You create a separate, purpose-built bucket with a strict lifecycle:- Dedicated bucket with a name that clearly indicates it is public (e.g.,
company-public-assets-prod). - Block Public Access is disabled on only this bucket, with a documented exception approved by the security team.
- The bucket contains only static assets (HTML, CSS, JS, images). No customer data, no PII, no application data.
- A bucket policy with a
Conditionrestricting uploads to only the CI/CD pipeline’s IAM role — no developer can upload directly. - CloudFront in front of the bucket with OAC (Origin Access Control), so the bucket is not directly accessible — only CloudFront can read from it.
- AWS Config rule monitors this specific bucket and alerts if any object matching sensitive data patterns (SSN regex, email regex, etc.) is uploaded.
16. You inherit a system with 40 Fargate tasks running 24/7 in production. The monthly AWS bill is $12,000. The CTO asks you to cut it by 50%. How?
Difficulty: Senior What the interviewer is really testing: Real FinOps chops — not just “use Savings Plans” but a systematic approach to right-sizing, architectural optimization, and knowing which cost levers actually move the needle. The trap: most candidates jump to Reserved Instances without first checking if the resources are right-sized. What weak candidates say:- “Buy Reserved Instances or Savings Plans.”
- “Move to Lambda.”
- “Use Spot instances.”
- Pull CloudWatch metrics for every task: CPU utilization average and p99, memory utilization average and peak. In my experience, 60-70% of Fargate tasks are over-provisioned. I have seen teams running 2 vCPU / 4GB tasks at 8% average CPU and 15% memory utilization because they copied the task definition from a template and never revisited it.
- AWS Compute Optimizer provides right-sizing recommendations for ECS tasks. It analyzes 14 days of CloudWatch metrics and suggests optimal CPU/memory configurations. It is free and takes 5 minutes to check.
- For a task running at 8% CPU / 15% memory on 2 vCPU / 4GB, dropping to 0.5 vCPU / 1GB cuts cost by 75% for that task. Across 40 tasks, this alone could save $4,000-6,000/month.
- If those 40 tasks include staging, dev, and QA environments running 24/7, schedule non-production environments to run only during business hours (8am-8pm weekdays = 60 hours/week vs 168 hours/week). Use ECS Scheduled Scaling or a Lambda that scales desired count to 0 at night and back up in the morning. This alone cuts non-prod costs by 64%.
- Better yet, use Fargate Spot for non-production. 70% discount with the only risk being occasional task termination — which is acceptable in dev/staging.
- Switch task definitions from x86 (
X86_64) to ARM (ARM64) and use Graviton-based Fargate tasks. This is a single line change in the task definition if you build multi-arch Docker images. Graviton Fargate is 20% cheaper than x86 Fargate for the same CPU/memory, with equal or better performance. For interpreted languages (Python, Node.js, Ruby), no code changes are needed. For compiled languages (Go, Rust, Java), you rebuild forlinux/arm64.
- Only after right-sizing and scheduling, commit to a Compute Savings Plan. A 1-year no-upfront Compute Savings Plan saves ~20%. A 3-year partial-upfront saves ~35%. Compute Savings Plans apply across Fargate, Lambda, and EC2, so they flex as your architecture evolves. I would commit to covering only 70% of the right-sized baseline — leave 30% on-demand for flexibility.
- Are any tasks doing batch processing that could be moved to Lambda (no idle cost between batches)?
- Are any tasks running singleton workers (one task processing a queue) that could be Lambda SQS consumers?
- Are there sidecar containers (log forwarders, metrics agents) that could be replaced with Firelens (built into ECS) or CloudWatch agent (no sidecar needed)?
Follow-up: How do you prevent the costs from creeping back up over the next 6 months?
Answer: Cost creep is the default state. Without active governance, spend returns to pre-optimization levels within 6-12 months as teams scale up for new features and never scale back down.- Weekly cost review. A 15-minute team meeting every Monday that reviews the Cost Explorer dashboard. Not a detailed analysis — just “are we trending up or down vs last week?” Anomalies get investigated immediately.
- AWS Budgets per service and per team. Set a monthly budget that is 10% above the optimized baseline. Alert at 80%, 90%, and 100%. At 100%, require a justification ticket for the increase.
- Right-sizing automation. A monthly Lambda function that runs Compute Optimizer recommendations and posts them to Slack. If a task has been over-provisioned for 30+ days, it opens a Jira ticket automatically.
- Tagging enforcement. Require
cost-centerandteamtags on all ECS services via AWS Config rules. Untagged resources cannot be attributed and therefore cannot be optimized. - Governance in the deployment pipeline. The CDK/Terraform pipeline runs a cost estimation tool (Infracost) on every PR. If the estimated monthly cost increase exceeds a threshold ($100/month), the PR requires a finance-team approval label.
17. A teammate says: “We should use DynamoDB for everything — it scales infinitely and costs less than RDS.” Where are they right, and where is this dangerously wrong?
Difficulty: Intermediate / Senior What the interviewer is really testing: Whether you can challenge popular narratives with nuance. DynamoDB is excellent, but “use it for everything” is one of the most common mistakes in AWS-heavy shops. This question separates engineers who have hit DynamoDB’s walls from those who have only read the marketing page. What weak candidates say:- “They are right — DynamoDB handles any scale and is fully managed.”
- “They are wrong — relational databases are always better for complex queries.”
- Scaling. DynamoDB on-demand mode genuinely scales to virtually unlimited throughput. I have seen tables handling 400,000 reads/second with single-digit millisecond latency. You do not manage shards, replicas, or connection pools. It just works.
- Operational simplicity. No patching, no vacuuming, no connection pool tuning, no replica lag monitoring. For a small team, this operational savings is massive — it is easily worth $20K/year in engineering time you do not spend.
- Performance at scale. For key-value and simple query patterns, DynamoDB’s latency is consistently 1-5ms regardless of table size. An RDS PostgreSQL query that returns in 2ms at 1GB can return in 200ms at 1TB without careful indexing and tuning.
- “Costs less than RDS” is often false. DynamoDB on-demand pricing is 0.25 per million read request units. A table with 10,000 writes/second sustained costs 32,400/month. An Aurora db.r6g.xlarge instance handling the same write throughput costs roughly $800/month. DynamoDB is cheaper for spiky, low-baseline workloads. It is dramatically more expensive for sustained high-throughput workloads unless you use provisioned capacity with reserved pricing, which erodes the “no capacity planning” benefit.
- “For everything” ignores the access pattern constraint. DynamoDB requires you to know your access patterns at design time. You design the partition key, sort key, and GSIs around specific queries. If the product team adds a new query pattern 6 months later (e.g., “find all orders by product ID across all customers”), you either already have a GSI for it or you add one (which consumes additional write capacity on every write to the table). In PostgreSQL, you add an index. In DynamoDB, you redesign your data model. I have been on a team where a “simple” new reporting requirement forced a complete DynamoDB table redesign and data migration because the existing single-table design could not support the new access pattern.
- Analytical queries are a non-starter. “Show me total revenue by region for the last 90 days” is a full table scan in DynamoDB. You export to S3 and query with Athena. In PostgreSQL, it is a 200ms query with a covering index. If your application needs even basic ad-hoc querying or reporting, DynamoDB alone is insufficient.
- Transactions are limited. DynamoDB transactions span up to 100 items in a single request and work only within a single table (or across tables in the same region). There is no equivalent to a multi-statement PostgreSQL transaction with savepoints, rollbacks, and isolation levels. For workflows like “transfer $100 from account A to account B while recording an audit log entry” where all three writes must succeed atomically, DynamoDB transactions work. For “insert an order, create 50 line items, update inventory for each product, and record 50 ledger entries atomically,” you exceed the 100-item limit.
Follow-up: When would you use DynamoDB single-table design versus multi-table design?
Answer: Single-table design — putting multiple entity types (users, orders, order items) in one table with carefully designed PK/SK structures — is powerful but has a steep learning curve and real trade-offs. Use single-table when: You have 3-6 well-known access patterns that will not change frequently, you need to fetch related entities in a single query (get user and their recent orders in one request), and your team has DynamoDB expertise. The performance benefit is real — one round trip instead of multiple table queries. Use multi-table when: Your team is new to DynamoDB (the cognitive overhead of single-table is high and mistakes are expensive), access patterns are still evolving (easier to add GSIs to isolated tables), or you have more than 8-10 access patterns (single-table GSIs become unwieldy). Multi-table is also easier to monitor — per-table CloudWatch metrics give you clear visibility into which entity type is causing hot partitions or throttling. The honest take from production: Single-table design is an optimization. Like all optimizations, apply it when you have data showing you need it, not upfront because a conference talk convinced you. Most DynamoDB applications work perfectly fine with multi-table design and never need the complexity of single-table.18. It is 2 AM. PagerDuty wakes you up. Your ECS Fargate service is returning 503 errors. CPU and memory look fine. Tasks are running. What is happening and how do you triage?
Difficulty: Senior What the interviewer is really testing: Real incident response skills under pressure, knowledge of the ECS/ALB/networking stack, and the ability to systematically narrow down a root cause when the obvious metrics look normal. The trap: candidates fixate on the application and ignore the infrastructure. What weak candidates say:- “Check the application logs for errors.”
- “Restart the tasks.”
- “Scale up to more tasks.”
- Check the ALB target group health. If targets show
unhealthy, the ALB is pulling tasks out of rotation. The health check might be failing even though CPU/memory are fine — the application could be returning 500 on the health check endpoint due to a downstream dependency failure (database unreachable, config service down). - Check
HealthyHostCountandUnHealthyHostCountCloudWatch metrics on the target group. If healthy count dropped to zero, the ALB has no targets to route to and returns 503 by default.
HTTPCode_ELB_503_Countmetric confirms the ALB is generating the 503s (as opposed to the application returning 503 via the ALB). A 503 from the ALB means: no healthy targets, all targets are deregistering, or the target group has no registered targets.RequestCounton the target group — are requests even reaching the targets? If request count is zero but the ALB is receiving traffic, the issue is target registration or health checks.- Check if a deployment just happened. ECS rolling deployments deregister old tasks and register new ones. If the new task definition has a bug (crashes on startup, fails health check), ECS keeps trying to start new tasks while the old ones are draining. During the crossover, healthy target count can drop to zero.
aws ecs describe-services— checkrunningCountvsdesiredCount. If running is less than desired, ECS is trying to start tasks but they are failing. Checkeventson the service for messages like “task failed to start” or “ECS was unable to place a task.”- Check stopped tasks:
aws ecs list-tasks --desired-status STOPPED. Look atstoppedReason— common culprits: “CannotPullContainerError” (ECR access issue, usually a VPC endpoint or NAT Gateway problem), “ResourceNotFoundException” (Secrets Manager or Parameter Store secret was deleted), “OutOfMemoryError” (container killed by OOM despite CloudWatch showing memory below limit — this happens when the container’s hard memory limit is hit but the task-level memory is not). - Check the ECS task definition — did someone push a bad container image to ECR? Did the entrypoint script change? A new image that crashes on startup will pass
docker pullbut fail immediately, and ECS will cycle through start-crash-restart.
- Security group on the tasks — did someone modify it? If the ALB’s security group cannot reach the task’s security group on the container port, health checks fail and the ALB returns 503.
- NACLs on the subnets — less common but devastating. A NACL change that blocks ephemeral ports (1024-65535) breaks ALB-to-task communication.
- NAT Gateway — if the tasks need to pull config from Secrets Manager or Parameter Store at startup, and the NAT Gateway is down or overloaded, tasks start but hang during initialization and fail health checks.
/health) made a synchronous call to Redis (ElastiCache) to verify cache connectivity. A maintenance window on the ElastiCache cluster (automatic patching, which we had not scheduled for off-hours) caused a 90-second Redis failover. During failover, every /health call timed out, the ALB marked all 24 targets unhealthy, and started returning 503. The fix had two parts: (1) moved ElastiCache maintenance window to Sunday 4 AM with explicit scheduling, and (2) changed the health check to not depend on Redis — the health check should verify the application can respond to HTTP requests, not that every dependency is healthy. For dependency health, we added a separate /ready endpoint used for operational monitoring but not ALB routing. The principle: liveness checks (is the process alive?) should be simple and dependency-free. Readiness checks (can it serve traffic?) can check dependencies but should not be used for ALB health checks unless you want a single dependency failure to take down all traffic.
Follow-up: How do you distinguish between a deployment-caused outage and an infrastructure-caused outage in under 2 minutes?
Answer: One command:aws ecs describe-services --services my-service --cluster my-cluster. The events field shows the last 100 events in chronological order. If the most recent events say “service my-service has started 4 tasks” and “service my-service has stopped 4 tasks: task was stopped because the underlying container exited,” you have a deployment issue. If the events show steady state (“service has reached a steady state”) but the ALB is returning 503, it is infrastructure — security groups, NACLs, NAT Gateway, or the health check endpoint.
The second check: aws ecs describe-services also shows deployments. If there are two deployments listed (PRIMARY with desiredCount and ACTIVE with desiredCount), a rolling update is in progress. If the PRIMARY deployment has runningCount: 0 and pendingCount: N, the new tasks are failing to start. Rollback immediately: aws ecs update-service --service my-service --cluster my-cluster --force-new-deployment to restart with the current (presumably broken) task definition, or better, use ECS circuit breaker which automatically rolls back if the new deployment fails health checks.
19. You are designing an event-driven system with EventBridge. A colleague says you do not need to worry about schema evolution because “events are just JSON.” Why is this a ticking time bomb, and how do you defuse it?
Difficulty: Senior / Staff-Level What the interviewer is really testing: Whether you have built event-driven systems that survived past version 1.0. Schema evolution is the silent killer of event-driven architectures — it never matters on day one, and it always matters on day 180. What weak candidates say:- “JSON is flexible, consumers just ignore fields they do not know.”
- “We will version our events when we need to.”
- “EventBridge schema registry handles this automatically.”
-
The phantom field break. Producer adds a field
order.discount_percentageas a float. Consumer A expects it. Consumer B (written 6 months ago) does not know about it and is fine. Consumer C (legacy Python service) receives the event and passes it to a function that chokes because it deserializes the entire payload and feeds it to a pandas DataFrame that cannot handle the new column. Nothing in the event contract told Consumer C this would happen. -
The type change time bomb. Producer changes
order.amountfrom integer cents (1599) to a string with currency ("$15.99"). Every consumer that parsesamountas an integer breaks silently — no exception, just wrong calculations. I have seen this exact scenario cause a $47,000 billing discrepancy over 3 days before anyone noticed. -
The required-to-optional change. Producer decides
customer.phoneis optional and starts sending events without it. Consumer D (the SMS notification service) dereferences a null pointer and crashes. Its SQS consumer dies, messages pile up in the DLQ, and nobody notices until 10,000 customers complain they did not get order confirmations.
- Event schema registry with enforcement. EventBridge Schema Registry discovers schemas automatically from live events. But discovery is not enough — you need enforcement. Define schemas explicitly (JSON Schema or OpenAPI), version them, and validate events at the producer before publishing. The producer calls a schema validation library before putting the event on the bus. If validation fails, the event is rejected, not published. This catches breaking changes at the source.
-
Schema versioning with semantic conventions. Every event type gets a version in its
detail-type:OrderPlaced.v2. Consumers subscribe to the version they support. When the producer needs a breaking change, it publishesOrderPlaced.v3alongsidev2. Consumers migrate at their own pace. The producer deprecatesv2only after all consumers have migrated (tracked via EventBridge metrics on rule invocations — if thev2rule has zero invocations for 30 days, it is safe to remove). - Backward-compatible changes only by default. Adding new optional fields is always safe. Removing fields, renaming fields, or changing field types is a breaking change that requires a version bump. This is the same contract discipline as API versioning, and it needs to be enforced in code review — not left to developer goodwill.
- Consumer contract testing. Each consumer publishes a “contract test” that defines the minimum fields and types it requires from each event type. The CI pipeline for the producer runs all downstream consumer contract tests before deploying. If the producer’s schema change breaks any consumer contract, the pipeline fails. This is the Pact testing model applied to events.
- Dead letter queue per consumer. Every EventBridge rule that invokes Lambda or SQS should have a DLQ. When a consumer fails to process an event (possibly due to a schema change), the event goes to the DLQ rather than disappearing. Monitor DLQ depth — a sudden spike after a producer deployment is a strong signal that a schema change broke something.
shipment.estimated_delivery to shipment.eta in their producer because “eta is shorter.” No version bump, no consumer notification, no contract tests. Five consumers broke. The warehouse management system stopped scheduling dock workers because it could not read the delivery estimate. The customer notification service stopped sending “your package is arriving tomorrow” emails. It took 4 hours to diagnose because the error was not a crash — the consumers just returned null for the missing field and continued with degraded logic. We caught it when a customer service team noticed customers were not getting notifications. After that incident, we implemented contract testing and a mandatory event schema review process for any PR that changes an event payload.
Follow-up: How do you handle the EventBridge schema registry’s limitations in practice?
Answer: The EventBridge schema registry discovers schemas from events passing through the bus, which is useful for documentation but has real limitations:- It discovers schemas reactively (an event must flow through the bus before the schema appears). It does not enforce schemas — a producer can publish any JSON.
- Schema versions in the registry are discovered versions, not semantic versions. Adding a new field creates a “new version” even if it is backward compatible.
- There is no built-in mechanism to reject events that do not match a schema.
quicktype or json-schema-to-typescript) that producers and consumers import. The combination gives you discoverability (registry), enforcement (validation Lambda), and developer experience (type-safe code).
20. Your company runs a SaaS product with 200 tenants on a shared infrastructure. One tenant starts sending 50x their normal traffic. Within minutes, all other tenants experience degraded performance. Diagnose the failure and design the fix.
Difficulty: Staff-Level What the interviewer is really testing: Multi-tenant isolation thinking, noisy-neighbor problem understanding, and the ability to design tenant-aware architectures on AWS. This is a cross-cutting question that spans Lambda concurrency, DynamoDB throughput, API Gateway throttling, and SQS queue design. What weak candidates say:- “Rate limit the noisy tenant.”
- “Give each tenant their own infrastructure.”
- “Scale up to handle the increased load.”
- API Gateway: If all tenants hit the same API Gateway, the noisy tenant’s 50x spike may be consuming the account-level API Gateway throttle limit (10,000 requests/second default). All tenants share this limit. When exceeded, every tenant gets 429 errors.
- Lambda: If backend Lambda functions have no reserved concurrency, the noisy tenant’s requests consume the account’s 1,000 concurrent execution limit. Other tenants’ requests are throttled with 429.
- DynamoDB: If tenants share a DynamoDB table, the noisy tenant’s requests may hot-partition the table (especially if tenant ID is the partition key and one tenant dominates throughput). Adaptive capacity helps but cannot fully compensate for a 50x spike.
- SQS: If tenants share an SQS queue, the noisy tenant’s messages flood the queue. Consumer Lambda processes messages in order of arrival, starving other tenants’ messages.
- API Gateway per-tenant throttling. Use API Gateway usage plans with API keys. Each tenant gets an API key with a throttle limit (e.g., 100 requests/second) and a burst limit. When Tenant X exceeds their limit, only they get 429 errors. Other tenants are unaffected. This is the first line of defense and the cheapest to implement.
- Lambda concurrency isolation. Create a Lambda function alias per tenant tier (free, standard, enterprise). Set reserved concurrency: free tier functions get 50 concurrent executions, standard gets 200, enterprise gets 500. The noisy tenant’s tier exhausts its own reserved pool and is throttled without impacting other tiers. For critical-path functions, I would also consider per-tenant Lambda functions for the largest tenants (top 5-10 by revenue), each with their own reserved concurrency.
-
DynamoDB tenant-aware design. Partition key should include tenant ID (e.g.,
PK = TENANT#123#ORDER#456). This naturally distributes each tenant’s data across different partitions. But if a single tenant’s throughput exceeds a partition’s capacity, add write sharding within the tenant key. Also: use DynamoDB on-demand mode so individual tenant spikes are absorbed without provisioned capacity limits, and set per-tenant item-level monitoring via Contributor Insights. - SQS queue-per-tenant or priority queuing. For the noisy-neighbor SQS problem, the cleanest solution is a queue-per-tenant for the top 20 tenants (by volume) and a shared queue for the remaining 180. Each per-tenant queue has its own Lambda consumer with independent reserved concurrency. The shared queue has a Lambda consumer that processes small tenants. If a top-20 tenant spikes, only their queue backs up. Alternative: a single queue with tenant-aware consumer logic that implements fair scheduling (round-robin across tenant IDs in the batch).
-
Rate limiting at the application layer. Beyond API Gateway throttling, implement a token bucket rate limiter in the application code (backed by Redis
INCRwith TTL). Each tenant has a bucket with a per-second and per-minute limit based on their plan tier. This catches bursts that exceed the API Gateway throttle (which operates at the API key level, not the business logic level).
Follow-up: How do you detect the noisy tenant automatically before other tenants are impacted?
Answer: The key is per-tenant observability — most teams only have aggregate metrics, which tells you the system is degraded but not which tenant is causing it.-
Per-tenant request counting in real-time. A Lambda@Edge function (or API Gateway access logging to Kinesis) that increments a Redis sorted set keyed by tenant ID with TTL-based windows. A background Lambda samples the sorted set every 30 seconds and fires a CloudWatch custom metric
RequestsPerTenant. A CloudWatch anomaly detection alarm triggers when any tenant exceeds 3 standard deviations from their 7-day baseline. - API Gateway access logs to Athena. API Gateway access logs include the API key (tenant) and can be queried in near real-time via Kinesis Firehose to S3 and Athena. A scheduled query every 5 minutes identifies tenants exceeding their normal traffic by 5x+.
- Automatic throttling. When the anomaly detection alarm fires, a Lambda function automatically lowers the offending tenant’s API Gateway usage plan throttle to their normal baseline + 20% buffer. This contains the blast radius within minutes, without human intervention. A Slack notification tells the on-call engineer what happened so they can investigate whether the traffic is legitimate or malicious.
21. You are reviewing a pull request that adds "Action": "s3:*", "Resource": "*" to a Lambda execution role. The developer says “I will tighten it later.” How do you handle this technically and interpersonally?
Difficulty: Intermediate
What the interviewer is really testing: Security discipline, ability to give constructive code review feedback, and whether you understand the real blast radius of overly permissive IAM. Also tests soft skills — can you push back without being adversarial? The obvious-answer trap: many candidates focus only on the technical risk and miss the interpersonal dimension.
What weak candidates say:
- “Block the PR and tell them to fix it.”
- “It is fine for now, we can tighten it later.”
- “Add a TODO comment and merge.”
s3:* usage comes from functions that need exactly 3 actions:
s3:* actually includes: s3:DeleteBucket, s3:PutBucketPolicy (can make any bucket public), s3:GetObject on every bucket (can exfiltrate customer data from unrelated services), s3:PutBucketReplication (can replicate data to an attacker-controlled account). Most developers are shocked when they see the full list — they think s3:* means “read and write files” and do not realize it includes bucket-level administrative operations.
On the systemic problem:
The fact that this reached a PR means the guardrails are missing:
- IAM Access Analyzer as a CI check. AWS IAM Access Analyzer has a policy validation API. Add a CI step that runs
aws accessanalyzer validate-policyon every IAM policy in the CDK/Terraform code. It flags overly permissive actions and wildcard resources as findings. The PR cannot merge withERROR-level findings. - SCP as a safety net. An SCP that denies
s3:PutBucketPolicy,s3:DeleteBucket, ands3:PutBucketPublicAccessBlockfor Lambda execution roles (identified by a naming convention or tag condition). This means even if a wildcard policy slips through, the most dangerous S3 actions are blocked at the organization level. - CDK/Terraform policy templates. Create reusable IAM policy modules for common patterns (S3 read-only, S3 read-write to specific bucket, DynamoDB CRUD on specific table). Developers import the module instead of writing raw JSON. This is faster than writing a custom policy and correct by default.
"Action": "*", "Resource": "*" — full admin access. Three months later, a dependency vulnerability in a Node.js library (prototype pollution in a logging package) allowed an attacker to execute arbitrary code in the Lambda context. Because the Lambda had full admin access, the attacker created a new IAM user, attached AdministratorAccess, and used those credentials to launch 48 p3.16xlarge instances for cryptocurrency mining. The bill for the weekend: $28,000. AWS reversed most of it under their abuse policy, but the incident response, credential rotation, forensic analysis, and customer notification took 3 weeks of engineering time. One IAM policy that should have been s3:GetObject on one bucket.
Follow-up: The developer pushes back: “IAM Access Analyzer is too strict, it flags everything. It slows us down.” How do you respond?
Answer: They are partially right — IAM Access Analyzer’s default findings include informational items that are not security risks. The fix is not to disable the tool; it is to configure the CI check to fail only onERROR and SECURITY_WARNING level findings, not SUGGESTION or WARNING. This eliminates 80% of the noise while catching the dangerous patterns (wildcard resources, overpermissive actions, public access).
I would also invest 2 days in creating a library of pre-approved IAM policy modules for the 10 most common patterns in the codebase. Developers import a module (import { s3ReadWrite } from '@infra/iam-policies') instead of writing raw JSON. The module is already validated, already scoped, and faster to use than writing a custom policy. You turn security from a tax into a productivity tool.
22. A service processes 2 million SQS messages per day. Occasionally, the same message is processed twice, causing duplicate charges to customers. The team says “SQS is at-least-once, duplicates are expected.” Is that an acceptable answer?
Difficulty: Senior What the interviewer is really testing: Whether you treat “the infrastructure does not guarantee it” as an excuse or an engineering constraint to solve. This is a question where the technically correct answer (“SQS Standard is at-least-once”) is the wrong answer in a business context. Strong engineers solve for business requirements, not infrastructure limitations. What weak candidates say:- “Switch to SQS FIFO for exactly-once delivery.”
- “SQS duplicates are rare, the business should accept them.”
- “Add a database check before processing each message.”
-
Idempotency key. Every message carries a unique idempotency key (the SQS
MessageId, a hash of the message body, or a business-level identifier likeorder_id + charge_type). Before processing, the consumer checks if this key has been processed before. - The atomic check-and-process pattern. This is where most implementations go wrong. A naive approach — “read from database, check if key exists, if not process, then write key” — has a race condition. Two concurrent Lambda instances receive the same duplicate message, both check the database simultaneously, both see “not processed,” both process, both write the key. You still get a duplicate.
ConditionExpression makes the check-and-write atomic. If two Lambda instances try simultaneously, one succeeds and one gets ConditionalCheckFailedException. The one that fails skips processing. Zero duplicates.
- TTL for cleanup. Idempotency keys should not live forever. Set a DynamoDB TTL to expire keys after 7 days (or whatever your SQS message retention is). This keeps the table from growing unbounded.
- What about SQS FIFO? SQS FIFO provides exactly-once delivery within a 5-minute deduplication window. But it does not provide exactly-once processing. If the consumer crashes after processing but before deleting the message, the message reappears after the visibility timeout. You still need the idempotency pattern. FIFO reduces the probability of duplicates but does not eliminate it. For 2 million messages/day (23 messages/second), SQS FIFO works within its throughput limits. But I would still implement idempotency because the cost is negligible and the protection is absolute.
-
The broader pattern. Idempotency is not just for SQS. API endpoints should be idempotent (Stripe’s
Idempotency-Keyheader is the gold standard). Event processors should be idempotent. Anything that can be retried should be idempotent. I treat idempotency the same way I treat input validation — it is not optional, it is a fundamental requirement for any operation that has side effects.
Follow-up: What happens if the Lambda crashes between the idempotency write and the actual processing? The key is recorded but the charge never happened.
Answer: This is the “phantom record” problem. The idempotency key is in the database, so any retry skips processing, but the original processing never completed. The customer is never charged. Solution: Two-phase idempotency. Write the idempotency record with astatus: PENDING in the first atomic write. Process the charge. Update the record to status: COMPLETED with the result. If the Lambda crashes between the initial write and the processing, the record exists with status: PENDING.
A separate “reconciliation” Lambda runs every 5 minutes (scheduled via EventBridge), queries for PENDING records older than 10 minutes, and retries them. The retry Lambda reads the PENDING record, attempts the charge, and updates the status.
This handles all failure modes: duplicate delivery (blocked by conditional write), crash before processing (retried by reconciler), crash after processing but before status update (reconciler retries, but the downstream payment processor should also be idempotent — Stripe’s Idempotency-Key prevents double-charging even if we call it twice).
The pattern is: make every step independently retry-safe, and have a background reconciler that catches anything that fell through the cracks.
Follow-up: How do you size the DynamoDB table for the idempotency store at 2 million messages/day?
Answer: Quick math: 2 million writes/day = 23 writes/second sustained. Each idempotency record is roughly 200 bytes (key + timestamp + status + result). With 7-day TTL, the table holds ~14 million items at steady state = ~2.8 GB storage. For DynamoDB on-demand: 23 WRU/second is trivial. Cost: ~75/month. Read capacity for the conditional check is included in the write (DynamoDB’s conditional write does a read internally). For provisioned: 25 WCU (with some headroom) and 5 RCU (for the reconciliation queries). Cost: ~$15/month. Storage: 2.8 GB * 0.70/month. Total: $16-76/month depending on mode. This is negligible compared to the cost of duplicate charges it prevents.Cross-Chapter Connections
- Cloud, Problem Framing & Trade-Offs — The decision frameworks and architectural thinking that precede service selection. Read that chapter first for the “why,” then this chapter for the “how.”
- Database Deep Dives — PostgreSQL internals (MVCC, WAL, VACUUM, query planner), DynamoDB access pattern design and single-table strategy, MongoDB aggregation patterns, and Redis data structures. Essential for understanding what runs inside RDS, Aurora, DynamoDB, and ElastiCache.
- Messaging, Concurrency & State — Deeper coverage of messaging semantics (exactly-once, idempotency, ordering guarantees), Kafka vs RabbitMQ vs SQS comparisons, dead letter queue patterns, and event-driven architecture principles that apply to SQS, Kinesis, and EventBridge patterns in this chapter.
- Authentication & Security — OAuth 2.0, OIDC, RBAC, ABAC, Zero Trust Architecture, and service-to-service authentication patterns that IAM roles, Secrets Manager, and SCPs implement at the AWS level. Essential context for the IAM, multi-account, and security sections in this chapter.
- Networking & Deployment — DNS resolution (Route 53 implements these patterns), load balancing (ALB/NLB), TLS termination, service discovery, VPC fundamentals, and deployment strategies (blue-green, canary) applied to AWS services.
- Compliance, Cost & Debugging — FinOps practices and cloud cost governance that extend this chapter’s Cost Engineering section, GDPR/HIPAA implications for S3 data residency and multi-account compliance boundaries, and incident response for cloud outages.
- APIs & Databases — REST API design, database indexing, and transaction patterns that inform API Gateway, DynamoDB, and Aurora design decisions.
- Caching & Observability — ElastiCache patterns (cache-aside, write-through, stampede prevention), cache invalidation strategies, and the observability stack (CloudWatch, X-Ray, Datadog) for monitoring cloud-native applications.
- Performance & Scalability — Auto-scaling patterns, latency optimization, and capacity planning that apply directly to Lambda concurrency, Fargate scaling, and database sizing.
- Reliability Principles — SLOs, error budgets, circuit breakers, retry patterns, bulkhead isolation, and chaos engineering. These resilience patterns are what make cloud architectures production-ready — Lambda retry behavior, SQS DLQ patterns, and multi-AZ failover all implement the principles covered there.
- Distributed Systems Theory — CAP theorem, consensus algorithms, and the fundamental challenges of distributed computing. DynamoDB’s eventual consistency, Aurora’s distributed storage, and multi-region active-active architectures all trace back to these theoretical foundations.