Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Cloud Service Patterns — AWS in Depth

Knowing that “the cloud” exists is table stakes. Knowing which specific service to reach for, why it behaves the way it does under load, and where the cost traps hide — that is what separates engineers who build on AWS from engineers who get paged by AWS. This chapter goes deep on the services you will actually use in production: Lambda, S3, DynamoDB, SQS, EventBridge, ECS, and the networking and security primitives that hold it all together. Every section is grounded in real production behavior, not marketing pages.
Cross-chapter context. This chapter builds on the cloud architecture foundations covered in Cloud, Problem Framing & Trade-Offs — especially the 5-Question Framework, the Well-Architected Framework, and the compute decision matrix. Where that chapter teaches you how to think about cloud decisions, this chapter teaches you how the services actually behave when real traffic hits them. For database internals behind RDS, Aurora, and DynamoDB, see Database Deep Dives. For IAM, Secrets Manager, and cloud security patterns, see Authentication & Security. For SQS/SNS/EventBridge messaging semantics (delivery guarantees, idempotency, dead letter queues), see Messaging, Concurrency & State. For VPC, DNS, and deployment strategies, see Networking & Deployment. For FinOps and cloud cost management, see Compliance, Cost & Debugging.
AWS-centric, principles-universal. This chapter uses AWS service names because AWS has the largest market share and the most granular service catalog. But every pattern here has a direct equivalent on GCP and Azure. If you understand why Lambda has cold starts, you understand why Cloud Functions and Azure Functions do too. The principles transfer; the console does not.

Part I — Serverless Patterns (Lambda / Cloud Functions)

1.1 Cold Starts — What Actually Happens

When you invoke a Lambda function, AWS must run your code somewhere. If there is already a warm execution environment sitting idle from a recent invocation, your code runs immediately — this is a “warm start” and adds near-zero overhead. But if no warm environment exists, AWS must provision one from scratch. This is a cold start, and understanding exactly what happens during one is the key to mitigating them. The cold start lifecycle:
  1. Download your code — AWS fetches your deployment package (zip or container image) from S3 or ECR. Larger packages take longer.
  2. Create the execution environment — AWS provisions a microVM (Firecracker), allocates memory, sets up networking, and mounts the filesystem.
  3. Initialize the runtime — The language runtime starts (Python interpreter, Node.js V8, JVM, .NET CLR). This is where JVM-based languages pay a massive tax.
  4. Run your initialization code — Any code outside your handler function executes: import statements, database connection setup, SDK client creation, global variable initialization.
  5. Execute your handler — Finally, your actual function logic runs.
Steps 1-4 are the cold start. Step 5 is what happens every time. Cold start duration by language:
RuntimeTypical Cold StartWorst CaseWhy
Python100-300 ms500 ms+Interpreter starts fast; import-heavy packages (pandas, numpy) add time
Node.js100-300 ms500 ms+V8 is quick; large node_modules increase download/init
Go50-100 ms200 msCompiled binary, no runtime initialization
Rust50-100 ms200 msSame as Go — compiled, minimal runtime
Java3-10 seconds15+ secondsJVM startup, class loading, JIT compilation warmup
.NET1-3 seconds5+ secondsCLR initialization, assembly loading
Java cold starts are not a myth. If you are building a latency-sensitive API on Lambda with Java, you will have cold start problems. A Spring Boot application with dependency injection, annotation scanning, and connection pool initialization can easily take 8-12 seconds on first invocation. This is not an edge case — it is the default behavior. Either mitigate aggressively or choose a different runtime.
Mitigation strategies:
  • Provisioned Concurrency — Pre-warms a specified number of execution environments. You pay for them whether they are used or not (like reserved instances for Lambda). Use when: your API has strict latency SLAs and cold starts are unacceptable.
  • SnapStart (Java only) — AWS takes a snapshot of the initialized JVM after your init code runs, then restores from snapshot on cold start instead of re-initializing. Reduces Java cold starts from seconds to ~200-500 ms. Released late 2022. Use for any Java Lambda function.
  • Keep functions small — Smaller deployment packages download faster. Trim unused dependencies. Use Lambda layers for shared libraries.
  • Warm-up pings — A CloudWatch Events rule that invokes your function every 5 minutes with a no-op payload. Cheap and effective, but does not guarantee warm instances under concurrent load (one ping keeps one instance warm).
  • Choose lightweight runtimes — Go and Rust have the fastest cold starts. Python and Node.js are a good middle ground. Java and .NET are the slowest without SnapStart or equivalent.
  • Initialize outside the handler — Database connections, SDK clients, and configuration should be created in the global scope (outside the handler function). This code runs once per cold start and is reused across warm invocations.
# GOOD: Connection created once, reused across invocations
import boto3
dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('Orders')

def handler(event, context):
    return table.get_item(Key={'order_id': event['order_id']})

# BAD: Connection created on EVERY invocation
def handler(event, context):
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('Orders')
    return table.get_item(Key={'order_id': event['order_id']})

1.2 Concurrency Model

Lambda concurrency is one of the most misunderstood aspects of the service. Here is how it actually works: One invocation = one execution environment. If your function receives 100 simultaneous requests, AWS creates 100 execution environments. Each environment handles exactly one request at a time (no multi-threading within a single environment). This is fundamentally different from a container or VM that handles many concurrent requests. Account-level limits:
LimitDefaultCan Increase
Concurrent executions (per region)1,000Yes, via support ticket (common to get 10K-100K)
Burst concurrency500-3,000 (region-dependent)No
Function-level reserved concurrencyConfigurable (0 to account limit)N/A
Reserved vs unreserved concurrency:
  • Unreserved — All functions share the account’s concurrency pool. A traffic spike on one function can starve all others.
  • Reserved — You allocate a guaranteed number of concurrent executions to a specific function. That function can never exceed its reservation, and no other function can consume its allocation. This is both a guarantee and a throttle.
Always set reserved concurrency on critical functions. Without it, a runaway function processing a backlog of SQS messages can consume your entire account’s concurrency, causing throttling on your user-facing API Lambda. This is one of the most common production incidents in serverless architectures.
What happens when you hit the limit: Lambda returns a 429 TooManyRequestsException. For synchronous invocations (API Gateway), this surfaces as a 429 to the client. For asynchronous invocations (S3 events, SNS), Lambda retries automatically (twice by default). For SQS event sources, the message returns to the queue and is retried based on visibility timeout.

1.3 Event Sources — The Lambda Trigger Taxonomy

Lambda’s power comes from its integration with dozens of AWS services. But each event source has different invocation semantics, and misunderstanding them causes subtle bugs.
Event SourceInvocation TypeRetry BehaviorConcurrency Model
API GatewaySynchronousNo retry (client retries)One Lambda per HTTP request
SQSPolling (Lambda pulls)Message returns to queue on failureBatch size controls concurrency
S3Asynchronous2 retries, then DLQOne Lambda per event
EventBridgeAsynchronousRetries for 24 hoursOne Lambda per event
KinesisPolling (Lambda pulls)Retries until data expiresOne Lambda per shard
DynamoDB StreamsPolling (Lambda pulls)Retries until data expiresOne Lambda per shard
SNSAsynchronous3 retriesOne Lambda per message
Kinesis and DynamoDB Streams will block on failure. If your Lambda function fails processing a record from a Kinesis stream, Lambda retries that same batch forever (until the record expires from the stream, typically 24-168 hours). This means a single poison record blocks all processing on that shard. Always configure bisectBatchOnFunctionError (to narrow down the failing record) and maxRetryAttempts (to eventually skip it), and send failed records to a DLQ or on-failure destination.

1.4 Lambda Architecture Patterns

Pattern 1: API Backend API Gateway -> Lambda -> DynamoDB/RDS. The most common pattern. Each HTTP endpoint maps to a Lambda function (or a single function with routing logic). Works well for moderate traffic APIs. Watch out for: cold starts on infrequently-hit endpoints, connection pooling with RDS (use RDS Proxy), and API Gateway’s 29-second timeout limit. Pattern 2: Event Processor S3/SQS/SNS/EventBridge -> Lambda -> DynamoDB/S3. Process events asynchronously. Image uploaded to S3 triggers a Lambda that generates thumbnails. Order placed publishes to SNS, triggering Lambdas for email, inventory, and analytics. This is where serverless shines — zero cost when idle, scales automatically with event volume. Pattern 3: Scheduled Task EventBridge Scheduled Rule -> Lambda. Replaces cron jobs. Clean up expired sessions every hour, generate daily reports, sync data between systems. Cheaper and more reliable than running a dedicated EC2 instance for cron. Pattern 4: Stream Processor Kinesis/DynamoDB Streams -> Lambda. Process real-time data streams. Clickstream analytics, change data capture, real-time aggregations. Lambda processes records in order within each shard. Use enhanced fan-out for high-throughput consumers.

1.5 Container Images vs Zip Packages

Lambda supports two deployment formats, and the choice matters more than most teams realize. Zip packages (the default):
  • Max size: 50 MB zipped, 250 MB unzipped
  • Fast deployment (seconds)
  • Works with Lambda layers for shared dependencies
  • Best for: most Lambda functions, small-to-medium dependency footprints
Container images:
  • Max size: 10 GB
  • Uses standard Docker tooling (Dockerfile, docker build, ECR)
  • Slower cold starts (larger image to download, though AWS caches aggressively)
  • No Lambda layer support (bake everything into the image)
  • Best for: ML inference (large model files), functions with massive dependencies, teams with existing Docker workflows, functions that need specific system libraries
The real reason to use container images: It is not about Docker familiarity. It is about dependency size. If your function needs TensorFlow, PyTorch, or a large native binary, the 250 MB zip limit is a hard wall. Container images give you 10 GB of headroom. For everything else, zip packages are simpler and deploy faster.

1.6 Step Functions — Orchestration Done Right

When you need to coordinate multiple Lambda functions into a workflow, you have three options: synchronous chaining (Lambda calls Lambda — terrible idea), event-driven choreography (SQS/EventBridge between steps — good for simple flows), or orchestration with Step Functions (good for complex flows with branching, retries, and error handling). When to use Step Functions:
  • Workflows with conditional logic (if payment succeeds, do X; if it fails, do Y)
  • Long-running workflows that exceed Lambda’s 15-minute timeout
  • Workflows that need human approval steps
  • Complex retry and error handling across multiple steps
  • When you need a visual representation of workflow state for debugging
When to use SQS/EventBridge instead:
  • Simple fan-out (one event triggers multiple independent actions)
  • When steps are truly independent and do not need coordination
  • When you want loose coupling between services owned by different teams
  • When the “workflow” is really just a pipeline (A -> B -> C with no branching)
Step Functions pricing trap: Standard workflows charge per state transition (0.025per1,000transitions).Aworkflowwith10stepsprocessing1millioneventscosts0.025 per 1,000 transitions). A workflow with 10 steps processing 1 million events costs 250 in state transitions alone. For high-volume, simple workflows, Express Workflows (charged per execution and duration, not per transition) are dramatically cheaper.

1.7 Cost Model — When Serverless Stops Being Cheap

Lambda pricing has two components: per-invocation (0.20permillionrequests)andpercomputetime(0.20 per million requests) and per-compute-time (0.0000166667 per GB-second). The combination means Lambda is extraordinarily cheap at low scale and surprisingly expensive at high scale. The break-even calculation:
MetricLambda (128 MB, 200ms avg)ECS Fargate (0.25 vCPU, 0.5 GB)
100K requests/month~$0.43~$9.00
1M requests/month~$4.30~$9.00
10M requests/month~$43.00~$9.00
100M requests/month~$430.00~$36.00 (4 tasks)
The crossover point depends heavily on memory allocation, execution duration, and traffic pattern. But the rule of thumb: at roughly 1-5 million sustained requests per month, containers start winning on cost. Below that, Lambda is almost always cheaper. Above that, do the math for your specific workload.
The hidden cost of “free.” Lambda’s free tier (1M requests + 400,000 GB-seconds per month) is generous for development and low-traffic production. But it masks the true cost curve. Teams that prototype on the free tier and then scale to production are often shocked by the bill. Always model costs at your projected 6-month traffic, not your current traffic.

1.8 Anti-Patterns

The Lambda-lith: Deploying your entire Express/Flask application as a single Lambda function behind API Gateway. You lose all the benefits of serverless (granular scaling, independent deployment, per-function monitoring) and keep all the downsides (cold starts, 15-minute timeout, payload limits). If you want to run a monolith, use containers. Synchronous Lambda chains: Lambda A calls Lambda B, which calls Lambda C, all synchronously. Each function waits for the next, consuming concurrency and accumulating latency. If Lambda C times out, Lambda B times out, and Lambda A times out — a cascade failure. Use Step Functions or async patterns (SQS between steps) instead. Cold-start-sensitive critical paths: Placing Lambda in the hot path of a user-facing request without provisioned concurrency, then wondering why p99 latency is 3 seconds. Either use provisioned concurrency, or move the latency-sensitive operation to an always-on service. Recursive Lambda invocations: A Lambda function that writes to an S3 bucket that triggers the same Lambda function. This creates an infinite loop that racks up a massive bill in minutes. AWS now has recursive loop detection for some event sources, but do not rely on it — design your event routing to prevent cycles. Lambda-to-Lambda synchronous chaining (expanded): This deserves emphasis because it remains one of the most common mistakes in serverless architectures. The problems compound:
  • Latency accumulation: Each hop adds cold start risk + network latency. A chain of A -> B -> C -> D with 300ms cold starts becomes 1.2 seconds in the worst case — before any business logic runs.
  • Concurrency multiplication: If A calls B synchronously, both A and B consume a concurrent execution for the full duration. A chain of 4 functions processing 100 requests consumes 400 concurrent executions. You hit the account limit 4x faster.
  • Cascading timeouts: If D’s timeout is 30 seconds, C must have a timeout > 30 seconds, B > C, A > B. Now A has a 2-minute timeout for what should be a 5-second operation.
  • Cost doubling: Every function in the chain is billed for the time it spends waiting for the downstream function. You are literally paying twice (or more) for the same wall-clock time.
  • The fix: Use Step Functions for orchestration (each step is invoked independently), SQS between steps for loose coupling, or collapse the chain into a single function if the logic is simple enough.
Oversized Lambda deployment packages: Stuffing everything into a single zip or container image because “it is easier.” A 200 MB zip package takes 2-5 seconds to download on cold start. This single choice can make the difference between 200ms and 5-second cold starts.
  • Why it happens: Teams install entire frameworks (all of boto3 when they only need S3, all of pandas when they only need CSV parsing), bundle test fixtures, include unused native binaries, or skip tree-shaking in Node.js.
  • The fix: Audit dependencies ruthlessly. Use pip install --target with only the packages you need. In Node.js, use bundlers (esbuild, webpack) to tree-shake dead code. Move shared dependencies to Lambda Layers — layers are downloaded once and cached across function updates, so updating your function code does not re-download shared libraries.
  • Lambda Layers best practices: Create layers for common SDKs, database drivers, and utility libraries. A layer can be up to 250 MB (unzipped) and a function can use up to 5 layers. Layers are versioned — pin specific versions in production, use $LATEST only in dev.
# Create a Lambda Layer for shared Python dependencies
mkdir -p python/lib/python3.12/site-packages
pip install requests boto3 pyjwt -t python/lib/python3.12/site-packages
zip -r shared-libs-layer.zip python/
aws lambda publish-layer-version \
  --layer-name shared-libs \
  --zip-file fileb://shared-libs-layer.zip \
  --compatible-runtimes python3.12
Not using Lambda Layers (or using them wrong): Teams either ignore layers entirely (duplicating dependencies across 20 functions) or create a single “kitchen sink” layer with every dependency (defeating the purpose of keeping packages small). The correct pattern: create focused layers by domain (one for database drivers, one for HTTP utilities, one for ML libraries). Each function pulls only the layers it needs. Lambda functions with VPC access when they do not need it: Placing a Lambda function in a VPC adds 1-10 seconds to cold start time (AWS must create an ENI — Elastic Network Interface — in your VPC). If your function only calls public AWS APIs (S3, DynamoDB, SQS) or external HTTP endpoints, it does not need VPC access. Only attach a VPC when the function must reach private resources (RDS in a private subnet, ElastiCache, internal APIs). Since 2019, AWS improved VPC cold starts significantly with Hyperplane ENIs, but it still adds overhead compared to non-VPC Lambda. Ignoring function-level metrics: Teams deploy Lambda functions and only look at aggregate CloudWatch metrics. But each function has its own cold start profile, error rate, and concurrency pattern. Use CloudWatch Lambda Insights or Datadog/Lumigo to get per-function visibility into: ConcurrentExecutions, Throttles, Duration (p50/p95/p99), IteratorAge (for stream-based triggers), and DeadLetterErrors.

Part II — Storage Patterns (S3 / Blob Storage)

2.1 S3 Consistency Model

Before December 2020, S3 had eventual consistency for overwrite PUTs and DELETEs. You could update an object and immediately read the old version. This was one of the most common sources of subtle bugs in S3-based architectures. Since December 2020, S3 provides strong read-after-write consistency for all operations. After a successful PUT, any subsequent GET returns the new version. After a DELETE, any subsequent GET returns a 404. This applies to all S3 API operations, all storage classes, and all regions. You do not need to pay extra for it. There is no performance penalty.
Why this matters for interviews: Many older blog posts, Stack Overflow answers, and even some AWS documentation references still mention eventual consistency in S3. If an interviewer asks about S3 consistency, demonstrate that you know the current model (strong consistency since December 2020) while also knowing what the old model was and why it caused problems. This shows you stay current, not just that you memorized an old textbook.

2.2 Storage Classes and Lifecycle Policies

S3 has six storage classes, each with different cost and access characteristics. Choosing the right one — and automating transitions between them — is one of the easiest ways to cut AWS bills.
Storage ClassUse CaseRetrieval TimeMonthly Cost (per GB)Retrieval Cost
S3 StandardFrequently accessed dataInstant~$0.023None
S3 Intelligent-TieringUnknown/changing access patternsInstant~$0.023 + monitoring feeNone
S3 Standard-IAInfrequently accessed, needs instant accessInstant~$0.0125Per-GB retrieval fee
S3 One Zone-IAInfrequent, reproducible data (thumbnails, transcoded media)Instant~$0.01Per-GB retrieval fee
S3 Glacier FlexibleArchive, retrieval in minutes to hours1-12 hours~$0.0036Per-GB + per-request
S3 Glacier Deep ArchiveLong-term archive, compliance12-48 hours~$0.00099Per-GB + per-request
Lifecycle policies automate transitions:
{
  "Rules": [
    {
      "ID": "ArchiveOldLogs",
      "Status": "Enabled",
      "Prefix": "logs/",
      "Transitions": [
        { "Days": 30, "StorageClass": "STANDARD_IA" },
        { "Days": 90, "StorageClass": "GLACIER" },
        { "Days": 365, "StorageClass": "DEEP_ARCHIVE" }
      ],
      "Expiration": { "Days": 2555 }
    }
  ]
}
This policy moves logs to Infrequent Access after 30 days, Glacier after 90 days, Deep Archive after a year, and deletes them after 7 years. For a company storing 10 TB of logs per month, this can reduce storage costs by 70-80%.

2.3 Multipart Uploads

For files larger than 100 MB, S3’s multipart upload is essential. Instead of uploading a single large object (which fails if the network drops midway), you split the file into parts (5 MB to 5 GB each), upload each part independently (possibly in parallel), and then tell S3 to assemble them. Why this matters:
  • Resilience — If one part fails, you retry only that part, not the entire file.
  • Parallelism — Upload parts concurrently to saturate your bandwidth.
  • Required for files > 5 GB — S3 has a 5 GB single-PUT limit.
  • Resumable — Parts already uploaded persist for up to 7 days (configurable). You can resume after a failure.
Incomplete multipart uploads cost money. If you start a multipart upload but never complete it (your application crashes, the user cancels), the uploaded parts remain in S3 and you pay for them. They are invisible in the S3 console unless you specifically look for them. Set a lifecycle policy to abort incomplete multipart uploads after a few days: "AbortIncompleteMultipartUpload": { "DaysAfterInitiation": 7 }. This is one of the most common hidden S3 cost traps.

2.4 Pre-Signed URLs — Secure Temporary Access

Pre-signed URLs let you grant temporary access to private S3 objects without making them public or sharing credentials. You generate a URL server-side that includes a cryptographic signature and expiration time. Anyone with the URL can download (or upload) the object until the URL expires. Common patterns:
  • Secure file downloads — User requests a file, your API generates a pre-signed GET URL valid for 15 minutes, returns it to the client. Client downloads directly from S3, bypassing your server entirely.
  • Direct-to-S3 uploads — Your API generates a pre-signed PUT URL. Client uploads directly to S3. Your server never touches the file data, eliminating a bandwidth and memory bottleneck.
  • Temporary media access — Serve images or videos through pre-signed URLs with short expiration. Works with CloudFront signed URLs for even better performance.
import boto3
s3 = boto3.client('s3')

# Generate a download URL valid for 1 hour
url = s3.generate_presigned_url(
    'get_object',
    Params={'Bucket': 'my-bucket', 'Key': 'reports/q4-2025.pdf'},
    ExpiresIn=3600
)

2.5 S3 Event Notifications

S3 can emit events when objects are created, deleted, or restored. These events can trigger Lambda functions, SQS queues, SNS topics, or EventBridge rules. This is the foundation of most event-driven data processing architectures on AWS. Common patterns:
  • Upload image to S3 -> Lambda generates thumbnails -> writes to another S3 bucket
  • CSV uploaded to S3 -> SQS message -> worker processes and loads to database
  • Log file uploaded -> EventBridge rule -> Step Functions workflow for analysis
S3 to EventBridge is the modern choice. While S3 can send events directly to Lambda, SQS, or SNS, routing through EventBridge gives you filtering, transformation, multiple targets, archival, and replay. Enable EventBridge notifications on the bucket and let EventBridge handle the routing. This decouples your event routing from your S3 configuration.

2.6 S3 Select and Athena — Querying Data in Place

Instead of downloading an entire CSV or JSON file to filter it, S3 Select lets you run SQL expressions directly against objects in S3. It pushes the filtering to the storage layer, reducing data transfer and processing time.
-- S3 Select: filter a CSV in S3 without downloading it
SELECT s.name, s.revenue
FROM S3Object s
WHERE s.revenue > 1000000
Athena takes this further: a serverless query engine that runs standard SQL against data in S3. No infrastructure to manage, no data to load. You point it at a bucket, define a schema, and run queries. Pay per query ($5 per TB scanned). When to use which:
  • S3 Select — Simple filtering on a single object. Quick, cheap, no setup.
  • Athena — Complex queries across many objects. Joins, aggregations, window functions. Best with columnar formats (Parquet, ORC) that reduce data scanned (and therefore cost).

2.7 S3 Cost Traps

Trap 1: Request costs. S3 storage is cheap, but GET and PUT requests are not free. GET: 0.0004per1,000requests.PUT:0.0004 per 1,000 requests. PUT: 0.005 per 1,000 requests. A system that makes 100 million GET requests per month pays $40 in request fees alone — on top of storage and transfer. Trap 2: Data transfer out. S3 to the internet costs 0.09perGB(first10TB/month).Serving1TBofmediadirectlyfromS3costs0.09 per GB (first 10 TB/month). Serving 1 TB of media directly from S3 costs 90/month in transfer alone. Use CloudFront ($0.085/GB but with caching, actual transfer from S3 drops dramatically). Trap 3: Storing everything in STANDARD. Old logs, backups, and infrequently accessed data sitting in S3 Standard is waste. Lifecycle policies are free to configure and can cut storage costs by 50-80%. Trap 4: Versioning without lifecycle rules. S3 versioning keeps every version of every object. Without lifecycle rules to expire old versions, storage grows silently. A 1 GB file overwritten daily generates 365 GB of versions per year.

Part III — Compute Patterns

3.1 The Compute Decision Matrix

This is the question you will face in every architecture review and most system design interviews: which compute platform for this workload?
FactorLambdaECS FargateECS on EC2EKSEC2
Startup time100ms-10s (cold start)30-60s30-60s30-60sMinutes
Max execution15 minutesUnlimitedUnlimitedUnlimitedUnlimited
ScalingAutomatic, per-requestAuto, task-basedAuto, instance+taskAuto, pod-basedAuto, instance-based
Min cost$0 (scale to zero)~$9/month (1 task)EC2 instance costEC2 + EKS $73/moEC2 instance
Operational overheadMinimalLowMediumHighHigh
Container supportYes (images)NativeNativeNativeManual
Team size1-3 OK2-5 OK5+5+Any
Best forEvent-driven, APIs, glueWeb apps, APIs, workersCost-optimized steady loadMulti-cloud, complex orchestrationLegacy, GPU, special hardware
The honest decision framework: Start with Lambda. When you hit its limits (execution time, cold starts, cost at scale), move to Fargate. When you need cost optimization at steady-state high traffic, consider ECS on EC2. Only reach for EKS when you need Kubernetes-specific features (multi-cloud portability, complex scheduling, service mesh) or your team already has deep Kubernetes expertise. EC2 bare metal is for when you need GPU access, specific hardware, or are running software that does not containerize well.

3.2 Fargate Spot — The Cost Optimization Cheat Code

Fargate Spot tasks run on spare AWS capacity at up to 70% discount. AWS can terminate them with 30 seconds notice. Use for: batch processing, CI/CD builds, development environments, any workload that can handle interruption. Do not use for: Production API servers, database tasks, anything where a 30-second termination notice causes data loss or user-facing errors. Strategy: Run your baseline on regular Fargate tasks and burst onto Spot. Set up capacity providers to automatically split traffic (e.g., 80% regular, 20% Spot).

3.3 Auto-Scaling Patterns

Target tracking (simplest, usually best): “Keep average CPU at 60%.” AWS handles the math — adds tasks when CPU exceeds 60%, removes when it drops below. Works for most workloads. Also supports custom metrics (queue depth, request count). Step scaling (more control): “When CPU > 70% for 3 minutes, add 2 tasks. When CPU > 90% for 1 minute, add 5 tasks.” Gives you fine-grained control over scaling speed. Useful when you know your scaling curve. Predictive scaling: Uses machine learning to forecast traffic based on historical patterns and pre-scales before the load arrives. Ideal for workloads with predictable daily/weekly cycles (e-commerce sites, business applications). Needs at least 24 hours of historical data.
Scale-in is harder than scale-out. It is easy to add capacity when load increases. It is tricky to remove capacity without dropping active connections or requests. Always configure scale-in cooldown periods (5-10 minutes minimum), use connection draining, and scale in more slowly than you scale out.

3.4 Graviton Processors — ARM on AWS

AWS Graviton processors are ARM-based chips designed by AWS. They offer up to 40% better price-performance compared to equivalent x86 instances. Available on EC2 (M7g, C7g, R7g families), ECS, EKS, Lambda, and RDS. Why switch:
  • 20-40% cheaper for compute-heavy workloads (no code changes needed for interpreted languages)
  • Better energy efficiency (lower carbon footprint)
  • Same or better single-threaded performance for most workloads
Watch out for:
  • Native compiled code (C, C++, Go, Rust) needs ARM cross-compilation
  • Some third-party software does not have ARM builds yet
  • Docker images must be built for ARM (--platform linux/arm64) or use multi-arch manifests

Part IV — Messaging and Event Patterns

4.1 The Messaging Decision Tree

This is the question that comes up in every system design interview involving async processing. Here is the honest decision framework: Do you need a simple task queue? -> SQS Standard. Messages are delivered at-least-once, ordering is best-effort. Simple, cheap, scales automatically. This handles 80% of async use cases. Do you need strict ordering? -> SQS FIFO. Messages are delivered exactly-once (within the deduplication window), in order, grouped by message group ID. Limited to 300 messages/second (3,000 with batching). Use when order matters (financial transactions, sequential processing). Do you need fan-out (one event, many consumers)? -> SNS for simple fan-out, EventBridge for filtered routing. SNS pushes to all subscribers. EventBridge lets you filter by event content (only route OrderPlaced events where order.total > 1000 to the fraud detection service). Do you need high-throughput stream processing with replay? -> Kinesis Data Streams. Consumers can replay the stream from any point. Supports multiple independent consumers. Sharded for parallelism. Use for: clickstream analytics, real-time aggregations, change data capture.
FeatureSQSSNSEventBridgeKinesis
ModelQueue (point-to-point)Pub/sub (fan-out)Event bus (filtered routing)Stream (ordered log)
OrderingFIFO optionNoNoPer-shard
RetentionUp to 14 daysNo retention (push only)Archive + replay1-365 days
ThroughputNearly unlimited (Standard)30M messages/sThousands/s (soft limit)Per-shard (1 MB/s in, 2 MB/s out)
Consumer modelPull (polling)Push (HTTP, Lambda, SQS)Push (100+ targets)Pull (KCL, Lambda)
ReplayNoNoYes (archive)Yes (any point in stream)
Cost modelPer-requestPer-publish + deliveryPer-eventPer-shard-hour + per-PUT
Best forTask queues, decouplingSimple fan-out, notificationsEvent routing, cross-account eventsReal-time analytics, CDC
Cross-chapter connection. For deeper coverage of messaging semantics — delivery guarantees, exactly-once processing, idempotency patterns, and dead letter queues — see Messaging, Concurrency & State. This section focuses on the AWS-specific service selection; that chapter covers the distributed systems principles that apply regardless of provider.

4.2 SQS Deep Dive

Visibility timeout: When a consumer reads a message, SQS hides it from other consumers for the visibility timeout period (default 30 seconds). If the consumer processes it and deletes it within that window, done. If the consumer crashes, the timeout expires and the message becomes visible again for another consumer. Set the timeout to slightly longer than your maximum processing time. Dead Letter Queue (DLQ): After a configurable number of failed processing attempts (maxReceiveCount), SQS moves the message to a DLQ. Monitor your DLQ. An empty DLQ means your processing is healthy. A growing DLQ means something is wrong. Set up CloudWatch alarms on DLQ message count. Batching: SQS supports reading up to 10 messages at a time and sending up to 10 messages at a time. Always batch when possible — it reduces API calls and therefore cost by up to 10x. Long polling: By default, ReceiveMessage returns immediately even if no messages are available (short polling). Set WaitTimeSeconds to 20 (maximum) for long polling — the call blocks until a message arrives or the timeout expires. This eliminates empty responses and reduces API call costs.

4.3 EventBridge Patterns

EventBridge is the most underused AWS service relative to its power. Think of it as a serverless event router with content-based filtering. Event patterns let you filter events by content:
{
  "source": ["my-app.orders"],
  "detail-type": ["OrderPlaced"],
  "detail": {
    "total": [{ "numeric": [">", 1000] }],
    "country": ["US", "CA"]
  }
}
This rule only matches OrderPlaced events with total > $1,000 from US or Canada. Events that do not match are silently ignored — no code needed. Schema registry automatically discovers and stores event schemas as events flow through EventBridge. You can generate type-safe code bindings from these schemas. This solves one of the biggest pain points of event-driven architecture: knowing what events look like. Cross-account events let you route events between AWS accounts — critical for multi-account organizational architectures where production, staging, and shared services run in separate accounts.

4.4 Event-Driven Anti-Patterns

The event spaghetti: Every service emits events, every other service consumes them, nobody knows the full event flow. Solve with: an event catalog (EventBridge schema registry), event ownership documentation, and architectural diagrams that show event flows. The event that should have been a command: Events describe what happened (“OrderPlaced”). Commands describe what should happen (“SendConfirmationEmail”). If your event consumers are doing imperative work that the publisher expects to happen, you have a disguised command. Use a queue (SQS) for commands, events (EventBridge/SNS) for notifications. Missing dead letter queues: Events that fail processing silently disappear. Every async integration must have a failure path — DLQ, on-failure destination, or error logging at minimum.

Part V — Database Patterns on AWS

5.1 RDS vs Aurora

Cross-chapter connection. This section covers the AWS service-level differences between RDS and Aurora — pricing, availability, and scaling behavior. For the database engine internals that run inside these managed services — MVCC, WAL, VACUUM, query planning, connection pooling, partitioning — see Database Deep Dives: PostgreSQL Internals. For DynamoDB access pattern design, single-table design, partition key strategy, and GSI patterns, see Database Deep Dives: DynamoDB Strategies. Understanding the engine internals is what lets you tune RDS/Aurora effectively rather than just throwing bigger instances at performance problems.
RDS is managed PostgreSQL/MySQL/MariaDB/Oracle/SQL Server. AWS handles patching, backups, and replication. You choose the instance size and manage storage scaling. Aurora is AWS’s cloud-native database engine compatible with PostgreSQL and MySQL. It uses a fundamentally different storage architecture: a distributed, fault-tolerant storage layer that replicates data six ways across three AZs. When Aurora is worth 2x the cost:
  • Availability requirements: Aurora’s storage layer survives losing two copies without read impact and three copies without write impact. RDS Multi-AZ failover takes 60-120 seconds; Aurora failover takes under 30 seconds.
  • Read-heavy workloads: Aurora supports up to 15 read replicas (RDS supports 5), with replication lag typically under 20ms (RDS lag can be seconds).
  • Auto-scaling storage: Aurora storage grows automatically in 10 GB increments up to 128 TB. No need to provision storage upfront.
  • Large databases: For databases over 1 TB, Aurora’s distributed storage outperforms RDS’s EBS-based storage significantly.
When RDS is fine (save the money):
  • Small databases (< 100 GB) with moderate traffic
  • Development and staging environments
  • When you need Oracle or SQL Server (Aurora only supports PostgreSQL and MySQL)
  • When your workload is write-heavy and does not benefit from Aurora’s read replica architecture

5.2 Aurora Serverless v2

Aurora Serverless v2 scales database compute automatically based on demand. You define a minimum and maximum capacity (in ACUs — Aurora Capacity Units), and Aurora adjusts in near real-time. You pay only for the capacity you use, measured in ACU-seconds. When it shines: Development environments, applications with variable traffic patterns, new products where you do not know the traffic profile yet. You get the Aurora storage engine’s reliability without paying for a provisioned instance 24/7. When it does not: Highly predictable, steady-state workloads. Provisioned Aurora instances are cheaper when utilization is consistently high. Also, ACU scaling has a brief pause that can cause latency spikes during rapid scale-up.

5.3 ElastiCache — Redis vs Memcached on AWS

FactorElastiCache RedisElastiCache Memcached
Data structuresStrings, hashes, lists, sets, sorted sets, streams, geospatialStrings only
PersistenceYes (snapshots, AOF)No (pure cache)
ReplicationYes (read replicas, Multi-AZ)No
Cluster modeYes (horizontal scaling)Yes (multi-node)
Pub/subYesNo
Use caseSession store, leaderboards, rate limiting, real-time analyticsSimple caching, session caching
The honest answer in 2025-2026: Just use Redis. Memcached’s only advantage is slightly simpler multi-threaded architecture for pure caching workloads. Redis does everything Memcached does plus vastly more. The operational simplicity of running one caching technology (Redis) instead of choosing between two is worth far more than Memcached’s marginal performance edge in narrow scenarios.
Cross-chapter connection. For deeper coverage of caching patterns (cache-aside, write-through, cache stampede, invalidation strategies) and observability, see Caching & Observability. For database internals (indexing, transactions, replication, sharding), see APIs & Databases.

Part VI — Networking and Security on AWS

6.1 VPC Design Patterns

Cross-chapter connection. VPC design is the AWS implementation of networking fundamentals covered in Networking & Deployment — DNS resolution (Route 53), load balancing (ALB/NLB), TLS termination, and service discovery. This section covers the AWS-specific patterns; that chapter covers the protocols and concepts underneath.
A VPC (Virtual Private Cloud) is your isolated network in AWS. Every resource you launch (EC2 instances, RDS databases, Lambda functions in a VPC, ECS tasks) lives inside a VPC. Getting VPC design right is foundational — changing it later is painful. The standard pattern:
VPC (10.0.0.0/16)
├── Public Subnet A (10.0.1.0/24) -- AZ-a
│   └── ALB, NAT Gateway, Bastion Host
├── Public Subnet B (10.0.2.0/24) -- AZ-b
│   └── ALB, NAT Gateway
├── Private Subnet A (10.0.10.0/24) -- AZ-a
│   └── Application servers, ECS tasks
├── Private Subnet B (10.0.11.0/24) -- AZ-b
│   └── Application servers, ECS tasks
├── Database Subnet A (10.0.20.0/24) -- AZ-a
│   └── RDS Primary
└── Database Subnet B (10.0.21.0/24) -- AZ-b
    └── RDS Standby
NAT Gateway vs VPC Endpoints: Resources in private subnets need a NAT Gateway (0.045/hour+0.045/hour + 0.045/GB processed) to reach the internet. But most “internet” traffic from private subnets is actually going to AWS services (S3, DynamoDB, SQS, ECR). VPC Endpoints provide private connectivity to AWS services without going through the NAT Gateway.
  • Gateway Endpoints (free): S3, DynamoDB. Use these always.
  • Interface Endpoints (0.01/hour+0.01/hour + 0.01/GB): Everything else (SQS, ECR, Secrets Manager, CloudWatch). Use when the cost is less than NAT Gateway traffic.
NAT Gateway is one of the most expensive items on AWS bills for many teams. A single NAT Gateway processing 1 TB of data per month costs 45inprocessingfeesplus45 in processing fees plus 32 in hourly fees — $77/month for what feels like basic networking. For a team pulling Docker images from ECR multiple times per day across 50 ECS tasks, the NAT Gateway bill can be hundreds of dollars monthly. VPC endpoints for ECR, S3, and CloudWatch Logs often pay for themselves within the first month.

6.2 Security Groups vs NACLs

Both control network traffic, but they work differently:
FeatureSecurity GroupsNetwork ACLs
LevelInstance/ENI levelSubnet level
StatefulnessStateful (return traffic automatically allowed)Stateless (must explicitly allow both directions)
RulesAllow rules onlyAllow AND deny rules
EvaluationAll rules evaluated togetherRules evaluated in order (lowest number first)
DefaultDeny all inbound, allow all outboundAllow all inbound and outbound
Use casePrimary security controlAdditional subnet-level defense
In practice: Security groups do 95% of the work. NACLs are a belt-and-suspenders defense layer that most teams configure once and rarely touch.

6.3 IAM Least Privilege

Cross-chapter connection. This section covers IAM as a cloud service primitive. For the broader authentication and authorization patterns that IAM implements — OAuth 2.0, OIDC, RBAC, ABAC, Zero Trust Architecture, service-to-service authentication — see Authentication & Security. IAM roles for Lambda/ECS/EC2 are the implementation of the principles covered there.
IAM (Identity and Access Management) is the most important security service on AWS. Every API call is authenticated and authorized through IAM. Getting IAM right is the difference between a secure architecture and a breach waiting to happen. Principles:
  • Never use root credentials. Create IAM users or roles for everything.
  • Use roles, not long-lived access keys. EC2 instance profiles, ECS task roles, Lambda execution roles — all use temporary credentials that rotate automatically.
  • Scope permissions narrowly. "Resource": "*" is almost never necessary. Specify the exact ARN of the resource.
  • Use conditions. Restrict by IP range, VPC, time of day, MFA requirement, or source account.
Trust policies define who can assume a role. This is separate from the role’s permissions (which define what the role can do). A Lambda execution role’s trust policy says “Lambda service can assume this role.” An ECS task role’s trust policy says “ECS task service can assume this role.”
{
  "Effect": "Allow",
  "Action": "s3:GetObject",
  "Resource": "arn:aws:s3:::my-bucket/reports/*"
}
This grants read access to only the reports/ prefix of one specific bucket — not all of S3.

6.4 Secrets Manager vs Parameter Store

Both store configuration and secrets. The difference is scope and cost.
FeatureSecrets ManagerSystems Manager Parameter Store
Cost0.40/secret/month+0.40/secret/month + 0.05/10K API callsFree (standard), $0.05/10K for advanced
Auto-rotationBuilt-in for RDS, Redshift, DocumentDBManual (you write the Lambda)
Cross-accountNative supportRequires additional IAM
EncryptionAlways encrypted (KMS)Optional encryption (KMS)
Best forDatabase passwords, API keys, rotation-required secretsApp configuration, feature flags, non-sensitive parameters
The cost-effective pattern: Use Parameter Store for application configuration (feature flags, endpoint URLs, non-sensitive settings). Use Secrets Manager only for credentials that need automatic rotation. A team with 50 secrets in Secrets Manager pays 20/month;thesame50valuesinParameterStoresstandardtiercost20/month; the same 50 values in Parameter Store's standard tier cost 0.

6.5 WAF and Shield

AWS WAF (Web Application Firewall): Filters HTTP/HTTPS requests to CloudFront, ALB, or API Gateway. You define rules to block common attacks: SQL injection, XSS, IP reputation lists, rate limiting, geo-blocking. Managed rule groups from AWS and partners provide pre-built protection against OWASP Top 10 threats. AWS Shield Standard: Free DDoS protection for all AWS resources. Protects against common network-layer (L3/L4) attacks. Automatically enabled. AWS Shield Advanced: $3,000/month. Provides DDoS cost protection (AWS credits you for scaling costs during an attack), 24/7 access to the AWS DDoS Response Team, advanced detection, and real-time attack visibility. Worth it only for organizations where a DDoS attack has significant financial impact.

6.6 Multi-Account Strategy

Most teams start with a single AWS account. This works fine for a side project or an early startup. But as the organization grows — multiple teams, multiple environments, compliance requirements, cost attribution needs — a single account becomes a liability. Multi-account strategy is not an advanced topic you deal with “later.” It is a foundational decision that gets exponentially harder to change the longer you wait.
Cross-chapter connection. Multi-account strategy intersects with several other chapters: Authentication & Security covers the IAM and access control patterns that SCPs enforce at the organization level. Compliance, Cost & Debugging covers FinOps practices like showback/chargeback that multi-account enables. Networking & Deployment covers the VPC peering, Transit Gateway, and DNS patterns that connect accounts.
Why multiple accounts:
  • Blast radius containment. A misconfigured IAM policy, a compromised credential, or a runaway resource in production cannot affect staging or other teams’ workloads. Each account is a hard security boundary.
  • Cost isolation. Each account generates its own bill. No tagging strategy needed to separate prod from dev — it is structural. Finance gets clear cost attribution by account.
  • Service limit isolation. Lambda concurrency limits, API Gateway throttling, EC2 instance limits — all are per-account. A runaway batch job in the data team’s account cannot starve the API team’s Lambda functions.
  • Compliance boundaries. PCI-DSS, HIPAA, SOC2, and GDPR all benefit from isolating regulated workloads into dedicated accounts with stricter controls. Auditors love clean account boundaries.
  • Independent deployments. Teams can deploy, experiment, and break things in their own accounts without coordination overhead.
AWS Organizations — the management layer: AWS Organizations lets you create and manage multiple AWS accounts from a central management account. Key features:
  • Organizational Units (OUs): Group accounts hierarchically. Common structure:
Root
├── Security OU
│   ├── Log Archive Account (centralized CloudTrail, Config)
│   └── Security Tooling Account (GuardDuty, Security Hub)
├── Infrastructure OU
│   ├── Networking Account (Transit Gateway, DNS, VPN)
│   └── Shared Services Account (CI/CD, container registries, artifact repos)
├── Workloads OU
│   ├── Production OU
│   │   ├── Team A Prod Account
│   │   └── Team B Prod Account
│   └── Non-Production OU
│       ├── Team A Dev Account
│       ├── Team A Staging Account
│       └── Team B Dev Account
└── Sandbox OU
    └── Developer Sandbox Accounts (experimentation, auto-cleanup)
  • Consolidated billing: All accounts roll up to a single bill. Volume discounts (Savings Plans, Reserved Instances) apply across the entire organization. You buy RIs in one account and they apply to matching usage in any account.
  • Service Control Policies (SCPs): Organization-wide guardrails that override IAM permissions. An SCP is a ceiling, not a floor — even if an IAM policy grants permission, an SCP can deny it.
Service Control Policies (SCPs) — the guardrails: SCPs are the most powerful governance tool in AWS Organizations. They restrict what any principal in an account can do, including the account’s root user. Common SCPs:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "DenyRegionsOutsideAllowed",
      "Effect": "Deny",
      "Action": "*",
      "Resource": "*",
      "Condition": {
        "StringNotEquals": {
          "aws:RequestedRegion": ["us-east-1", "us-west-2", "eu-west-1"]
        }
      }
    }
  ]
}
This SCP prevents any resource creation outside three allowed regions — critical for data sovereignty compliance where EU data must stay in EU regions. Common SCP patterns:
  • Region restriction: Deny all actions outside approved regions.
  • Prevent root user access: Deny all actions by the root user (except account recovery).
  • Require encryption: Deny s3:PutObject unless s3:x-amz-server-side-encryption is set. Deny rds:CreateDBInstance unless StorageEncrypted is true.
  • Prevent leaving the organization: Deny organizations:LeaveOrganization on all accounts.
  • Deny expensive services: Deny ec2:RunInstances for instance types larger than xlarge in sandbox accounts. Prevent accidental p4d.24xlarge launches that cost $32/hour.
Account strategies:
StrategyWhen to UseProsCons
Account-per-environmentSmall-to-medium orgs, single teamSimple, clear prod/staging/dev separationDoes not scale to many teams
Account-per-teamMedium orgs, 3-10 teamsTeam autonomy, independent limits and billingMany accounts to manage, cross-team access complexity
Account-per-workloadLarge orgs, regulated industriesMaximum isolation, granular complianceAccount sprawl, significant management overhead
Hybrid (recommended)Most growing organizationsBalances isolation with manageabilityRequires thoughtful OU structure
The practical recommendation for most teams: Start with account-per-environment (prod, staging, dev) plus a shared-services account (CI/CD, ECR, DNS). When you add a second team that needs isolation, add account-per-team within each environment OU. This hybrid approach gives you 80% of the benefit with 20% of the complexity. Use AWS Control Tower to automate account provisioning with pre-configured guardrails, logging, and networking — it eliminates weeks of manual setup per account.
Cross-account access patterns:
  • IAM cross-account roles: Account A’s Lambda assumes a role in Account B to read from Account B’s S3 bucket. The trust policy on Account B’s role explicitly allows Account A’s principal.
  • Resource-based policies: S3 bucket policies, KMS key policies, and SNS topic policies can grant access to specific external accounts without role assumption.
  • AWS RAM (Resource Access Manager): Share VPC subnets, Transit Gateway, and other resources across accounts without duplicating them.
  • EventBridge cross-account event buses: Route events from workload accounts to a central observability account for unified monitoring.
Multi-account governance at scale: As organizations grow past 20-30 accounts, manual governance breaks down. The following patterns keep multi-account estates healthy:
  • Account vending machine. Automate new account creation with Control Tower Account Factory or a custom pipeline (Terraform + Step Functions). Every new account is provisioned with: baseline SCPs applied via its OU, CloudTrail logging shipped to the centralized log archive, GuardDuty enrolled in the delegated administrator account, VPC with Transit Gateway attachment pre-configured, and a pre-seeded CI/CD role for the owning team. A developer requests an account through a self-service portal; 15 minutes later, the account is ready with all guardrails in place. No tickets, no waiting.
  • Drift detection. AWS Config rules and Config Conformance Packs continuously audit every account against your organization’s baseline. Common rules: “all S3 buckets must have versioning enabled,” “no security groups allowing 0.0.0.0/0 inbound on port 22,” “all RDS instances must be encrypted.” When drift is detected, AWS Config sends findings to Security Hub, which triggers SNS notifications or auto-remediation Lambda functions. Without drift detection, your carefully crafted guardrails erode within weeks as engineers work around them.
  • Centralized audit and compliance. AWS CloudTrail Organization Trail captures every API call across every account in the organization and delivers logs to the centralized log archive account. The log archive account has an S3 bucket with Object Lock (WORM — Write Once Read Many) so that even a compromised account cannot tamper with its own audit trail. CloudTrail Lake enables SQL-based queries across the organization-wide audit log for investigations.
  • Delegated administrator accounts. Instead of running all security tools from the management account (which should have minimal usage), delegate administration to specialized accounts: GuardDuty delegated admin, Security Hub delegated admin, AWS Config delegated admin. This follows the principle of least privilege at the account level — the management account only manages organization structure and SCPs.
Workload Identity with OIDC — eliminating long-lived credentials: The shift toward workload identity federation is one of the most impactful security improvements in cloud architecture. Instead of distributing IAM access keys (long-lived credentials that can be leaked, committed to Git, or stolen), workloads authenticate using short-lived tokens from trusted identity providers. How OIDC federation works with AWS:
  1. Your CI/CD platform (GitHub Actions, GitLab CI, CircleCI) issues an OIDC token to the running job. This token contains claims about the job: repository name, branch, workflow, actor.
  2. The job calls AWS STS AssumeRoleWithWebIdentity, presenting the OIDC token.
  3. AWS validates the token against the OIDC provider’s public keys (configured as an IAM Identity Provider in your account).
  4. If valid, AWS returns temporary credentials (access key, secret key, session token) scoped to a specific IAM role with a short TTL (typically 1 hour).
  5. The job uses these temporary credentials to deploy. When the job ends, the credentials expire automatically.
# GitHub Actions example -- no stored AWS credentials
name: Deploy
on:
  push:
    branches: [main]
permissions:
  id-token: write  # Required for OIDC
  contents: read
jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/GitHubDeployRole
          aws-region: us-east-1
          # No access keys stored anywhere -- OIDC federation handles auth
      - run: aws ecs update-service --cluster prod --service api --force-new-deployment
Why this matters: Zero long-lived credentials means zero credential rotation burden, zero risk of credential leakage in logs or Git history, and a complete audit trail linking every AWS API call back to a specific CI/CD job, branch, and actor. If you are still distributing IAM access keys to CI/CD systems, fixing this is the highest-impact security improvement you can make. OIDC trust policy on the IAM role:
{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Federated": "arn:aws:iam::123456789012:oidc-provider/token.actions.githubusercontent.com"
    },
    "Action": "sts:AssumeRoleWithWebIdentity",
    "Condition": {
      "StringEquals": {
        "token.actions.githubusercontent.com:aud": "sts.amazonaws.com"
      },
      "StringLike": {
        "token.actions.githubusercontent.com:sub": "repo:my-org/my-repo:ref:refs/heads/main"
      }
    }
  }]
}
The Condition block is critical — it restricts which repositories and branches can assume this role. Without it, any GitHub repository could assume your deployment role. The same OIDC pattern works for Kubernetes pods (EKS Pod Identity / IAM Roles for Service Accounts), GitLab CI, and any OIDC-compliant identity provider.

Part VII — Cost Engineering

Cross-chapter connection. This section covers AWS-specific pricing mechanics — Reserved Instances, Savings Plans, Spot, and cost allocation tags. For the broader FinOps discipline — cost culture, showback/chargeback models, unit economics, cloud budgeting frameworks, and organizational cost governance — see Compliance, Cost & Debugging: Cost-Aware Engineering. A senior engineer knows the AWS pricing levers and the organizational practices that keep cloud spend sustainable.

7.1 Reserved Instances vs Savings Plans vs Spot

This is the compute pricing decision framework that every team running significant AWS workloads must understand.
OptionDiscountCommitmentFlexibilityBest For
On-Demand0% (baseline)NoneCompleteUnknown workloads, testing, burst capacity
Savings Plans (Compute)Up to 66%1 or 3 years, $/hour commitmentAny instance family, region, OS, tenancyTeams that know they will spend $X/hour on compute but want flexibility
Savings Plans (EC2)Up to 72%1 or 3 years, specific instance familySame family, any size/OS/tenancyWorkloads committed to a specific instance family
Reserved InstancesUp to 72%1 or 3 years, specific instance typeLimited (convertible RIs have some flex)Steady-state databases, well-known workloads
Spot InstancesUp to 90%None (but can be terminated)Any instance type with spare capacityBatch processing, CI/CD, fault-tolerant workloads
The decision framework:
  1. Cover your steady-state baseline (databases, minimum API capacity) with Savings Plans or RIs.
  2. Handle predictable burst capacity with On-Demand (auto-scaling groups).
  3. Run fault-tolerant batch work on Spot.
  4. Use Compute Savings Plans when you are not sure which instance types you will use — they apply across Lambda, Fargate, and EC2.

7.2 Cost Allocation Tags

Tags are key-value pairs you attach to AWS resources. They flow through to Cost Explorer and billing reports, letting you answer: “How much does the recommendations team spend on compute?” or “What does our staging environment cost?” Why they matter before day 1: Retroactive tagging is painful and often incomplete. Resources created without tags generate unattributable costs. Once your bill is $50K/month with 500 untagged resources, you have a cost visibility problem that takes weeks to clean up. Mandatory tag strategy:
Tag KeyPurposeExample Values
teamCost attribution to teamplatform, payments, ml
environmentSeparate prod from dev costsproduction, staging, dev
serviceMap costs to servicesapi-gateway, order-processor
cost-centerFinance/budgeting alignmenteng-100, data-200
Enable AWS Cost Allocation Tags in the Billing console to make your tags appear in Cost Explorer reports. Use AWS Organizations Tag Policies to enforce tagging compliance across accounts.

7.3 The Serverless Tax

Managed services trade operational cost for financial cost. At some point, the financial cost exceeds what you would pay to self-manage. Examples where serverless costs more:
  • NAT Gateway: 77/monthfor1TBoftraffic.At3.nanoinstancerunningasaNATcosts 77/month for 1 TB of traffic. A t3.nano instance running as a NAT costs ~4/month (with reduced availability and throughput).
  • API Gateway REST: 3.50permillionrequests.AnALB(3.50 per million requests. An ALB (16/month + $0.008/LCU-hour) is cheaper above roughly 5 million requests/month.
  • Managed Kafka (MSK): $200+/month minimum. Self-managed Kafka on EC2 can be cheaper at scale (but operational cost is significant).
  • Lambda at scale: As shown in section 1.7, Lambda is 10-40x more expensive than containers at sustained high traffic.
When the serverless tax is worth paying:
  • Your team is small and engineering time is more expensive than AWS bills
  • Traffic is variable and unpredictable
  • The operational risk of self-managing exceeds the cost savings
  • You are optimizing for speed of delivery, not cost efficiency

Part VIII — Cloud-Agnostic Patterns

This entire chapter uses AWS service names because they are the most widely deployed. But if you only think in AWS, you are building a career on one vendor’s terminology. The best cloud engineers think in patterns and then implement in services. This section maps the portable patterns (they work everywhere) versus the cloud-specific patterns (where vendor lock-in is real and sometimes worth it).
“Cloud-agnostic” does not mean “use the lowest common denominator.” The goal is not to avoid all managed services or abstract everything behind Terraform. The goal is to know which decisions lock you in and decide consciously. Some lock-in is a great trade — DynamoDB single-table design gives you performance no portable alternative matches. Other lock-in is accidental and costly — wiring business logic directly to Lambda’s event format when a container would be just as good and run anywhere.

8.1 Patterns That Transfer Across Clouds

These patterns have direct equivalents on AWS, GCP, and Azure. The APIs differ, but the architecture, trade-offs, and failure modes are the same. If you understand the pattern, switching clouds is a configuration change, not an architecture rewrite. Container orchestration:
PatternAWSGCPAzure
Managed KubernetesEKSGKEAKS
Managed containers (serverless)ECS FargateCloud RunContainer Apps
Container registryECRArtifact RegistryACR
The pattern is the same everywhere: package your application as a container image, push to a registry, run on managed infrastructure. Kubernetes is the most portable option — a Kubernetes manifest deploys on EKS, GKE, AKS, or self-hosted with minimal changes. If you are designing for potential multi-cloud or cloud exit, containers on Kubernetes are the safest compute choice. Default failure mode: When a managed Kubernetes node becomes unhealthy, the control plane reschedules pods to healthy nodes. If the control plane itself fails (rare with managed services), existing workloads continue running but no new scheduling or scaling occurs until the control plane recovers. Your pods do not crash — they become unmanageable. This means deploys, scaling events, and self-healing stop until the control plane returns. Object storage:
PatternAWSGCPAzure
Object storageS3Cloud StorageBlob Storage
Storage classes/tiersStandard/IA/GlacierStandard/Nearline/Coldline/ArchiveHot/Cool/Cold/Archive
Lifecycle policiesS3 Lifecycle RulesObject Lifecycle ManagementLifecycle Management Policies
Pre-signed URLsS3 pre-signed URLsSigned URLsShared Access Signatures (SAS)
The S3 API has become a de facto standard. MinIO, Backblaze B2, Cloudflare R2, and most on-premises storage systems support the S3 API. Code written against S3 often works against these alternatives with just an endpoint change. Default failure mode: Object storage is designed for 99.999999999% (11 nines) durability. The realistic failure mode is not data loss but availability degradation: for a few minutes during a regional incident, reads may return 503 errors or experience elevated latency. Writes may fail transiently. Lifecycle transitions (Standard to Glacier) may be delayed. The critical mitigation: always implement exponential retry on S3 API calls. Cross-region replication provides disaster recovery, but adds cost and replication lag (typically seconds to minutes). Managed message queues:
PatternAWSGCPAzure
Task queueSQSCloud TasksStorage Queues / Service Bus
Pub/sub fan-outSNSPub/SubService Bus Topics / Event Grid
Event routingEventBridgeEventarcEvent Grid
StreamingKinesisPub/Sub (streaming mode)Event Hubs
The queue-vs-stream distinction is universal: queues are for work distribution (one consumer per message), streams are for event replay (multiple consumers, ordered log). See Messaging, Concurrency & State for the underlying semantics that apply regardless of vendor. Default failure mode: Managed queues can experience two failure patterns. First, message delivery delay: during broker scaling or partition rebalancing, messages may be delayed by seconds to minutes but are not lost. Second, duplicate delivery: SQS, Pub/Sub, and Service Bus all provide at-least-once delivery by default, meaning your consumers must be idempotent. The stream failure mode is different: Kinesis and Event Hubs can experience hot shard/partition throttling under skewed partition keys, causing consumer lag to grow while messages pile up. The recovery pattern is the same across clouds: monitor consumer lag, scale consumers horizontally, and ensure your dead letter queue catches poison messages. Replay and recovery across clouds: Streams (Kinesis, Kafka, Event Hubs) provide message replay from any point in the retention window — this is the primary disaster recovery mechanism for event-driven architectures. Queues do not support replay; once a message is consumed and deleted, it is gone. If you need replay capability with a queue-based architecture, write every message to a durable log (S3, Cloud Storage) before processing, creating a “replayable queue” pattern. This dual-write adds complexity but is essential for audit trails and disaster recovery. Relational databases:
PatternAWSGCPAzure
Managed PostgreSQL/MySQLRDS, AuroraCloud SQL, AlloyDBAzure Database for PostgreSQL/MySQL
Serverless relationalAurora ServerlessCloud SQL (auto-scale)Azure SQL Serverless
Read replicasRDS/Aurora Read ReplicasCloud SQL ReplicasAzure Read Replicas
PostgreSQL and MySQL are the same engines regardless of where they run. Your queries, indexes, and tuning strategies transfer directly. The managed service differences are in availability SLAs, failover speed, and pricing — important, but not architectural. See Database Deep Dives for engine internals that apply everywhere. Default failure mode: Managed relational databases fail by promoting a read replica to primary during an outage. The failover typically takes 15-60 seconds (Aurora is fastest at ~30 seconds, standard RDS can take 1-2 minutes). During failover, writes fail and reads may be stale or unavailable. The critical application-level requirement: your connection string must use the cluster endpoint (which DNS-resolves to the current primary), not the instance endpoint. Applications must handle connection resets gracefully — a connection pool that retries on disconnect survives failover; one that does not results in a user-facing outage that lasts far longer than the actual database failover. Portability note: The engine (PostgreSQL, MySQL) is portable, but the management plane is not. Aurora-specific features (Aurora Serverless auto-scaling, Aurora Global Database, Aurora Parallel Query) do not exist on GCP’s Cloud SQL or Azure’s Flexible Server. If portability matters, restrict yourself to standard engine features and treat managed-service-specific features as opt-in lock-in. Secrets and configuration:
PatternAWSGCPAzure
Secrets managementSecrets ManagerSecret ManagerKey Vault
Configuration storeParameter StoreRuntime ConfiguratorApp Configuration
Key managementKMSCloud KMSKey Vault (Keys)
The pattern: never hardcode secrets, rotate them automatically, encrypt at rest with managed keys. The implementation APIs differ, but tools like HashiCorp Vault, External Secrets Operator (Kubernetes), and Doppler abstract across clouds if needed. Default failure mode: Secrets management services are designed for extreme availability (Secrets Manager, Secret Manager, and Key Vault all target 99.99%+). The realistic failure mode is rotation failure: an automatic rotation Lambda/Function fails silently, the secret expires, and downstream services start getting authentication errors. The mitigation: monitor rotation Lambda invocations and failures as a top-level alert. Always keep the previous secret version active for a grace period (dual-secret rotation) so that consumers using cached credentials continue to work during the rotation window. IAM and identity:
PatternAWSGCPAzure
Identity & access managementIAMIAMEntra ID (formerly Azure AD) + RBAC
Service accounts / workload identityIAM RolesService Accounts / Workload IdentityManaged Identities
Organization policiesSCPs (Organizations)Organization PoliciesAzure Policy
The principle of least privilege, role-based access, and workload identity (no long-lived credentials) are universal. See Authentication & Security for the foundational concepts. Default failure mode: IAM is the one service where a “failure” is usually self-inflicted misconfiguration, not a service outage. The most dangerous failure mode is permission escalation through overly broad policies — a developer grants s3:* on * to unblock themselves, and that policy persists for months. The second most common failure is IAM propagation delay: policy changes can take up to 60 seconds to propagate globally, so a “grant then immediately use” pattern may fail intermittently. The mitigation: test permission changes with iam:SimulatePrincipalPolicy before relying on them, and always include a brief retry window after policy attachment. Workload identity portability: The underlying concept — “workloads prove their identity through short-lived tokens issued by a trusted identity provider, not through long-lived credentials” — is identical across clouds. AWS IAM Roles for Service Accounts (IRSA) and EKS Pod Identity, GCP Workload Identity, and Azure Managed Identities all implement the same OIDC-based pattern. If your CI/CD and Kubernetes workloads use OIDC federation, switching clouds means reconfiguring the trust relationship, not rewriting your authentication logic.

8.2 Patterns That Are Cloud-Specific

These are services where the cloud vendor has built something with unique architecture or performance characteristics that do not have a direct portable equivalent. Using these services is a conscious lock-in trade — you gain performance, simplicity, or cost benefits that portable alternatives cannot match. DynamoDB (AWS): Single-digit millisecond latency at any scale, single-table design, global tables for multi-region replication. No direct equivalent on GCP or Azure. The closest alternatives (Bigtable, Cosmos DB) have different data models, consistency guarantees, and access patterns. If you design around DynamoDB’s single-table pattern, migrating to another database is an application rewrite, not a configuration change. This is often worth it — DynamoDB’s operational simplicity and performance at scale are hard to match.
  • Default failure mode: DynamoDB’s primary failure mode is hot partition throttling: if a single partition key receives disproportionate traffic, requests to that partition are throttled even if the table’s aggregate throughput is well within limits. Global tables add replication lag (typically sub-second, but can spike to seconds during regional degradation). The recovery pattern: monitor ThrottledRequests per partition, use write sharding for hot keys, and design your access patterns to distribute load evenly across the key space.
Lambda@Edge / CloudFront Functions (AWS): Run code at CDN edge locations. CloudFlare Workers, Vercel Edge Functions, and Deno Deploy offer similar patterns, but the integration with the rest of the AWS ecosystem (S3, DynamoDB, IAM) is unique. Google BigQuery (GCP): Serverless analytics with separation of storage and compute, automatic optimization, and ML integration (BigQuery ML). Athena on AWS is comparable for ad-hoc queries, but BigQuery’s execution engine, slot-based pricing, and BI Engine caching layer are architecturally distinct. Azure Active Directory / Entra ID (Azure): The deepest enterprise identity integration. If your organization runs Microsoft 365, Teams, and Active Directory on-premises, Azure’s identity story is years ahead of AWS IAM Identity Center or GCP’s Cloud Identity. Choosing Azure for identity-heavy enterprise workloads is often the right call regardless of where your compute runs. Cloud Spanner (GCP): A globally distributed, strongly consistent relational database. CockroachDB is the closest open-source equivalent. AWS has no direct managed equivalent (Aurora Global Database is async-replicated, not strongly consistent across regions). If you need global strong consistency, Spanner is one of the only managed options.

8.3 The Abstraction Layer Decision

When teams want cloud portability, they reach for abstraction layers. Here is the honest trade-off: Infrastructure as Code (Terraform / Pulumi):
  • Portable: Resource definitions can target any cloud. Switching providers means rewriting .tf files, not application code.
  • Not portable: The resources themselves are cloud-specific. Defining an Aurora cluster in Terraform does not make it portable — it just means your infrastructure definition is in a consistent language.
  • Verdict: Use Terraform/Pulumi for consistency and automation, not for cloud portability.
Kubernetes as the abstraction layer:
  • Portable: Kubernetes manifests (Deployments, Services, ConfigMaps) work on any Kubernetes cluster. Move from EKS to GKE by repointing kubectl.
  • Not portable: Ingress controllers, storage classes, load balancer annotations, and IAM integration are cloud-specific. You will still have cloud-specific YAML.
  • Verdict: Kubernetes gives you compute portability. Data layer, networking, and identity are still cloud-specific. This is why Kubernetes is popular for multi-cloud — it solves the hardest portability problem (running your applications) while accepting that storage and networking are different.
Application-level abstractions (SDKs, libraries):
  • Libraries like libcloud, fog, or cloud-agnostic SDK wrappers provide unified APIs across providers.
  • The honest truth: These abstractions target the lowest common denominator. You lose the best features of each cloud. In practice, teams use them for basic operations (file upload, queue send) and still use native SDKs for advanced features.
  • Verdict: Use for simple, commodity operations (put object, send message). Do not use for complex patterns (DynamoDB streams, BigQuery ML, Lambda event mappings).
A senior engineer’s perspective on cloud lock-in: “Lock-in is a spectrum, not a binary. I am happy to lock in to S3 (the API is a de facto standard), container orchestration on EKS (Kubernetes is portable), and managed PostgreSQL on Aurora (the engine is portable, only the management plane is locked). I am cautious about locking in to DynamoDB single-table design (hard to migrate), Step Functions workflows (vendor-specific state machines), or proprietary ML pipelines (SageMaker vs Vertex AI). The question is never ‘is this portable?’ — it is ‘what would it cost to migrate this specific component, and is that cost justified by the benefits we get today?’”

Interview Questions — Cloud Service Patterns

What they are testing: Can you design an event-driven architecture, handle failure gracefully, and reason about scaling behavior?Strong answer framework:
  1. Trigger: File uploaded to S3 triggers an event notification.
  2. Buffering: S3 event goes to SQS (not directly to Lambda). SQS provides buffering if Lambda is at concurrency limits, retry with backoff, and DLQ for failed messages.
  3. Processing: Lambda polls SQS, processes each file (parse, validate, transform).
  4. Output: Processed results written to a separate S3 bucket or DynamoDB.
  5. Error handling: Failed messages go to a DLQ after 3 retries. A separate Lambda monitors the DLQ and alerts or routes to manual review.
Key details that impress:
  • “I would not trigger Lambda directly from S3 because SQS gives me buffering, retry control, and a DLQ. Direct S3-to-Lambda has limited retry semantics.”
  • “I would set reserved concurrency on the processing Lambda to prevent it from consuming the entire account’s concurrency pool.”
  • “For large files, I would use S3 Select to process the file in place rather than downloading the entire object.”
  • “I would add a visibility timeout on SQS that is 6x the Lambda timeout, per AWS recommendations, to prevent a message from becoming visible again while Lambda is still processing it.”
Architecture diagram (text):
S3 Upload -> S3 Event Notification -> SQS Queue -> Lambda (processor) -> S3 / DynamoDB
                                         |
                                         v (after 3 failures)
                                      DLQ -> Lambda (alerter) -> SNS -> PagerDuty/Slack
Structured Answer Template:
  1. Trigger (what starts the flow) -> 2. Buffer (decouple producer from consumer) -> 3. Processing (Lambda or worker) -> 4. Sink (where results go) -> 5. Failure path (retries, DLQ, alerting) -> 6. Back-pressure & scaling knobs (reserved concurrency, visibility timeout, batch size). Always call out why you inserted the queue between S3 and Lambda, not just that you did.
Real-World Example: Netflix’s Cosmos media-processing platform uses a near-identical shape: files land in S3, S3 events fan out through SQS/SNS to worker Lambdas and Titus containers that transcode video and run QC checks, with DLQs feeding a human-triage queue for failed assets. Their writeup describes tens of millions of files/day flowing through this exact pattern.
Big Word Alert — DLQ (Dead Letter Queue): a second queue where messages land after they have failed processing N times. Use it so a single poison message does not block the main queue and so humans can inspect failures later.
Big Word Alert — Visibility Timeout: the window during which SQS hides a message from other consumers after one consumer picks it up. Set it to ~6x your Lambda timeout so the message does not reappear mid-processing.
Follow-up Q&A Chain:Q: Why not let Lambda retry directly on S3 events and skip SQS entirely? A: Direct S3-to-Lambda retries are limited (2 async retries, no fine control, no DLQ with Lambda destinations before 2019). SQS gives you explicit retry count via maxReceiveCount, backoff via visibility timeout, and a real DLQ. You also get buffering for free when Lambda is throttled.Q: What happens if the same file is uploaded twice with the same key — how do you prevent double-processing? A: S3 generates a new event per upload even on overwrite, so you must make processing idempotent. Store a processed-marker (e.g., s3://processed-bucket/{etag}) or use a DynamoDB dedupe table keyed on etag + key. The first Lambda to write the marker wins; others short-circuit.Q: The upstream system starts producing 10x more files overnight. What breaks first? A: Usually the downstream database, not Lambda or SQS. Lambda auto-scales, SQS auto-scales, but your DynamoDB write capacity or Postgres connection pool becomes the bottleneck. Second failure mode: you hit the account-level Lambda concurrency cap and starve other functions — set reserved concurrency before this happens.
What they are testing: Do you understand cold start mechanics and can you apply the right mitigation for the context?Strong answer:A 30% cold start rate means roughly 1 in 3 invocations hits a cold start. This indicates the function is not invoked frequently enough to keep warm instances alive (Lambda recycles idle instances after ~5-15 minutes).Step 1: Diagnose. Look at CloudWatch metrics — Init Duration shows cold start time. Check invocation frequency. If the function is invoked once every 10 minutes, cold starts are inevitable without intervention.Step 2: Quick wins (free or cheap).
  • Move initialization code outside the handler (SDK clients, DB connections).
  • Reduce deployment package size (remove unused dependencies, use Lambda layers).
  • If using Java, enable SnapStart.
  • Switch to a lighter runtime if possible (Python/Node.js start 10-50x faster than Java).
Step 3: Provisioned concurrency (costs money).
  • Analyze traffic patterns. If traffic is predictable (business hours), use scheduled provisioned concurrency (scale up at 8 AM, down at 6 PM).
  • If traffic is unpredictable, use Application Auto Scaling on provisioned concurrency to scale based on utilization.
  • Start with the minimum needed: if peak concurrency is 50, provisioning 30-40 instances covers 80-90% of cold starts.
Step 4: Architectural alternatives.
  • If this is a user-facing API with strict latency requirements and high cold start rate, consider whether Lambda is the right compute model. A small Fargate task that is always warm may be cheaper and faster than Lambda with full provisioned concurrency.
The nuance that separates senior from mid-level: “The fix depends on whether this is a cost problem or a latency problem. If cold starts add 200ms and the SLA is 500ms, it is tolerable. If cold starts add 5 seconds on a Java function and the SLA is 1 second, I need to fix it aggressively. The 30% rate alone does not tell me if it is a problem — the duration of the cold start and the SLA of the service do.”
Structured Answer Template:
  1. Reframe: is this a latency problem or a cost problem? -> 2. Quantify: Init Duration vs total Duration, p50/p99 impact, SLA. -> 3. Cheap fixes: init-outside-handler, smaller package, lighter runtime, SnapStart (Java). -> 4. Paid fixes: Provisioned Concurrency (scheduled or autoscaled). -> 5. Architectural fix: is Lambda even the right compute? Fargate/Cloud Run for always-warm workloads.
Real-World Example: Snapchat’s backend team has publicly described moving latency-critical services off Java Lambda onto always-on ECS/EKS after finding that even with provisioned concurrency, JVM cold starts blew their p99 SLA during traffic spikes. For their spiky-but-latency-tolerant workloads (e.g., media thumbnail processing) they stayed on Lambda with provisioned concurrency + SnapStart.
Big Word Alert — Provisioned Concurrency: a feature that keeps N Lambda execution environments pre-initialized so the first invocation does not pay the cold-start tax. Use it when latency matters more than idle cost.
Big Word Alert — SnapStart: AWS-managed snapshot of a fully initialized JVM that Lambda restores instead of booting from scratch. Java-only today; cuts Java cold starts from seconds to ~200-500 ms.
Follow-up Q&A Chain:Q: Provisioned Concurrency is set to 50 but you still see cold starts at peak. Why? A: Provisioned Concurrency only covers the first 50 concurrent environments. Invocation 51+ during a burst falls back to on-demand and pays the full cold start. Fix: enable Application Auto Scaling on the provisioned value with a target utilization of ~70%, so it scales ahead of real demand.Q: Why does moving code outside the handler actually help — what is the mechanism? A: Code outside the handler runs once during the Init phase (which is what “cold start” is). Warm invocations skip Init entirely and jump straight to the handler. So DB clients, config loads, and ML model warmups paid at Init are amortized across every subsequent warm invocation on that instance.Q: When would you deliberately not fix cold starts? A: When the workload is truly async (SQS consumer, scheduled job) and the caller does not care about the extra 300ms. Paying for Provisioned Concurrency here is wasted money. Save the budget for user-facing API functions.
What they are testing: Can you do back-of-envelope cost estimation and reason about when serverless is and is not cost-effective?Strong answer:Lambda estimate:
  • 10M events/day = ~300M events/month
  • Assume 128 MB memory, 200ms average duration
  • Compute: 300M * 0.2s * (128/1024) GB = 7.5M GB-seconds
  • Compute cost: 7.5M * 0.0000166667= 0.0000166667 = ~125/month
  • Request cost: 300M * 0.20/1M=0.20/1M = 60/month
  • Lambda total: ~$185/month
Fargate estimate:
  • 10M events/day = ~116 events/second average
  • Assume each event takes 200ms to process
  • Need: 116 * 0.2 = ~24 concurrent processing slots
  • Run 3 Fargate tasks (0.5 vCPU, 1 GB each) with 10 concurrent processors each
  • Cost: 3 tasks * 0.04048/vCPUhour0.5vCPU730hours= 0.04048/vCPU-hour * 0.5 vCPU * 730 hours = ~44/month (vCPU)
  • Plus: 3 tasks * 0.004445/GBhour1GB730hours= 0.004445/GB-hour * 1 GB * 730 hours = ~10/month (memory)
  • Fargate total: ~$54/month
The conclusion: At 10M events/day with short processing times, Fargate is roughly 3x cheaper than Lambda. But the total difference is 130/month.IfyourteamvaluesLambdasoperationalsimplicity(nocontainerstomanage,automaticscaling,zeroidlecostonweekends),130/month. If your team values Lambda's operational simplicity (no containers to manage, automatic scaling, zero-idle cost on weekends), 130/month is nothing compared to engineering time.The senior nuance: “The raw cost comparison favors Fargate, but the total cost of ownership includes engineering time for container management, deployment pipelines, health checks, and scaling configuration. For a team of 3 engineers, I would stay on Lambda until the bill exceeds $500/month. For a team with existing container infrastructure, I would use Fargate from the start.”
Structured Answer Template:
  1. Convert the workload to the billing unit each service uses (GB-seconds + requests for Lambda; vCPU-hour + GB-hour for Fargate). -> 2. Do the math out loud. -> 3. Add the missing dimensions: data transfer, NAT, observability, eng time. -> 4. Name the crossover point, not just the winner. -> 5. Pick based on team shape, not just dollars.
Real-World Example: iRobot migrated much of their backend from EC2/Fargate-style compute to Lambda and publicly reported saving ~$5K/month at their volume because their workload was bursty enough that Fargate was idle 60% of the time. Contrast with Coca-Cola’s vending-machine payment service, which stayed on Lambda specifically because traffic was spiky and mostly zero between purchases.
Big Word Alert — GB-seconds: Lambda’s billing unit = configured memory in GB multiplied by execution duration in seconds. A 512MB function running 400ms = 0.2 GB-seconds. Always compute this before sizing up memory.
Big Word Alert — Fargate vCPU-hour: the billable unit for Fargate = allocated vCPU count multiplied by wall-clock hours the task ran. Unlike Lambda, you pay whether the task is busy or idle.
Follow-up Q&A Chain:Q: At what event volume does Fargate actually win? A: Rule of thumb: if your workload runs essentially 24/7 at sustained load, Fargate wins by 2-4x. Crossover is usually around 50-70% steady-state utilization. Below that, Lambda wins because you pay zero when idle.Q: What dimensions does this napkin math miss? A: Data transfer (egress + NAT Gateway), CloudWatch logs ingestion cost (often larger than the Lambda bill itself at high volumes), and observability tooling that charges per container or per invocation. Also missing: the hidden cost of Fargate’s cluster management — ALB, target groups, service discovery.Q: How does GPU or ARM change the calculation? A: Graviton (ARM) Lambda and Fargate are ~20% cheaper than x86 for the same workload. GPU is Fargate-only (Lambda has no GPU), so any ML inference changes the answer to “Fargate with a GPU task, or SageMaker Serverless Inference.”
What they are testing: Systematic debugging approach, familiarity with AWS cost tools, and the ability to distinguish signal from noise.Strong answer:Step 1: Scope the increase. Open AWS Cost Explorer and compare this month to last month. Group by service — is the increase concentrated in one service (e.g., EC2 jumped 60%) or spread across many? Group by region — did someone spin up resources in a new region? Group by tag — is it one team or one service?Step 2: Identify the top contributors. Cost Explorer’s “Daily Costs” view shows when the increase started. If it started on a specific day, correlate with deployments, configuration changes, or traffic spikes. Look at the “Top 5 Cost Changes” widget.Step 3: Common culprits.
  • Forgotten resources: Someone launched large EC2 instances for testing and forgot to terminate them. Check for instances with low utilization.
  • NAT Gateway data processing: A new service was deployed that routes traffic through NAT Gateway. Check NAT Gateway charges in the VPC cost breakdown.
  • S3 request costs: A misconfigured client making millions of unnecessary API calls. Check S3 request metrics.
  • Data transfer: Cross-region replication, S3-to-internet transfer, or inter-AZ traffic increase. Check the “Data Transfer” line item.
  • DynamoDB scaling: On-demand pricing with a traffic spike, or provisioned capacity that was scaled up and never scaled back down.
  • Spot interruptions: If Spot capacity became unavailable, workloads fell back to On-Demand pricing.
Step 4: Prevent recurrence. Set up AWS Budgets with alerts at 80% and 100% of expected spend. Enable Cost Anomaly Detection for automatic alerting on unusual spending patterns. Require cost allocation tags on all resources via Organization Tag Policies. Review costs weekly in a 15-minute team meeting.The word that impresses: “I would check our cost anomaly detection alerts first — if those were configured, we should have caught this before the bill arrived.”
Structured Answer Template:
  1. Scope the delta (which service, which region, which tag). -> 2. Find the start-date in Cost Explorer’s daily view. -> 3. Correlate with a deploy, feature flag, or traffic change. -> 4. Apply the usual-suspect checklist (forgotten EC2, NAT Gateway, S3 requests, data transfer, on-demand DDB, Spot fallback). -> 5. Close the loop with a guardrail (Budgets alarm, Cost Anomaly Detection, tag policy).
Real-World Example: The Adobe Creative Cloud team published a retro on a $200K/month surprise NAT Gateway bill caused by a new service pulling container images through NAT instead of an ECR VPC endpoint. They found it via Cost Explorer grouped by usage-type, deployed the endpoint, and the bill dropped 80% in a single billing cycle.
Big Word Alert — VPC Endpoint: a private connection from your VPC to an AWS service that bypasses the public internet and the NAT Gateway. Gateway endpoints (S3, DynamoDB) are free; Interface endpoints cost ~$7/month per AZ.
Big Word Alert — Cost Anomaly Detection: an AWS service that uses ML to baseline your spend and alerts you when a service’s daily cost deviates beyond normal. Set it up once; it pays for itself the first time someone leaves a p4d instance running overnight.
Follow-up Q&A Chain:Q: Cost Explorer shows EC2 is up 40% but all instance counts look normal. Where else do you look? A: Switch “Group by” to “Usage Type”. EC2 billing hides data-transfer (DataTransfer-Out-Bytes), NAT processing (NatGateway-Bytes), and EBS gp3 throughput overages under the EC2 umbrella. One of those is usually the culprit when compute counts are flat.Q: You find the spike started on the day of a deploy. How do you confirm causation, not just correlation? A: Compare CloudWatch metrics on the suspect service (invocation count, bytes transferred, DB read/write units) against the cost timeline. Look at feature flags flipped that day. Use the Cost Explorer API to export hourly costs and overlay with deployment timestamps from your CI system.Q: How do you prevent the same thing next month? A: Three layers: (1) AWS Budgets with 80%/100% alerts per-service and per-tag, (2) Cost Anomaly Detection with Slack/email subscribers, (3) a tag policy that blocks untagged resource creation so every cost has an owner. None of these are optional at scale.
What they are testing: Understanding of stream vs queue semantics and the ability to pick the right tool for specific requirements.Strong answer:Choose Kinesis when:
  • You need replay. Kinesis retains data for 1-365 days. Multiple consumers can read from any point in the stream. SQS deletes messages after they are consumed.
  • You need ordering. Kinesis guarantees order within a shard. SQS Standard does not guarantee order; SQS FIFO guarantees it but at lower throughput.
  • You need multiple independent consumers. With Kinesis, consumer A reads the sales analytics and consumer B reads the fraud detection stream — independently, at their own pace, from the same data. SQS delivers each message to one consumer.
  • You are doing real-time analytics. Kinesis integrates with Kinesis Data Analytics (Apache Flink), allowing windowed aggregations, pattern detection, and real-time SQL over streams.
Choose SQS when:
  • You need a simple task queue. One message, one consumer, delete after processing.
  • You need per-message retry and DLQ. SQS has built-in retry counting and dead letter queue support. Kinesis requires you to build this yourself.
  • Traffic is spiky and you need buffering. SQS scales automatically with no shard management.
  • You do not need replay or ordering. SQS Standard is simpler and cheaper for most async processing.
The sharp distinction: “SQS is for work distribution — assign tasks to workers. Kinesis is for event streaming — publish a log of what happened and let multiple consumers derive meaning from it.”
Structured Answer Template:
  1. State the mental model (queue = work handoff, stream = durable log). -> 2. Walk the four axes: replay, ordering, multi-consumer, retention. -> 3. Map each axis to the requirement. -> 4. Mention cost/ops shape (shards vs auto-scaling). -> 5. End with the one-liner distinction.
Real-World Example: Netflix’s Keystone pipeline ingests over 1 trillion events/day into Kinesis-style streams (they use a custom Kafka-based system but the pattern is identical): a single stream is consumed by real-time fraud detection, recommendation updates, and batch analytics — each reading at its own pace from the same data. That multi-consumer fanout is exactly why they chose streams over queues.
Big Word Alert — Shard: the unit of parallelism in Kinesis Data Streams. Each shard handles 1MB/s in and 2MB/s out; ordering is guaranteed within a shard, not across shards. You manage shard count manually (or use on-demand mode).
Big Word Alert — FIFO Queue (SQS): a variant of SQS that preserves ordering within a message group ID and deduplicates within a 5-minute window. Lower throughput than Standard SQS (300 TPS without batching) — use only when order actually matters.
Follow-up Q&A Chain:Q: You chose Kinesis for ordering. A hot partition key (say, one VIP customer) is overwhelming one shard. What do you do? A: Either (1) change the partition key to spread load (e.g., hash customer_id + transaction_minute instead of just customer_id) and accept weaker ordering, or (2) split that one customer into their own dedicated stream. Kinesis ordering is per-shard, so a hot shard is a design smell, not an AWS limit.Q: Could you just use SQS FIFO with message group IDs to get ordering cheaper than Kinesis? A: Yes, for low-throughput use cases. SQS FIFO gives you per-group ordering and dedupe without shard management. The trade-off: no replay, no multi-consumer fanout, and a hard ceiling around 3000 TPS per group with batching. Above that, Kinesis wins.Q: Your Kinesis consumer Lambda is falling behind by 20 minutes. How do you catch up without losing data? A: Three options: (1) increase parallelization factor on the event source mapping (up to 10 concurrent Lambdas per shard), (2) add shards to increase parallelism, (3) switch to Kinesis Enhanced Fan-Out which gives each consumer its own 2MB/s read throughput instead of sharing. Do not delete and recreate the consumer — you will lose the checkpoint.
What they are testing: Deep understanding of distributed systems, data consistency, and the practical challenges of multi-region deployments.Strong answer framework:Compute layer: Deploy the payment service as ECS Fargate tasks in two regions (e.g., us-east-1 and eu-west-1). Use Route 53 latency-based routing to direct users to the nearest region. Health checks automatically failover to the surviving region.Data layer (the hard part):
  • DynamoDB Global Tables for payment state (transaction records, idempotency keys). Global Tables provide multi-region replication with eventual consistency (typically under 1 second). Use conditional writes and idempotency keys to prevent duplicate payments when both regions are active.
  • Aurora Global Database for account data (if relational is required). One primary region handles writes; the secondary has a read replica that can be promoted to writer in under 1 minute.
Messaging: SQS queues are regional. Use SNS cross-region subscriptions or EventBridge cross-region event buses to replicate events between regions.The consistency challenge: Active-active with payments is dangerous because two regions could process the same payment simultaneously. Mitigations:
  • Idempotency keys stored in DynamoDB Global Tables (both regions see the same key within ~1 second).
  • Route each customer to a “home” region by default (reduces cross-region conflict).
  • Implement optimistic locking: write with a condition that the item does not already exist.
Why an experienced engineer raises this concern: “Active-active sounds great on paper, but for payments, I would actually recommend active-passive unless the business requirement truly demands zero RPO. Active-active introduces split-brain risk where both regions accept a conflicting write. For most payment systems, a 30-second failover (active-passive) is acceptable, and the consistency guarantees are much simpler.”
Structured Answer Template:
  1. Clarify requirements: RTO, RPO, and “active-active vs hot-standby”. -> 2. Compute: Fargate in both regions, Route 53 latency routing. -> 3. Data: pick DynamoDB Global Tables OR Aurora Global (never hand-wave “we’ll sync”). -> 4. Idempotency strategy (dedupe key in both regions). -> 5. Failure modes: split-brain, replication lag, region-partition scenarios. -> 6. Push back if active-active is overkill.
Real-World Example: Stripe runs active-active across two regions for its API plane but explicitly keeps the ledger (money-moving writes) as active-passive with a strongly consistent primary. Their blog explains they made this split because the latency cost of cross-region consensus on every payment write was unacceptable, and the business can tolerate a 30-60s failover if needed.
Big Word Alert — DynamoDB Global Tables: a multi-region, multi-active DynamoDB deployment with asynchronous replication (typically <1s lag). Last-writer-wins on conflicts — which is why idempotency keys matter for correctness.
Big Word Alert — RPO / RTO: Recovery Point Objective = how much data loss is acceptable (measured in time). Recovery Time Objective = how long failover may take. Payment systems typically want RPO=0, RTO<60s — which drives the entire architecture.
Follow-up Q&A Chain:Q: Both regions accept a write for the same idempotency key within the replication window. What happens? A: With DynamoDB Global Tables, last-writer-wins based on timestamp — one write is silently dropped. Defense: use a conditional write (attribute_not_exists(idempotency_key)) in both regions. The second write will fail with a ConditionalCheckFailedException, and your code treats that as “already processed, return success.”Q: How do you test this works without waiting for a real region outage? A: Chaos engineering. Netflix’s Chaos Monkey / Gremlin tooling lets you simulate region failure. Simpler: in a staging environment, block Route 53 health checks to one region and verify traffic fails over within your RTO. Run this quarterly.Q: Aurora Global Database only has one writer region. How is that “active-active”? A: It is not — that is active-passive with fast promotion. True active-active RDBMS at cross-region scale is very hard (see Spanner, CockroachDB). For most enterprise use cases, active-passive Aurora with a <1 minute failover is the correct answer even if the interviewer asked for “active-active.”
What they are testing: Practical architecture with cost awareness, understanding of storage tiers, and data lifecycle management.Strong answer:Ingestion: Applications write logs to CloudWatch Logs (or ship directly to Kinesis Data Firehose via the Fluent Bit agent for higher volume). Firehose batches logs into compressed files and delivers to S3 every 60-300 seconds.Hot storage (0-30 days): Logs land in S3 Standard. Use CloudWatch Logs Insights or OpenSearch for interactive querying during active development and incident investigation.Warm storage (30-90 days): S3 lifecycle policy transitions logs to S3 Standard-IA. Still queryable via Athena on demand, but cheaper storage.Cold storage (90+ days): Transition to Glacier Flexible Retrieval. Available within hours if needed for compliance or historical investigation.Archival (1+ years): Glacier Deep Archive for long-term compliance retention at $1/TB/month.The cost-saving details that impress:
  • “I would use Parquet format for logs in S3 — Athena charges per TB scanned, and Parquet reduces scan volume by 80-90% compared to raw JSON.”
  • “I would avoid keeping logs in CloudWatch Logs long-term. CloudWatch Logs Insights is expensive for large volumes. Export to S3 within 24 hours and query with Athena.”
  • “I would set up S3 Intelligent-Tiering for logs with unpredictable access patterns — it automatically moves objects between access tiers with no retrieval fees.”
Structured Answer Template:
  1. Ingestion path (Fluent Bit / Firehose -> S3, not CloudWatch Logs long-term). -> 2. Hot/warm/cold/archive tiers with transition ages. -> 3. File format and partitioning (Parquet + year/month/day partitions). -> 4. Query layer (Athena with partition projection). -> 5. Retention policy tied to compliance requirement, not gut feel. -> 6. Alerting layer on live tail (subscription filters or OpenSearch).
Real-World Example: Airbnb publicly described their log pipeline as Fluent Bit -> Kinesis Firehose -> S3 (Parquet, partitioned by hour) -> Athena/Presto. Their writeup cites 90%+ query-cost reduction after converting from JSON to Parquet and enabling columnar pruning — the single biggest lever on any S3-based log pipeline.
Big Word Alert — Parquet: a columnar file format that stores each column contiguously with type-aware compression. Athena only reads the columns you query, so a query selecting 3 columns out of 50 reads ~6% of the data it would read from JSON.
Big Word Alert — S3 Intelligent-Tiering: an S3 storage class that auto-moves objects between Frequent, Infrequent, Archive Instant, and Deep Archive tiers based on access pattern, with no retrieval fees in the first two tiers. Set it once and stop babysitting lifecycle rules.
Follow-up Q&A Chain:Q: Athena scans cost $5/TB. Your query over 30 days of logs scans 8 TB. How do you cut it 10x? A: Four levers, stacked: (1) convert to Parquet (~10x scan reduction), (2) partition on year=/month=/day=/hour= so the query only reads relevant partitions, (3) use partition projection so Athena does not query Glue for partition metadata, (4) select only the columns you need — never SELECT *.Q: Compliance says logs must be retained for 7 years but queryable within 48 hours. Which tier? A: Glacier Flexible Retrieval (retrieval in 1-5 min to 3-5 hours depending on tier). Deep Archive is cheaper but 12+ hours to restore — that blows the 48-hour SLA if the request comes in on a Friday evening.Q: How do you handle a log spike (e.g., deploy-time error flood) without blowing up Firehose cost? A: Firehose charges per GB ingested, so you cannot fully escape cost, but: (1) sample debug/trace logs at the source (Fluent Bit filter), (2) enable Firehose compression (GZIP) — ~5x reduction on text logs, (3) set a CloudWatch Logs metric filter that alerts when a single app’s log volume 10x’s, so you catch runaway logging before the month-end bill.
What they are testing: Security mindset, ability to push back constructively, IAM understanding.Strong answer:“I would not grant AdministratorAccess. Here is what I would do instead:
  1. Understand the requirement. What AWS services does the function actually need to access? S3? DynamoDB? SQS? Get the specific operations — read, write, list?
  2. Create a scoped policy. Grant only the permissions the function needs, on only the resources it needs to access.
{
  "Effect": "Allow",
  "Action": [
    "dynamodb:GetItem",
    "dynamodb:PutItem",
    "dynamodb:Query"
  ],
  "Resource": "arn:aws:dynamodb:us-east-1:123456789:table/Orders"
}
  1. Use IAM Access Analyzer. If the function already exists with broad permissions, enable IAM Access Analyzer to see which permissions it actually uses over a 30-day period. Then generate a least-privilege policy based on actual usage.
  2. Explain the risk. AdministratorAccess on a Lambda function means a vulnerability in the function code (injection, SSRF, dependency vulnerability) becomes a complete account compromise. The attacker can create IAM users, access any database, exfiltrate any S3 bucket, and launch cryptomining instances — all through the Lambda’s execution role.
The principle is: give the function the minimum permissions it needs, on the specific resources it needs, and nothing more. If the developer does not know what permissions they need, that is a conversation worth having before writing the code.”
Structured Answer Template:
  1. Do not say no flatly — understand the underlying need. -> 2. Enumerate the specific actions + resources. -> 3. Write a scoped policy (Action + Resource + Condition). -> 4. Offer a tool (Access Analyzer, IAM Access Advisor) so the developer can iterate. -> 5. Explain the blast-radius if * stays. -> 6. Commit to unblocking them within the same day.
Real-World Example: The Capital One breach in 2019 was enabled in part by an over-privileged IAM role on an EC2 instance that had s3:* access across buckets. A single SSRF vulnerability allowed the attacker to steal credentials and exfiltrate 100M records. Capital One’s post-incident remediation was exactly this pattern: scope every role to the specific buckets and actions, monitored by Access Analyzer.
Big Word Alert — IAM Access Analyzer: an AWS service that inspects CloudTrail activity for a role over 90 days and generates a least-privilege policy based on what the role actually used. Run it before writing any scoped policy from scratch.
Big Word Alert — Permission Boundary: a policy attached to a role that caps the maximum permissions the role can ever have, even if someone attaches AdministratorAccess to it. Use it as a safety net on developer-created roles.
Follow-up Q&A Chain:Q: The developer pushes back: “I need it to work now, I’ll tighten it later.” How do you handle that? A: Unblock them with a timeboxed exception — grant ReadOnlyAccess (not Admin) for 24 hours via a condition (aws:CurrentTime &lt; tomorrow), and open a ticket to scope it down. Put the ticket on their sprint board, not yours. “Tighten later” never happens unless it is someone’s scheduled work.Q: Lambda is behind an API Gateway that is public. Does that change anything? A: Yes — it means the blast-radius from a vulnerability (SSRF, injection, dependency CVE) is your entire AWS account. This is the Capital One scenario almost exactly. Public-facing Lambdas need extra rigor: scoped IAM, a permission boundary, and VPC egress controls if they do not need internet access.Q: How do you enforce this org-wide, not just case-by-case? A: Service Control Policies at the AWS Organization level that deny iam:AttachRolePolicy with arn:aws:iam::aws:policy/AdministratorAccess unless the caller is in a specific ops account. Combine with AWS Config rules that flag any role with *:* in its policy. Automation, not vigilance.
What they are testing: Practical cost engineering, VPC networking knowledge, and ability to identify hidden cloud costs.Strong answer:Step 1: Identify the traffic. Enable VPC Flow Logs and analyze where the NAT Gateway traffic is going. Most of it is probably going to AWS services, not the public internet.Step 2: Deploy VPC Endpoints for AWS services.
  • S3 Gateway Endpoint (free) — Eliminates all S3 traffic through NAT Gateway.
  • DynamoDB Gateway Endpoint (free) — Same for DynamoDB.
  • ECR, CloudWatch Logs, SQS, Secrets Manager Interface Endpoints (~$7/month each) — Eliminates pull-through traffic for container images and logging.
Step 3: Check for unnecessary cross-AZ traffic. If services in AZ-a are communicating with services in AZ-b, that traffic goes through the NAT Gateway and costs $0.01/GB on top of NAT processing. Consider AZ-affinity for tightly coupled services.Step 4: Evaluate the NAT instance alternative. For non-critical workloads (dev, staging), a t3.nano or t4g.nano instance acting as a NAT costs ~$4/month. You lose the managed high-availability, but for non-production environments, that is acceptable.Expected savings: Most teams that deploy S3 and ECR VPC endpoints see a 50-70% reduction in NAT Gateway costs within the first billing cycle.”
Structured Answer Template:
  1. Measure first — VPC Flow Logs grouped by destination. -> 2. Deploy free gateway endpoints (S3, DynamoDB). -> 3. Deploy interface endpoints for chatty services (ECR, CloudWatch Logs, Secrets Manager). -> 4. Hunt cross-AZ traffic (zonal placement or AZ affinity). -> 5. Consider NAT instance for non-prod. -> 6. Quantify expected savings before committing.
Real-World Example: Lyft’s engineering team posted a retro where a single misconfigured pod pulling 50GB container images through NAT Gateway every 5 minutes was costing them ~18K/month.AfteraddingECRInterfaceendpointsineachAZ,theNATbilldropped8018K/month. After adding ECR Interface endpoints in each AZ, the NAT bill dropped 80% and the ECR endpoint cost was &lt;100/month. The before-after ratio shows up in nearly every team’s first optimization pass.
Big Word Alert — NAT Gateway: a managed AWS service that lets private-subnet resources reach the internet (or AWS public endpoints) via a shared public IP. Charges 0.045/hourpergateway+0.045/hour per gateway + 0.045/GB processed — the processing fee is what kills you.
Big Word Alert — Gateway Endpoint vs Interface Endpoint: Gateway endpoints (S3, DynamoDB only) attach to a route table and are free. Interface endpoints (most other AWS services) are ENIs in your subnet and cost ~7/monthperAZ+7/month per AZ + 0.01/GB — still cheaper than NAT for anything chatty.
Follow-up Q&A Chain:Q: How do you find which service is driving NAT cost without reading gigabytes of Flow Logs manually? A: Ship Flow Logs to S3 in Parquet, query with Athena: SELECT dstaddr, sum(bytes) FROM flow_logs WHERE action='ACCEPT' AND dst_port IN (443) GROUP BY dstaddr ORDER BY 2 DESC. Cross-reference the top IPs against AWS IP ranges (published at ip-ranges.json) to identify which service family they belong to.Q: You deploy an ECR VPC endpoint, but NAT cost does not drop. Why? A: Check three things: (1) the VPC endpoint policy actually allows ecr:* for your account (default policy is permissive but someone may have locked it down), (2) DNS resolution is enabled on the endpoint (otherwise clients resolve the public ECR hostname and route via NAT), (3) Docker pulls also hit S3 for image layers — you need the S3 Gateway endpoint too.Q: Is the managed NAT Gateway ever worth keeping over a NAT instance? A: Yes, in prod. NAT Gateway is fully managed with HA inside the AZ and no maintenance burden. NAT instances (a t4g.nano running iptables) save 80% of cost but you own patching, failover, and scaling bandwidth. Only viable in dev/staging where an outage is acceptable.

Real-World Architecture Examples

Example 1: E-Commerce Order Pipeline

Context: An e-commerce platform processing 50,000 orders per day.
User places order
  -> API Gateway -> Lambda (validate order)
  -> DynamoDB (store order, status: PENDING)
  -> SQS (order-processing queue)
  -> Lambda (process payment via Stripe API)
  -> EventBridge (OrderPaid event)
    -> Lambda: send confirmation email (SES)
    -> Lambda: update inventory (DynamoDB)
    -> Lambda: publish to analytics (Kinesis Firehose -> S3)
  -> DynamoDB (update order, status: CONFIRMED)
Why this works: Each step is independent and can fail/retry without affecting others. The SQS queue buffers orders during traffic spikes. EventBridge fan-out ensures the email, inventory, and analytics services are decoupled — adding a new consumer (loyalty points, warehouse notification) requires zero changes to existing code.

Example 2: Media Processing Pipeline

Context: A platform where users upload videos up to 5 GB, which need to be transcoded into multiple formats.
Client -> Pre-signed URL (PUT) -> S3 (raw-uploads bucket)
  -> S3 Event -> SQS -> Lambda (validate file, extract metadata)
  -> Step Functions workflow:
    -> Parallel:
      -> MediaConvert job: 1080p H.264
      -> MediaConvert job: 720p H.264
      -> MediaConvert job: 480p H.264
      -> Lambda: generate thumbnail from first frame
    -> Lambda: update DynamoDB (status: READY, URLs)
    -> Lambda: send notification (SNS -> user's webhook)
Why Step Functions here: The workflow has parallelism (transcode all formats simultaneously), error handling (retry individual format conversions), and state tracking (know exactly which step a video is in). SQS alone cannot express this.

Example 3: Real-Time Fraud Detection

Context: A fintech processing credit card transactions that need fraud scoring within 100ms.
Transaction API -> Kinesis Data Streams (partitioned by card_id)
  -> Lambda (enhanced fan-out consumer):
    -> Call ML inference endpoint (SageMaker) for fraud score
    -> If score > 0.8: publish to SNS (block transaction)
    -> If score 0.5-0.8: publish to SQS (human review queue)
    -> All: write to DynamoDB (transaction log)
    -> All: Kinesis Firehose -> S3 (for model retraining)
Why Kinesis, not SQS: Ordering by card_id ensures all transactions for the same card are processed sequentially (detecting velocity patterns — 5 transactions in 10 seconds). Multiple consumers (real-time scoring, batch analytics, model retraining) read from the same stream independently. Replay capability allows re-scoring historical transactions when the model is updated.

Curated Resources

Essential Docs:Architecture Center:
  • “Optimizing Lambda Cost and Performance” (SVS401) — Deep dive into Lambda internals, Firecracker, and optimization techniques from the Lambda team.
  • “Advanced Serverless Patterns” (SVS340) — Step Functions, event-driven patterns, and saga implementations.
  • “S3 Masterclass” (STG301) — Storage classes, lifecycle management, and cost optimization from the S3 team.
  • “Networking Best Practices for VPC” (NET301) — VPC design, Transit Gateway, and PrivateLink patterns.
  • “Running Kubernetes at Scale” (CON301) — For teams that have chosen EKS, this covers production patterns and pitfalls.
Search for these on YouTube — AWS publishes all re:Invent sessions publicly.
Books:
  • Designing Data-Intensive Applications by Martin Kleppmann — Not AWS-specific, but the distributed systems principles behind every AWS service.
  • AWS Certified Solutions Architect Study Guide — Even if you are not pursuing the cert, the SAA-C03 study materials are the most structured way to learn AWS service selection.
Blogs:
  • AWS Architecture Blog — Practical architecture patterns from AWS solutions architects.
  • Last Week in AWS by Corey Quinn — The best newsletter for AWS cost analysis and commentary. Corey is brutally honest about pricing.
  • theburningmonk.com by Yan Cui — The best independent blog on serverless patterns, costs, and pitfalls.
  • cloudonaut.io by Michael and Andreas Wittig — Deep technical blog covering VPC, IAM, cost optimization, and CloudFormation.

Interview Deep-Dive Questions

How to use this section. These questions go beyond surface-level recall. Each one starts with a core question, then chains into follow-ups the way a real senior-to-staff-level interview unfolds. The follow-ups dig into internals, trade-offs, failure modes, and production judgment. Practice answering the core question first, then see if you can handle the follow-up chain without looking at the answers.

1. Walk me through exactly what happens inside AWS when a Lambda function cold-starts. Where does the time go, and what levers do you have at each stage?

Difficulty: Senior What the interviewer is really testing: Whether you understand the cold start lifecycle at an infrastructure level, not just “it is slow.” Can you decompose latency and apply targeted fixes? Strong answer: A cold start is not one thing — it is a pipeline of five sequential stages, and each has different optimization levers.
  • Stage 1 — Code download (~50-500ms). The Lambda service fetches your deployment package from an internal S3 cache. Larger packages take longer. Container images are cached closer to the execution fleet via ECR’s lazy-loading (SOCI — Seekable OCI), so only the layers you need at startup are fetched. Lever: Shrink package size. Use tree-shaking in Node.js, strip unused dependencies in Python. For container images, use a minimal base image (alpine or distroless) and order Dockerfile layers so frequently-changing code is last.
  • Stage 2 — Microvm provisioning (~50-100ms). AWS uses Firecracker to spin up a lightweight microVM. This is largely outside your control, but it is fast because Firecracker was literally purpose-built for this — it boots a VM in ~125ms. Lever: Almost none directly. But choosing more memory indirectly allocates proportional CPU, which speeds up subsequent stages.
  • Stage 3 — Runtime initialization (~10-5000ms). The language runtime starts. Python interpreter: fast. Node.js V8: fast. JVM with class loading and JIT warmup: 3-10 seconds. .NET CLR assembly loading: 1-3 seconds. Lever: Choose a lighter runtime for cold-start-sensitive paths. For Java, SnapStart snapshots the JVM state after init and restores from the snapshot, cutting cold start from seconds to 200-500ms. For .NET, Native AOT compiles ahead-of-time, eliminating CLR startup.
  • Stage 4 — Init code execution (~10-2000ms). Your code outside the handler runs: imports, SDK client creation, DB connection establishment, config loading from Parameter Store or Secrets Manager. Lever: This is where you have the most control. Lazy-initialize anything you do not need on every invocation. Cache configuration in memory. Use the AWS SDK’s built-in credential provider instead of making explicit STS calls. If you call Secrets Manager during init, that is a network round-trip on every cold start.
  • Stage 5 — Handler execution. This happens on every invocation, warm or cold. Not part of cold start.
The critical insight: stages 1-4 happen sequentially, so the total cold start is their sum. A Java Lambda with a 200MB container image calling Secrets Manager during init could easily see 8+ seconds of cold start — 500ms download, 100ms VM, 5s JVM, 2s init code.
The line that separates mid-level from senior: “I would not just say ‘use provisioned concurrency.’ I would first instrument Init Duration in CloudWatch, break down where the time is going, fix the cheap problems (package size, init code), and only then reach for provisioned concurrency if the remaining cold start still violates the SLA.”

Follow-up: Your team runs Java on Lambda and the CTO refuses to let you switch runtimes. Cold starts are 6 seconds. What do you do?

Answer: This is a real constraint I have seen. Here is the playbook, in order of cost and complexity:
  • Enable SnapStart immediately. This is the single biggest win for Java Lambda. AWS snapshots the JVM state after your init code runs using CRaC (Coordinated Restore at Checkpoint). On cold start, it restores from snapshot instead of re-initializing the JVM. This typically cuts cold starts from 6s to 400-800ms. It is a configuration flag — zero code changes for most applications.
  • Audit your init code. Spring Boot with annotation scanning and component scanning is the biggest offender. If you are using Spring Boot, consider switching to Micronaut, Quarkus, or plain Java (no framework). Quarkus was designed for fast startup and has an explicit Lambda extension. If switching frameworks is off the table, at least disable annotation scanning for packages you do not need.
  • Trim the dependency tree. Run mvn dependency:tree and look for transitive dependencies you do not use. Every JAR that the classloader touches adds startup time. Shade only the classes you use with the Maven Shade plugin.
  • Provisioned concurrency as the final layer. After SnapStart and init optimization, if cold starts are still above SLA, provision enough warm instances to cover baseline traffic. Use Application Auto Scaling to track the ProvisionedConcurrencyUtilization metric and scale provisioned capacity with demand.
  • Measure the cost trade-off. Provisioned concurrency for 50 instances at 512MB costs roughly $20/day. Compare that to the engineering cost of rewriting in another language. Usually, SnapStart plus provisioned concurrency is dramatically cheaper than a rewrite.

Follow-up: What are the gotchas with SnapStart that most people miss?

Answer: SnapStart has three specific traps:
  • Uniqueness assumptions break. If your init code generates a random seed, UUID, or unique identifier, the snapshot captures that value. Every restored instance starts with the same “random” value. This breaks security tokens, encryption IVs, and anything that assumes init-time uniqueness. You must use CRaC’s beforeCheckpoint and afterRestore hooks to re-initialize these values. The AWS docs call this out, but most developers miss it until they see duplicate “unique” IDs in production.
  • Network connections are stale. A database connection opened during init is in the snapshot but the TCP connection is dead by the time the function restores. You must implement connection validation or re-establishment in afterRestore. Connection pools that do health checks on borrow (like HikariCP’s connectionTestQuery) handle this naturally, but raw connections do not.
  • Deterministic encryption becomes non-deterministic. If your init code seeds a CSPRNG (cryptographically secure pseudorandom number generator), the snapshot freezes the seed state. All restored instances start from the same PRNG state. This is a subtle security vulnerability. AWS provides the software.amazon.lambda.snapstart.SnapshotRestore interface to re-seed after restore.

2. You are designing a system that processes 500,000 file uploads per day to S3. Each file needs validation, transformation, and loading into a database. Walk me through the architecture.

Difficulty: Senior / Staff-Level What the interviewer is really testing: End-to-end event-driven architecture design, back-of-envelope capacity planning, failure handling at scale, and the judgment to pick the right AWS services. Strong answer: First, some quick math to size the problem. 500K files per day is roughly 6 files per second sustained, with likely peaks of 20-30/s during business hours. This is well within Lambda’s comfort zone but high enough that error handling and cost optimization matter. The architecture:
  • Upload path: Clients upload via pre-signed PUT URLs generated by an API (Lambda behind API Gateway). Files land in an S3 bucket partitioned by date prefix (uploads/2026/04/10/). Pre-signed URLs let the client upload directly to S3, so our API never touches the file bytes — this eliminates a huge bandwidth and memory bottleneck.
  • Event routing: S3 Event Notifications route to SQS (not directly to Lambda). I put SQS in the middle for three reasons: (1) it buffers during Lambda throttling or cold start bursts, (2) it gives me configurable retry with exponential backoff, and (3) it provides a DLQ for poison messages. I would set the visibility timeout to 6x the Lambda timeout per AWS best practices.
  • Processing Lambda: Polls SQS, downloads the file from S3, validates it (schema check, file type, size limits, malware scan if required), transforms it (normalize fields, enrich with lookup data), and writes to the database. I would set reserved concurrency at 50-100 to protect the database from connection storms — this is critical.
  • Database writes: If the target is DynamoDB, I would use BatchWriteItem for throughput. If it is Aurora PostgreSQL, I would use RDS Proxy to pool connections from Lambda (otherwise each concurrent Lambda opens its own connection and you exhaust max_connections within minutes). Batch inserts via COPY or multi-row INSERT rather than one-row-at-a-time.
  • Error handling: After 3 failed processing attempts, SQS moves the message to a DLQ. A separate Lambda monitors the DLQ depth via CloudWatch Alarm. If the DLQ grows beyond a threshold, it triggers an SNS notification to PagerDuty. I would also write failed records to a separate S3 bucket (failed-uploads/) with the error details for manual reprocessing.
  • Idempotency: Files may be delivered more than once (SQS is at-least-once). I would use a DynamoDB table or a database unique constraint on a hash of the file content or the S3 object key to prevent duplicate processing.

Follow-up: The product team now says some files are 5GB+. How does that change your architecture?

Answer: This changes several things:
  • Upload path: Files over 5GB require multipart upload. The pre-signed URL approach still works but I need to generate pre-signed URLs for each part (using CreateMultipartUpload, then UploadPart pre-signed URLs, then CompleteMultipartUpload). The client-side SDK handles this — the AWS SDK’s TransferManager in Java or boto3’s upload_file with multipart threshold in Python. I must also add a lifecycle policy to abort incomplete multipart uploads after 7 days to avoid hidden storage costs.
  • Processing Lambda timeout: A 5GB file cannot be downloaded, validated, transformed, and loaded within Lambda’s 15-minute timeout. I have two options: (1) use S3 Select to process the file in place if it is CSV/JSON/Parquet — push the filtering to the storage layer, or (2) move processing to an ECS Fargate task that can run for hours. The SQS message would trigger a Step Functions workflow that launches a Fargate task instead of a Lambda.
  • Memory constraints: Lambda maxes out at 10 GB memory. A 5GB file loaded entirely into memory leaves little room for processing. For Fargate, I would configure tasks with 8-16 GB memory and stream the file from S3 in chunks rather than loading it all at once.
  • The hybrid approach: Keep Lambda for files under 100MB (the vast majority) and route large files to Fargate. The SQS consumer Lambda checks the file size from the S3 event metadata. Small files are processed inline; large files trigger a Step Functions workflow that launches a Fargate task. This optimizes cost (Lambda for the 99% of small files) while handling the edge case (Fargate for the 1% of large files).

Follow-up: How do you prevent Lambda from overwhelming your Aurora PostgreSQL database with connections?

Answer: This is one of the most common production failures in serverless-to-relational architectures. Lambda can scale to hundreds of concurrent executions in seconds, each opening its own database connection. Aurora PostgreSQL’s max_connections defaults to around 1,600 for a db.r5.large, but each connection consumes memory and process resources.
  • RDS Proxy is the primary solution. RDS Proxy sits between Lambda and Aurora, maintaining a pool of warm database connections. Hundreds of Lambda instances share a pool of, say, 100 database connections. RDS Proxy handles connection multiplexing, authentication caching, and automatic failover. It adds ~1ms of latency, which is negligible. Cost: roughly $21/month for a small instance.
  • Reserved concurrency as a guard rail. Even with RDS Proxy, I would set reserved concurrency on the Lambda function to cap the maximum number of simultaneous executions. If the proxy pool is 100 connections and each Lambda holds one connection for 500ms, I would cap Lambda at 200 concurrent executions — this ensures the proxy is never overwhelmed.
  • Connection reuse in Lambda. Initialize the database connection outside the handler (in the module scope). The connection persists across warm invocations of the same execution environment. This means a warm Lambda reuses its connection instead of opening a new one per invocation.
  • The pattern I would avoid: Opening and closing a connection per invocation. This creates massive connection churn, wastes time on TCP + TLS + auth handshake on every request, and generates load on the database’s process management.

Going Deeper: What happens if RDS Proxy itself becomes the bottleneck?

RDS Proxy scales based on the underlying database instance size — it provisions enough capacity to handle the database’s max_connections. But the real bottleneck is rarely the proxy; it is the database’s ability to handle concurrent queries. If you are seeing proxy connection timeouts, the actual problem is usually slow queries holding connections open. The fix is query optimization (indexes, query plans, connection timeout settings on the proxy), not more proxy capacity. I would check pg_stat_activity in Aurora to see if connections are stuck in idle in transaction state, which is a common leak pattern where application code opens a transaction and never commits or rolls back.

3. Your company runs 200 Lambda functions across 5 teams. You are tasked with designing the AWS account strategy. What do you recommend?

Difficulty: Staff-Level What the interviewer is really testing: Organizational architecture thinking, AWS Organizations knowledge, understanding of blast radius and governance, and the ability to balance isolation with operational simplicity. Strong answer: The way I think about this is: accounts are a security and operational boundary, not an organizational chart. The goal is to contain blast radius, enable independent team velocity, and give finance clean cost attribution — without drowning in account sprawl. My recommended structure for 5 teams and 200 functions:
  • Management Account — AWS Organizations root. No workloads here. Only billing, SCPs, and Control Tower configuration.
  • Security OU: Log Archive Account (centralized CloudTrail, Config, GuardDuty findings), Security Tooling Account (Security Hub, IAM Access Analyzer).
  • Infrastructure OU: Shared Services Account (CI/CD pipelines, ECR repositories, shared Lambda layers, DNS zones in Route 53), Networking Account (Transit Gateway, VPN, if needed).
  • Workloads OU: Split into Production OU and Non-Production OU. Within each, one account per team. So 5 production accounts and 5 development/staging accounts. Total: 10 workload accounts.
  • Sandbox OU: Individual developer sandbox accounts with aggressive SCPs (no resources larger than medium, auto-cleanup after 7 days via AWS Nuke).
Total: ~15 accounts. This sounds like a lot, but AWS Control Tower automates provisioning with guardrails pre-configured. Key SCPs I would apply:
  • Root OU: Deny all actions outside approved regions (us-east-1, eu-west-1). Deny root user access. Require S3 encryption. Prevent leaving the organization.
  • Sandbox OU: Deny expensive instance types. Deny production service creation (RDS Multi-AZ, Aurora, large DynamoDB tables).
  • Production OU: Deny deletion of CloudTrail logs. Deny disabling encryption. Require MFA for destructive actions.
Cost attribution: Each team’s production and non-production accounts generate separate line items in consolidated billing. No tagging games needed for team-level cost attribution — the account boundary handles it structurally. Within each account, teams tag by service for finer granularity.

Follow-up: Team A needs to read data from Team B’s DynamoDB table. How do you handle cross-account access without breaking isolation?

Answer: I have three options, and the right one depends on the access pattern:
  • Cross-account IAM role assumption (my default choice). Team B creates an IAM role in their account with a trust policy allowing Team A’s Lambda execution role to assume it. Team A’s Lambda calls sts:AssumeRole, gets temporary credentials, and reads from Team B’s DynamoDB table. The role in Team B’s account has a scoped policy: only dynamodb:GetItem and dynamodb:Query on the specific table, nothing else. This is auditable (CloudTrail logs every role assumption), revocable (delete the trust policy), and granular (limit by IP, VPC, or condition keys).
  • DynamoDB Streams to EventBridge (for async data sharing). If Team A does not need synchronous reads but just needs to react to changes in Team B’s data, use DynamoDB Streams in Team B’s account to publish change events to EventBridge. EventBridge cross-account rules forward relevant events to Team A’s event bus. This fully decouples the teams — Team A does not need any access to Team B’s account.
  • Shared data layer (for truly shared data). If the DynamoDB table is shared reference data (product catalog, config data), consider putting it in the Shared Services account with cross-account read access for all workload accounts. This is appropriate when the data is organizational, not team-owned.
What I would avoid: Resource-based policies directly on the DynamoDB table. DynamoDB does not support resource-based policies (unlike S3 or SQS), so the cross-account role approach is the only option for direct access.

Follow-up: How do you prevent SCP misconfigurations from causing a production outage?

Answer: This is a real risk. An overly broad deny SCP applied at the wrong OU level can instantly break production across every account in that OU.
  • Test SCPs in a staging OU first. Create a mirror of your production OU structure with a test account. Apply the SCP there, run automated integration tests, and verify nothing breaks before promoting to the production OU.
  • Use the SCP simulator. AWS IAM Policy Simulator can evaluate SCPs against specific API calls. Before applying a deny policy, simulate the actions your production services make (Lambda invoke, DynamoDB read/write, S3 access, CloudWatch logging) and verify they are not blocked.
  • Deploy SCPs through CI/CD. SCPs should be in version control (Terraform or CloudFormation StackSets), deployed through a pipeline with approval gates and automatic rollback. Never apply SCPs manually through the console.
  • Always include a break-glass exclusion. Every deny SCP should have a condition excluding a specific “break-glass” IAM role that can bypass the restriction in emergencies. This role should require MFA and be heavily audited, but it prevents you from locking yourself out.
  • Monitor with CloudTrail. Set up CloudWatch alarms on AccessDenied events in CloudTrail. A sudden spike in access denied errors after an SCP change indicates something is broken.

4. Explain the difference between SQS, SNS, EventBridge, and Kinesis. When have you used each one, and when have you made the wrong choice?

Difficulty: Intermediate / Senior What the interviewer is really testing: Whether you have genuine production experience with these services versus textbook knowledge. The “wrong choice” part specifically tests self-awareness and learning from mistakes. Strong answer: The way I think about it is along two axes: how many consumers and do you need replay.
  • SQS is a task queue. One message goes to one consumer. Once processed, it is deleted. Think of it as a to-do list: each item gets assigned to one worker. I use SQS for work distribution — processing uploaded files, handling order fulfillment, running async jobs. The killer feature is simplicity: no shards to manage, no partitioning strategy, nearly infinite throughput for Standard queues, built-in DLQ, and visibility timeout for safe concurrent processing.
  • SNS is a megaphone. One message goes to many subscribers (Lambda, SQS, HTTP, email, SMS). It is fan-out with no retention — if the subscriber is down, the message is lost (unless the subscriber is an SQS queue, which buffers it). I use SNS when a single event should trigger multiple independent reactions: “order placed” triggers email, inventory update, and analytics simultaneously.
  • EventBridge is SNS with brains. It adds content-based filtering (only route high-value orders to fraud detection), schema registry (know what your events look like), cross-account routing, and archive/replay. I use EventBridge as the default for new event-driven integrations. It is more expensive per event than SNS, but the filtering alone eliminates hundreds of lines of consumer-side if statements.
  • Kinesis is a log. Events are ordered within a shard, retained for 1-365 days, and multiple consumers read independently at their own pace from any point in the stream. I use Kinesis for real-time analytics (clickstream data), change data capture (reacting to database changes), and any use case where I need to replay the event history (reprocess last 24 hours after deploying a bug fix).
A wrong choice I have made: Early in a project, I used Kinesis for what was fundamentally a task queue. The team needed to process webhook events from a payment provider — one event, one handler, no replay needed. Kinesis added shard management overhead, required handling the iterator/checkpoint logic, and when a single poison record arrived, it blocked the entire shard for hours because we had not configured bisectBatchOnFunctionError. We migrated to SQS in a weekend and the system became dramatically simpler. The lesson: do not reach for the most powerful tool when the simple one fits.
The sharp distinction to memorize for interviews: SQS = work distribution (one consumer per message). SNS/EventBridge = event notification (many consumers per event). Kinesis = event streaming (ordered log with replay). If you get this taxonomy wrong, the interviewer will probe relentlessly.

Follow-up: You need exactly-once processing. Which of these gives it to you?

Answer: None of them give you exactly-once processing out of the box. This is a fundamental distributed systems truth that separates real practitioners from textbook learners.
  • SQS FIFO claims “exactly-once delivery” within a 5-minute deduplication window. It deduplicates messages with the same MessageDeduplicationId. But “exactly-once delivery” is not “exactly-once processing.” If your Lambda reads the message, processes it, writes to the database, and then crashes before deleting the message from SQS, the message becomes visible again and gets processed a second time. You still need idempotency in your consumer.
  • Kinesis is explicitly at-least-once. Checkpointing happens after processing. If the consumer crashes between processing and checkpointing, the record is reprocessed.
  • SNS and EventBridge are at-least-once. Retry logic can deliver the same event multiple times.
The real answer: Build idempotent consumers. Use a deduplication store (DynamoDB with a conditional write on a unique event ID, or a database unique constraint). The pattern is: before processing, check if this event ID has been seen. If yes, skip. If no, process and record the event ID atomically. This works regardless of which messaging service you use, which is exactly the point — exactly-once semantics is an application concern, not an infrastructure guarantee.

Going Deeper: You mentioned Kinesis shard blocking. Walk me through what happens and how to prevent it.

When Lambda processes records from a Kinesis stream, it reads a batch of records from a shard and invokes your function. If the function returns an error, Lambda retries the same batch. It keeps retrying until the records expire from the stream (which could be 24 hours to 365 days depending on your retention setting). During this entire retry loop, no new records on that shard are processed. One poison record blocks all processing on the shard. Prevention:
  • bisectBatchOnFunctionError: true — After a failure, Lambda splits the batch in half and retries each half. This narrows down to the single failing record through binary search rather than blocking on the entire batch.
  • maxRetryAttempts — Set a maximum number of retries (e.g., 3-5). After exhausting retries, the failing record is sent to the on-failure destination.
  • On-failure destination — Route failed records to an SQS queue or SNS topic for investigation. This is your DLQ equivalent for stream sources.
  • Error handling in your function — Catch exceptions per record within the batch. Process what you can, collect failures, and use batchItemFailures response (a relatively newer Lambda feature) to report which specific records failed. Lambda retries only the failed records, not the entire batch.
  • maxRecordAge — Set a maximum age for records. If a record is older than this threshold, Lambda skips it and sends it to the on-failure destination. This prevents ancient records from blocking the shard indefinitely.

5. You are building a new microservice. A colleague says “just put it on Lambda” and another says “use Fargate.” How do you make this decision?

Difficulty: Intermediate What the interviewer is really testing: Structured decision-making about compute platforms, awareness of trade-offs, and the ability to ask clarifying questions instead of jumping to an answer. Strong answer: My first response would be: “What are the requirements?” Because neither is universally better — they optimize for different things. Here are the questions I would ask:
  • What is the traffic pattern? If it is spiky (zero at night, peak during the day), Lambda’s scale-to-zero saves money. If it is steady 24/7 traffic, Fargate’s always-on pricing wins.
  • What is the latency SLA? If p99 must be under 100ms, Lambda cold starts are a risk unless I pay for provisioned concurrency. Fargate is always warm.
  • How long does a single request take? Lambda has a 15-minute hard timeout. If the service does long-running work (video processing, ML training, large file transformations), Lambda is out.
  • What are the dependency requirements? If the service needs 2GB of ML model files, Lambda’s container image path works but cold starts will be painful. Fargate handles large images without cold start penalties.
  • Does it talk to a relational database? Lambda-to-RDS without RDS Proxy is a foot-gun (connection storms). Fargate maintains persistent connection pools naturally.
  • What does the team know? If the team has deep container experience with existing Dockerfiles, CI/CD pipelines, and monitoring, Fargate is lower friction. If the team is small and does not want to manage containers, Lambda is simpler.
My default framework: Start with Lambda for new services unless one of these disqualifiers is present: execution longer than 15 minutes, hard latency SLA below 50ms at p99, steady high-traffic workload where containers are 3x+ cheaper, or heavy reliance on relational databases without RDS Proxy. If any of those apply, use Fargate. If none apply, Lambda’s operational simplicity wins.

Follow-up: At what traffic volume does the cost crossover from Lambda to Fargate typically happen?

Answer: The crossover depends on three variables: memory allocation, execution duration, and traffic steadiness. But for a common configuration (256MB memory, 200ms average duration):
  • Below 1 million requests/month: Lambda wins overwhelmingly. The free tier alone covers most of this. A Fargate task running 24/7 costs at minimum ~$9/month regardless of traffic.
  • 1-10 million requests/month: Comparable cost. Lambda is roughly 1080/month.AsingleFargatetaskhandlingthisloadcosts 10-80/month. A single Fargate task handling this load costs ~9-20/month. The difference is not worth optimizing for.
  • 10-100 million requests/month: Fargate wins on raw compute cost. Lambda at 50M requests/month with 256MB/200ms costs ~200/month.TwoFargatetaskscanhandlethesameloadfor 200/month. Two Fargate tasks can handle the same load for ~40/month.
  • Above 100 million requests/month: Fargate or ECS on EC2 is dramatically cheaper. At this scale, consider EC2 with Savings Plans — you might save another 50-70%.
The nuance: Lambda’s “cost” includes zero operational overhead. You do not configure health checks, manage deployments, set up auto-scaling rules, or handle container image builds. Those tasks consume engineering hours. At 150/hourfullyloadedengineercost,even2hourspermonthofcontainermanagementeliminatestheLambdatoFargatecostsavingsuptoroughly150/hour fully loaded engineer cost, even 2 hours per month of container management eliminates the Lambda-to-Fargate cost savings up to roughly 300/month. For small teams, I bias toward Lambda longer than the raw math suggests.

Follow-up: What about App Runner? When does that fit?

Answer: App Runner is the “I want Fargate but even simpler” option. You point it at a container image or a source code repository, and AWS handles build, deploy, scaling, TLS, load balancing — everything. It is the closest AWS equivalent to Heroku or Google Cloud Run. When it fits: Web applications and APIs where the team does not want to configure VPCs, load balancers, or auto-scaling policies. Prototypes. Services owned by teams that are not infrastructure-savvy. The developer experience is excellent — push code, get a URL. When it does not fit: When you need VPC integration (private databases, internal services), custom networking, fine-grained scaling controls, or cost optimization. App Runner’s pricing is slightly higher than raw Fargate because you are paying for the managed layer. And it has fewer knobs — you cannot configure health check paths, connection draining timeouts, or scaling cooldown periods with the same granularity as Fargate. My honest take: App Runner is under-discussed. For teams that want container-based deployment without Kubernetes or Fargate complexity, it is a great option. But it occupies a narrow sweet spot — simpler than Fargate, more capable than Lambda, less flexible than either.

6. A critical production service depends on DynamoDB. The table is getting hot-partition throttling at 3 AM. Walk me through your diagnosis and fix.

Difficulty: Senior What the interviewer is really testing: Real operational debugging ability with DynamoDB, understanding of partition key design, and the ability to diagnose production issues under pressure. Strong answer: Hot-partition throttling means one or more partition keys are receiving disproportionate traffic. DynamoDB distributes data and throughput across partitions based on the partition key. If one key gets 10x the traffic of others, that partition exhausts its allocated throughput while other partitions sit idle. DynamoDB’s adaptive capacity helps (it redistributes throughput to hot partitions over minutes), but it cannot fully compensate for severely skewed access patterns. Diagnosis steps:
  1. Check CloudWatch Contributor Insights. Enable it on the table — it shows the most frequently accessed and throttled partition keys. This immediately tells you which keys are hot. If the top key accounts for 30% of all reads at 3 AM, you have found the problem.
  2. Correlate with application behavior. What runs at 3 AM? Batch jobs, cron tasks, data exports, report generation. A nightly job scanning a specific partition key range (e.g., all orders from yesterday using a partition key of date=2026-04-09) concentrates all traffic on one partition.
  3. Check if it is on-demand or provisioned. On-demand tables can still throttle if traffic exceeds the table’s previous peak by more than 2x within 30 minutes. Provisioned tables throttle when consumed capacity exceeds allocated capacity per partition.
Fixes, from fastest to most architectural:
  • Immediate: Increase provisioned capacity or switch to on-demand. This buys time but does not fix the root cause if the partition key design is skewed.
  • Short-term: Add write sharding to the hot key. Append a random suffix (e.g., date=2026-04-09#3) to spread writes across multiple physical partitions. The batch reader scatters-gathers across all suffixes. This is the standard DynamoDB hot-partition pattern.
  • Medium-term: Redesign the access pattern. If the 3 AM job scans by date, and the partition key is date, every daily scan hits one partition. Restructure the key to include another dimension (e.g., PK = customer_id, SK = date) so that the scan is distributed. Use a GSI if query access patterns require the date-first lookup.
  • Architectural: DAX (DynamoDB Accelerator). If the 3 AM traffic is read-heavy and reads the same data repeatedly, put DAX in front of the table. DAX caches reads at the item and query level, absorbing the burst without hitting the underlying partition.

Follow-up: The hot key is a single global counter that multiple services increment. How do you handle that?

Answer: A global counter on a single DynamoDB item is the classic hot-partition anti-pattern. Every increment hits the same partition key, same physical partition, and that partition maxes out at roughly 1,000 WCUs. The scatter-gather counter pattern: Instead of one item {PK: "global_counter", count: N}, create N shards: {PK: "counter#0", count: X}, {PK: "counter#1", count: Y}, …, {PK: "counter#N", count: Z}. Each writer randomly picks a shard and increments it. To read the total, query all shards and sum. With 10 shards, you spread write throughput 10x.
  • Choose shard count based on expected writes per second. Each shard can handle ~1,000 WCUs, so 10 shards handle ~10,000 increments per second.
  • The trade-off is read complexity: getting the current total requires reading all shards. For use cases where you only need an approximate count or can tolerate 1-second staleness, cache the total in a separate item and update it periodically.
  • If exact real-time counts are needed, use DynamoDB Streams to aggregate changes into a single total asynchronously.
Alternative: Use ElastiCache Redis. If the counter is truly global, high-frequency, and needs sub-millisecond reads, Redis INCR is atomic and handles hundreds of thousands of increments per second on a single key. Periodically persist the value to DynamoDB for durability. This is what I would recommend if the counter is in a hot path (e.g., rate limiting) rather than just analytics.

7. Your S3 data transfer bill is $15,000/month. How do you bring it down?

Difficulty: Senior What the interviewer is really testing: Deep understanding of S3 cost model, CDN and caching strategies, architectural thinking about data movement, and cost engineering as a core engineering skill. Strong answer: 15,000/monthinS3datatransferat15,000/month in S3 data transfer at 0.09/GB means roughly 167 TB of data leaving S3 per month. That is significant. Here is my systematic approach: Step 1: Understand where the data is going. S3 server access logs and CloudTrail S3 data events tell me which buckets, prefixes, and clients are responsible. I would break it down by:
  • S3 to internet (most expensive at $0.09/GB)
  • S3 to CloudFront (cheaper at $0.00/GB for origin fetches in most regions)
  • S3 cross-region (e.g., replication to another region at $0.02/GB)
  • S3 to EC2/Lambda in the same region (free — but going through NAT Gateway costs $0.045/GB)
Step 2: Put CloudFront in front of everything serving to the internet. If end-users or external clients are reading from S3, they should hit a CloudFront distribution, not S3 directly. CloudFront caches objects at edge locations, so repeated requests for the same object are served from cache without hitting S3. For content with even modest cache hit ratios (50-70%), this cuts S3 egress in half. CloudFront’s data transfer pricing is also slightly lower than direct S3 egress. Step 3: Enable S3 Transfer Acceleration for upload-heavy use cases. If a chunk of the bill comes from cross-region traffic (e.g., users in Asia uploading to us-east-1), Transfer Acceleration routes through CloudFront edge locations for faster and more efficient transfers. Step 4: Eliminate unnecessary cross-region replication. If you are replicating data to multiple regions for disaster recovery, ask whether all that data needs to be replicated. Often, only critical data (database backups, configuration) needs cross-region copies, while logs and analytics data can stay in one region. Step 5: Move compute to the data. If an analytics pipeline in eu-west-1 is pulling 50 TB/month from S3 in us-east-1, move the pipeline or create a replica bucket in the same region as the compute. Data gravity matters — move compute to data, not data to compute. Step 6: Use S3 Select and Athena to reduce scan volumes. If clients download entire objects to extract a few fields, S3 Select pushes the filtering to the storage layer. You transfer only the matching rows, not the whole file. For analytics queries, Athena with Parquet-formatted data scans (and transfers) 80-90% less data than querying raw JSON.

Follow-up: You put CloudFront in front of S3 and cache hit ratio is only 20%. Why, and how do you fix it?

Answer: A 20% cache hit ratio means 80% of requests are cache misses, which means either the content is not cacheable or the cache is not configured well. Common causes:
  • Unique URLs per user. If each request includes a unique query string parameter (session token, timestamp, user ID), CloudFront treats each URL as a unique cache key. Fix: Configure CloudFront to ignore irrelevant query strings in the cache key. Whitelist only query parameters that actually change the content (e.g., size=thumbnail vs size=full).
  • Low TTL or Cache-Control: no-cache headers. If S3 objects have no Cache-Control header or set max-age=0, CloudFront does not cache them (or caches very briefly). Fix: Set appropriate Cache-Control headers on S3 objects. Static assets: max-age=31536000, immutable. Dynamic content: use a shorter TTL but still cache (even max-age=60 dramatically reduces origin load).
  • Long-tail content. If you serve millions of unique objects and each is accessed once per day, even perfect caching does not help because the first request is always a miss. Fix: This is a content distribution problem, not a caching problem. Consider pre-warming the cache for popular content or accepting the miss rate and focusing on reducing object sizes instead.
  • Single edge location. If all traffic comes from one region but CloudFront distributes across hundreds of edge locations, each edge sees low traffic and evicts cached content quickly. Fix: Use CloudFront’s Price Class to restrict to fewer edge locations (higher hit rate per location) or use Origin Shield (an additional caching layer between edge locations and S3 that centralizes cache fills).
Origin Shield is the underrated fix. It adds a regional cache between edge locations and S3. Instead of 200 edge locations each making their own origin requests, they all hit Origin Shield first. If Origin Shield has the object cached, it serves it — one origin request instead of 200. For large-scale distributions, Origin Shield can improve cache hit ratios by 20-40%.

8. Your team wants to go multi-region active-active. When would you argue against it?

Difficulty: Staff-Level What the interviewer is really testing: Architectural maturity, the ability to push back on complexity, and understanding of the real costs and risks of multi-region deployments. Strong answer: Multi-region active-active is one of the most over-prescribed architectural patterns. It sounds great in a design meeting, but the operational complexity is staggering. I would argue against it in most cases. Here is my framework: Argue against when:
  • The business does not actually need zero-downtime failover. Most services can tolerate 30-60 seconds of downtime during an active-passive failover. Ask the business: “What is the cost per minute of downtime?” If the answer is less than the engineering cost of maintaining active-active, it is not worth it. For most SaaS companies, active-passive with Route 53 health checks and automated failover provides 99.95%+ availability.
  • The data layer has strong consistency requirements. Active-active means writes happen in multiple regions simultaneously. For eventually consistent data (user preferences, analytics, content caches), this is manageable. For strongly consistent data (financial balances, inventory counts, sequential ordering), you enter split-brain territory. DynamoDB Global Tables give you eventual consistency with ~1 second replication lag. That 1-second window is where duplicate orders, double-charges, and inventory over-sells live.
  • The team does not have the operational maturity to run it. Active-active requires: per-region monitoring and alerting, automated failover testing, data conflict resolution strategies, region-aware routing, region-specific deployment pipelines, and engineers who can debug cross-region replication issues at 3 AM. If the team has not mastered single-region operations (observability, incident response, deployment automation), adding a second region multiplies problems rather than solving them.
  • The cost does not justify the benefit. Active-active roughly doubles your infrastructure cost (two of everything) plus adds 30-50% operational overhead (cross-region replication, routing, testing). For a service doing 1MARR,spending1M ARR, spending 500K on active-active infrastructure is absurd. For a service doing 100MARRwhereanhourofdowntimecosts100M ARR where an hour of downtime costs 500K, it is a no-brainer.
When I would advocate for it:
  • Regulatory requirement (EU data must be served from EU, US from US) — but this is often better solved with region-pinned routing, not true active-active.
  • Latency-sensitive global user base where 200ms cross-ocean latency is unacceptable.
  • The business genuinely cannot tolerate the 30-60 seconds of active-passive failover (financial trading, real-time gaming, emergency services).
The staff-level insight: “My default recommendation is active-passive with automated failover. This gives you disaster recovery, meets 99.95%+ availability SLAs, and avoids the data consistency nightmare of active-active. I would only move to active-active when I have exhausted single-region reliability improvements (multi-AZ, auto-scaling, circuit breakers) and the business case specifically demands it.”

Follow-up: If you do go active-active, how do you handle data conflicts?

Answer: Data conflicts in active-active occur when both regions write to the same record within the replication window. The strategies depend on the data type:
  • Last-writer-wins (LWW). DynamoDB Global Tables use this by default — the write with the latest timestamp wins. This is fine for data where the most recent value is always correct (user profile updates, session data, preferences). It is dangerous for counters, balances, or any additive data where both writes carry information.
  • Application-level conflict resolution. Instead of overwriting, design your data model so writes are additive. Use event sourcing: instead of updating a balance, append a debit/credit event. Both regions can append independently, and the balance is derived by replaying events. Conflicts become “concurrent events” that are both valid.
  • Region-pinning with failover. Assign each entity (user, account, order) a “home” region. All writes for that entity go to its home region. The other region serves reads from the replica. On failover, the secondary region promotes to writer. This avoids conflicts entirely at the cost of cross-region write latency for entities whose users are in the “wrong” region.
  • CRDTs (Conflict-Free Replicated Data Types). For specific data structures (counters, sets, flags), CRDTs guarantee that concurrent updates from different regions merge deterministically without conflicts. This is theoretically elegant but requires redesigning your data model around CRDT-compatible structures. Practically, it works for counters (G-Counter), boolean flags (LWW-Register), and sets (OR-Set). It does not work for arbitrary relational data.
  • My recommendation for most teams: Region-pinning with failover. It sidesteps the conflict problem entirely, is simple to reason about, and gives you active-active read scalability with single-region write simplicity. True active-active writes to the same data from multiple regions should be reserved for teams with deep distributed systems expertise.

9. Explain VPC Endpoints — Gateway vs Interface. When do you use each, and what is the cost impact?

Difficulty: Intermediate / Senior What the interviewer is really testing: Whether you understand AWS networking costs at a practical level and can optimize real infrastructure spend. Strong answer: VPC endpoints provide private connectivity from your VPC to AWS services without routing traffic through the internet (via a NAT Gateway or Internet Gateway). There are two types, and they have very different architectures and cost profiles. Gateway Endpoints:
  • Available only for S3 and DynamoDB.
  • Free. No hourly charge, no data processing charge.
  • Implemented as a route table entry. You add the endpoint, associate it with your private subnet route tables, and traffic to S3/DynamoDB is routed directly through the AWS backbone — it never leaves the AWS network and never touches your NAT Gateway.
  • You should enable these on every VPC with private subnets. There is zero reason not to.
Interface Endpoints (powered by AWS PrivateLink):
  • Available for 100+ AWS services (SQS, SNS, ECR, CloudWatch Logs, Secrets Manager, KMS, etc.) and third-party services.
  • Cost: ~0.01/hourperAZ( 0.01/hour per AZ (~7.30/month per AZ) plus $0.01/GB of data processed.
  • Implemented as an ENI (Elastic Network Interface) in your subnet. The endpoint gets a private IP address in your VPC, and traffic to the service resolves to that private IP via a private hosted zone.
  • Use when: the data processing savings through avoiding NAT Gateway ($0.045/GB) exceed the endpoint hourly cost. For high-traffic services (ECR image pulls, CloudWatch Logs ingestion), the break-even is usually within the first few GB per month.
Cost example: 50 ECS tasks pull container images from ECR 3 times/day. Each pull downloads ~500MB. Monthly data: 50 * 3 * 0.5GB * 30 = 2,250 GB. Through NAT Gateway: 2,250 * 0.045=0.045 = 101.25/month. Through ECR Interface Endpoint: 7.302AZs+2,2507.30 * 2 AZs + 2,250 * 0.01 = 37.10/month.Savings:37.10/month. Savings: 64/month for this single service.

Follow-up: You have Interface Endpoints for ECR in two AZs but container image pulls are still going through the NAT Gateway. What went wrong?

Answer: This is a surprisingly common misconfiguration. There are several things to check:
  • Private DNS is not enabled. When you create an Interface Endpoint, you can enable “Private DNS.” This creates a private hosted zone that overrides the public DNS name (e.g., api.ecr.us-east-1.amazonaws.com) to resolve to the endpoint’s private IP address. If private DNS is not enabled, your ECS tasks still resolve the ECR hostname to its public IP, which routes through the NAT Gateway.
  • DNS resolution is not enabled on the VPC. The VPC must have enableDnsSupport and enableDnsHostnames set to true for private hosted zones to work. Check VPC settings.
  • There are actually two ECR endpoints required. ECR requires both com.amazonaws.region.ecr.api (for API calls like authentication) and com.amazonaws.region.ecr.dkr (for Docker image layer downloads). Missing either one causes partial or complete fallback to the public endpoint. You also need an S3 Gateway Endpoint because ECR stores image layers in S3.
  • Security group on the endpoint is too restrictive. Interface Endpoints have security groups. If the security group does not allow inbound HTTPS (port 443) from your ECS tasks’ security group, the connection fails and the SDK falls back to the public endpoint.
  • The endpoint is in the wrong AZs. If your ECS tasks run in AZs where the endpoint does not have an ENI, traffic from those AZs goes through the public path. Create the endpoint in all AZs where your tasks run.

10. You are designing the CI/CD pipeline for a team with 30 Lambda functions. What is your deployment strategy?

Difficulty: Senior What the interviewer is really testing: Practical DevOps thinking for serverless, understanding of deployment safety mechanisms, and the ability to design for both velocity and reliability. Strong answer: 30 Lambda functions is enough that you need structure, but not so many that you need a platform team. My approach: Repository structure: Monorepo with each function in its own directory. Shared code (utilities, models, SDK wrappers) in a shared/ directory that builds into a Lambda Layer. A change to shared code triggers deployment of all functions that use that layer. A change to a single function deploys only that function. Build pipeline (CodePipeline or GitHub Actions):
  1. On PR: Lint, unit test, sam build to verify the functions compile/package. Run integration tests against a dev-stage AWS account.
  2. On merge to main: Build all changed functions, push artifacts to S3 (zip) or ECR (container images), deploy to staging account.
  3. Staging validation: Run smoke tests against the staging deployment. If all pass, proceed to production.
  4. Production deployment: Deploy with traffic shifting using Lambda aliases and CodeDeploy integration.
Traffic shifting (the critical safety mechanism):
  • Each function has a live alias pointing to the current production version.
  • On deployment, I publish a new version and shift the live alias using CodeDeploy with a canary or linear strategy:
    • Canary10Percent5Minutes: Route 10% of traffic to the new version. If CloudWatch alarms (error rate, latency p99, custom business metrics) trigger within 5 minutes, auto-rollback. If clean, shift to 100%.
    • Linear10PercentEvery1Minute: For less risky changes, shift 10% every minute over 10 minutes.
  • The rollback is instant — just re-point the alias to the previous version. No re-deployment needed.
Lambda Layers for shared dependencies:
  • Publish the shared layer as a versioned Layer. Pin production functions to a specific layer version (not $LATEST).
  • Layer updates go through the same canary deployment. Deploy the new layer to one function first, validate, then roll out to all functions.
Infrastructure as Code: All Lambda configurations (memory, timeout, reserved concurrency, environment variables, event source mappings) are in SAM templates or CDK. No ClickOps. Every change is reviewed in a PR.

Follow-up: How do you handle database schema migrations in this serverless CI/CD pipeline?

Answer: Schema migrations in serverless are tricky because there is no persistent server to run migration scripts from. Here is what works:
  • Dedicated migration Lambda. A Lambda function whose sole purpose is running database migrations. It is triggered as a custom resource in CloudFormation/CDK or as a step in the CI/CD pipeline (invoke via AWS CLI after deployment). It connects to the database via RDS Proxy, runs the migration scripts (using a tool like Flyway, Alembic, or Knex), and reports success/failure.
  • Backward-compatible migrations only. Since Lambda functions are deployed with canary traffic shifting, the old version and new version run simultaneously during deployment. The database schema must be compatible with both versions. This means: add columns (nullable or with defaults), never rename or drop columns during deployment. Use a two-phase migration: Phase 1 deploys the new schema (additive only) and the new code that can use both old and new schema. Phase 2 (days later, after all traffic has shifted) deploys a cleanup migration that removes the old columns.
  • Migration as a pre-deploy step. In the CI/CD pipeline, run the migration Lambda before deploying the new function code. If the migration fails, the deployment stops and the old code continues running against the old schema. If the migration succeeds, deploy the new code that uses the new schema.
  • For DynamoDB: There are no schema migrations in the traditional sense (DynamoDB is schemaless). But adding a new GSI takes time and consumes write capacity. Add new GSIs in a separate deployment step, monitor the backfill progress, and only deploy the code that queries the new GSI after the backfill completes.

Going Deeper: How do you test 30 Lambda functions locally before deploying?

SAM CLI local invoke is the primary tool. sam local invoke FunctionName -e event.json runs your function in a Docker container that mimics the Lambda runtime. It is not perfect (no cold start simulation, no IAM role), but it catches most logic errors. For integration testing, I would use sam local start-api to spin up a local API Gateway + Lambda environment, then run the API test suite against it. For event-driven functions (SQS, S3 triggers), use sam local invoke with sample events generated by sam local generate-event. The honest caveat: local testing catches code bugs but not deployment bugs, IAM issues, or VPC networking problems. I would maintain a dedicated dev/test AWS account where the CI pipeline deploys on every PR and runs integration tests against real AWS services. The confidence hierarchy is: unit tests (fast, local) -> local Lambda testing (medium, Docker) -> integration tests in dev account (slow, real AWS) -> canary deployment in production (final safety net).

11. What is the most expensive mistake you have seen (or made) on AWS, and what did you learn from it?

Difficulty: Senior (tests production experience and self-awareness) What the interviewer is really testing: Real-world experience, not theoretical knowledge. They want to see humility, systematic thinking about cost, and whether you have actually operated AWS at scale. Strong answer (example narrative): The most expensive mistake I saw was a recursive Lambda invocation that went undetected for 4 hours on a Friday night. A Lambda function processed images uploaded to an S3 bucket. The processed output was written back to the same bucket with a different prefix. But the S3 event notification was configured with no prefix filter — it triggered on all PutObject events, not just the upload prefix. So: upload triggers Lambda, Lambda writes output to same bucket, output triggers Lambda again, which writes another output, infinitely. In 4 hours, the function executed 23 million times and generated 4 TB of duplicate output objects. The bill was around $12,000 — Lambda invocations, S3 PUT requests, S3 storage, and the biggest chunk was S3 request costs on the millions of PUTs. What I learned:
  • Always use prefix filters on S3 event notifications. Source prefix and destination prefix must be different, and the event notification must filter to only the source prefix.
  • Set reserved concurrency on every Lambda function. If this function had reserved concurrency of 10, the loop would have been throttled and the blast radius contained.
  • AWS now has recursive loop detection for Lambda-S3-Lambda loops (released 2023). But do not rely on detection — design to prevent recursion.
  • Set up AWS Budgets with action thresholds. A budget action can automatically apply an SCP that denies Lambda invocations if the monthly spend exceeds a threshold. We now have a $500 daily anomaly alert on every account.
  • Use a separate destination bucket. The simplest architectural fix is never writing output to the same bucket as input. Two buckets, one event notification, zero recursion risk.
Why interviewers love this question: It separates engineers who have run things in production from those who have only studied. If you cannot name a real expensive mistake (yours or one you witnessed), it signals limited production experience. Have a genuine story ready — the more specific the dollar amount and the more concrete the lesson, the more credible you are.

Follow-up: How do you set up cost guardrails to prevent runaway bills on a new AWS account?

Answer:
  • AWS Budgets: Create a monthly budget with alerts at 50%, 80%, and 100% of expected spend. At 100%, trigger a budget action that notifies via SNS and optionally applies an SCP restricting resource creation.
  • Cost Anomaly Detection: Enable it on the billing account. It uses ML to detect unusual spending patterns and alerts within hours rather than waiting for the end of the month. Configure alerts for both percentage-based (50% above forecast) and absolute-dollar thresholds ($100 anomaly).
  • SCPs for sandbox accounts: Deny creation of expensive resources (large EC2 instances, large RDS instances, Redshift clusters, SageMaker endpoints). Deny actions in regions you do not use.
  • Lambda concurrency limits: Set account-level concurrency limits in non-production accounts to a low number (100-200). This prevents any single runaway function from generating millions of invocations.
  • S3 lifecycle policies from day one: Abort incomplete multipart uploads after 3 days. Expire old object versions after 30 days. Transition to cheaper storage classes on a schedule.
  • Tag enforcement: Use AWS Organizations Tag Policies to require team and environment tags on all resources. Untagged resources generate unattributable costs that nobody optimizes.

12. You need to migrate a monolithic application from EC2 to a cloud-native architecture. How do you approach this?

Difficulty: Staff-Level What the interviewer is really testing: Strategic migration planning, risk management, understanding of the Strangler Fig pattern, and the judgment to prioritize what to migrate first. Strong answer: The first thing I would not do is rewrite everything at once. The big-bang rewrite is the highest-risk approach and fails more often than it succeeds. Instead, I use the Strangler Fig pattern — incrementally replacing pieces of the monolith with cloud-native services while the monolith continues running. Phase 0 — Understand the monolith (2-4 weeks). Before touching anything, I would map the monolith’s components, their dependencies, data stores, and traffic patterns. Identify which components have clear boundaries (a payments module that talks to a specific set of tables) versus which are deeply entangled (a “utils” module imported by everything). Use application performance monitoring (X-Ray, Datadog APM) to trace request flows and identify hot paths. Phase 1 — Lift and shift to containers (2-4 weeks). Containerize the monolith as-is and deploy on ECS Fargate. No code changes, no re-architecture. This gets you off EC2, into a reproducible deployment pipeline, with auto-scaling and health checks. The monolith runs identically — just in a container. This de-risks the migration: if anything breaks later, you can always fall back to “container running the monolith.” Phase 2 — Extract the easiest, highest-value component (4-8 weeks). Pick a component that is: (a) loosely coupled (few dependencies on other monolith code), (b) independently deployable (has its own API or event interface), and (c) a good candidate for serverless (event-driven, bursty traffic, or stateless). Common first extractions: image processing, email/notification sending, report generation, webhook handling. Extract it into a Lambda function or Fargate service. Route traffic to the new service using an API Gateway or load balancer rule. The monolith continues handling everything else. Phase 3 — Repeat for each component (ongoing). Work through the monolith component by component, extracting to the appropriate compute platform. For each extraction: deploy the new service alongside the monolith, shadow traffic or canary to verify correctness, then cut over. Keep the monolith code for that component intact but unreachable as a rollback path for 2-4 weeks. Phase 4 — Decommission the monolith. When the last component is extracted, the monolith container is an empty shell. Turn it off. In practice, this takes 6-18 months for a medium-sized monolith. Key decisions at each phase:
  • Database: Do not try to split the database first. That is the hardest part. Start with the monolith and new services sharing the same database (via RDS Proxy if needed). Extract service-specific tables into their own databases (DynamoDB, separate RDS instances) later, once the service boundaries are stable.
  • Authentication: Extract auth into a shared service early (Cognito, Auth0, or a custom auth service). Every new microservice needs auth, and having a consistent auth layer prevents each service from reimplementing it differently.
  • Shared data: Use an event bus (EventBridge) from the beginning. When the monolith writes an order, it publishes an OrderCreated event. New services subscribe to events rather than querying the monolith’s database. This decouples the extraction pace from the data migration pace.

Follow-up: The CTO wants the migration done in 3 months. How do you push back?

Answer: I would present data, not opinions. “Here is what we can realistically deliver in 3 months, and here is what that timeline risks.” What 3 months can deliver: Phase 1 (containerization) and Phase 2 (one component extraction). This gives us a deployable, auto-scaling container setup, a CI/CD pipeline, and proof that the extraction pattern works. We will have migrated one component end-to-end and validated the approach. What 3 months cannot deliver safely: Full decomposition of a monolith into microservices. Rushing this leads to: services with unclear boundaries that need to be re-merged, data consistency bugs from premature database splitting, and an explosion of inter-service communication complexity that nobody has monitoring for yet. The risk framework: “Each week we spend on the migration is a week we are not shipping features. The Strangler Fig approach lets us do both — we ship features on the monolith while extracting components in parallel. If we try to do everything in 3 months, we do neither well.” My concrete proposal: “Give me 3 months for Phase 1 and Phase 2. After that, we will have a validated pattern, a realistic velocity measurement, and I can give you an evidence-based timeline for the rest. I would rather under-promise and over-deliver than commit to an aggressive timeline that leads to cutting corners on testing and monitoring.”

Going Deeper: How do you handle the database during migration — do you split it early or late?

Late. Always late. Premature database splitting is the number one reason monolith-to-microservice migrations fail. Here is why: when you extract a service from the monolith but both still share the same database, you can move incrementally. The new service reads and writes to the same tables. If something goes wrong, you can revert to the monolith handling that component without any data migration. When you split the database, you introduce: data synchronization (change data capture between the old and new databases), distributed transactions (what was a single database transaction is now a cross-service saga), and migration risk (moving data between databases while both systems are live). My approach: share the database until the service boundary is stable (the service has been running independently for 2-4 months with no boundary changes). Then split the data using DynamoDB Streams or PostgreSQL logical replication to keep the old and new databases in sync during the transition period. Once the new service is fully cut over, decommission the sync and drop the tables from the old database. The exception: if the monolith’s database is itself the bottleneck (connection limits, query contention, scaling ceiling), then an early database split for the highest-traffic service is justified — but do it for operational reasons, not for architectural purity.

Advanced Interview Scenarios

How this section differs from the questions above. The previous deep-dive questions test whether you know how AWS services work. These scenarios test whether you have been burned by them. Each question is designed so that the “textbook” answer is either incomplete or actively wrong. The strong answers come from engineers who have debugged these problems at 2 AM with PagerDuty screaming.

13. Your Lambda function works fine in dev but randomly times out in production — about 5% of invocations hit the 30-second limit. CloudWatch shows the function uses only 60% of allocated memory. What is going on?

Difficulty: Senior What the interviewer is really testing: Whether you understand Lambda’s CPU allocation model, downstream dependency failures, and production debugging beyond “just increase the timeout.” The obvious wrong answer is “increase the timeout to 60 seconds.” What weak candidates say:
  • “Increase the timeout to 60 seconds or give it more memory.”
  • “It is probably cold starts causing the timeouts.”
  • “Add retry logic in the function.”
These miss the root cause entirely. Increasing the timeout just makes the failure take longer. Cold starts do not cause 30-second delays in any runtime. Retrying inside a function that is about to timeout burns time with no benefit. What strong candidates say: The way I would approach this is: 5% timeout rate with normal memory usage screams downstream dependency, not Lambda compute. Here is the diagnostic playbook I have actually run in production:
  • Step 1: Enable X-Ray tracing. X-Ray shows the time breakdown per invocation. When I had this exact problem on a payment processing Lambda, X-Ray revealed that 5% of invocations spent 28 seconds waiting on an HTTP call to a third-party fraud detection API. The API had a long-tail latency problem — p50 was 200ms but p99 was 29 seconds. Without X-Ray, CloudWatch only shows total duration, which is useless for multi-step functions.
  • Step 2: Check the CPU angle. Lambda allocates CPU proportional to memory. At 128MB, you get roughly 1/10th of a vCPU. If the function does CPU-intensive work (JSON parsing of large payloads, image manipulation, crypto operations), it can be CPU-starved even with memory headroom. I once debugged a Node.js Lambda that parsed a 5MB JSON payload — at 128MB memory, parsing took 8 seconds. At 1024MB memory, it took 400ms. The fix was increasing memory from 128MB to 512MB, which cut p99 latency from 12 seconds to 1.2 seconds. The extra memory was irrelevant — it was the CPU that mattered.
  • Step 3: Check DNS resolution and connection establishment. Lambda in a VPC resolves DNS through the VPC’s DNS resolver, which can be slow under load. Also check if the function creates new TCP/TLS connections per invocation instead of reusing them. A TLS handshake to a downstream service adds 100-300ms, and if the downstream has connection limits, the handshake can queue for seconds.
  • Step 4: Check for connection pool exhaustion on downstream services. If the Lambda talks to RDS without RDS Proxy, and 200 concurrent Lambda instances each hold a connection, Aurora’s max_connections might be exhausted. New connections queue until one frees up, causing sporadic timeouts on the functions waiting for a connection. CloudWatch metric DatabaseConnections on the RDS side confirms this.
  • Step 5: Set explicit socket timeouts. The AWS SDK defaults to no socket timeout in some configurations. If a downstream service silently drops a connection (no RST, no FIN), the Lambda hangs until its own timeout. I always set connectTimeout: 3000, socketTimeout: 10000 on every HTTP client. This converts a 30-second timeout into a 10-second error with a clear message.
War Story: At a fintech startup processing 2M transactions/day, we had a Lambda with exactly this 5% timeout pattern. Root cause: the function called DynamoDB, which was fine, and then called an internal microservice via ALB. The ALB had a 60-second idle timeout, but the Lambda’s HTTP client had no keepalive. Connections were being reused from the Lambda connection pool but the ALB had already closed them server-side. The HTTP client would write the request, wait for a response that never came (half-open connection), and hang until Lambda’s 30-second timeout killed it. Fix: set SO_KEEPALIVE with a 15-second interval on the HTTP client and add a 5-second read timeout. Timeout rate dropped from 5% to 0.01% overnight.

Follow-up: How would you differentiate between a CPU-bound timeout and a network-bound timeout without X-Ray?

Answer: Instrument the function with manual timestamps around each operation and log them. Even console.log(Date.now()) before and after each external call gives you a poor-man’s trace. If the gap between “before DynamoDB call” and “after DynamoDB call” is 25 seconds, it is network. If the gap between “received payload” and “finished parsing” is 25 seconds, it is CPU. The power-user move: check the REPORT line in CloudWatch Logs. It shows Billed Duration, Max Memory Used, and Init Duration. If Max Memory Used is 90%+ of allocated memory, the garbage collector may be thrashing. If Max Memory Used is low but duration is high, it is probably a downstream wait. Also, Lambda Power Tuning (an open-source tool from Alex Casalboni) systematically tests your function at different memory settings and graphs cost vs duration — it reveals CPU-bound vs I/O-bound behavior instantly.

Follow-up: You discover it is CPU-bound. Increasing memory from 128MB to 1769MB gives you a full vCPU. But the function only needs 80MB of memory. Are you paying for waste?

Answer: This is a question where the math is non-obvious. At 128MB, if the function runs for 10 seconds (CPU-starved), you pay for 128MB * 10s = 1,280 MB-seconds. At 1769MB, it runs in 800ms, so you pay 1769MB * 0.8s = 1,415 MB-seconds. Roughly the same cost, but your latency dropped from 10 seconds to 800ms. In many cases, increasing memory actually decreases cost because the function finishes so much faster that the total MB-seconds goes down. Lambda Power Tuning visualizes this exact tradeoff — I have seen cases where 3x memory allocation reduced both latency and cost by 40%.

14. Your team chose Step Functions to orchestrate an order fulfillment workflow. Six months later, the monthly Step Functions bill is $8,000 and growing. The CTO asks why “serverless” is so expensive. Diagnose and fix.

Difficulty: Senior What the interviewer is really testing: Whether you understand Step Functions pricing models (Standard vs Express), can identify architectural cost traps, and know when to replace orchestration with choreography. The trap: most candidates do not know that Standard Workflows charge per state transition. What weak candidates say:
  • “Step Functions is just expensive, we should rewrite it.”
  • “Move everything to Lambda and SQS.”
  • “The CTO should expect high costs for serverless at scale.”
What strong candidates say: 8,000/monthinStepFunctionsalmostcertainlymeansStandardWorkflowswithhightransitionvolume.Letmedothemathbackward.StandardWorkflowscost8,000/month in Step Functions almost certainly means Standard Workflows with high transition volume. Let me do the math backward. Standard Workflows cost 0.025 per 1,000 state transitions. 8,000/8,000 / 0.025 * 1,000 = 320 million state transitions per month. If the workflow has 10 states and processes 1 million orders/month, that is 10 million transitions — 250/month,not250/month, not 8,000. So either the volume is massive, or the workflow has far more states than necessary. Diagnosis steps I would actually take:
  • Pull the Step Functions CloudWatch metrics. ExecutionsStarted tells me volume. StateMachinesCount and per-machine execution counts tell me if it is one workflow or many. Most importantly, look at the number of states per executionStatesExecuted / ExecutionsStarted gives average transitions per workflow.
  • Check for “chatty” workflows. I have seen teams put every micro-operation as a separate Step Functions state: validate input (1 transition), check inventory (2), reserve inventory (3), process payment (4), handle payment failure choice (5), update order (6), send email (7), update analytics (8), log completion (9). That is 9 transitions per order. But the real killer is retry states and map states. A Map state iterating over 100 line items in an order fires 100+ transitions per order. If each line item has 3 states, a 100-item order costs 300+ transitions.
  • Identify the fix — Standard to Express migration for eligible workflows. Express Workflows charge per execution and duration (0.00001667/GBsecond),notpertransition.Fortheorderfulfillmentexample:aworkflowrunning1milliontimes/monthat5secondsaveragewith64MBmemorycostsroughly0.00001667/GB-second), not per transition. For the order fulfillment example: a workflow running 1 million times/month at 5 seconds average with 64MB memory costs roughly 0.83/month on Express. The same workflow on Standard with 10 transitions costs $250/month. Express is 300x cheaper for high-volume, short-duration workflows.
  • The catch with Express Workflows: They have a 5-minute maximum duration, are asynchronous by default (no waiting for the result), and do not record execution history in the console (you must log to CloudWatch). If the order workflow takes longer than 5 minutes (waiting for payment confirmation, external API calls), Express does not work and you need a hybrid approach.
  • Hybrid approach: Use Express for the high-volume, fast inner loop (process each line item) and Standard for the outer orchestration (order lifecycle with human approval, long waits). The Express workflow is called as a nested workflow within the Standard outer workflow.
War Story: At an e-commerce company processing 800K orders/month, the Step Functions bill was 6,200/month.Therootcause:aMapstateiteratedoverorderlineitems,andtheaverageorderhad12items,eachwith5states.Thatis60transitionsperorderbaseline.Butreturnsprocessingreusedthesameworkflowwithadditionalstates,adding8moretransitionsperreturn.800Korders60transitions+200Kreturns68transitions=61.6milliontransitions=6,200/month. The root cause: a Map state iterated over order line items, and the average order had 12 items, each with 5 states. That is 60 transitions per order baseline. But returns processing reused the same workflow with additional states, adding 8 more transitions per return. 800K orders * 60 transitions + 200K returns * 68 transitions = 61.6 million transitions = 1,540/month — except the team had also added logging and wait states, pushing per-order transitions to 90+. We migrated the line-item processing to an Express sub-workflow (saved 4,800/month)andcollapsedthreeconsecutiveLambdastatesthatalwaysrantogetherintoasingleLambda(savedanother4,800/month) and collapsed three consecutive Lambda states that always ran together into a single Lambda (saved another 400/month). Final bill: $980/month.

Follow-up: When would you replace Step Functions entirely with event choreography (SQS + EventBridge)?

Answer: When the workflow is a straight pipeline with no branching, no parallel states, no retries beyond what SQS provides, and no need for visual execution history. “Order placed -> process payment -> update inventory -> send email” is a pipeline. Each step publishes an event, the next step subscribes. SQS between each step gives you retry and DLQ for free. The cost is effectively zero beyond the SQS API calls ($0.40 per million). But the moment you need conditional branching (“if payment fails, try backup processor, then notify customer, then schedule retry in 24 hours”), error handling across steps, or parallel execution with aggregation (“transcode to 3 formats and wait for all to finish”), choreography becomes spaghetti. That is when Step Functions earns its keep — the state machine definition is the documentation, and the console visualization is the debugging tool. My heuristic: fewer than 4 steps with no branching = choreography. More than 4 steps or any branching/parallel = Step Functions (Express if under 5 minutes, Standard if over).

Follow-up: How do you test Step Functions workflows locally?

Answer: Step Functions Local is the official answer — a Docker container that emulates the Step Functions service. It runs your state machine definition and can invoke local Lambda functions via SAM CLI or mock the service integrations. It is adequate for happy-path testing. The honest answer: Step Functions Local is painful for complex workflows. The mock integrations are limited, and real service integrations (DynamoDB, SQS) do not work locally. What actually works in practice is: (1) unit test each Lambda function independently, (2) integration test the full workflow in a dev AWS account using the real service — deploy via SAM/CDK on every PR, run the workflow with test data, assert on the execution output, and tear down. (3) Use Step Functions’ built-in test state feature (added in 2023) to test individual states against real AWS services without running the full workflow. The feedback loop is slower than local testing, but the confidence is dramatically higher.

15. An S3 bucket contains 50TB of customer data. A security audit reveals the bucket was publicly accessible for 72 hours due to a misconfigured bucket policy. The data was not encrypted at rest. Walk me through your incident response AND how you prevent recurrence.

Difficulty: Staff-Level What the interviewer is really testing: Security incident response maturity, S3 security model depth, and whether you think about prevention systemically (guardrails, not just fixes). This is a question where panic and “just make it private” is the weak answer. What weak candidates say:
  • “Remove the public policy immediately and enable encryption.”
  • “Check CloudTrail to see who changed the policy.”
  • “Turn on S3 Block Public Access.”
These are step 1 of a 10-step process. Removing the policy fixes the symptom. The real question is: was data exfiltrated, who is affected, what are the regulatory obligations, and how do you make this structurally impossible in the future? What strong candidates say: This is a potential data breach with regulatory implications. I would run the incident response playbook in parallel tracks: Track 1 — Immediate containment (first 15 minutes):
  • Enable S3 Block Public Access at the account level, not just the bucket level. This overrides any bucket policy, ACL, or access point policy that grants public access. It is a single API call: aws s3control put-public-access-block --account-id 123456789 --public-access-block-configuration BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true. This immediately locks down every bucket in the account.
  • Enable default encryption on the bucket with SSE-S3 or SSE-KMS. Existing objects are not retroactively encrypted — you must re-upload or use S3 Batch Operations to copy objects in place with encryption. For 50TB, S3 Batch Operations takes hours but runs server-side.
  • Restrict bucket access to only specific IAM roles via an explicit deny bucket policy.
Track 2 — Forensic investigation (first 2 hours):
  • Pull S3 Server Access Logs for the bucket during the 72-hour window. These logs show every GET, PUT, LIST request including source IP, requester, and user agent. If access logging was not enabled (common oversight), check CloudTrail S3 data events — but these are only available if data event logging was enabled. If neither was enabled, you have a visibility gap.
  • Analyze access logs for anomalous IPs, bulk downloads, or programmatic access patterns (many GET requests in rapid succession from non-internal IPs). Tools: Athena query over the access logs in S3, or import into a SIEM (Splunk, Elastic).
  • Check AWS CloudTrail for the PutBucketPolicy or PutBucketAcl event that opened the bucket. This tells you who changed the policy, from which IP, and at what time. Determine if it was intentional (developer testing who forgot to revert) or compromised credentials.
Track 3 — Regulatory notification (first 24 hours):
  • If the data contains PII (names, emails, financial records), this likely triggers GDPR Article 33 (72-hour notification to the supervisory authority), CCPA breach notification, HIPAA breach notification, or industry-specific requirements. Notify the legal team immediately. The notification clock starts when you discover the breach, not when it occurred.
  • Determine the scope: which customers’ data was in the bucket, what data fields were exposed, was any data actually downloaded by unauthorized parties. The access logs from Track 2 determine this.
Track 4 — Structural prevention (next 1-2 weeks):
  • SCP: Deny public S3 access at the organization level. Apply an SCP that denies s3:PutBucketPolicy and s3:PutBucketAcl when the policy grants public access. This makes it structurally impossible for any account in the organization to make a bucket public, regardless of individual IAM permissions.
  • AWS Config rule: s3-bucket-public-read-prohibited. This continuously monitors all buckets and alerts (or auto-remediates) if any bucket becomes public. Set it to auto-remediate by invoking a Lambda that re-applies the Block Public Access configuration.
  • Macie for sensitive data discovery. Enable Amazon Macie on all S3 buckets to automatically discover and classify sensitive data (PII, financial data, credentials). Macie would have flagged this bucket as containing sensitive data that was publicly accessible.
  • S3 Block Public Access as the default. Enable Block Public Access at the organization level via SCP so every new account inherits the restriction. The only way to make a bucket public is to first remove the organization-level block — which requires Security OU approval.
  • Mandatory encryption via SCP. Deny s3:PutObject when s3:x-amz-server-side-encryption is not set. This forces encryption on all new objects organization-wide.
War Story: At a healthcare SaaS company, a developer created a temporary S3 bucket for a data migration, disabled Block Public Access because a vendor tool required a public pre-signed URL (it did not — the developer misunderstood the tool), and the bucket contained 3.2 million patient records. It was public for 5 days before an external security researcher reported it through a responsible disclosure program. The forensic analysis (using Athena over 800GB of S3 access logs) showed 14 unique non-internal IPs that had listed or downloaded objects. HIPAA notification was required. The total cost: 180,000inlegalfees,forensics,andpatientnotification.Thefixcost:180,000 in legal fees, forensics, and patient notification. The fix cost: 0 for SCPs and Config rules that would have prevented the entire incident.

Follow-up: The developer says they needed the bucket to be public for a legitimate use case (serving static assets for a marketing site). How do you accommodate that safely?

Answer: You do not make a data bucket public. You create a separate, purpose-built bucket with a strict lifecycle:
  • Dedicated bucket with a name that clearly indicates it is public (e.g., company-public-assets-prod).
  • Block Public Access is disabled on only this bucket, with a documented exception approved by the security team.
  • The bucket contains only static assets (HTML, CSS, JS, images). No customer data, no PII, no application data.
  • A bucket policy with a Condition restricting uploads to only the CI/CD pipeline’s IAM role — no developer can upload directly.
  • CloudFront in front of the bucket with OAC (Origin Access Control), so the bucket is not directly accessible — only CloudFront can read from it.
  • AWS Config rule monitors this specific bucket and alerts if any object matching sensitive data patterns (SSN regex, email regex, etc.) is uploaded.
The key principle: public buckets exist for static content only, are created through infrastructure-as-code with security review, and have monitoring that ensures they stay static-content-only. Any bucket containing customer or application data must never be public, period.

16. You inherit a system with 40 Fargate tasks running 24/7 in production. The monthly AWS bill is $12,000. The CTO asks you to cut it by 50%. How?

Difficulty: Senior What the interviewer is really testing: Real FinOps chops — not just “use Savings Plans” but a systematic approach to right-sizing, architectural optimization, and knowing which cost levers actually move the needle. The trap: most candidates jump to Reserved Instances without first checking if the resources are right-sized. What weak candidates say:
  • “Buy Reserved Instances or Savings Plans.”
  • “Move to Lambda.”
  • “Use Spot instances.”
Buying commitments on oversized resources locks in waste for 1-3 years. Moving 40 tasks to Lambda is a multi-month rewrite with unknown cost outcomes. Spot is not appropriate for production API tasks. What strong candidates say: Cutting 50% from 12,000meanssaving12,000 means saving 6,000/month. I would attack this in layers, starting with the cheapest and fastest wins: Layer 1 — Right-sizing (week 1, typical savings: 30-50%):
  • Pull CloudWatch metrics for every task: CPU utilization average and p99, memory utilization average and peak. In my experience, 60-70% of Fargate tasks are over-provisioned. I have seen teams running 2 vCPU / 4GB tasks at 8% average CPU and 15% memory utilization because they copied the task definition from a template and never revisited it.
  • AWS Compute Optimizer provides right-sizing recommendations for ECS tasks. It analyzes 14 days of CloudWatch metrics and suggests optimal CPU/memory configurations. It is free and takes 5 minutes to check.
  • For a task running at 8% CPU / 15% memory on 2 vCPU / 4GB, dropping to 0.5 vCPU / 1GB cuts cost by 75% for that task. Across 40 tasks, this alone could save $4,000-6,000/month.
Layer 2 — Non-production environments (week 1, typical savings: 15-25%):
  • If those 40 tasks include staging, dev, and QA environments running 24/7, schedule non-production environments to run only during business hours (8am-8pm weekdays = 60 hours/week vs 168 hours/week). Use ECS Scheduled Scaling or a Lambda that scales desired count to 0 at night and back up in the morning. This alone cuts non-prod costs by 64%.
  • Better yet, use Fargate Spot for non-production. 70% discount with the only risk being occasional task termination — which is acceptable in dev/staging.
Layer 3 — Graviton migration (week 2, typical savings: 20%):
  • Switch task definitions from x86 (X86_64) to ARM (ARM64) and use Graviton-based Fargate tasks. This is a single line change in the task definition if you build multi-arch Docker images. Graviton Fargate is 20% cheaper than x86 Fargate for the same CPU/memory, with equal or better performance. For interpreted languages (Python, Node.js, Ruby), no code changes are needed. For compiled languages (Go, Rust, Java), you rebuild for linux/arm64.
Layer 4 — Savings Plans (week 3, typical savings: 25-35% on remaining spend):
  • Only after right-sizing and scheduling, commit to a Compute Savings Plan. A 1-year no-upfront Compute Savings Plan saves ~20%. A 3-year partial-upfront saves ~35%. Compute Savings Plans apply across Fargate, Lambda, and EC2, so they flex as your architecture evolves. I would commit to covering only 70% of the right-sized baseline — leave 30% on-demand for flexibility.
Layer 5 — Architectural review (weeks 4-6, case-specific):
  • Are any tasks doing batch processing that could be moved to Lambda (no idle cost between batches)?
  • Are any tasks running singleton workers (one task processing a queue) that could be Lambda SQS consumers?
  • Are there sidecar containers (log forwarders, metrics agents) that could be replaced with Firelens (built into ECS) or CloudWatch agent (no sidecar needed)?
War Story: At a Series B startup, I inherited 52 Fargate tasks across 3 environments. Monthly bill: 18,400.Afteranalysis:16tasksweredev/stagingrunning24/7(shouldhavebeenbusinesshoursonly),23productiontaskswereprovisionedat1vCPU/2GBbutaveraged1218,400. After analysis: 16 tasks were dev/staging running 24/7 (should have been business hours only), 23 production tasks were provisioned at 1 vCPU / 2GB but averaged 12% CPU / 300MB memory, and nobody had considered Graviton. The optimization path: right-sized production to 0.25 vCPU / 0.5GB (saved 5,200/month), scheduled non-prod to business hours (saved 2,800/month),migratedalltaskstoARM/Graviton(saved2,800/month), migrated all tasks to ARM/Graviton (saved 1,900/month), then bought a 1-year Compute Savings Plan on the remaining baseline (saved 1,600/month).Totalsavings:1,600/month). Total savings: 11,500/month — 62% reduction. Zero downtime, zero code changes, completed in 3 weeks.

Follow-up: How do you prevent the costs from creeping back up over the next 6 months?

Answer: Cost creep is the default state. Without active governance, spend returns to pre-optimization levels within 6-12 months as teams scale up for new features and never scale back down.
  • Weekly cost review. A 15-minute team meeting every Monday that reviews the Cost Explorer dashboard. Not a detailed analysis — just “are we trending up or down vs last week?” Anomalies get investigated immediately.
  • AWS Budgets per service and per team. Set a monthly budget that is 10% above the optimized baseline. Alert at 80%, 90%, and 100%. At 100%, require a justification ticket for the increase.
  • Right-sizing automation. A monthly Lambda function that runs Compute Optimizer recommendations and posts them to Slack. If a task has been over-provisioned for 30+ days, it opens a Jira ticket automatically.
  • Tagging enforcement. Require cost-center and team tags on all ECS services via AWS Config rules. Untagged resources cannot be attributed and therefore cannot be optimized.
  • Governance in the deployment pipeline. The CDK/Terraform pipeline runs a cost estimation tool (Infracost) on every PR. If the estimated monthly cost increase exceeds a threshold ($100/month), the PR requires a finance-team approval label.

17. A teammate says: “We should use DynamoDB for everything — it scales infinitely and costs less than RDS.” Where are they right, and where is this dangerously wrong?

Difficulty: Intermediate / Senior What the interviewer is really testing: Whether you can challenge popular narratives with nuance. DynamoDB is excellent, but “use it for everything” is one of the most common mistakes in AWS-heavy shops. This question separates engineers who have hit DynamoDB’s walls from those who have only read the marketing page. What weak candidates say:
  • “They are right — DynamoDB handles any scale and is fully managed.”
  • “They are wrong — relational databases are always better for complex queries.”
Both extremes are wrong. The real answer requires understanding access patterns, cost models, and operational trade-offs. What strong candidates say: My teammate is right about three things and dangerously wrong about two things. Where they are right:
  • Scaling. DynamoDB on-demand mode genuinely scales to virtually unlimited throughput. I have seen tables handling 400,000 reads/second with single-digit millisecond latency. You do not manage shards, replicas, or connection pools. It just works.
  • Operational simplicity. No patching, no vacuuming, no connection pool tuning, no replica lag monitoring. For a small team, this operational savings is massive — it is easily worth $20K/year in engineering time you do not spend.
  • Performance at scale. For key-value and simple query patterns, DynamoDB’s latency is consistently 1-5ms regardless of table size. An RDS PostgreSQL query that returns in 2ms at 1GB can return in 200ms at 1TB without careful indexing and tuning.
Where they are dangerously wrong:
  • “Costs less than RDS” is often false. DynamoDB on-demand pricing is 1.25permillionwriterequestunitsand1.25 per million write request units and 0.25 per million read request units. A table with 10,000 writes/second sustained costs 1.2510,00086,400/1,000,00030=1.25 * 10,000 * 86,400 / 1,000,000 * 30 = 32,400/month. An Aurora db.r6g.xlarge instance handling the same write throughput costs roughly $800/month. DynamoDB is cheaper for spiky, low-baseline workloads. It is dramatically more expensive for sustained high-throughput workloads unless you use provisioned capacity with reserved pricing, which erodes the “no capacity planning” benefit.
  • “For everything” ignores the access pattern constraint. DynamoDB requires you to know your access patterns at design time. You design the partition key, sort key, and GSIs around specific queries. If the product team adds a new query pattern 6 months later (e.g., “find all orders by product ID across all customers”), you either already have a GSI for it or you add one (which consumes additional write capacity on every write to the table). In PostgreSQL, you add an index. In DynamoDB, you redesign your data model. I have been on a team where a “simple” new reporting requirement forced a complete DynamoDB table redesign and data migration because the existing single-table design could not support the new access pattern.
  • Analytical queries are a non-starter. “Show me total revenue by region for the last 90 days” is a full table scan in DynamoDB. You export to S3 and query with Athena. In PostgreSQL, it is a 200ms query with a covering index. If your application needs even basic ad-hoc querying or reporting, DynamoDB alone is insufficient.
  • Transactions are limited. DynamoDB transactions span up to 100 items in a single request and work only within a single table (or across tables in the same region). There is no equivalent to a multi-statement PostgreSQL transaction with savepoints, rollbacks, and isolation levels. For workflows like “transfer $100 from account A to account B while recording an audit log entry” where all three writes must succeed atomically, DynamoDB transactions work. For “insert an order, create 50 line items, update inventory for each product, and record 50 ledger entries atomically,” you exceed the 100-item limit.
My framework: DynamoDB for well-defined access patterns with high scale requirements (user profiles, session stores, IoT telemetry, gaming leaderboards). PostgreSQL/Aurora for evolving query patterns, complex relationships, reporting, and transactional workloads. Often the answer is both — DynamoDB for the hot-path key-value lookups and Aurora for the analytical and relational queries. War Story: A team I consulted for went all-in on DynamoDB single-table design for their e-commerce platform. It was beautiful — 6 access patterns, 2 GSIs, sub-5ms responses. Then the product team asked for “search orders by date range and status across all customers.” Single-table design had customer ID as the partition key. A date-range query across all customers required a full table scan or a new GSI with a different partition key. The GSI added 40% to the write costs (every order write now updated the GSI). Six months later, the product team wanted full-text search on order notes. They ended up streaming DynamoDB data to OpenSearch via DynamoDB Streams and Lambda. The “simple DynamoDB” architecture now had DynamoDB + OpenSearch + Lambda + Streams — more operational complexity than a single Aurora instance with pg_trgm would have had from day one.

Follow-up: When would you use DynamoDB single-table design versus multi-table design?

Answer: Single-table design — putting multiple entity types (users, orders, order items) in one table with carefully designed PK/SK structures — is powerful but has a steep learning curve and real trade-offs. Use single-table when: You have 3-6 well-known access patterns that will not change frequently, you need to fetch related entities in a single query (get user and their recent orders in one request), and your team has DynamoDB expertise. The performance benefit is real — one round trip instead of multiple table queries. Use multi-table when: Your team is new to DynamoDB (the cognitive overhead of single-table is high and mistakes are expensive), access patterns are still evolving (easier to add GSIs to isolated tables), or you have more than 8-10 access patterns (single-table GSIs become unwieldy). Multi-table is also easier to monitor — per-table CloudWatch metrics give you clear visibility into which entity type is causing hot partitions or throttling. The honest take from production: Single-table design is an optimization. Like all optimizations, apply it when you have data showing you need it, not upfront because a conference talk convinced you. Most DynamoDB applications work perfectly fine with multi-table design and never need the complexity of single-table.

18. It is 2 AM. PagerDuty wakes you up. Your ECS Fargate service is returning 503 errors. CPU and memory look fine. Tasks are running. What is happening and how do you triage?

Difficulty: Senior What the interviewer is really testing: Real incident response skills under pressure, knowledge of the ECS/ALB/networking stack, and the ability to systematically narrow down a root cause when the obvious metrics look normal. The trap: candidates fixate on the application and ignore the infrastructure. What weak candidates say:
  • “Check the application logs for errors.”
  • “Restart the tasks.”
  • “Scale up to more tasks.”
Checking logs is not wrong, but starting there without triaging the infrastructure path wastes precious minutes. Restarting randomly is a panic move. Scaling up when existing tasks are healthy does not help if the problem is not capacity. What strong candidates say: 503 errors with healthy CPU and memory means the requests are not reaching the application, or the application is rejecting them for a non-resource reason. I work backward from the user through the request path: Minute 0-2: Scope the blast radius.
  • Check the ALB target group health. If targets show unhealthy, the ALB is pulling tasks out of rotation. The health check might be failing even though CPU/memory are fine — the application could be returning 500 on the health check endpoint due to a downstream dependency failure (database unreachable, config service down).
  • Check HealthyHostCount and UnHealthyHostCount CloudWatch metrics on the target group. If healthy count dropped to zero, the ALB has no targets to route to and returns 503 by default.
Minute 2-5: Check the ALB itself.
  • HTTPCode_ELB_503_Count metric confirms the ALB is generating the 503s (as opposed to the application returning 503 via the ALB). A 503 from the ALB means: no healthy targets, all targets are deregistering, or the target group has no registered targets.
  • RequestCount on the target group — are requests even reaching the targets? If request count is zero but the ALB is receiving traffic, the issue is target registration or health checks.
  • Check if a deployment just happened. ECS rolling deployments deregister old tasks and register new ones. If the new task definition has a bug (crashes on startup, fails health check), ECS keeps trying to start new tasks while the old ones are draining. During the crossover, healthy target count can drop to zero.
Minute 5-10: Check the deployment and task lifecycle.
  • aws ecs describe-services — check runningCount vs desiredCount. If running is less than desired, ECS is trying to start tasks but they are failing. Check events on the service for messages like “task failed to start” or “ECS was unable to place a task.”
  • Check stopped tasks: aws ecs list-tasks --desired-status STOPPED. Look at stoppedReason — common culprits: “CannotPullContainerError” (ECR access issue, usually a VPC endpoint or NAT Gateway problem), “ResourceNotFoundException” (Secrets Manager or Parameter Store secret was deleted), “OutOfMemoryError” (container killed by OOM despite CloudWatch showing memory below limit — this happens when the container’s hard memory limit is hit but the task-level memory is not).
  • Check the ECS task definition — did someone push a bad container image to ECR? Did the entrypoint script change? A new image that crashes on startup will pass docker pull but fail immediately, and ECS will cycle through start-crash-restart.
Minute 10-15: Check networking.
  • Security group on the tasks — did someone modify it? If the ALB’s security group cannot reach the task’s security group on the container port, health checks fail and the ALB returns 503.
  • NACLs on the subnets — less common but devastating. A NACL change that blocks ephemeral ports (1024-65535) breaks ALB-to-task communication.
  • NAT Gateway — if the tasks need to pull config from Secrets Manager or Parameter Store at startup, and the NAT Gateway is down or overloaded, tasks start but hang during initialization and fail health checks.
War Story: At a SaaS company running 24 Fargate tasks behind an ALB, we got paged for 503s at 1:47 AM. CPU: 12%. Memory: 35%. Tasks running: 24/24. All “healthy” from ECS’s perspective. But ALB target group showed 0 healthy targets. Root cause: the application health check endpoint (/health) made a synchronous call to Redis (ElastiCache) to verify cache connectivity. A maintenance window on the ElastiCache cluster (automatic patching, which we had not scheduled for off-hours) caused a 90-second Redis failover. During failover, every /health call timed out, the ALB marked all 24 targets unhealthy, and started returning 503. The fix had two parts: (1) moved ElastiCache maintenance window to Sunday 4 AM with explicit scheduling, and (2) changed the health check to not depend on Redis — the health check should verify the application can respond to HTTP requests, not that every dependency is healthy. For dependency health, we added a separate /ready endpoint used for operational monitoring but not ALB routing. The principle: liveness checks (is the process alive?) should be simple and dependency-free. Readiness checks (can it serve traffic?) can check dependencies but should not be used for ALB health checks unless you want a single dependency failure to take down all traffic.

Follow-up: How do you distinguish between a deployment-caused outage and an infrastructure-caused outage in under 2 minutes?

Answer: One command: aws ecs describe-services --services my-service --cluster my-cluster. The events field shows the last 100 events in chronological order. If the most recent events say “service my-service has started 4 tasks” and “service my-service has stopped 4 tasks: task was stopped because the underlying container exited,” you have a deployment issue. If the events show steady state (“service has reached a steady state”) but the ALB is returning 503, it is infrastructure — security groups, NACLs, NAT Gateway, or the health check endpoint. The second check: aws ecs describe-services also shows deployments. If there are two deployments listed (PRIMARY with desiredCount and ACTIVE with desiredCount), a rolling update is in progress. If the PRIMARY deployment has runningCount: 0 and pendingCount: N, the new tasks are failing to start. Rollback immediately: aws ecs update-service --service my-service --cluster my-cluster --force-new-deployment to restart with the current (presumably broken) task definition, or better, use ECS circuit breaker which automatically rolls back if the new deployment fails health checks.

19. You are designing an event-driven system with EventBridge. A colleague says you do not need to worry about schema evolution because “events are just JSON.” Why is this a ticking time bomb, and how do you defuse it?

Difficulty: Senior / Staff-Level What the interviewer is really testing: Whether you have built event-driven systems that survived past version 1.0. Schema evolution is the silent killer of event-driven architectures — it never matters on day one, and it always matters on day 180. What weak candidates say:
  • “JSON is flexible, consumers just ignore fields they do not know.”
  • “We will version our events when we need to.”
  • “EventBridge schema registry handles this automatically.”
The first answer is wishful thinking. The second is “we will fix the roof when it rains.” The third misunderstands what the schema registry does. What strong candidates say: “Events are just JSON” is technically true and architecturally catastrophic. The moment you have 3 producer services and 8 consumer services exchanging events, you have an implicit contract between all of them. Without explicit schema governance, you get:
  • The phantom field break. Producer adds a field order.discount_percentage as a float. Consumer A expects it. Consumer B (written 6 months ago) does not know about it and is fine. Consumer C (legacy Python service) receives the event and passes it to a function that chokes because it deserializes the entire payload and feeds it to a pandas DataFrame that cannot handle the new column. Nothing in the event contract told Consumer C this would happen.
  • The type change time bomb. Producer changes order.amount from integer cents (1599) to a string with currency ("$15.99"). Every consumer that parses amount as an integer breaks silently — no exception, just wrong calculations. I have seen this exact scenario cause a $47,000 billing discrepancy over 3 days before anyone noticed.
  • The required-to-optional change. Producer decides customer.phone is optional and starts sending events without it. Consumer D (the SMS notification service) dereferences a null pointer and crashes. Its SQS consumer dies, messages pile up in the DLQ, and nobody notices until 10,000 customers complain they did not get order confirmations.
How I defuse this:
  • Event schema registry with enforcement. EventBridge Schema Registry discovers schemas automatically from live events. But discovery is not enough — you need enforcement. Define schemas explicitly (JSON Schema or OpenAPI), version them, and validate events at the producer before publishing. The producer calls a schema validation library before putting the event on the bus. If validation fails, the event is rejected, not published. This catches breaking changes at the source.
  • Schema versioning with semantic conventions. Every event type gets a version in its detail-type: OrderPlaced.v2. Consumers subscribe to the version they support. When the producer needs a breaking change, it publishes OrderPlaced.v3 alongside v2. Consumers migrate at their own pace. The producer deprecates v2 only after all consumers have migrated (tracked via EventBridge metrics on rule invocations — if the v2 rule has zero invocations for 30 days, it is safe to remove).
  • Backward-compatible changes only by default. Adding new optional fields is always safe. Removing fields, renaming fields, or changing field types is a breaking change that requires a version bump. This is the same contract discipline as API versioning, and it needs to be enforced in code review — not left to developer goodwill.
  • Consumer contract testing. Each consumer publishes a “contract test” that defines the minimum fields and types it requires from each event type. The CI pipeline for the producer runs all downstream consumer contract tests before deploying. If the producer’s schema change breaks any consumer contract, the pipeline fails. This is the Pact testing model applied to events.
  • Dead letter queue per consumer. Every EventBridge rule that invokes Lambda or SQS should have a DLQ. When a consumer fails to process an event (possibly due to a schema change), the event goes to the DLQ rather than disappearing. Monitor DLQ depth — a sudden spike after a producer deployment is a strong signal that a schema change broke something.
War Story: At a logistics company with 14 microservices communicating through EventBridge, a team renamed the field shipment.estimated_delivery to shipment.eta in their producer because “eta is shorter.” No version bump, no consumer notification, no contract tests. Five consumers broke. The warehouse management system stopped scheduling dock workers because it could not read the delivery estimate. The customer notification service stopped sending “your package is arriving tomorrow” emails. It took 4 hours to diagnose because the error was not a crash — the consumers just returned null for the missing field and continued with degraded logic. We caught it when a customer service team noticed customers were not getting notifications. After that incident, we implemented contract testing and a mandatory event schema review process for any PR that changes an event payload.

Follow-up: How do you handle the EventBridge schema registry’s limitations in practice?

Answer: The EventBridge schema registry discovers schemas from events passing through the bus, which is useful for documentation but has real limitations:
  • It discovers schemas reactively (an event must flow through the bus before the schema appears). It does not enforce schemas — a producer can publish any JSON.
  • Schema versions in the registry are discovered versions, not semantic versions. Adding a new field creates a “new version” even if it is backward compatible.
  • There is no built-in mechanism to reject events that do not match a schema.
In practice, I layer on top: (1) a shared event schema library (a Git repo with JSON Schema files for every event type), (2) a lightweight validation Lambda on the event bus that validates events against the schema and routes invalid events to a quarantine queue, and (3) code-generated type-safe event classes from the schemas (using tools like quicktype or json-schema-to-typescript) that producers and consumers import. The combination gives you discoverability (registry), enforcement (validation Lambda), and developer experience (type-safe code).

20. Your company runs a SaaS product with 200 tenants on a shared infrastructure. One tenant starts sending 50x their normal traffic. Within minutes, all other tenants experience degraded performance. Diagnose the failure and design the fix.

Difficulty: Staff-Level What the interviewer is really testing: Multi-tenant isolation thinking, noisy-neighbor problem understanding, and the ability to design tenant-aware architectures on AWS. This is a cross-cutting question that spans Lambda concurrency, DynamoDB throughput, API Gateway throttling, and SQS queue design. What weak candidates say:
  • “Rate limit the noisy tenant.”
  • “Give each tenant their own infrastructure.”
  • “Scale up to handle the increased load.”
Rate limiting is part of the answer but not the architecture. Per-tenant infrastructure does not scale to 200 tenants cost-effectively. Scaling up treats the symptom and does not protect against the next noisy tenant. What strong candidates say: This is the noisy-neighbor problem, and it is the defining architectural challenge of multi-tenant SaaS. The failure cascades because shared resources have finite capacity and no isolation between tenants. Here is how I diagnose and fix it layer by layer: Diagnosis — tracing the blast radius:
  • API Gateway: If all tenants hit the same API Gateway, the noisy tenant’s 50x spike may be consuming the account-level API Gateway throttle limit (10,000 requests/second default). All tenants share this limit. When exceeded, every tenant gets 429 errors.
  • Lambda: If backend Lambda functions have no reserved concurrency, the noisy tenant’s requests consume the account’s 1,000 concurrent execution limit. Other tenants’ requests are throttled with 429.
  • DynamoDB: If tenants share a DynamoDB table, the noisy tenant’s requests may hot-partition the table (especially if tenant ID is the partition key and one tenant dominates throughput). Adaptive capacity helps but cannot fully compensate for a 50x spike.
  • SQS: If tenants share an SQS queue, the noisy tenant’s messages flood the queue. Consumer Lambda processes messages in order of arrival, starving other tenants’ messages.
The fix — tenant-aware isolation architecture:
  • API Gateway per-tenant throttling. Use API Gateway usage plans with API keys. Each tenant gets an API key with a throttle limit (e.g., 100 requests/second) and a burst limit. When Tenant X exceeds their limit, only they get 429 errors. Other tenants are unaffected. This is the first line of defense and the cheapest to implement.
  • Lambda concurrency isolation. Create a Lambda function alias per tenant tier (free, standard, enterprise). Set reserved concurrency: free tier functions get 50 concurrent executions, standard gets 200, enterprise gets 500. The noisy tenant’s tier exhausts its own reserved pool and is throttled without impacting other tiers. For critical-path functions, I would also consider per-tenant Lambda functions for the largest tenants (top 5-10 by revenue), each with their own reserved concurrency.
  • DynamoDB tenant-aware design. Partition key should include tenant ID (e.g., PK = TENANT#123#ORDER#456). This naturally distributes each tenant’s data across different partitions. But if a single tenant’s throughput exceeds a partition’s capacity, add write sharding within the tenant key. Also: use DynamoDB on-demand mode so individual tenant spikes are absorbed without provisioned capacity limits, and set per-tenant item-level monitoring via Contributor Insights.
  • SQS queue-per-tenant or priority queuing. For the noisy-neighbor SQS problem, the cleanest solution is a queue-per-tenant for the top 20 tenants (by volume) and a shared queue for the remaining 180. Each per-tenant queue has its own Lambda consumer with independent reserved concurrency. The shared queue has a Lambda consumer that processes small tenants. If a top-20 tenant spikes, only their queue backs up. Alternative: a single queue with tenant-aware consumer logic that implements fair scheduling (round-robin across tenant IDs in the batch).
  • Rate limiting at the application layer. Beyond API Gateway throttling, implement a token bucket rate limiter in the application code (backed by Redis INCR with TTL). Each tenant has a bucket with a per-second and per-minute limit based on their plan tier. This catches bursts that exceed the API Gateway throttle (which operates at the API key level, not the business logic level).
War Story: At a B2B SaaS company with 340 tenants, one enterprise tenant ran a data migration on a Friday evening that sent 12,000 API requests/second (normal was 200/second). Within 3 minutes: Lambda concurrency hit 1,000 (account limit at the time), DynamoDB consumed 4x provisioned capacity causing throttling, and SQS queue depth grew to 2 million messages. 339 other tenants experienced 15-second API response times. Time to detect: 8 minutes (CloudWatch alarm on p99 latency). Time to mitigate: 45 minutes (manually throttling the tenant by modifying their API Gateway usage plan). Time to fix properly: 6 weeks to implement per-tier reserved concurrency, per-tenant API throttling, and queue isolation for the top 20 tenants. Monthly infrastructure cost increase for isolation: 1,200.Revenueprotected:1,200. Revenue protected: 80K MRR from the other tenants who would have churned if degradation continued.

Follow-up: How do you detect the noisy tenant automatically before other tenants are impacted?

Answer: The key is per-tenant observability — most teams only have aggregate metrics, which tells you the system is degraded but not which tenant is causing it.
  • Per-tenant request counting in real-time. A Lambda@Edge function (or API Gateway access logging to Kinesis) that increments a Redis sorted set keyed by tenant ID with TTL-based windows. A background Lambda samples the sorted set every 30 seconds and fires a CloudWatch custom metric RequestsPerTenant. A CloudWatch anomaly detection alarm triggers when any tenant exceeds 3 standard deviations from their 7-day baseline.
  • API Gateway access logs to Athena. API Gateway access logs include the API key (tenant) and can be queried in near real-time via Kinesis Firehose to S3 and Athena. A scheduled query every 5 minutes identifies tenants exceeding their normal traffic by 5x+.
  • Automatic throttling. When the anomaly detection alarm fires, a Lambda function automatically lowers the offending tenant’s API Gateway usage plan throttle to their normal baseline + 20% buffer. This contains the blast radius within minutes, without human intervention. A Slack notification tells the on-call engineer what happened so they can investigate whether the traffic is legitimate or malicious.
The goal: detect in under 2 minutes, auto-mitigate in under 5 minutes, and notify a human for follow-up. No pager for other tenants, ever.

21. You are reviewing a pull request that adds "Action": "s3:*", "Resource": "*" to a Lambda execution role. The developer says “I will tighten it later.” How do you handle this technically and interpersonally?

Difficulty: Intermediate What the interviewer is really testing: Security discipline, ability to give constructive code review feedback, and whether you understand the real blast radius of overly permissive IAM. Also tests soft skills — can you push back without being adversarial? The obvious-answer trap: many candidates focus only on the technical risk and miss the interpersonal dimension. What weak candidates say:
  • “Block the PR and tell them to fix it.”
  • “It is fine for now, we can tighten it later.”
  • “Add a TODO comment and merge.”
Blocking without helping is adversarial. “Fix it later” never happens — this is how every S3 data breach starts. A TODO is a wish, not a policy. What strong candidates say: I would handle this in two parts: fix the immediate PR, and fix the systemic problem that lets this happen. On the PR (technical): I would comment: “I cannot approve this as-is, but let me help you scope it correctly. What S3 operations does the function actually need?” Then I would pair with them for 10 minutes to write the correct policy. In my experience, 90% of s3:* usage comes from functions that need exactly 3 actions:
{
  "Effect": "Allow",
  "Action": [
    "s3:GetObject",
    "s3:PutObject",
    "s3:ListBucket"
  ],
  "Resource": [
    "arn:aws:s3:::specific-bucket-name",
    "arn:aws:s3:::specific-bucket-name/*"
  ]
}
This takes 5 minutes to write and reduces the blast radius from “the Lambda can delete any object in any bucket in the account, create new buckets, modify bucket policies, exfiltrate any data to any external account” to “the Lambda can read and write objects in one specific bucket.” If the function is compromised (dependency vulnerability, injection attack), the attacker can access one bucket instead of everything. I would also show the developer what s3:* actually includes: s3:DeleteBucket, s3:PutBucketPolicy (can make any bucket public), s3:GetObject on every bucket (can exfiltrate customer data from unrelated services), s3:PutBucketReplication (can replicate data to an attacker-controlled account). Most developers are shocked when they see the full list — they think s3:* means “read and write files” and do not realize it includes bucket-level administrative operations. On the systemic problem: The fact that this reached a PR means the guardrails are missing:
  • IAM Access Analyzer as a CI check. AWS IAM Access Analyzer has a policy validation API. Add a CI step that runs aws accessanalyzer validate-policy on every IAM policy in the CDK/Terraform code. It flags overly permissive actions and wildcard resources as findings. The PR cannot merge with ERROR-level findings.
  • SCP as a safety net. An SCP that denies s3:PutBucketPolicy, s3:DeleteBucket, and s3:PutBucketPublicAccessBlock for Lambda execution roles (identified by a naming convention or tag condition). This means even if a wildcard policy slips through, the most dangerous S3 actions are blocked at the organization level.
  • CDK/Terraform policy templates. Create reusable IAM policy modules for common patterns (S3 read-only, S3 read-write to specific bucket, DynamoDB CRUD on specific table). Developers import the module instead of writing raw JSON. This is faster than writing a custom policy and correct by default.
On the interpersonal dimension: The developer saying “I will tighten it later” usually means “I do not know what permissions I need and scoping it correctly feels like extra work that slows me down.” Responding with “that is a security risk, fix it” is technically correct but interpersonally counterproductive. The productive approach: “Let me help you figure out the right permissions now. It takes 5 minutes and it means we do not have to come back to this. What AWS services does the function call?” This turns a confrontation into a collaboration. War Story: At a startup where I was the second senior engineer, a developer deployed a Lambda with "Action": "*", "Resource": "*" — full admin access. Three months later, a dependency vulnerability in a Node.js library (prototype pollution in a logging package) allowed an attacker to execute arbitrary code in the Lambda context. Because the Lambda had full admin access, the attacker created a new IAM user, attached AdministratorAccess, and used those credentials to launch 48 p3.16xlarge instances for cryptocurrency mining. The bill for the weekend: $28,000. AWS reversed most of it under their abuse policy, but the incident response, credential rotation, forensic analysis, and customer notification took 3 weeks of engineering time. One IAM policy that should have been s3:GetObject on one bucket.

Follow-up: The developer pushes back: “IAM Access Analyzer is too strict, it flags everything. It slows us down.” How do you respond?

Answer: They are partially right — IAM Access Analyzer’s default findings include informational items that are not security risks. The fix is not to disable the tool; it is to configure the CI check to fail only on ERROR and SECURITY_WARNING level findings, not SUGGESTION or WARNING. This eliminates 80% of the noise while catching the dangerous patterns (wildcard resources, overpermissive actions, public access). I would also invest 2 days in creating a library of pre-approved IAM policy modules for the 10 most common patterns in the codebase. Developers import a module (import { s3ReadWrite } from '@infra/iam-policies') instead of writing raw JSON. The module is already validated, already scoped, and faster to use than writing a custom policy. You turn security from a tax into a productivity tool.

22. A service processes 2 million SQS messages per day. Occasionally, the same message is processed twice, causing duplicate charges to customers. The team says “SQS is at-least-once, duplicates are expected.” Is that an acceptable answer?

Difficulty: Senior What the interviewer is really testing: Whether you treat “the infrastructure does not guarantee it” as an excuse or an engineering constraint to solve. This is a question where the technically correct answer (“SQS Standard is at-least-once”) is the wrong answer in a business context. Strong engineers solve for business requirements, not infrastructure limitations. What weak candidates say:
  • “Switch to SQS FIFO for exactly-once delivery.”
  • “SQS duplicates are rare, the business should accept them.”
  • “Add a database check before processing each message.”
SQS FIFO has a 300/s throughput limit (3,000 with batching) and may not fit. Telling the business to accept duplicate charges is a career-limiting move. A database check without atomic writes has a race condition. What strong candidates say: “SQS is at-least-once” is a statement about infrastructure behavior, not an excuse for billing customers twice. The customer does not care about our messaging semantics — they care about being charged correctly. This is an application-level idempotency problem, and it is one of the most important patterns in distributed systems. The idempotency approach I have implemented in production:
  • Idempotency key. Every message carries a unique idempotency key (the SQS MessageId, a hash of the message body, or a business-level identifier like order_id + charge_type). Before processing, the consumer checks if this key has been processed before.
  • The atomic check-and-process pattern. This is where most implementations go wrong. A naive approach — “read from database, check if key exists, if not process, then write key” — has a race condition. Two concurrent Lambda instances receive the same duplicate message, both check the database simultaneously, both see “not processed,” both process, both write the key. You still get a duplicate.
The correct pattern uses an atomic conditional write:
# DynamoDB: Conditional put that fails if key already exists
try:
    idempotency_table.put_item(
        Item={
            'idempotency_key': message_id,
            'processed_at': timestamp,
            'result': None,  # Will be updated after processing
            'ttl': int(time.time()) + 86400 * 7  # Expire after 7 days
        },
        ConditionExpression='attribute_not_exists(idempotency_key)'
    )
except ConditionalCheckFailedException:
    # Already processed -- skip
    return {'statusCode': 200, 'body': 'Duplicate, skipped'}

# Process the message (charge the customer)
result = process_charge(message)

# Update the idempotency record with the result
idempotency_table.update_item(
    Key={'idempotency_key': message_id},
    UpdateExpression='SET result = :r',
    ExpressionAttributeValues={':r': json.dumps(result)}
)
The ConditionExpression makes the check-and-write atomic. If two Lambda instances try simultaneously, one succeeds and one gets ConditionalCheckFailedException. The one that fails skips processing. Zero duplicates.
  • TTL for cleanup. Idempotency keys should not live forever. Set a DynamoDB TTL to expire keys after 7 days (or whatever your SQS message retention is). This keeps the table from growing unbounded.
  • What about SQS FIFO? SQS FIFO provides exactly-once delivery within a 5-minute deduplication window. But it does not provide exactly-once processing. If the consumer crashes after processing but before deleting the message, the message reappears after the visibility timeout. You still need the idempotency pattern. FIFO reduces the probability of duplicates but does not eliminate it. For 2 million messages/day (23 messages/second), SQS FIFO works within its throughput limits. But I would still implement idempotency because the cost is negligible and the protection is absolute.
  • The broader pattern. Idempotency is not just for SQS. API endpoints should be idempotent (Stripe’s Idempotency-Key header is the gold standard). Event processors should be idempotent. Anything that can be retried should be idempotent. I treat idempotency the same way I treat input validation — it is not optional, it is a fundamental requirement for any operation that has side effects.
War Story: At a subscription billing platform processing 4.2 million charge events/month through SQS, we discovered that 0.03% of charges were duplicated — roughly 1,260 double-charges per month. Each averaged 45.Totalcustomerovercharge:45. Total customer overcharge: 56,700/month. The duplicate rate was low enough that customer support handled it as one-off refund requests for months before anyone connected the pattern. Root cause: SQS Standard’s at-least-once delivery during high-throughput periods, combined with Lambda consumers that had no idempotency check. The fix (DynamoDB conditional writes as shown above) took 3 days to implement across 4 consumer Lambda functions. DynamoDB cost for the idempotency table: 12/month.Revenueimpactprevented:12/month. Revenue impact prevented: 56,700/month. The cost-to-fix-to-cost-of-not-fixing ratio was roughly 1:5,000.

Follow-up: What happens if the Lambda crashes between the idempotency write and the actual processing? The key is recorded but the charge never happened.

Answer: This is the “phantom record” problem. The idempotency key is in the database, so any retry skips processing, but the original processing never completed. The customer is never charged. Solution: Two-phase idempotency. Write the idempotency record with a status: PENDING in the first atomic write. Process the charge. Update the record to status: COMPLETED with the result. If the Lambda crashes between the initial write and the processing, the record exists with status: PENDING. A separate “reconciliation” Lambda runs every 5 minutes (scheduled via EventBridge), queries for PENDING records older than 10 minutes, and retries them. The retry Lambda reads the PENDING record, attempts the charge, and updates the status. This handles all failure modes: duplicate delivery (blocked by conditional write), crash before processing (retried by reconciler), crash after processing but before status update (reconciler retries, but the downstream payment processor should also be idempotent — Stripe’s Idempotency-Key prevents double-charging even if we call it twice). The pattern is: make every step independently retry-safe, and have a background reconciler that catches anything that fell through the cracks.

Follow-up: How do you size the DynamoDB table for the idempotency store at 2 million messages/day?

Answer: Quick math: 2 million writes/day = 23 writes/second sustained. Each idempotency record is roughly 200 bytes (key + timestamp + status + result). With 7-day TTL, the table holds ~14 million items at steady state = ~2.8 GB storage. For DynamoDB on-demand: 23 WRU/second is trivial. Cost: ~1.25/millionwrites60millionwrites/month=1.25/million writes * 60 million writes/month = 75/month. Read capacity for the conditional check is included in the write (DynamoDB’s conditional write does a read internally). For provisioned: 25 WCU (with some headroom) and 5 RCU (for the reconciliation queries). Cost: ~$15/month. Storage: 2.8 GB * 0.25/GB=0.25/GB = 0.70/month. Total: $16-76/month depending on mode. This is negligible compared to the cost of duplicate charges it prevents.

Cross-Chapter Connections

This chapter connects to:
  • Cloud, Problem Framing & Trade-Offs — The decision frameworks and architectural thinking that precede service selection. Read that chapter first for the “why,” then this chapter for the “how.”
  • Database Deep Dives — PostgreSQL internals (MVCC, WAL, VACUUM, query planner), DynamoDB access pattern design and single-table strategy, MongoDB aggregation patterns, and Redis data structures. Essential for understanding what runs inside RDS, Aurora, DynamoDB, and ElastiCache.
  • Messaging, Concurrency & State — Deeper coverage of messaging semantics (exactly-once, idempotency, ordering guarantees), Kafka vs RabbitMQ vs SQS comparisons, dead letter queue patterns, and event-driven architecture principles that apply to SQS, Kinesis, and EventBridge patterns in this chapter.
  • Authentication & Security — OAuth 2.0, OIDC, RBAC, ABAC, Zero Trust Architecture, and service-to-service authentication patterns that IAM roles, Secrets Manager, and SCPs implement at the AWS level. Essential context for the IAM, multi-account, and security sections in this chapter.
  • Networking & Deployment — DNS resolution (Route 53 implements these patterns), load balancing (ALB/NLB), TLS termination, service discovery, VPC fundamentals, and deployment strategies (blue-green, canary) applied to AWS services.
  • Compliance, Cost & Debugging — FinOps practices and cloud cost governance that extend this chapter’s Cost Engineering section, GDPR/HIPAA implications for S3 data residency and multi-account compliance boundaries, and incident response for cloud outages.
  • APIs & Databases — REST API design, database indexing, and transaction patterns that inform API Gateway, DynamoDB, and Aurora design decisions.
  • Caching & Observability — ElastiCache patterns (cache-aside, write-through, stampede prevention), cache invalidation strategies, and the observability stack (CloudWatch, X-Ray, Datadog) for monitoring cloud-native applications.
  • Performance & Scalability — Auto-scaling patterns, latency optimization, and capacity planning that apply directly to Lambda concurrency, Fargate scaling, and database sizing.
  • Reliability Principles — SLOs, error budgets, circuit breakers, retry patterns, bulkhead isolation, and chaos engineering. These resilience patterns are what make cloud architectures production-ready — Lambda retry behavior, SQS DLQ patterns, and multi-AZ failover all implement the principles covered there.
  • Distributed Systems Theory — CAP theorem, consensus algorithms, and the fundamental challenges of distributed computing. DynamoDB’s eventual consistency, Aurora’s distributed storage, and multi-region active-active architectures all trace back to these theoretical foundations.