An API Gateway is the single entry point for all client requests to your microservices. It handles cross-cutting concerns and provides a unified interface.Think of an API Gateway like the reception desk at a large corporate office. Visitors (clients) do not wander the building knocking on individual office doors (services). They go to reception, identify themselves (authentication), get a visitor badge (authorization), and reception directs them to the right floor (routing). If a department is closed, reception tells the visitor immediately (circuit breaking) rather than letting them walk to an empty office. The trade-off: reception adds a stop to every visit (latency), and if reception itself goes down, nobody gets in (single point of failure). That is why production gateways run as horizontally scaled, stateless clusters — you need multiple receptionists.
CLIENT │ ┌─────────────┼─────────────┐ │ │ │ ▼ ▼ ▼┌─────────┐ ┌─────────┐ ┌─────────┐│ User │ │ Order │ │ Product ││ Service │ │ Service │ │ Service │└─────────┘ └─────────┘ └─────────┘Problems:• Client needs to know all service URLs• Each service implements auth, rate limiting• No unified error handling• CORS issues• Multiple round trips• Exposing internal services
CLIENT │ ▼ ┌─────────────┐ │ API GATEWAY │ │ ─────────── │ │ • Routing │ │ • Auth │ │ • Rate Limit│ │ • Caching │ │ • Logging │ └──────┬──────┘ ┌─────────────┼─────────────┐ │ │ │ ▼ ▼ ▼┌─────────┐ ┌─────────┐ ┌─────────┐│ User │ │ Order │ │ Product ││ Service │ │ Service │ │ Service │└─────────┘ └─────────┘ └─────────┘Benefits:• Single entry point• Centralized cross-cutting concerns• Protocol translation• Request aggregation• Service discovery abstraction
Caveats & Common Pitfalls: Why API Gateway (and why it can hurt you)
The gateway as a single point of failure. Every client request goes through it. If it is a single process, a single deploy, or a single AZ, you have created a top-level SPOF. Outages here are 100% customer-visible.
The gateway as a magnet for business logic. “We need to add a little logic to combine two responses.” Six months later, the gateway has become a de facto service with domain knowledge about orders, payments, and users. Deployments are slow, ownership is fuzzy, and bugs span teams.
The gateway as a bottleneck. All traffic passes through it, so any extra latency or CPU cost is multiplied by every request. Adding three middleware layers that each add 5ms turns into a flat 15ms tax on every endpoint.
The gateway bypass pattern emerges. Because the gateway is slow, painful to change, or owned by another team, internal clients start calling services directly. Auth, rate limiting, and observability are silently lost for those paths.
Solutions & Patterns: Thin gateway, clear responsibilities, horizontal scalingTreat the gateway as infrastructure with a narrow, stable charter: routing, auth, rate limiting, TLS, observability. Push everything else (aggregation, business logic, domain translation) into a BFF service behind the gateway. Run the gateway as a stateless, horizontally scalable fleet with at least three replicas per AZ. Invest in gateway-specific CI/CD so changes can ship without a meeting.Decision rule for “does this belong in the gateway?”: if the logic is cross-cutting (applies to every service uniformly) and has no domain knowledge, it belongs in the gateway. If it touches business rules or service-specific data shapes, it belongs in a BFF or the service itself.Before: Checkout latency is 800ms; investigation reveals the gateway is doing JWT verification, rate limit check, tenant resolution, audit log write, recommendation fetch, and response shaping before calling the actual service.
After: Gateway handles JWT + rate limit (together ~3ms). Tenant resolution moves to a shared library used by services. Audit log becomes async. Recommendation fetch and response shaping move to a BFF service. Total latency drops to around 200ms and the gateway is now stateless and fast.
Your API gateway becomes a bottleneck: p99 climbs to 400ms in the gateway alone, separate from downstream latency. Walk through diagnosis and fix.
Strong Answer Framework:
Profile the gateway. Most gateways have per-middleware timing or you can add it cheaply. Identify which middleware is dominant — usually JWT verification (if doing a remote call), rate limiting (if not using local counters), or request body inspection.
Short-term wins: cache JWT verification results (with a short TTL, cryptographically valid); move rate limit counters to local memory with periodic flush to Redis; stop inspecting request bodies for anything that does not strictly need them.
Move heavy logic out of the hot path. Audit logging, analytics enrichment, request/response transforms for a single endpoint, and business-specific rewrites should not be in the gateway. Push to sidecars, async pipelines, or per-service BFFs.
Scale horizontally. Gateways should be stateless so adding replicas linearly scales throughput. If scaling does not help, there is a shared dependency (Redis, auth service) that needs its own attention.
Set an SLO: “gateway-only overhead under 10ms at p99.” Alert on breaches. Over time, teams will push back on anything that threatens this budget, which is exactly the pressure you want.
Real-World Example: Netflix’s Zuul 2 rewrite (described in a 2018 tech blog post) was motivated by exactly this pattern — Zuul 1 had become a latency bottleneck because plugins accumulated over years. The rewrite enforced a strict “no blocking I/O in the gateway” policy and capped the plugin budget. Envoy’s popularity has a similar origin: Lyft built it because their per-service HTTP libraries were inconsistent and their edge proxy was slow.Senior Follow-up Questions:
Follow-up 1: “How do you decide if JWT validation should call an auth service or be local?”If the token is self-contained (RS256 with a rotating public key), validate locally with a cached public key. That is under 1ms per request. If the token requires revocation checks or dynamic permission lookups, you cannot avoid a remote call, but you can cache its result for the token’s remaining lifetime.Follow-up 2: “What is the risk of caching JWT validation results?”If you cache for longer than the token’s remaining TTL, a revoked token could still be honored. The cache key should include the token’s expiration, and the cached entry should never outlive the token itself. For immediate revocation, you need an out-of-band revocation list with its own refresh cadence.Follow-up 3: “When does a service mesh replace an API gateway?”A service mesh (Istio, Linkerd) handles service-to-service concerns: mTLS, retries, observability. An API gateway handles edge concerns: client auth, rate limiting per client, public API versioning. They complement each other; most mature orgs run both. The mesh is not a replacement for the gateway.
Common Wrong Answers:
“Add more replicas until latency drops.” Fails when the bottleneck is a shared dependency (e.g., Redis for rate limits); more gateway replicas just hit the same wall.
“Remove middleware to speed up.” Fails because some middleware (auth, rate limit) is non-negotiable; the right move is to make it faster, not remove it.
Further Reading:
Netflix Tech Blog’s “Zuul 2: The Netflix Journey to Asynchronous, Non-Blocking Systems.”
Envoy documentation on filter chains and latency budgets.
Sam Newman’s Building Microservices, 2nd edition, chapter on API gateways.
Before we write a single line of routing code, we need to wire up the middleware pipeline that every request will flow through. The order of middleware matters enormously: security headers must come before body parsing (to reject oversized bodies early), logging must come before auth (so you can see unauthenticated attacks), and error handlers must come last (to catch everything). Without this careful ordering, you get subtle bugs — a missing request ID means you cannot correlate logs across services, a missing CORS header means browsers silently reject responses, and a missing body size limit means a single attacker can OOM your gateway with a 10GB upload. The key tradeoff here is that every middleware adds latency and CPU cost, so you only add what earns its keep — every team eventually wants to add “just one more” middleware and ends up with a 100ms gateway overhead.
Routing is the heart of a gateway: when a request comes in for /api/users/123, the gateway must know to forward it to the user service. This sounds trivial, but production routing has three non-obvious requirements. First, the gateway must rewrite the path so the downstream service does not need to know about the gateway prefix — the user service should receive /users/123, not /api/users/123. Second, authentication context must flow through — once the gateway validates the JWT, it injects headers like X-User-ID so downstream services do not need to re-validate. Third, correlation headers must propagate — without X-Request-ID flowing end-to-end, debugging a distributed trace is impossible. The key tradeoff is that every proxy call adds a TCP hop; use connection pooling and HTTP/2 to amortize this. Without proper routing, you end up with either clients that know all internal service URLs (tight coupling), or a big-ball-of-mud gateway where routing logic and business logic mix.
Hard-coded upstreams. Service URLs compiled into gateway code means adding a new service requires a gateway deploy. Do this once and you are locked into a slow feedback loop.
No health checks on upstream registrations. The gateway keeps routing to a dead upstream because it trusts the registry. Symptoms: a fraction of requests timeout and recover on retry, with no clear cause.
Routing rules that are too clever. Regex-based path routing or header-based feature flags accumulate until no one can predict where a request will go. Incidents become “why did this request reach the wrong service?” debugging.
Bypass paths. Internal teams hit services directly because the gateway does not support their use case yet. Those paths skip auth, rate limiting, and observability, and nobody notices until an incident.
Solutions & Patterns: Config-driven routing, active health checks, no bypassRouting is configuration, not code. Use a dynamic config system (Consul, etcd, Kubernetes CRDs, or a purpose-built routing service) that the gateway watches. Services register themselves with health check endpoints; the gateway probes and removes unhealthy upstreams automatically. Keep routing rules declarative and simple: one path pattern per service, with feature flags handled inside the service.Decision rule for bypass: if a client cannot get through the gateway, that is a gateway bug. Fix the gateway, do not let the client bypass. Track “bypass count” as a negative metric.Before: gateway.js has a switch statement with 40 cases mapping paths to services. New service requires a PR, a review, a deploy, and a restart.
After: Gateway reads route table from Consul, refreshing every 10 seconds. New service registers itself; gateway picks it up within seconds. Health checks remove dead instances automatically.
Two services with different teams both want to own '/api/payments/*'. How do you decide routing, and what does 'gateway governance' look like?
Strong Answer Framework:
Recognize this as a governance problem, not a technical one. URL paths are contracts with clients; splitting one path across two services is messy and usually indicates a team boundary drawn in the wrong place.
Clarify the domain. Is this truly two services (e.g., payments-intake and payments-settlement), or is it one domain with two teams collaborating? In the former, sub-path routing (/api/payments/intake/*, /api/payments/settlement/*) works. In the latter, find a single team to own the path.
Codify the governance in a gateway config review process. Changes to public URL paths go through a CODEOWNERS-style approval with a small platform team as backstop. No team ships a routing change without this review.
Use a service catalog. Every public path has a single owning service with a single team. Cross-team changes require explicit re-assignment.
Sunset paths carefully. If a team wants to take over a path, the old owner runs a parallel route with a deprecation date, telemetry shows the migration, and the switch-over is boring.
Real-World Example: Amazon famously has a “two-pizza team owns a service” rule; the service catalog enforces one owner per URL. When teams split, Amazon’s internal service framework forces an explicit namespace split before letting two teams touch the same path. Uber’s 2019 “Domain-Oriented Microservice Architecture” blog post describes moving from URL-namespaced ownership to domain-namespaced ownership specifically to avoid this kind of dispute.Senior Follow-up Questions:
Follow-up 1: “What if the two services genuinely serve different use cases under the same path (e.g., v1 and v2)?”That is URL versioning, not cross-team ownership. /api/v1/payments/* and /api/v2/payments/* can be different services during migration. But they should share a common owner or a well-defined interface between owners.Follow-up 2: “How do you audit routing changes?”Every change to the route table is a git commit in a config repo with review. The gateway reads from that repo (or a derived store). You can git blame any route and find the PR that introduced it.Follow-up 3: “What about canary routing: 10% to service-v2, 90% to service-v1?”Supported by most gateways via weighted routing. Represent the weight in config; monitor by version; automate rollback on error-rate regression. Keep weights simple (increments of 10%) to avoid debugging percentage arithmetic in production.
Common Wrong Answers:
“Let both teams route to the same path and the gateway picks one.” Fails because non-deterministic routing makes incidents impossible to debug.
“Add a header to distinguish which service.” Fails because clients cannot reliably set headers without coordination, and this effectively forks the URL contract.
Authentication at the gateway is the single most important security boundary in a microservices architecture. The principle is simple: validate once at the edge, then trust inside the perimeter. Without this, every service would need to duplicate JWT validation logic — and when you rotate a key, you have to update every service. But there is a critical tradeoff: if the gateway’s auth logic is wrong, every service is wrong. That is why the code validates the token locally (using the signing key) instead of calling an auth service on every request (which would add 20-50ms per request and create a hard dependency on the auth service being up). If we skip this middleware entirely, every service must re-validate the JWT and enforce public/private route rules, leading to inconsistency — one service forgets to check the exp claim and tokens “never expire” there.
# gateway/src/middleware/auth.pyimport refrom dataclasses import dataclass, fieldfrom typing import Patternfrom fastapi import Request, HTTPException, Dependsfrom jose import jwt, JWTError, ExpiredSignatureErrorfrom pydantic_settings import BaseSettingsclass AuthSettings(BaseSettings): jwt_secret: str jwt_algorithm: str = "HS256" class Config: env_file = ".env"auth_settings = AuthSettings()@dataclassclass PublicRoute: method: str path: str | Pattern[str]PUBLIC_PATHS: list[PublicRoute] = [ PublicRoute("POST", "/api/users/register"), PublicRoute("POST", "/api/users/login"), PublicRoute("GET", "/api/products"), PublicRoute("GET", re.compile(r"^/api/products/[\w-]+$")),]@dataclassclass AuthenticatedUser: id: str email: str roles: list[str] = field(default_factory=list) permissions: list[str] = field(default_factory=list)def _is_public_path(method: str, path: str) -> bool: for route in PUBLIC_PATHS: method_match = route.method == method or route.method == "*" if isinstance(route.path, re.Pattern): path_match = bool(route.path.match(path)) else: path_match = route.path == path if method_match and path_match: return True return Falseasync def auth_dependency(request: Request) -> AuthenticatedUser | None: """Validate JWT on protected paths. Returns None for public routes.""" if _is_public_path(request.method, request.url.path): return None auth_header = request.headers.get("authorization") if not auth_header or not auth_header.startswith("Bearer "): raise HTTPException( status_code=401, detail={"code": "UNAUTHORIZED", "message": "Missing or invalid authorization header"}, ) token = auth_header[7:] try: decoded = jwt.decode( token, auth_settings.jwt_secret, algorithms=[auth_settings.jwt_algorithm], ) except ExpiredSignatureError: raise HTTPException( status_code=401, detail={"code": "TOKEN_EXPIRED", "message": "Authentication token has expired"}, ) except JWTError: raise HTTPException( status_code=401, detail={"code": "INVALID_TOKEN", "message": "Invalid authentication token"}, ) user = AuthenticatedUser( id=decoded["sub"], email=decoded.get("email", ""), roles=decoded.get("roles", []), permissions=decoded.get("permissions", []), ) # Stash on request.state so downstream handlers/proxies can forward it request.state.user = user return user# Wire as a dependency on the protected routerauth_middleware = Depends(auth_dependency)def require_roles(*required_roles: str): """Declarative RBAC for individual routes.""" async def checker(user: AuthenticatedUser | None = Depends(auth_dependency)) -> AuthenticatedUser: if user is None: raise HTTPException( status_code=401, detail={"code": "UNAUTHORIZED", "message": "Authentication required"}, ) if not any(role in user.roles for role in required_roles): raise HTTPException( status_code=403, detail={"code": "FORBIDDEN", "message": "Insufficient permissions"}, ) return user return checker
Caveats & Common Pitfalls: Auth at the gateway
Trust-the-gateway everywhere. Once a request is inside the perimeter, any service accepts it. A single bypass (an internal port open, a misconfigured Istio rule) lets an attacker call any service freely. Defense in depth means services still verify their caller.
Auth as a remote call on every request. A 20-50ms hop to an auth service on every request destroys latency and creates a hard dependency on the auth service being up. Use self-contained tokens (RS256 JWT) or cached validation.
Leaking the bearer token downstream. Forwarding the raw client JWT to every internal service means an internal service can impersonate the user to any other service. Use a per-hop service identity (mTLS or SPIFFE) and pass claims in a stripped-down internal header.
No key rotation plan. The signing key has been the same for 4 years. When it leaks, you cannot rotate without breaking every token in flight. Plan rotation from day one with overlapping validity windows.
Solutions & Patterns: Validate-at-edge, stripped identity propagation, defense in depthThe gateway validates the client JWT, extracts claims (user id, tenant, scopes), and propagates them via an internal header like X-User-Id, X-User-Tenant, signed with an internal key services verify. The original client token does not leave the gateway. Services trust the internal header because they verify its signature, not because they trust “anything that got to us.” Key rotation uses a JWKS endpoint so new keys propagate automatically.Decision rule for auth location: always at the gateway for client-facing traffic. For service-to-service, use workload identity (mTLS certificates, SPIFFE IDs) that is independent of the user’s identity. The two authentications compose: “user U acting via service S.”Before: Gateway forwards the raw user JWT to downstream services. A compromised service can call any other service in the user’s name. Key rotation requires coordinating across every service simultaneously.
After: Gateway validates JWT, issues an internal signed header with user claims, services trust that header only if signed by the gateway’s internal key. Service-to-service auth uses mTLS with rotating certificates. Key rotation affects only the gateway and its JWKS endpoint.
You are designing auth for a new platform. The team lead says 'just put auth in the gateway and trust downstream.' What do you push back on, and what do you propose?
Strong Answer Framework:
Agree on the core: yes, validate the client token at the gateway to avoid duplicating JWT parsing in every service.
Push back on “trust downstream without verification.” Defense in depth requires services to verify their caller, even if it is another internal service. The mechanism is workload identity (mTLS or SPIFFE), not user identity.
Propose a two-layer model: user identity (gateway-signed internal header, services verify the signature) and service identity (mTLS certificate issued by an internal CA, rotated frequently). A service authorizes on both: “user U via service S.”
Address key management: a JWKS endpoint the gateway hosts, services fetch and cache the public keys, rotation is automated with overlapping windows. No manual key distribution.
Document the threat model: what happens if a service is compromised? If only user identity, the attacker can act as any user. With service identity, the attacker can only act as the compromised service to services that explicitly trust it.
Real-World Example: Google’s BeyondCorp model abandoned the “trusted internal network” assumption and enforces identity at every hop. SPIFFE and SPIRE, born at Pinterest and now a CNCF project, codify this pattern for microservices. Istio’s mTLS and authorization policies implement exactly this two-layer model.Senior Follow-up Questions:
Follow-up 1: “Why not use the raw JWT internally?”Three reasons. One, the JWT may carry sensitive claims (email, PII) that not every downstream service needs; stripped identity headers follow least privilege. Two, a compromised service can replay the user’s JWT against other services anywhere, not just the ones it was supposed to call. Three, rotating signing keys for user JWTs is harder than rotating internal-only signing keys.Follow-up 2: “How do you handle service-to-service auth when there is no user context (e.g., a cron job)?”Service identity (mTLS or SPIFFE) is sufficient. The cron service has its own identity; downstream services authorize based on “which service am I talking to” without needing a user claim.Follow-up 3: “How does a service mesh fit in?”The mesh handles service identity automatically: every sidecar proxy presents an mTLS cert, and policy governs which service can call which. The gateway still handles user identity. Together, you get “this user via this service is allowed to call this operation.”
Common Wrong Answers:
“Validate JWT at the gateway and trust the internal network.” Fails the “what if one service is compromised” test, and modern threat models (supply chain, zero-day) make internal compromise realistic.
“Pass the raw JWT everywhere for simplicity.” Fails because it creates a blast-radius problem on compromise and couples service auth to user auth forever.
Further Reading:
BeyondCorp papers from Google on identity-aware proxies.
SPIFFE/SPIRE documentation and the “Solving the Bottom Turtle” book.
Istio documentation on authorization policies and peer authentication.
Rate limiting exists because a single misbehaving client can bring down your entire platform. One infinite loop in a customer’s script, one bot scraping your API, one bug in a mobile app that retries too aggressively — any of these can 100x your traffic in seconds. The job of rate limiting is to protect your services by rejecting excess requests at the edge, before they consume database connections or CPU. The tradeoff is operational complexity: per-user limits need a shared counter across gateway instances (Redis), and that Redis call adds 1-2ms per request. Skip rate limiting and you get cascading outages — a single bad actor causes database contention, which slows down legitimate users, who retry, which creates more load. The layered approach below (global + per-user + per-endpoint) matters because different attacks look different: a credential-stuffing attack hits /login hard but looks normal globally, while a scraper hits /products from many IPs but stays under per-user limits.
# gateway/src/middleware/rate_limiter.pyfrom typing import Callablefrom fastapi import Request, HTTPExceptionfrom slowapi import Limiterfrom slowapi.util import get_remote_addressfrom starlette.responses import JSONResponsefrom pydantic_settings import BaseSettingsimport redis.asyncio as redis_asyncclass RateLimitSettings(BaseSettings): redis_url: str = "redis://localhost:6379/0" internal_secret: str = "" class Config: env_file = ".env"rl_settings = RateLimitSettings()# slowapi uses a sync storage URI; redis-py async client is for custom counters.limiter = Limiter( key_func=get_remote_address, storage_uri=rl_settings.redis_url, default_limits=["1000/minute"], # global cap)# Separate async client for custom per-endpoint logicredis_client = redis_async.from_url(rl_settings.redis_url, decode_responses=True)def _rate_key(request: Request) -> str: """Per-user when authenticated, otherwise per-IP.""" user = getattr(request.state, "user", None) if user is not None: return f"user:{user.id}" return f"ip:{get_remote_address(request)}"async def rate_limiter(request: Request, call_next: Callable): """Top-level middleware that enforces 100 req/min per user (or IP).""" # Skip for internal service-to-service calls if request.headers.get("x-internal-service") == rl_settings.internal_secret: return await call_next(request) key = f"ratelimit:{_rate_key(request)}:minute" count = await redis_client.incr(key) if count == 1: await redis_client.expire(key, 60) if count > 100: return JSONResponse( status_code=429, content={ "error": { "code": "RATE_LIMIT_EXCEEDED", "message": "Too many requests, please try again later", } }, headers={"Retry-After": "60"}, ) response = await call_next(request) response.headers["X-RateLimit-Limit"] = "100" response.headers["X-RateLimit-Remaining"] = str(max(0, 100 - count)) return responseasync def endpoint_limiter( request: Request, *, max_requests: int, window_seconds: int, message: str = "Rate limit exceeded for this endpoint",) -> None: """Endpoint-scoped limiter — call from inside specific routes.""" user_key = _rate_key(request) key = f"ratelimit:endpoint:{user_key}:{request.url.path}:{window_seconds}" count = await redis_client.incr(key) if count == 1: await redis_client.expire(key, window_seconds) if count > max_requests: raise HTTPException( status_code=429, detail={"code": "ENDPOINT_RATE_LIMIT", "message": message}, )# Ready-made helpers for specific routesasync def auth_limiter(request: Request) -> None: await endpoint_limiter( request, max_requests=5, window_seconds=15 * 60, message="Too many login attempts, please try again later", )async def password_reset_limiter(request: Request) -> None: await endpoint_limiter( request, max_requests=3, window_seconds=60 * 60, message="Too many password reset requests", )
Caveats & Common Pitfalls: Rate limiting at the wrong layer
Per-instance counters that do not aggregate. Each gateway replica tracks its own counter. With 10 replicas, a “100 requests per minute” limit becomes 1000. Users hit “nothing” until you fix it.
Rate limiting in the service instead of the gateway. Services spend CPU processing requests that should have been rejected at the edge. Under attack, the service is still overwhelmed; the rate limiter was just a rubber stamp.
One global limit for all use cases. Login, search, and file upload all share the same per-user limit. A user hitting search hard cannot log in. Or attackers hammer /login under the shared limit without tripping anything specific.
Ignoring Retry-After. Clients retry immediately, hitting the limit again, getting rejected again. The header exists specifically to coordinate backoff; always set it.
Solutions & Patterns: Layered limits, shared state, honest responsesThree layers of rate limits: global (protect infrastructure from absolute volume), per-user (fair usage), per-endpoint (protect specific operations like login). Counters live in a shared store (Redis) with atomic increment. On limit exceeded, return 429 with Retry-After. Different endpoints get different limits (login: 5/minute, search: 100/minute, file upload: 10/hour).Decision rule for layer placement: always at the gateway, closest to the client. Any service-level rate limit is defense in depth, not the primary mechanism. The gateway rejects before consuming downstream resources.Before: Gateway has a flat 1000 req/min per user limit with in-memory counters. Ten replicas mean actual limit is 10,000/min. A credential-stuffing attack against /login succeeds because login shares the global limit.
After: Gateway uses Redis for global counters. /login has a specific 5/min limit per IP and per account. /api/* has 100/min per user. File upload has 10/hour per user. A credential-stuffing attack trips the login-specific limit at request 6 and gets locked out.
A legitimate enterprise customer complains they are getting rate-limited even though they pay for a 'premium tier.' Walk through how you diagnose and fix the system.
Strong Answer Framework:
Confirm the diagnosis: capture request ID, timing, 429 response, and the rate-limit headers. Is the limit being hit globally, per-user, or per-endpoint?
Check the rate-limit key. Is the customer being identified as a single “user” even though they have 100 machines making requests? A common mistake is keying on user ID when the enterprise has many concurrent processes.
Verify tier logic. Does the gateway actually read the customer’s tier from the token/profile and apply a higher limit? Often the code reads “default” when the user object is missing the tier field.
Fix the immediate case: raise their limit manually, capture the evidence, and push a permanent fix.
Generalize: rate-limit keys should match the billing model. If enterprises pay per-seat, key on seat or API key. If per-request volume, key on tenant. Misalignment between billing and rate-limit keys produces exactly these incidents.
Real-World Example: GitHub’s API rate limits are explicitly tiered by auth method and customer tier; the docs spell out 5,000 requests/hour for authenticated personal tokens and 15,000/hour for GitHub Apps with higher limits available on Enterprise. Stripe similarly exposes rate limit headers and negotiates higher limits for enterprise customers. The failure mode where “premium” means “same limit as free” is a recurring pattern in startup APIs.Senior Follow-up Questions:
Follow-up 1: “What is the sliding-window vs fixed-window trade-off?”Fixed window is simpler: count per minute, reset at minute boundary. The problem is edge behavior: a client can send 2x the limit if they send N at minute’s end and N at the next minute’s start. Sliding window counts over a rolling period and avoids this, at slightly higher compute cost. Most production systems use token bucket, which has similar smoothness properties.Follow-up 2: “How do you handle burst capacity?”Token bucket allows accumulated tokens up to a max, so short bursts are fine as long as the long-term rate stays within the limit. This matches real usage patterns: a user might send 50 requests in a second and then nothing for a minute. A strict rate limit rejects the burst; token bucket allows it.Follow-up 3: “How do you rate-limit anonymous traffic?”By IP, with an understanding that IP is noisy (NAT, VPN, CGNAT). Use a looser limit per IP and tighter limit per identified user. Layer WAF and bot-detection for known abuse patterns.
Common Wrong Answers:
“Just raise the limit for everyone.” Fails because it removes protection from infrastructure and does not address the ticket.
“Tell the customer to call less.” Fails because the customer is paying for a tier that was supposed to provide the capacity.
Further Reading:
GitHub REST API rate limit documentation.
Stripe’s engineering blog on rate limiting.
Kong documentation on rate-limiting plugins and their redis backends.
Design the auth, rate limiting, and routing for a new API: what goes in the gateway vs in the service? Walk through your reasoning.
Strong Answer Framework:
State the principle: cross-cutting concerns with no domain knowledge belong in the gateway. Anything touching business rules belongs in the service.
In the gateway: TLS termination, JWT signature verification, rate limiting (global, per-user, per-endpoint), request routing, correlation ID injection, basic request/response logging, CORS.
In the service: authorization decisions (can this user do this action on this resource), business-specific validation, domain-level rate limits (e.g., “you can only create 10 projects per org per day”), idempotency key handling, and all business logic.
Explain the gray areas. Authentication is gateway; authorization is service because it needs domain context. Basic rate limiting is gateway; business-tier rate limiting (“paid customers get 10x”) can be either depending on whether tier is exposed in the JWT.
Document the contract: the gateway promises to pass a verified X-User-Id and scopes; the service promises to enforce authorization and business rules. Nothing in either layer assumes the other’s responsibilities.
Real-World Example: Kong, Envoy, and AWS API Gateway all ship with plugins/filters for exactly the concerns listed as “gateway” above. The pattern has settled across the industry over the past 5-7 years. Services like Netflix’s Zuul 2 explicitly exclude business logic via architectural policy. Conversely, early monolithic gateway attempts (e.g., Mashery-style enterprise gateways circa 2012) that held business logic became unmaintainable.Senior Follow-up Questions:
Follow-up 1: “Where does request transformation (e.g., GraphQL-to-REST) fit?”In a BFF or a dedicated transformation service, behind the gateway. The gateway should not hold service-specific contract knowledge; transformations are domain-aware and belong next to the domain.Follow-up 2: “Is it ever right to put business logic in the gateway?”Rarely, and only for cross-cutting policy that genuinely applies to every service uniformly: tenant isolation (reject requests whose tenant does not match the JWT), global IP denylists, legal/compliance (block specific countries for certain endpoints). Anything narrower belongs in the service.Follow-up 3: “How do you prevent scope creep into the gateway?”Explicit RFC process for adding gateway functionality; a latency budget the gateway owner defends; regular reviews of gateway plugins to remove or relocate ones that do not fit. If you do not have active curation, scope creep is guaranteed over 2-3 years.
Common Wrong Answers:
“Put everything in the gateway so services can be simple.” Fails because the gateway becomes a monolith with no ownership and blocks every team.
“Put everything in the services so the gateway is just a proxy.” Fails because every service re-implements auth and rate limiting with subtle differences, and you inevitably get bugs and gaps.
Further Reading:
Sam Newman’s Building Microservices, 2nd edition, on gateway patterns.
Phil Calçado’s “Pattern: API Gateway / Backends for Frontends.”
Netflix Tech Blog on Zuul 2 and its plugin scope policy.
Aggregation exists because mobile and web clients are penalized by round trips — every HTTP request on a mobile network costs 100-300ms of latency just for the TCP + TLS handshake. If a dashboard needs data from five services, doing that client-side means five sequential (or parallel) mobile round trips versus one trip if the gateway aggregates server-side. The key technique is parallel fan-out using Promise.all or asyncio.gather — you call all five services simultaneously and wait for the slowest one, not the sum of all. The important nuance is Promise.allSettled versus Promise.all: for a dashboard, if the recommendations service is down, you still want to show orders and cart (partial response), so use allSettled. For an order detail page where missing data breaks the UI, use all and fail fast. The tradeoff is that the gateway now has partial knowledge of downstream service contracts, which couples it to those services — overdo this and your gateway becomes a dumping ground for orchestration logic that belongs in a BFF service.
# gateway/src/aggregation/order_details.pyimport asynciofrom dataclasses import dataclassfrom datetime import datetimefrom typing import Anyimport httpxfrom fastapi import APIRouter, Depends, Request, HTTPExceptionfrom pydantic import BaseModelfrom gateway.middleware.auth import auth_dependency, AuthenticatedUserfrom gateway.routes import services, get_http_clientclass AggregationError(Exception): pass@dataclassclass OrderAggregator: client: httpx.AsyncClient async def _get(self, url: str) -> Any: resp = await self.client.get(url) resp.raise_for_status() return resp.json() async def get_order(self, order_id: str) -> dict: return await self._get(f"{services.order_service_url}/orders/{order_id}") async def get_user(self, user_id: str) -> dict: return await self._get(f"{services.user_service_url}/users/{user_id}") async def get_order_products(self, order: dict) -> list[dict]: product_ids = [item["productId"] for item in order["items"]] resp = await self.client.post( f"{services.product_service_url}/products/batch", json={"ids": product_ids}, ) resp.raise_for_status() return resp.json() async def get_order_timeline(self, order_id: str) -> list[dict]: # gather with return_exceptions is the asyncio equivalent of Promise.allSettled order_events, shipping_events = await asyncio.gather( self._get(f"{services.order_service_url}/orders/{order_id}/events"), self._get(f"{services.order_service_url}/shipments/order/{order_id}/events"), return_exceptions=True, ) events: list[dict] = [] if not isinstance(order_events, Exception): events.extend(order_events) if not isinstance(shipping_events, Exception): events.extend(shipping_events) return sorted(events, key=lambda e: datetime.fromisoformat(e["timestamp"])) async def get_order_details(self, order_id: str, user_id: str) -> dict: try: order_task = asyncio.create_task(self.get_order(order_id)) user_task = asyncio.create_task(self.get_user(user_id)) order = await order_task user, products = await asyncio.gather( user_task, self.get_order_products(order), ) except httpx.HTTPError as exc: raise AggregationError("Failed to aggregate order details") from exc quantity_by_product = {i["productId"]: i["quantity"] for i in order["items"]} return { "order": { "id": order["id"], "status": order["status"], "total": order["total"], "created_at": order["createdAt"], }, "customer": {"id": user["id"], "name": user["name"], "email": user["email"]}, "items": [ { "product_id": p["id"], "name": p["name"], "price": p["price"], "quantity": quantity_by_product.get(p["id"]), } for p in products ], "shipping": order.get("shippingAddress"), "timeline": await self.get_order_timeline(order_id), } async def get_dashboard(self, user_id: str) -> dict: orders, cart, recommendations, notifications = await asyncio.gather( self._get(f"{services.order_service_url}/users/{user_id}/orders?limit=5"), self._get(f"{services.order_service_url}/users/{user_id}/cart"), self._get(f"{services.product_service_url}/users/{user_id}/recommendations"), self._get(f"{services.notification_service_url}/users/{user_id}/unread"), return_exceptions=True, ) def ok(v): return v if not isinstance(v, Exception) else None return { "recent_orders": ok(orders) or [], "cart": ok(cart), "recommendations": ok(recommendations) or [], "unread_notifications": len(ok(notifications) or []), }agg_router = APIRouter()@agg_router.get("/aggregate/orders/{order_id}")async def aggregate_order( order_id: str, user: AuthenticatedUser = Depends(auth_dependency),) -> dict: if user is None: raise HTTPException(status_code=401, detail="auth required") aggregator = OrderAggregator(get_http_client()) try: return await aggregator.get_order_details(order_id, user.id) except AggregationError as exc: raise HTTPException(status_code=502, detail=str(exc))@agg_router.get("/aggregate/dashboard")async def aggregate_dashboard( user: AuthenticatedUser = Depends(auth_dependency),) -> dict: if user is None: raise HTTPException(status_code=401, detail="auth required") aggregator = OrderAggregator(get_http_client()) return await aggregator.get_dashboard(user.id)
Caveats & Common Pitfalls: Request aggregation in the gateway
The gateway grows domain knowledge. Every aggregation encodes which services have which fields. After dozens of endpoints, the gateway becomes a de facto service with no clear owner.
Promise.all on partial-failure-tolerant flows. One slow service fails the whole aggregation. For dashboards where partial data is fine, you should use Promise.allSettled and return what you have.
No per-sub-request timeout. One slow downstream drags the total response to whatever the longest call is. Set per-sub-request timeouts so any single slow dependency fails fast.
Fan-out storms. A single aggregated endpoint calls 8 downstreams. Under load, each gateway replica is doing 8x the downstream request volume. Plan capacity accordingly and consider caching.
Solutions & Patterns: Aggregation lives in a BFF, not the gatewayThe clean pattern is a dedicated BFF service that handles aggregation for each client type (mobile, web, internal tools). The gateway remains thin and stateless; the BFF owns the domain coupling. Each BFF uses parallel fan-out with per-sub-request timeouts and allSettled where partial data is acceptable. Cache aggregated responses for short TTLs (5-60 seconds) if the data is not user-specific or if staleness is acceptable.Decision rule: if the aggregation involves business logic, data shaping, or client-specific views, it belongs in a BFF. If it is genuinely just “fetch these three things in parallel and concatenate,” it could live at the gateway, but be honest: most “simple” aggregations grow over time.Before: Gateway aggregates 6 services for the dashboard. p99 is 1.2 seconds because the slowest service is always on the critical path. Whenever recommendations is broken, the whole dashboard returns 500.
After: Mobile BFF does the aggregation with 400ms per-sub-request timeouts, allSettled for optional fields (recommendations), and a 30-second cache. p99 drops to 450ms and dashboard never fails outright; at worst it drops the recommendations panel.
Your gateway aggregates data from 6 services for a mobile dashboard. Mobile engineers complain about latency and occasional 500s. Walk through your redesign.
Strong Answer Framework:
Measure. Per-sub-request timing at the gateway; identify the long-tail service and the error rate per service.
Identify the critical vs. non-critical services. Core dashboard data (account balance, recent transactions) must be present; recommendations, promotions, and usage tips can be missing.
Move aggregation to a mobile BFF. Gateway just routes /mobile/dashboard to the BFF; BFF does the fan-out with per-sub-request timeouts (say 300ms each) and allSettled.
For critical services, use Promise.all with tight timeouts. For non-critical, use allSettled and drop failures from the response with a flag like {"recommendations": null, "recommendations_available": false}.
Add short-TTL caching for fan-out results where staleness is acceptable. Even a 30-second cache dramatically reduces downstream load.
Real-World Example: SoundCloud introduced the BFF pattern publicly in a 2015 blog post precisely because their iOS and web apps had different aggregation needs and their single API was becoming bloated. Netflix’s device-specific “Edge Services” (one per device family) are a more evolved version; each service knows its client’s needs and shapes responses accordingly.Senior Follow-up Questions:
Follow-up 1: “When does a BFF become worth the operational overhead?”When you have multiple client types with meaningfully different needs and shared aggregation hurts each of them (too much data for mobile, too little for web). For a single-client product, a BFF is overkill — one API shaped for that client is simpler.Follow-up 2: “How do you cache a per-user dashboard without cache explosion?”Cache the sub-responses by their natural keys (user balance by user ID, recommendations by user ID) with short TTLs, and compose on each request. The composition is cheap; the sub-requests are not. This gets you most of the cache benefit without per-aggregate-response cache entries.Follow-up 3: “What is the right timeout for each sub-request?”Calibrate to each service’s p95 or p99 under normal load. Non-critical services can have shorter timeouts since their failure is tolerable. The sum of all per-sub-request timeouts should stay well under the total SLA; since they run in parallel, the budget is max, not sum.
Common Wrong Answers:
“Add caching to the gateway.” Fails because it does not address per-user data and still couples aggregation to the gateway.
“Reduce the number of services called.” Fails because each service exists for a reason; the right fix is to fail gracefully, not to remove features.
Further Reading:
Phil Calçado’s “Pattern: Backends for Frontends.”
SoundCloud’s blog post “BFF@SoundCloud.”
Netflix Tech Blog on device-specific edge services.
The Mobile BFF exists because mobile clients are bandwidth-constrained and battery-constrained in ways web clients are not. Sending a 2MB product payload when the user’s phone only needs 50KB to render a list item wastes data (costing the user money on metered plans) and battery (the radio stays active longer). The mobile BFF’s job is to be ruthlessly aggressive about payload size: one thumbnail instead of a full image gallery, a short description instead of full HTML, top-3 reviews instead of all reviews. The tradeoff is that mobile BFFs need their own deployment pipeline and their own on-call rotation — that is real operational cost. If you skip the BFF and have the mobile app call services directly, you end up either over-fetching (slow mobile experience) or exposing internal service URLs and auth (security nightmare). A well-built Mobile BFF can reduce payload by 80-90% for common list endpoints.
The Web BFF has the opposite constraints from the Mobile BFF: bandwidth is plentiful, but round trips are perceived as slow (the user stares at a loading spinner). So the Web BFF leans hard into parallel fan-out — one request to the BFF produces five concurrent calls to downstream services, and the aggregated response fills the entire page at once. The tradeoff is that a single slow service slows the entire page render, but that is usually a better experience than a page that appears in pieces over several seconds. The Web BFF also returns richer data: full descriptions, all images, related products, questions, breadcrumbs — everything the web UI needs for SEO and interactivity. Without a Web BFF, you either make the web client orchestrate all these calls (more client complexity, worse perceived performance) or return identical responses to mobile and web (you slow down mobile, or you under-serve web).
BFF-per-team explosion. Every team wants their own BFF; a year later you have 15 BFFs, each a small monolith, each doing 80% the same cross-cutting work. The right granularity is BFF-per-client-type (mobile, web, internal tools), not per-team.
BFFs that reach into each other. A mobile BFF calls a web BFF because “they already have that logic.” Now they are coupled and a web BFF change breaks mobile. BFFs should call services, not other BFFs.
Contract drift between BFF and service. BFFs cache service response shapes in code. When the service evolves, the BFF silently breaks or returns wrong data. Shared contract tests are required, not optional.
BFF as a place to avoid fixing the service. When the underlying service has the wrong shape, teams patch around it in the BFF instead of fixing the service. Over time, the BFF accumulates workarounds and nobody can remove them because clients depend on the patched shape.
Solutions & Patterns: BFF per client type, contract tests, no BFF-to-BFF callsOne BFF per client type: a mobile BFF, a web BFF, a partner-integration BFF. Each BFF calls underlying domain services directly; BFFs do not call each other. Contract tests sit between BFF and service, validating that response shapes match expectations. The BFF owns client-specific concerns (pagination strategy, image URL signing, field selection) but delegates domain logic to services.Decision rule for whether to add logic to a BFF vs a service: if the logic is client-specific (mobile wants thumbnails, web wants full-size), BFF. If the logic is domain-invariant (charge a card, update an address), service. When in doubt, prefer the service — BFFs are easier to throw away than services.Before: Mobile team adds a “compute discount” function to their BFF because the order service returns raw pricing. Web team adds the same function to their BFF with slightly different rounding. Customer service sees different totals in different clients.
After: Order service exposes a pricing endpoint that returns fully-computed totals. Both BFFs consume the same endpoint. No client-specific business logic; rounding is authoritative.
Your team wants to create a new BFF for admin users. The platform team pushes back with 'BFFs are getting out of hand.' How do you decide, and what governance applies?
Strong Answer Framework:
Agree on the concern: BFF proliferation is real and expensive. Each BFF is another deployable, another oncall rotation, another set of cross-cutting infrastructure (auth, logging, tracing) to keep in sync.
Apply the client-type test: is “admin users” a genuinely different client from existing ones? If admins use the same web app with different permissions, no new BFF — extend the existing web BFF. If admins use a dedicated admin console with very different data shapes, yes.
Propose governance: a “BFF registry” with explicit justification for each one, a periodic review of whether existing BFFs should merge, and a shared library for cross-cutting concerns so new BFFs start with auth/logging/tracing baked in.
Define the alternatives: sometimes a BFF is overkill. A dedicated endpoint in the existing web BFF, or a small query service, may be enough. Pick the smallest thing that solves the problem.
Commit to the cost model: if the team wants a new BFF, they own it end-to-end including oncall, upgrades, and eventual deprecation when no longer needed.
Real-World Example: Spotify’s backstage on “Golden Paths” talks about curating client-facing services with explicit governance. Atlassian’s 2020 internal platform discussion (published in talks at QCon) described exactly this push-back model: “new BFFs require justification because cross-cutting maintenance compounds.”Senior Follow-up Questions:
Follow-up 1: “What goes into the shared BFF library?”Auth header verification, correlation ID propagation, tracing, structured logging, retry/timeout primitives for downstream calls, rate limit header pass-through. Basically everything a BFF has to get right but that is not client-specific.Follow-up 2: “Should BFFs be public-facing or always behind the gateway?”Always behind the gateway. The gateway handles TLS, initial JWT validation, and global rate limiting. The BFF assumes verified identity and focuses on composition and shaping.Follow-up 3: “How do you migrate away from a BFF that has outlived its purpose?”Redirect its traffic to the primary BFF for that client type, or to direct service calls if the BFF was thin. Monitor until usage drops to zero, then remove. The hardest part is finding consumers, which is why the BFF registry matters.
Common Wrong Answers:
“BFFs are light, just add one whenever you need shaping.” Fails because the operational cost per BFF is significant and proliferation erodes platform consistency.
“Use a single API for every client.” Fails because different clients genuinely have different needs and forcing uniformity hurts each of them.
Further Reading:
Phil Calçado’s “Pattern: Backends for Frontends.”
Spotify Engineering’s “Backstage” and Golden Paths documentation.
Sam Newman’s Building Microservices, 2nd edition, chapter on BFFs.
A circuit breaker exists because cascading failures are the single most destructive failure mode in microservices. When the payment service slows from 50ms to 5 seconds, every gateway request waits 5 seconds, the gateway’s connection pool fills up with pending requests, new requests queue behind them, and within 30 seconds your entire gateway is wedged — even for requests that have nothing to do with payments. A circuit breaker short-circuits this: after N failures, it stops calling the failing service entirely for some cooldown period, returning an immediate error (or a fallback value) and letting the failing service recover without the pressure of retrying traffic. The tradeoff is that a tripped breaker causes some requests to fail that might have succeeded — you are trading “some requests fail immediately” for “no requests succeed because the whole gateway is frozen.” The critical tuning parameters are error threshold (50% is standard), volume threshold (must have enough requests to have statistical signal), and reset timeout (how long to wait before probing again). Without circuit breakers, one slow downstream service takes down your entire fleet in a classic retry-storm pattern.
# gateway/src/middleware/circuit_breaker.pyimport asyncioimport timefrom dataclasses import dataclass, fieldfrom enum import Enumfrom typing import Any, Awaitable, Callableimport httpximport structlogfrom fastapi import APIRouter, HTTPException, Requestlogger = structlog.get_logger()class CircuitState(str, Enum): CLOSED = "CLOSED" OPEN = "OPEN" HALF_OPEN = "HALF_OPEN"@dataclassclass CircuitStats: successes: int = 0 failures: int = 0 rejects: int = 0 fallbacks: int = 0@dataclassclass CircuitBreaker: """Async circuit breaker for an individual downstream service.""" name: str timeout: float = 10.0 error_threshold_pct: float = 50.0 reset_timeout: float = 30.0 volume_threshold: int = 10 state: CircuitState = CircuitState.CLOSED stats: CircuitStats = field(default_factory=CircuitStats) _opened_at: float = 0.0 _lock: asyncio.Lock = field(default_factory=asyncio.Lock) async def call( self, operation: Callable[[], Awaitable[Any]], fallback: Callable[[], Awaitable[Any]] | None = None, ) -> Any: async with self._lock: # Try to transition OPEN -> HALF_OPEN after reset_timeout elapses if self.state == CircuitState.OPEN: if time.time() - self._opened_at >= self.reset_timeout: self.state = CircuitState.HALF_OPEN logger.info("circuit.half_open", service=self.name) else: self.stats.rejects += 1 if fallback is not None: self.stats.fallbacks += 1 return await fallback() raise HTTPException( status_code=503, detail=f"circuit open for {self.name}", ) try: result = await asyncio.wait_for(operation(), timeout=self.timeout) except Exception as exc: await self._record_failure() if fallback is not None: self.stats.fallbacks += 1 return await fallback() raise else: await self._record_success() return result async def _record_success(self) -> None: async with self._lock: self.stats.successes += 1 if self.state == CircuitState.HALF_OPEN: self.state = CircuitState.CLOSED logger.info("circuit.closed", service=self.name) async def _record_failure(self) -> None: async with self._lock: self.stats.failures += 1 total = self.stats.successes + self.stats.failures if total < self.volume_threshold: return failure_pct = (self.stats.failures / total) * 100 if failure_pct >= self.error_threshold_pct: self.state = CircuitState.OPEN self._opened_at = time.time() logger.warning("circuit.opened", service=self.name, pct=failure_pct)class GatewayCircuitBreaker: def __init__(self) -> None: self._breakers: dict[str, CircuitBreaker] = {} def get(self, service_name: str, **options) -> CircuitBreaker: if service_name not in self._breakers: self._breakers[service_name] = CircuitBreaker(name=service_name, **options) return self._breakers[service_name] def status(self) -> dict: return { name: {"state": br.state.value, "stats": br.stats.__dict__} for name, br in self._breakers.items() }circuit_breaker = GatewayCircuitBreaker()router = APIRouter()@router.get("/api/users/{user_id}")async def get_user(user_id: str, request: Request) -> dict: from gateway.routes import services, get_http_client client = get_http_client() breaker = circuit_breaker.get("user-service") async def call_user_service() -> dict: resp = await client.get(f"{services.user_service_url}/users/{user_id}") resp.raise_for_status() return resp.json() async def fallback() -> dict: return {"id": user_id, "name": "Unknown User", "_fallback": True} return await breaker.call(call_user_service, fallback=fallback)@router.get("/health/circuits")async def circuits_status() -> dict: return circuit_breaker.status()
Caveats & Common Pitfalls: Circuit breakers in the gateway
One breaker per upstream, shared across all endpoints. A breaker that trips on /admin/heavy trips for every endpoint on that service, taking down read endpoints because of a write endpoint’s problem. Breakers should be per (service, operation-class).
Volume threshold too low. A breaker that trips after 5 failures on a low-traffic endpoint is noisy; one failed canary run trips prod. Require a statistically significant sample before tripping.
No fallback. Breaker opens, gateway returns 503 to every client. For some endpoints, a cached or default response is infinitely better than a hard failure.
Breaker state not observable. When a breaker is open, you want to know immediately. Without metrics per (service, state) and alerts on “breaker open for over 5 minutes,” breakers trip silently and users see failures for longer than needed.
Solutions & Patterns: Scoped breakers, fallbacks, full observabilityRun breakers at the per-operation level (service + path pattern), not service-global. Volume threshold of at least 20 requests in the window before evaluating error rate. Tune error rate thresholds to the endpoint’s tolerance: 50% is fine for most, lower for read-heavy endpoints where any error is unusual. Always define a fallback: cached response, default response, or honest 503 with Retry-After. Emit breaker state as a metric with labels (service, operation, state) and alert on open-for-too-long.Decision rule for fallback vs failure: if the data is user-visible and staleness is acceptable (product list, recommendations), use a cached fallback. If the data is authoritative and stale data would be wrong (account balance, inventory count), fail honestly with 503.Before: Gateway has one breaker per downstream service. Payment service has a bug in one endpoint; breaker trips; all payment endpoints fail; checkout is fully down even though most payment endpoints are healthy.
After: Breakers scoped to (service, operation-class). Only the broken endpoint trips. Checkout still works via other endpoints. Oncall sees the specific alert “payment-service:POST /authorize breaker open.”
A gateway circuit breaker trips on every deploy of a downstream service because the first 50 requests on the new instance fail while it warms up. How do you fix this without removing the breaker?
Strong Answer Framework:
Identify the real problem: the breaker is working correctly; the downstream is presenting unhealthy responses during warm-up. Fixing the breaker is the wrong layer.
Short-term: raise the volume threshold so small initial failures do not trip, or add a warm-up delay to the breaker’s evaluation window.
Real fix: the downstream needs proper readiness checks. Kubernetes should route traffic only after readinessProbe succeeds, and the probe should exercise the full critical path, not just return 200 from a static endpoint.
For even warmer deploys, use prefetch / JIT warmup in the service: on startup, hit caches, compile regexes, warm connection pools before declaring readiness.
Monitor “newly-deployed-pod error rate” as a separate metric from steady-state error rate. If the two diverge significantly, the deploy process itself needs work.
Real-World Example: Netflix’s internal deploy tooling (Spinnaker plus their deployment strategies) explicitly handles pod warm-up with staged traffic ramping. LinkedIn’s Rest.li service framework uses connection warmup to avoid exactly this pattern.Senior Follow-up Questions:
Follow-up 1: “Why do brand new instances fail requests?”JIT compilation, lazy connection pool initialization, cold cache, unwarmed DNS. A brand new JVM service may process the first 100 requests 10x slower than steady state because the JIT has not kicked in. Database connection pools acquire connections on first use.Follow-up 2: “How do readiness probes interact with deploy rollouts?”Kubernetes only sends traffic to pods that pass readiness. A well-designed probe tests the actual critical path (hitting the database, verifying cache, verifying downstream connectivity), not just an empty HTTP handler. This keeps bad pods out of rotation until they are actually ready.Follow-up 3: “What is traffic ramping / canary deploy?”Send a small percentage (say 1%) of traffic to the new version, monitor error rates and latency, ramp up on success. If the new version has issues, rollback affects only 1% of users. Implements directly in the gateway via weighted routing.
Common Wrong Answers:
“Disable the breaker during deploys.” Fails because it removes the safety net exactly when it is most needed.
“Lower the breaker sensitivity.” Fails because now the breaker does not trip on real failures either, defeating its purpose.
Further Reading:
Kubernetes documentation on readiness and startup probes.
Netflix Tech Blog on canary analysis and deploy strategies.
Resilience4j docs on circuit-breaker configuration for warm-up.
Declarative YAML works well for static deployments, but mature platforms need to configure Kong dynamically — onboarding a new tenant should not require a redeploy. The Kong Admin API on port 8001 lets you register services, routes, consumers, and plugins via HTTP. The tradeoff is that dynamic configuration means your “source of truth” is now in Kong’s database instead of git, so you need to either sync back to git or accept that operational drift will happen. Use this approach when you have self-service onboarding; use the declarative file otherwise.
Caveats & Common Pitfalls: Managed gateways (Kong, AWS API Gateway, Apigee)
Plugin sprawl. Every team enables “just one more” plugin for their use case. Latency creeps up. Nobody owns the plugin config or knows which plugins are still necessary.
Vendor lock-in through custom plugins. Custom Lua plugins in Kong or Velocity templates in Apigee are hard to port. When you decide to switch vendors (or move to Envoy), you rewrite everything.
Gateway bypass in tests. Integration tests hit services directly because the gateway is “hard to run locally.” Tests pass; production has gateway-specific bugs no test caught. Always test through the gateway for at least smoke tests.
Per-route config drift. Hundreds of routes accumulate with subtly different plugin configs. Nobody knows why /api/orders has a 30s timeout and /api/orders/cancel has a 10s timeout. Standardize defaults and justify exceptions.
Solutions & Patterns: Declarative config, CI validation, standardized defaultsEvery gateway configuration lives in version control as declarative YAML or similar. A CI pipeline validates syntax, enforces standards (all routes must have timeouts, all public routes must require auth), and deploys on merge. Exceptions to defaults require explicit justification in the PR. Run the gateway locally in docker-compose for development and CI so “gateway is hard to run locally” stops being an excuse.Decision rule for custom plugins: only when the generic plugin catalog is genuinely insufficient. Custom plugins are a last resort; before writing one, ask if the logic belongs in a BFF or a service instead.Before: Kong config drifts in the admin UI. Three different teams have modified it this quarter. Nobody can reproduce prod locally. A change to a rate limit for one service silently affects another because shared plugin state was not recognized.
After: All Kong config is a kong.yml file in git. CI validates and deploys. Developers run docker-compose up to get a local Kong that matches prod exactly. Changes are reviewed like code.
'Your API Gateway is adding 50ms of latency to every request. The P99 is 200ms. Product says this is unacceptable for the checkout flow. How do you optimize it?'
Strong Answer:50ms at the gateway is high — a well-tuned gateway should add 1-5ms for simple proxying. I would profile where that 50ms is spent. Common culprits in order of likelihood:First, JWT validation on every request. If the gateway is making a network call to an auth service to validate tokens, that alone can cost 20-30ms. The fix is switching to local JWT validation using a cached public key. The gateway verifies the signature locally in microseconds instead of making a round-trip. Key rotation is handled by refreshing the public key every few minutes.Second, rate limiter hitting Redis. If every request does a Redis INCR for rate limiting, and Redis is in a different availability zone, you are paying 5-10ms per request for rate checking. Fix: use a local in-memory rate limiter (like a token bucket per IP) for the first layer, and only fall back to Redis for distributed rate limiting on endpoints that need it. Most checkout requests from authenticated users do not need per-request Redis calls.Third, request aggregation happening at the gateway level. If the gateway is calling multiple downstream services to assemble a response, that is architectural, not tunable. Move aggregation logic into a BFF service or into the client, and have the gateway do pure proxying only.Fourth, connection pool exhaustion. If the gateway is creating new TCP connections to downstream services on every request instead of reusing keep-alive connections, that adds TLS handshake overhead. Fix: configure connection pooling with keep-alive.For the checkout flow specifically, I would also implement a “fast path” that bypasses non-essential middleware (detailed request logging, analytics collection) and only runs the minimum required middleware (auth, routing). This is a common pattern at Stripe and Amazon — critical paths get special treatment.Follow-up: “Should the API Gateway be the place where you aggregate data from multiple services?”For simple aggregation (combining user + order data for a dashboard), yes — it reduces client round trips. But it should be thin aggregation (parallel fetch + merge), not business logic. The moment you add conditional logic, transformation rules, or error handling that is specific to a business domain, that aggregation belongs in a BFF service or a dedicated orchestration service, not the gateway. The gateway should remain a commodity infrastructure component that any team can configure, not a place where business logic accumulates.
'You have a mobile app, a web app, and an admin dashboard. Should you use one API Gateway or multiple? What about the BFF pattern?'
Strong Answer:One gateway for routing and cross-cutting concerns (auth, rate limiting, TLS termination), but separate BFF services per client type for response shaping. This is the pattern Netflix and Airbnb use.The single gateway handles the commodity work: JWT validation, CORS, rate limiting, request logging, and routing to the correct service. It does not know or care whether the request came from mobile or web.Behind the gateway, each client type has its own BFF: Mobile BFF, Web BFF, Admin BFF. The Mobile BFF returns slim responses (small payloads, fewer images, paginated data). The Web BFF returns rich responses (full product details, reviews, recommendations in one call). The Admin BFF returns detailed internal data (audit logs, system metrics, user activity) that would never be exposed to customers.Why not just use one BFF? Because the needs diverge rapidly. Mobile needs to minimize payload size for bandwidth. Web needs to minimize round trips for perceived speed. Admin needs completeness and accuracy over speed. When these are in one codebase, every change becomes a negotiation between three teams with different priorities. Separate BFFs let each team iterate independently.The cost is code duplication — all three BFFs call the same downstream services. I mitigate this with shared client libraries for the service calls, but the response shaping logic is intentionally duplicated because that is where the differentiation lives.Follow-up: “What happens when a downstream service fails? Should the BFF handle the fallback or the gateway?”The gateway handles infrastructure-level failures: circuit breaking to prevent cascade failures, returning 503 when a service is completely down. The BFF handles business-level fallbacks: if the recommendation service is down, the web BFF shows “Popular Products” from a cache instead of personalized recommendations. The mobile BFF might skip the recommendations section entirely to reduce payload. These are product decisions, not infrastructure decisions, so they belong in the BFF where the product team has ownership.
'How do you implement rate limiting in a distributed system where the API Gateway runs as multiple instances behind a load balancer?'
Strong Answer:The fundamental challenge is that per-user rate limits need to be consistent across gateway instances. If a user sends 100 requests and you have 4 gateway instances, each instance sees 25 requests and thinks the user is within a 50-request limit.The standard solution is a centralized counter in Redis. Each gateway instance does an atomic INCR on a key like ratelimit:user:123:minute:2024-01-15T10:30 with a TTL equal to the window size. Redis handles the concurrency. This adds 1-2ms per request for the Redis round-trip, which is acceptable for most use cases.For higher performance, I use a two-tier approach. The first tier is a local in-memory token bucket per user. It handles burst absorption and catches obvious abuse without any network call. The second tier is Redis for accurate distributed counting. The local tier allows up to 80% of the limit, and the remaining 20% is checked against Redis. This means a user could theoretically get 120% of their limit if they perfectly distribute requests across all instances, but that is an acceptable trade-off for removing Redis from the hot path of 80% of requests.For extreme scale (millions of requests per second), I have seen teams use sliding window algorithms with Redis sorted sets, or even move rate limiting into a service mesh sidecar (Envoy) with shared state via a control plane. But for most systems, the simple Redis counter with fixed windows is sufficient and much easier to operate.The gotcha that catches teams: clock skew between instances. If instance A thinks it is 10:30:59 and instance B thinks it is 10:31:01, they use different window keys and the limit doubles for that second. Use NTP synchronization and make your window slightly larger than the stated limit to absorb clock drift.Follow-up: “A customer is complaining they are being rate limited even though they are well within their plan limits. How do you debug this?”I would check three things: First, are they using multiple API keys? Rate limits are typically per-key, so two keys each using half the limit appears fine but is counted separately. Second, check if retry storms are inflating their request count — their client library might be retrying failed requests with aggressive backoff, doubling or tripling their effective request rate. Third, check for shared IP rate limiting — if they are behind a corporate NAT, all employees share one IP, and the IP-based rate limit (separate from user-based) might be the one triggering. The fix depends on the cause: consolidate keys, fix retry logic, or switch from IP-based to authenticated user-based rate limiting.