Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
GraphQL Federation
GraphQL Federation enables you to compose a unified API from multiple GraphQL services, perfect for microservices architectures. The core problem it solves: in a REST microservices world, the client has to make N requests to N services and stitch the data together itself. With federation, a single GraphQL query like “get this user with their orders and each order’s products” gets decomposed by the gateway into efficient parallel requests to the right services. The client sees one API; behind the scenes, the gateway is orchestrating across your entire service fleet.- Understand GraphQL Federation architecture
- Implement Apollo Federation with Node.js and Python
- Master schema composition and entities
- Build federated queries and mutations
- Handle authentication and authorization
Why GraphQL Federation?
REST APIs force a painful choice on microservice architectures. Either you build a single “mega endpoint” that returns everything (ballooning the payload and coupling services), or you force the client to make a chain of N requests across services and stitch results together. Neither scales. REST’s resource-oriented design fits a single service beautifully, but the moment data lives across service boundaries, the client becomes a de facto orchestrator — a job it was never designed for. Mobile clients on flaky networks pay the worst price: every extra round trip is a new opportunity for latency, timeouts, and partial failures. GraphQL fixed the client-side problem (ask for exactly what you want, get it in one query) but early approaches to multi-service GraphQL were rough. Schema stitching, the predecessor to federation, asked a central gateway to import every subgraph’s full schema and manually declare how types link together. This worked for small teams but collapsed under real-world pressure: every schema change required redeploying the gateway, type conflicts between services had to be resolved by hand, and there was no standard way to describe cross-service relationships. Federation codified these concerns — entities, keys, reference resolvers — into a standard so that subgraphs can evolve independently while still composing into a coherent API. The tradeoff is explicit: you’re accepting gateway complexity, query-planning overhead, and a steeper operational learning curve in exchange for client simplicity, parallel subgraph execution, and a single typed schema across your entire fleet. For small systems (under ~5 services, under ~20 engineers) this tradeoff is rarely worth it — a REST BFF or a single GraphQL service will serve you better. For larger organizations with independent frontend teams and dozens of services, federation pays for itself within months.Federation Architecture
The heart of federation is a directed conversation between the router (or gateway) and a set of subgraphs. When a client query arrives at the router, the router consults its composed supergraph schema, produces a query plan (essentially a DAG of subgraph calls with data dependencies), and executes that plan — fetching from multiple subgraphs in parallel where it can, and sequentially where one subgraph’s output feeds another’s input. Subgraphs never talk to each other directly. This indirection is crucial: it’s what lets subgraphs evolve independently and what lets the router enforce global concerns like tracing, rate limiting, and authentication in one place. Subgraphs communicate with the router via standard GraphQL over HTTP, but with a twist: each subgraph exposes a special_entities query the router uses to resolve cross-subgraph references. When the router needs a User that was mentioned by the Orders subgraph, it sends the key fields ({ __typename: "User", id: "123" }) to the Users subgraph’s _entities endpoint, which invokes a reference resolver to hydrate the full object. This is the mechanism that makes the whole thing feel seamless to clients despite being a distributed orchestration under the hood.
Implementing Subgraphs
A subgraph is just a normal GraphQL service with two extras: it speaks the federation directive dialect (@key, @external, @requires, @provides), and it implements reference resolvers for any entity it owns. “Owning” an entity means being the authoritative source for that type — you declare a @key on it, and you provide a __resolveReference (Node.js) or resolve_reference (Python/strawberry) function that, given just the key fields, returns the full object. Other subgraphs can then “extend” your entity by adding fields of their own without touching your code.
The Python and Node.js patterns differ in surface syntax but map one-to-one conceptually. In Node.js with Apollo Server, you write SDL as a template string and wire up a resolver map. In Python with strawberry-graphql, you use class decorators (@strawberry.federation.type(keys=["id"])) and methods — strawberry introspects your Python types to produce the schema. Both approaches give you the same federation capabilities; the choice is mostly about which language ecosystem your team already lives in.
Users Subgraph
The Users subgraph is the simplest case: one entity (User) with no external dependencies. It owns the User type via @key(fields: "id") and provides the reference resolver. Other subgraphs — Orders, Reviews — will extend User to add their fields, but they’ll always call back here to get the core user data.
- Node.js
- Python
Orders Subgraph
The Orders subgraph demonstrates two key federation patterns: it owns theOrder entity, and it extends the User entity with an orders field. This is how federation distributes schema ownership without sacrificing a unified API. The Users team doesn’t need to know Orders exists — they just own the User type. The Orders team adds order-related capabilities to User by declaring extend type User (in SDL) or using strawberry’s extend_type mechanism. The router composes these together into one supergraph User type at runtime.
Notice also how the Orders subgraph returns a reference stub ({ __typename: 'User', id: order.userId }) rather than a full User object. The subgraph only knows the user’s ID — it doesn’t need to fetch the user. The router sees this stub in the response and, if the client asked for User fields like name or email, calls back to the Users subgraph’s reference resolver to fill them in. This lazy hydration is what keeps federation fast: data is fetched only when the query demands it.
- Node.js
- Python
Products Subgraph
Products is the “owner” of the Product entity and also demonstrates@shareable — a directive that lets multiple subgraphs legitimately define the same field (e.g., averageRating might be computed by both Products and a dedicated Reviews subgraph). Without @shareable, composition would fail with a duplicate-field error. Use it sparingly: it’s usually a sign that ownership boundaries are fuzzy, and fuzzy boundaries lead to drifting implementations.
- Node.js
- Python
Scenario: A subgraph team wants to rename a field `productName` to `title` on the `Product` type. That field is used by 200 client queries across 4 frontend apps. How do you manage the migration without breaking production?
Scenario: A subgraph team wants to rename a field `productName` to `title` on the `Product` type. That field is used by 200 client queries across 4 frontend apps. How do you manage the migration without breaking production?
- Never rename in place. Add the new field as an alias alongside the old one. The schema now has
productName: String!andtitle: String!, both resolving from the same underlying data. - Mark the old field
@deprecated. With a clear reason:@deprecated(reason: "Use 'title' instead. Removal planned 2026-Q3."). Apollo Studio surfaces deprecations to every client; this is the primary notification channel. - Track usage per client. Apollo Studio reports field-level usage by client name. You know which of the 200 queries are hitting the deprecated field and which clients issue them. Contact those teams directly.
- Migrate clients one at a time. Frontend team A replaces
productNamewithtitlein their queries. Deploy. Verify usage ofproductNamefrom client A drops to zero in Apollo Studio. Repeat for teams B, C, D. - Wait for usage to hit zero. All clients migrated. Old field has zero usage for two consecutive weeks. Now you can plan removal.
- Remove with a breaking-change gate. The PR removing
productNamerequires approval from the schema registry’s composition check (which passes because no query references it anymore) plus sign-off from the platform team. Deploy during a low-traffic window. - Keep a rollback path. If removal causes unexpected failures (a client you didn’t know about, a server-side script, a forgotten scheduled report), the field can be re-added in under an hour. Don’t truncate the underlying data column — leave it readable.
Asset.title to Asset.displayTitle (different semantics — title was the internal name, displayTitle was the localized customer-facing name). The migration took 14 months because of the long tail of internal tools and batch jobs. They used Apollo Studio to identify each consuming service, filed tickets to each team, and kept the deprecated field in place until every consumer had migrated. The critical discipline was “never rush removal” — a single forgotten cron job that ran monthly would have caused an incident if the field disappeared before it migrated.Senior Follow-up QuestionsproductName and title should return slightly different values because of a business rule change?Strong Answer: This is no longer a rename; it’s a semantic split. Make it explicit: productName keeps its original behavior (maybe adding an @deprecated(reason: "Use 'title' for the new display logic")), and title implements the new rule. Document the difference in the schema description field. Clients deciding which to use read the descriptions and choose deliberately. Don’t silently migrate semantics — that produces bugs only noticed after finance reports diverge.@requiresExplicitAck directive that forces clients to opt in to using deprecated fields via a query hint.- “We’ll coordinate a synchronized big-bang rename.” Never works. Someone is on vacation, a client is forgotten, a legacy system has a hardcoded query. One team that rolls back blocks the whole migration. Deprecations are asynchronous on purpose.
- “The client teams should migrate first, then we rename.” Client teams won’t migrate without motivation. Deprecation with a dated removal plan creates the motivation; pure “please migrate” emails get ignored.
- Netflix Engineering, “GraphQL at Netflix Studio” (2021 Summit talk) — the exact migration pattern.
- Apollo Docs, “Schema change management” — official tooling for the workflow.
- Lee Byron, “GraphQL at Facebook” (2015) — origin story of schema evolution principles.
Scenario: You want to enable a new subgraph team, but the existing three subgraphs each use slightly different naming conventions (camelCase vs snake_case, `createdAt` vs `created_timestamp`). How do you handle this?
Scenario: You want to enable a new subgraph team, but the existing three subgraphs each use slightly different naming conventions (camelCase vs snake_case, `createdAt` vs `created_timestamp`). How do you handle this?
- Pick a single convention for the federated schema. GraphQL convention is camelCase for fields. Pick it, document it, and stop debating.
- Inventory existing deviations. A week-long project: go through every subgraph’s schema and list the fields that deviate. Roughly triage into “easy to migrate” (no external clients), “hard to migrate” (many clients), and “historical” (nobody knows why it’s snake_case but it is).
- Standardize new subgraphs immediately. The onboarding team writes camelCase from day one. Their schema is correct; they don’t carry legacy debt.
- For existing subgraphs, treat the deviation as a deprecation. Add the camelCase version alongside the old one, deprecate the old, migrate clients, remove (same playbook as the previous scenario).
- Document conventions in a living style guide. Include naming, nullability defaults, enum uppercasing, directive usage. Enforce with a linter that runs in every subgraph’s CI.
- Avoid mapping in the gateway. Some teams try to rename fields at the gateway layer. Don’t — it hides the inconsistency rather than fixing it, and the mapping is yet another place to maintain state. Normalize at the source.
- Accept that some inconsistencies will persist. Perfectly consistent schema across 20 subgraphs over 5 years is unrealistic. Pick the high-traffic inconsistencies to fix, accept the rest.
auto_camel_case=True. The Python code stays idiomatic; the schema stays federation-standard. If the team still refuses, escalate with data: query logs showing every other subgraph uses camelCase, and clients that query across subgraphs have to context-switch for this one team’s schema. It’s an organization norm, not a language preference.@company/graphql-schema-lint). Every subgraph repo’s CI imports it and runs it as a required check. Rules are defined centrally and updated centrally; subgraphs inherit new rules automatically. For exceptions, the repo adds a lint-ignore comment with a ticket reference and a removal date — exceptions are tracked, not buried.- “Rewrite all subgraphs to the new convention in one release.” Coordinated rewrites across 30 repos fail. Stage incrementally.
- “Leave it — clients can handle both.” Clients end up with conditional logic for each subgraph’s quirks, which is tech debt on the client side instead of the server side. Consolidate at the source.
- Airbnb Engineering Blog, “Schema Stitching at Scale” and follow-ups on their federation migration.
- Apollo GraphQL Docs, “Schema design best practices.”
- Shopify GraphQL Design Tutorial — opinionated and thorough style guide.
Apollo Gateway
The gateway (sometimes called the router) is the single entry point your clients talk to. Its job is three-fold: compose the supergraph schema from all subgraph schemas, plan each incoming query as a DAG of subgraph fetches, and execute that plan while handling cross-cutting concerns like auth header forwarding, tracing, and error propagation. In production, the recommended choice is the Apollo Router (written in Rust) because its query-planning and execution are dramatically faster than the Node.js gateway, which matters when you’re processing thousands of QPS. For development and small-scale deployments, the Node.jsApolloGateway is plenty capable and easier to customize.
A key architectural point: the gateway is stateless. All it holds is the composed supergraph schema. Requests can be load-balanced across any number of gateway instances. This is deliberately simple — keeping the gateway stateless means you can scale it horizontally without worrying about session affinity or distributed cache consistency. When you need per-user data (auth context, dataloader caches), it lives in request-scoped context, not gateway state.
For Python stacks, there’s no equivalent Python-native gateway with the same maturity as Apollo’s. The pragmatic pattern is to run the Apollo Router (Rust binary, language-agnostic) in front of Python strawberry subgraphs. The router doesn’t care what language implements the subgraphs — it speaks federation over HTTP/GraphQL. Below we show Node.js gateway code alongside a YAML router config and a Python FastAPI-based auth proxy.
- Node.js
- Python
Supergraph Configuration (Alternative to IntrospectAndCompose)
For production, don’t use runtime introspection (IntrospectAndCompose). It’s convenient for local dev, but it tightly couples gateway startup to subgraph availability and makes rollbacks hard. Instead, compose the supergraph schema ahead of time in CI (using rover supergraph compose against a schema registry), publish the resulting supergraph.graphql as an artifact, and point your gateway at it. This gives you atomic deploys, clean rollbacks, and schema validation before anything touches production.
Scenario: Your GraphQL gateway response time is 800ms at the 99th percentile, but each subgraph responds in under 100ms. Diagnose what's going wrong.
Scenario: Your GraphQL gateway response time is 800ms at the 99th percentile, but each subgraph responds in under 100ms. Diagnose what's going wrong.
- Get a single slow query and its query plan. Capture a 99th-percentile request with its full plan (Apollo Studio shows this). The plan is a DAG: which subgraph calls run in parallel, which are sequential, which feed data from one into another.
- Compute the critical path. If the plan has
Users (50ms) → Orders (90ms) → Products (70ms)sequential, critical path is 210ms. If parallel, critical path is 90ms. The gap between your measured 800ms and the computed critical path is the problem. - Check for N+1 entity resolution. A query like
users { orders { product { name } } }where the gateway calls the Products subgraph once per product (say, 50 products) means 50 sequential subgraph calls. Each is fast; together they are 50 × 20ms = 1000ms. DataLoader on the Products subgraph’s__resolveReferencebatches these to one call. - Check for serialized
@requires. If Subgraph A’s resolver declares@requires(fields: "x y z"), the gateway makes the dependency fetch first, sequentially. A chain of three such fetches is three serial round trips. Audit@requiresusage and inline small required fields into the declaring subgraph when possible. - Check connection pooling to subgraphs. The gateway opens a new HTTP connection per subgraph call by default. TLS handshake + TCP setup each add 20-50ms. Enable keep-alive with a sufficient pool size (typically 50-200 per subgraph).
- Check for gateway CPU bottleneck. The gateway does query planning (expensive for complex queries) and response merging. If CPU is saturated on the gateway host, responses queue up. Solution: scale out the gateway, or move to the Rust-based Apollo Router which is 10x more efficient than the Node.js gateway.
- Check for subgraph coldstarts. In serverless or autoscaled deployments, infrequently-hit subgraphs may cold-start on a request. This looks like “100ms usual, 2000ms occasional” — the gateway inherits the cold-start latency. Keep warm instances or pre-warm on scale events.
@requires(fields: "authorName") on a Post resolver triggered a separate fetch to Users for each post in a feed query returning 20 posts. 20 × 30ms = 600ms. The fix was to add authorName directly to the Posts subgraph (denormalizing at write time via an event subscription from Users), eliminating the fetch entirely. Gateway response time dropped to 70ms.Senior Follow-up Questions@defer directive so initial response lands quickly and the rest streams), or migrating to Apollo Router (Rust) which is significantly faster at merging._entities queries, not one batched query. The fix is at the gateway: Apollo Gateway’s RemoteGraphQLDataSource supports batched entity fetching in Federation 2, but it must be explicitly enabled. Check the gateway config. The net effect is one HTTP call with 50 entity references per subgraph, not 50 calls.- “The network must be slow.” Rarely the actual issue in a single region with properly-sized pools. Measure before blaming infrastructure.
- “Add a cache.” Caching in GraphQL is much harder than REST because queries are unique. Caching won’t help if each request is a new query shape. Solve the fan-out problem first, then cache selectively.
- Apollo Blog, “Performance tuning Apollo Gateway” (2022).
- Medium Engineering, “GraphQL fan-out avoidance at Medium” (2020).
- Rishi Nair and team at Meta, “DataLoader: history and internals” (GraphQL Conf 2023 talk).
Scenario: Your gateway takes the entire API down when it crashes. How do you design for gateway high-availability without rearchitecting everything?
Scenario: Your gateway takes the entire API down when it crashes. How do you design for gateway high-availability without rearchitecting everything?
- Horizontal scaling is table stakes. The gateway is stateless, so put at least 3 instances behind a load balancer in at least 2 availability zones. Configure autoscaling on CPU (typically scale out above 60% sustained). This alone eliminates “single crash = total outage.”
- Use Apollo Router (Rust) over Apollo Gateway (Node.js) in production. The Rust binary has predictable performance, lower memory usage, and tighter crash characteristics than Node. The gateway crashes most teams see are OOM in Node under load.
- Load the supergraph from a static file, not from live introspection. A gateway that starts by introspecting every subgraph is fragile — any subgraph being down prevents startup. A gateway that boots from a supergraph SDL file starts in seconds and is decoupled from subgraph availability.
- Implement health checks that don’t depend on all subgraphs. The gateway’s
/healthzendpoint should return healthy as long as the gateway process is alive. Subgraph health is a separate concern; one subgraph being down should not cascade to “the gateway is dead.” - Set per-subgraph timeouts aggressively. If a subgraph hangs, the gateway waits. Default timeouts are often too generous (30s). Set them based on the subgraph’s 99th percentile + margin (maybe 2s). Fail fast, return partial data.
- Deploy gateway updates with canaries. 1% of traffic to the new version first, then 10%, then 100% over a few minutes. A buggy gateway deploy is the most common way federation goes down; canary protects against this.
- Practice failover. Quarterly chaos drill: kill one gateway instance during peak, watch the LB route around it, verify no user-visible errors. If it doesn’t work in the drill, it won’t work during an incident.
- “Add caching to make the gateway optional.” Caching in front of the gateway helps for repeated queries but doesn’t address the failure mode (the gateway itself dying). You need gateway redundancy, not cache redundancy.
- “Have clients talk directly to subgraphs as a fallback.” This breaks the entire federation abstraction, re-exposes internal schemas, and defeats the benefit of having a single endpoint. It is not a viable fallback.
- Apollo, “Router vs Gateway comparison” (2023) — operational guidance.
- Expedia Engineering, “Federation at Scale” (GraphQL Summit 2022).
- “Site Reliability Engineering” (Google, 2016) — the general principles apply: canarying, defense in depth, graceful degradation.
Advanced Federation Patterns
Entity References and @requires
The@requires directive is one of the most powerful (and most misunderstood) features of Federation. It lets a subgraph declare: “I need field X from another subgraph before I can compute field Y.” The gateway will automatically fetch the required data first, then pass it to your resolver. This avoids the need for your subgraph to make direct HTTP calls to other services — the gateway handles the orchestration for you.
Production pitfall: Overusing @requires can create deep dependency chains where the gateway needs to make sequential calls across multiple subgraphs to resolve a single field. Monitor your query plans with Apollo Studio to catch these chains before they impact latency.
reviewSummary receives the externally-required name as part of the entity representation passed to it — it doesn’t have to fetch it. This is declarative cross-service data flow: you describe the dependency in SDL, the planner figures out the fetch sequence, and your code stays simple.
- Node.js
- Python
Compound Keys
Sometimes a single-field key isn’t enough — cart items, for example, are naturally identified by the combination of user and product. Federation supports compound keys through the same@key directive, just with space-separated fields.
Interface Entities
Federation also supports interfaces as entities (Federation 2.3+), which is useful for Relay-stylenode(id) queries and polymorphic APIs. The gateway can route a single query through the correct concrete-type subgraph based on __typename.
Error Handling in Federation
Error handling in federation has to solve two problems at once: local subgraph errors (a DB query failed, a validation check rejected input) and cross-subgraph failures (one subgraph is down while others are healthy). GraphQL’s response model is kind to both: theerrors array can contain partial failures alongside a data object with whatever fields resolved successfully. This is fundamentally more forgiving than REST, where a 500 from one service often poisons the entire response.
The pattern below shows typed error classes that carry structured extensions (error codes, entity names, field names) so clients can programmatically branch on failure modes rather than parsing error strings. The trick is to resist the temptation to throw on every failure — for non-critical nested fields, returning null with a logged warning often gives users a better experience than a full query failure.
- Node.js
- Python
Authentication & Authorization
Auth in federated systems follows a clear split: authentication at the gateway, authorization at the subgraph. The gateway validates the incoming JWT or session token once, extracts user claims, and forwards them to subgraphs as trusted headers (typicallyx-user-id, x-user-role). Subgraphs trust these headers because they’re reachable only through the gateway (enforced via network policy or service mesh). Each subgraph then enforces its own authorization rules: who can see this user’s email, who can modify this order, who can delete a product. The gateway shouldn’t try to centralize authorization because it doesn’t know each subgraph’s domain rules — the Orders subgraph knows what “order owner” means, not the gateway.
The critical security rule: subgraphs must never trust arbitrary callers. Either enforce that they can only be reached via the gateway (network-level), or have them re-validate the JWT themselves. The second approach is slower but safer in zero-trust environments. When in doubt, do both.
- Node.js
- Python
Testing Federation
Testing federated systems has two layers. First, unit-test each subgraph in isolation — its resolvers should be testable with standard GraphQL execution against a mock database. Second, integration-test the composed supergraph — run real subgraphs (or mocks speaking the federation protocol) behind a gateway and assert that cross-subgraph queries return what you expect. The integration layer is where subtle federation bugs hide: a missing@external directive, a reference resolver returning the wrong shape, a circular dependency between subgraphs.
In Python, the easiest integration-test pattern is to spin up each subgraph on a distinct port via pytest fixtures, use httpx.AsyncClient to send queries to a running Apollo Router (or a mock gateway), and assert on the responses. Strawberry ships a schema.execute method that’s handy for testing individual subgraphs without HTTP.
- Node.js
- Python
Interview Questions
Q1: What is GraphQL Federation and how does it work?
Q1: What is GraphQL Federation and how does it work?
- Subgraphs: Individual GraphQL services
- Gateway: Entry point that composes supergraph
- Supergraph: Combined schema from all subgraphs
- Entities: Types that can span multiple subgraphs
- Each subgraph defines its portion of the schema
- Gateway fetches schemas and composes supergraph
- Client sends query to gateway
- Gateway plans execution across subgraphs
- Results are combined and returned
@key: Defines entity’s unique identifier@external: Field defined in another subgraph@requires: Field needs external data
Q2: How do entities work in Federation?
Q2: How do entities work in Federation?
- Orders subgraph returns
{ __typename: 'User', id: '1' } - Gateway sees it needs User fields
- Gateway calls Users subgraph’s
__resolveReference - Full User data is returned
Q3: What's the difference between @external and @requires?
Q3: What's the difference between @external and @requires?
- Gateway fetches
pricefrom Products subgraph - Passes it to this subgraph’s resolver
- Resolver uses it to compute
priceWithTax
Q4: How do you handle authentication in Federation?
Q4: How do you handle authentication in Federation?
- Validate JWT in gateway middleware
- Add user info to context
- Forward auth headers to subgraphs
- Read user from headers
- Apply authorization in resolvers
- Use directives for declarative auth
Chapter Summary
- Federation composes unified GraphQL from multiple services
- Entities enable cross-subgraph type resolution
- Gateway handles query planning and execution
- Use @key, @external, @requires for entity relationships
- Forward authentication from gateway to subgraphs
- Test federated queries with mock subgraphs
- Both Node.js (Apollo) and Python (strawberry-graphql + Apollo Router) are production-ready stacks
Interview Deep-Dive
'Your team is debating between REST with a BFF and GraphQL Federation for a microservices API. What are the real-world trade-offs, not just the theoretical ones?'
'Your team is debating between REST with a BFF and GraphQL Federation for a microservices API. What are the real-world trade-offs, not just the theoretical ones?'
'Explain how entity resolution works in Apollo Federation. What happens when one subgraph needs data from another subgraph to resolve a field?'
'Explain how entity resolution works in Apollo Federation. What happens when one subgraph needs data from another subgraph to resolve a field?'
type User @key(fields: "id") { id: ID!, name: String, email: String }. The Order subgraph extends it: extend type User @key(fields: "id") { orders: [Order!]! }. The Order subgraph can add the orders field to User without the User subgraph knowing about it.When a client queries { user(id: "123") { name, orders { total } } }, the gateway creates a query plan: Step one, fetch { user(id: "123") { id, name } } from the User subgraph. Step two, take the user ID from step one and call the Order subgraph’s __resolveReference with { __typename: "User", id: "123" } to fetch the orders.The Order subgraph’s __resolveReference method receives the User entity representation (just the key fields) and uses it to look up orders: (user) => orderService.getOrdersByUserId(user.id). It returns the orders, which the gateway merges with the user data from step one.The key insight: subgraphs do not call each other directly. The gateway orchestrates all communication. This means subgraphs can be developed independently — the Order team adds the orders field to User without the User team changing anything.The performance concern: each entity reference is a separate call from the gateway to the subgraph. A query that returns 50 users, each with orders, would make 50 separate calls to the Order subgraph for entity resolution. This is the N+1 problem. The solution: batch entity resolution. Instead of 50 individual __resolveReference calls, the gateway sends one batch request with all 50 user IDs. The Order subgraph’s batch resolver fetches all 50 users’ orders in a single database query.Follow-up: “What happens when one subgraph is down? Does the entire query fail?”By default in Apollo Gateway, if any subgraph fails, the entire query fails. This is problematic because a user query that includes non-critical data (recommendations) should not fail just because the recommendation subgraph is down. The solution is @defer directive (Apollo Federation 2) which allows the gateway to return partial results and stream the deferred fields when they become available. For non-streamable clients, I implement nullable fields with error extensions: the gateway returns null for the failed subgraph’s fields and includes an error in the response extensions indicating which fields were unavailable. The client handles the partial response gracefully.'How do you handle schema evolution in a federated GraphQL architecture with 10 subgraphs owned by different teams?'
'How do you handle schema evolution in a federated GraphQL architecture with 10 subgraphs owned by different teams?'
rover subgraph check locally to verify their schema change composes with the current production schemas. Second, CI composition check: the PR pipeline pulls all production subgraph schemas from the registry, substitutes the PR’s schema for its subgraph, and runs composition. If it fails, the PR is blocked. Third, staging validation: the change deploys to a staging environment where all subgraphs run, and integration tests verify that existing queries still return expected results. This catches runtime errors that composition checks miss (a field exists in the schema but the resolver throws an error).