Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

GraphQL Federation

GraphQL Federation enables you to compose a unified API from multiple GraphQL services, perfect for microservices architectures. The core problem it solves: in a REST microservices world, the client has to make N requests to N services and stitch the data together itself. With federation, a single GraphQL query like “get this user with their orders and each order’s products” gets decomposed by the gateway into efficient parallel requests to the right services. The client sees one API; behind the scenes, the gateway is orchestrating across your entire service fleet.
Learning Objectives:
  • Understand GraphQL Federation architecture
  • Implement Apollo Federation with Node.js and Python
  • Master schema composition and entities
  • Build federated queries and mutations
  • Handle authentication and authorization

Why GraphQL Federation?

REST APIs force a painful choice on microservice architectures. Either you build a single “mega endpoint” that returns everything (ballooning the payload and coupling services), or you force the client to make a chain of N requests across services and stitch results together. Neither scales. REST’s resource-oriented design fits a single service beautifully, but the moment data lives across service boundaries, the client becomes a de facto orchestrator — a job it was never designed for. Mobile clients on flaky networks pay the worst price: every extra round trip is a new opportunity for latency, timeouts, and partial failures. GraphQL fixed the client-side problem (ask for exactly what you want, get it in one query) but early approaches to multi-service GraphQL were rough. Schema stitching, the predecessor to federation, asked a central gateway to import every subgraph’s full schema and manually declare how types link together. This worked for small teams but collapsed under real-world pressure: every schema change required redeploying the gateway, type conflicts between services had to be resolved by hand, and there was no standard way to describe cross-service relationships. Federation codified these concerns — entities, keys, reference resolvers — into a standard so that subgraphs can evolve independently while still composing into a coherent API. The tradeoff is explicit: you’re accepting gateway complexity, query-planning overhead, and a steeper operational learning curve in exchange for client simplicity, parallel subgraph execution, and a single typed schema across your entire fleet. For small systems (under ~5 services, under ~20 engineers) this tradeoff is rarely worth it — a REST BFF or a single GraphQL service will serve you better. For larger organizations with independent frontend teams and dozens of services, federation pays for itself within months.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    THE PROBLEM WITH REST IN MICROSERVICES                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Client wants: User with their orders and order products                    │
│                                                                              │
│  REST APPROACH (Multiple round trips):                                      │
│  ─────────────────────────────────────                                      │
│                                                                              │
│  1. GET /users/123                    → { name, email }                     │
│  2. GET /users/123/orders             → [{ id: 'o1' }, { id: 'o2' }]        │
│  3. GET /orders/o1                    → { items: ['p1', 'p2'] }             │
│  4. GET /orders/o2                    → { items: ['p3'] }                   │
│  5. GET /products/p1                  → { name, price }                     │
│  6. GET /products/p2                  → { name, price }                     │
│  7. GET /products/p3                  → { name, price }                     │
│                                                                              │
│  Total: 7 HTTP requests!                                                    │
│                                                                              │
│  ═══════════════════════════════════════════════════════════════════════   │
│                                                                              │
│  GRAPHQL FEDERATION (Single request):                                       │
│  ────────────────────────────────────                                       │
│                                                                              │
│  query {                                                                    │
│    user(id: "123") {                                                        │
│      name                                                                   │
│      email                                                                  │
│      orders {                                                               │
│        id                                                                   │
│        items {                                                              │
│          product { name, price }                                            │
│          quantity                                                           │
│        }                                                                    │
│      }                                                                      │
│    }                                                                        │
│  }                                                                          │
│                                                                              │
│  Total: 1 request! Gateway handles federation                               │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Federation Architecture

The heart of federation is a directed conversation between the router (or gateway) and a set of subgraphs. When a client query arrives at the router, the router consults its composed supergraph schema, produces a query plan (essentially a DAG of subgraph calls with data dependencies), and executes that plan — fetching from multiple subgraphs in parallel where it can, and sequentially where one subgraph’s output feeds another’s input. Subgraphs never talk to each other directly. This indirection is crucial: it’s what lets subgraphs evolve independently and what lets the router enforce global concerns like tracing, rate limiting, and authentication in one place. Subgraphs communicate with the router via standard GraphQL over HTTP, but with a twist: each subgraph exposes a special _entities query the router uses to resolve cross-subgraph references. When the router needs a User that was mentioned by the Orders subgraph, it sends the key fields ({ __typename: "User", id: "123" }) to the Users subgraph’s _entities endpoint, which invokes a reference resolver to hydrate the full object. This is the mechanism that makes the whole thing feel seamless to clients despite being a distributed orchestration under the hood.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    APOLLO FEDERATION ARCHITECTURE                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│                           ┌─────────────────┐                               │
│                           │     Clients     │                               │
│                           └────────┬────────┘                               │
│                                    │                                        │
│                                    ▼                                        │
│                    ┌───────────────────────────────┐                       │
│                    │       Apollo Gateway          │                       │
│                    │  ┌──────────────────────────┐ │                       │
│                    │  │   Supergraph Schema      │ │                       │
│                    │  │   (Composed from all)    │ │                       │
│                    │  └──────────────────────────┘ │                       │
│                    │  ┌──────────────────────────┐ │                       │
│                    │  │   Query Planner          │ │                       │
│                    │  │   (Optimizes execution)  │ │                       │
│                    │  └──────────────────────────┘ │                       │
│                    └──────────────┬────────────────┘                       │
│                                   │                                         │
│           ┌───────────────────────┼───────────────────────┐                │
│           │                       │                       │                │
│           ▼                       ▼                       ▼                │
│   ┌───────────────┐       ┌───────────────┐       ┌───────────────┐       │
│   │ Users Subgraph│       │Orders Subgraph│       │Product Subgraph│      │
│   │               │       │               │       │               │       │
│   │ type User     │       │ type Order    │       │ type Product  │       │
│   │   @key(id)    │       │   @key(id)    │       │   @key(id)    │       │
│   │               │       │               │       │               │       │
│   │ Users DB      │       │ Orders DB     │       │ Products DB   │       │
│   └───────────────┘       └───────────────┘       └───────────────┘       │
│                                                                              │
│  KEY CONCEPTS:                                                              │
│  • Subgraph: Individual GraphQL service with part of the schema           │
│  • Supergraph: Composed schema from all subgraphs                          │
│  • Entity: Type that can be resolved across subgraphs (@key directive)    │
│  • Query Planner: Optimizes how to fetch data from multiple subgraphs     │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Implementing Subgraphs

A subgraph is just a normal GraphQL service with two extras: it speaks the federation directive dialect (@key, @external, @requires, @provides), and it implements reference resolvers for any entity it owns. “Owning” an entity means being the authoritative source for that type — you declare a @key on it, and you provide a __resolveReference (Node.js) or resolve_reference (Python/strawberry) function that, given just the key fields, returns the full object. Other subgraphs can then “extend” your entity by adding fields of their own without touching your code. The Python and Node.js patterns differ in surface syntax but map one-to-one conceptually. In Node.js with Apollo Server, you write SDL as a template string and wire up a resolver map. In Python with strawberry-graphql, you use class decorators (@strawberry.federation.type(keys=["id"])) and methods — strawberry introspects your Python types to produce the schema. Both approaches give you the same federation capabilities; the choice is mostly about which language ecosystem your team already lives in.

Users Subgraph

The Users subgraph is the simplest case: one entity (User) with no external dependencies. It owns the User type via @key(fields: "id") and provides the reference resolver. Other subgraphs — Orders, Reviews — will extend User to add their fields, but they’ll always call back here to get the core user data.
// users-subgraph/index.js

const { ApolloServer } = require('@apollo/server');
const { startStandaloneServer } = require('@apollo/server/standalone');
const { buildSubgraphSchema } = require('@apollo/subgraph');
const { gql } = require('graphql-tag');

const typeDefs = gql`
  extend schema @link(url: "https://specs.apollo.dev/federation/v2.0", import: ["@key", "@external", "@requires"])

  type Query {
    user(id: ID!): User
    users: [User!]!
    me: User
  }

  type Mutation {
    createUser(input: CreateUserInput!): User!
    updateUser(id: ID!, input: UpdateUserInput!): User!
  }

  # Entity - can be referenced from other subgraphs
  type User @key(fields: "id") {
    id: ID!
    email: String!
    name: String!
    avatar: String
    createdAt: String!
  }

  input CreateUserInput {
    email: String!
    name: String!
    password: String!
  }

  input UpdateUserInput {
    name: String
    avatar: String
  }
`;

// In-memory data store (use real DB in production)
const users = [
  { id: '1', email: 'alice@example.com', name: 'Alice', createdAt: '2024-01-01' },
  { id: '2', email: 'bob@example.com', name: 'Bob', createdAt: '2024-01-02' }
];

const resolvers = {
  Query: {
    user: (_, { id }) => users.find(u => u.id === id),
    users: () => users,
    me: (_, __, { user }) => user  // From auth context
  },
  
  Mutation: {
    createUser: (_, { input }) => {
      const newUser = {
        id: String(users.length + 1),
        ...input,
        createdAt: new Date().toISOString()
      };
      users.push(newUser);
      return newUser;
    },
    updateUser: (_, { id, input }) => {
      const user = users.find(u => u.id === id);
      if (!user) throw new Error('User not found');
      Object.assign(user, input);
      return user;
    }
  },
  
  User: {
    // Reference resolver - called when other subgraphs need to resolve a User
    __resolveReference: (userRef) => {
      return users.find(u => u.id === userRef.id);
    }
  }
};

const server = new ApolloServer({
  schema: buildSubgraphSchema({ typeDefs, resolvers })
});

startStandaloneServer(server, { listen: { port: 4001 } })
  .then(({ url }) => console.log(`Users subgraph ready at ${url}`));

Orders Subgraph

The Orders subgraph demonstrates two key federation patterns: it owns the Order entity, and it extends the User entity with an orders field. This is how federation distributes schema ownership without sacrificing a unified API. The Users team doesn’t need to know Orders exists — they just own the User type. The Orders team adds order-related capabilities to User by declaring extend type User (in SDL) or using strawberry’s extend_type mechanism. The router composes these together into one supergraph User type at runtime. Notice also how the Orders subgraph returns a reference stub ({ __typename: 'User', id: order.userId }) rather than a full User object. The subgraph only knows the user’s ID — it doesn’t need to fetch the user. The router sees this stub in the response and, if the client asked for User fields like name or email, calls back to the Users subgraph’s reference resolver to fill them in. This lazy hydration is what keeps federation fast: data is fetched only when the query demands it.
// orders-subgraph/index.js

const { ApolloServer } = require('@apollo/server');
const { startStandaloneServer } = require('@apollo/server/standalone');
const { buildSubgraphSchema } = require('@apollo/subgraph');
const { gql } = require('graphql-tag');

const typeDefs = gql`
  extend schema @link(url: "https://specs.apollo.dev/federation/v2.0", import: ["@key", "@external", "@requires", "@extends"])

  type Query {
    order(id: ID!): Order
    orders(userId: ID): [Order!]!
  }

  type Mutation {
    createOrder(input: CreateOrderInput!): Order!
    updateOrderStatus(id: ID!, status: OrderStatus!): Order!
  }

  type Order @key(fields: "id") {
    id: ID!
    status: OrderStatus!
    items: [OrderItem!]!
    total: Float!
    createdAt: String!
    # Reference to User entity (defined in Users subgraph)
    user: User!
  }

  type OrderItem {
    product: Product!
    quantity: Int!
    price: Float!
  }

  # Extend User type from Users subgraph
  type User @key(fields: "id") {
    id: ID! @external
    # Add orders field to User type
    orders: [Order!]!
  }

  # Reference to Product entity (defined in Products subgraph)
  type Product @key(fields: "id") {
    id: ID! @external
  }

  enum OrderStatus {
    PENDING
    CONFIRMED
    SHIPPED
    DELIVERED
    CANCELLED
  }

  input CreateOrderInput {
    userId: ID!
    items: [OrderItemInput!]!
  }

  input OrderItemInput {
    productId: ID!
    quantity: Int!
  }
`;

// Sample data
const orders = [
  {
    id: 'o1',
    userId: '1',
    status: 'DELIVERED',
    items: [
      { productId: 'p1', quantity: 2, price: 29.99 },
      { productId: 'p2', quantity: 1, price: 49.99 }
    ],
    total: 109.97,
    createdAt: '2024-01-10'
  },
  {
    id: 'o2',
    userId: '1',
    status: 'SHIPPED',
    items: [
      { productId: 'p3', quantity: 1, price: 199.99 }
    ],
    total: 199.99,
    createdAt: '2024-01-15'
  }
];

const resolvers = {
  Query: {
    order: (_, { id }) => orders.find(o => o.id === id),
    orders: (_, { userId }) => 
      userId ? orders.filter(o => o.userId === userId) : orders
  },
  
  Mutation: {
    createOrder: async (_, { input }, { dataSources }) => {
      // Fetch product prices from Products subgraph
      const items = await Promise.all(
        input.items.map(async item => {
          const product = await dataSources.productsAPI.getProduct(item.productId);
          return {
            productId: item.productId,
            quantity: item.quantity,
            price: product.price
          };
        })
      );
      
      const total = items.reduce((sum, item) => sum + item.price * item.quantity, 0);
      
      const newOrder = {
        id: `o${orders.length + 1}`,
        userId: input.userId,
        status: 'PENDING',
        items,
        total,
        createdAt: new Date().toISOString()
      };
      
      orders.push(newOrder);
      return newOrder;
    },
    
    updateOrderStatus: (_, { id, status }) => {
      const order = orders.find(o => o.id === id);
      if (!order) throw new Error('Order not found');
      order.status = status;
      return order;
    }
  },
  
  Order: {
    __resolveReference: (orderRef) => orders.find(o => o.id === orderRef.id),
    
    // Return reference stub - Gateway will resolve via Users subgraph
    user: (order) => ({ __typename: 'User', id: order.userId }),
    
    items: (order) => order.items.map(item => ({
      ...item,
      product: { __typename: 'Product', id: item.productId }
    }))
  },
  
  // Extend User type with orders
  User: {
    orders: (user) => orders.filter(o => o.userId === user.id)
  }
};

const server = new ApolloServer({
  schema: buildSubgraphSchema({ typeDefs, resolvers })
});

startStandaloneServer(server, { listen: { port: 4002 } })
  .then(({ url }) => console.log(`Orders subgraph ready at ${url}`));

Products Subgraph

Products is the “owner” of the Product entity and also demonstrates @shareable — a directive that lets multiple subgraphs legitimately define the same field (e.g., averageRating might be computed by both Products and a dedicated Reviews subgraph). Without @shareable, composition would fail with a duplicate-field error. Use it sparingly: it’s usually a sign that ownership boundaries are fuzzy, and fuzzy boundaries lead to drifting implementations.
// products-subgraph/index.js

const { ApolloServer } = require('@apollo/server');
const { startStandaloneServer } = require('@apollo/server/standalone');
const { buildSubgraphSchema } = require('@apollo/subgraph');
const { gql } = require('graphql-tag');

const typeDefs = gql`
  extend schema @link(url: "https://specs.apollo.dev/federation/v2.0", import: ["@key", "@shareable"])

  type Query {
    product(id: ID!): Product
    products(category: String): [Product!]!
    searchProducts(query: String!): [Product!]!
  }

  type Mutation {
    createProduct(input: CreateProductInput!): Product!
    updateProduct(id: ID!, input: UpdateProductInput!): Product!
    updateInventory(id: ID!, quantity: Int!): Product!
  }

  type Product @key(fields: "id") {
    id: ID!
    name: String!
    description: String
    price: Float!
    category: String!
    inventory: Int!
    images: [String!]!
    reviews: [Review!]!
    averageRating: Float @shareable
  }

  type Review {
    id: ID!
    userId: ID!
    rating: Int!
    comment: String
    createdAt: String!
  }

  input CreateProductInput {
    name: String!
    description: String
    price: Float!
    category: String!
    inventory: Int!
    images: [String!]
  }

  input UpdateProductInput {
    name: String
    description: String
    price: Float
    category: String
    images: [String!]
  }
`;

const products = [
  {
    id: 'p1',
    name: 'Wireless Headphones',
    description: 'High-quality wireless headphones',
    price: 29.99,
    category: 'Electronics',
    inventory: 100,
    images: ['/images/headphones.jpg'],
    reviews: [
      { id: 'r1', userId: '1', rating: 5, comment: 'Great sound!', createdAt: '2024-01-05' }
    ]
  },
  {
    id: 'p2',
    name: 'Laptop Stand',
    description: 'Ergonomic laptop stand',
    price: 49.99,
    category: 'Accessories',
    inventory: 50,
    images: ['/images/stand.jpg'],
    reviews: []
  },
  {
    id: 'p3',
    name: 'Mechanical Keyboard',
    description: 'RGB mechanical keyboard',
    price: 199.99,
    category: 'Electronics',
    inventory: 25,
    images: ['/images/keyboard.jpg'],
    reviews: [
      { id: 'r2', userId: '2', rating: 4, comment: 'Nice typing experience', createdAt: '2024-01-12' }
    ]
  }
];

const resolvers = {
  Query: {
    product: (_, { id }) => products.find(p => p.id === id),
    products: (_, { category }) => 
      category ? products.filter(p => p.category === category) : products,
    searchProducts: (_, { query }) =>
      products.filter(p => 
        p.name.toLowerCase().includes(query.toLowerCase()) ||
        p.description?.toLowerCase().includes(query.toLowerCase())
      )
  },
  
  Mutation: {
    createProduct: (_, { input }) => {
      const newProduct = {
        id: `p${products.length + 1}`,
        ...input,
        images: input.images || [],
        reviews: []
      };
      products.push(newProduct);
      return newProduct;
    },
    
    updateProduct: (_, { id, input }) => {
      const product = products.find(p => p.id === id);
      if (!product) throw new Error('Product not found');
      Object.assign(product, input);
      return product;
    },
    
    updateInventory: (_, { id, quantity }) => {
      const product = products.find(p => p.id === id);
      if (!product) throw new Error('Product not found');
      product.inventory = quantity;
      return product;
    }
  },
  
  Product: {
    __resolveReference: (productRef) => products.find(p => p.id === productRef.id),
    
    averageRating: (product) => {
      if (product.reviews.length === 0) return null;
      const sum = product.reviews.reduce((acc, r) => acc + r.rating, 0);
      return sum / product.reviews.length;
    }
  }
};

const server = new ApolloServer({
  schema: buildSubgraphSchema({ typeDefs, resolvers })
});

startStandaloneServer(server, { listen: { port: 4003 } })
  .then(({ url }) => console.log(`Products subgraph ready at ${url}`));
Caveats and Common Pitfalls for Subgraphs
  1. Schema stitching vs federation confusion. Engineers from a pre-2019 stack keep calling federation “stitching” or try to apply stitching patterns (manually declaring type links in the gateway) on top of federation. The two are different systems. Stitching is deprecated for good reason — the gateway was the single point of schema knowledge, and every subgraph schema change required a gateway redeploy. Federation inverts this: subgraphs declare their own entity ownership, the gateway composes them.
  2. Subgraph teams moving at different speeds break the supergraph. The Orders team ships schema changes weekly; the Users team ships monthly. When Orders introduces an extension of User that depends on a new User field, and the Users team hasn’t released it yet, composition fails in CI and blocks the Orders deploy. The root cause is schema coupling without release coordination.
  3. Over-extending another team’s entity. A subgraph that adds 20 fields to User creates an implicit ownership boundary problem — users querying user.myField can’t tell whether that field lives with User or with some other team, which makes bug triage a cross-team scavenger hunt.
  4. Returning full entities instead of references. The Orders subgraph resolves order.user by calling the Users service and returning the full User object, instead of returning { __typename: "User", id: order.userId }. This bypasses the query planner’s ability to batch and dedupe, re-creates the N+1 problem, and defeats the whole point of federation.
Solutions and Patterns
  • Always return reference stubs for entities owned by other subgraphs. Let the router resolve them. The stub is { __typename: "User", id: "123" } in SDL, a User(id=id) instance in strawberry.
  • Use a schema registry with CI composition checks. Every PR that modifies a subgraph schema runs rover subgraph check against the current production supergraph. Breaking changes are caught before merge, not at runtime.
  • Establish entity ownership rules. One team owns the entity (defines @key). Other teams can extend it, but with restrictions: only new fields, never renames, never key changes. Document ownership in the schema registry so it shows in code review.
  • Coordinate schema releases across teams via a platform team. Major cross-entity changes get a dedicated Slack channel and a synchronized rollout window. Minor additions can flow independently.
  • Prefer federation over extension where possible. If the Users team can include an orders: [Order!] field on their own User type (and fetch from Orders internally), do it there — extension is for cases where the field genuinely belongs to a different domain.
Strong Answer Framework
  1. Never rename in place. Add the new field as an alias alongside the old one. The schema now has productName: String! and title: String!, both resolving from the same underlying data.
  2. Mark the old field @deprecated. With a clear reason: @deprecated(reason: "Use 'title' instead. Removal planned 2026-Q3."). Apollo Studio surfaces deprecations to every client; this is the primary notification channel.
  3. Track usage per client. Apollo Studio reports field-level usage by client name. You know which of the 200 queries are hitting the deprecated field and which clients issue them. Contact those teams directly.
  4. Migrate clients one at a time. Frontend team A replaces productName with title in their queries. Deploy. Verify usage of productName from client A drops to zero in Apollo Studio. Repeat for teams B, C, D.
  5. Wait for usage to hit zero. All clients migrated. Old field has zero usage for two consecutive weeks. Now you can plan removal.
  6. Remove with a breaking-change gate. The PR removing productName requires approval from the schema registry’s composition check (which passes because no query references it anymore) plus sign-off from the platform team. Deploy during a low-traffic window.
  7. Keep a rollback path. If removal causes unexpected failures (a client you didn’t know about, a server-side script, a forgotten scheduled report), the field can be re-added in under an hour. Don’t truncate the underlying data column — leave it readable.
Real-World ExampleNetflix’s Studio Platform team, in a 2021 GraphQL Summit talk, documented migrating Asset.title to Asset.displayTitle (different semantics — title was the internal name, displayTitle was the localized customer-facing name). The migration took 14 months because of the long tail of internal tools and batch jobs. They used Apollo Studio to identify each consuming service, filed tickets to each team, and kept the deprecated field in place until every consumer had migrated. The critical discipline was “never rush removal” — a single forgotten cron job that ran monthly would have caused an incident if the field disappeared before it migrated.Senior Follow-up Questions
Follow-up 1: Two of the 200 queries can’t be migrated because they come from a legacy mobile app with users still running it a year later. Do you keep the old field forever?Strong Answer: Often yes, at least for the lifetime of the legacy app. The cost of keeping a deprecated field is small (one resolver line, one SDL entry), and the risk of breaking a legacy client with active users is high (support tickets, bad reviews). Define a “sunset policy” with product/legal: after N versions of the legacy app are forced-upgraded or usage falls below a threshold, the field retires. Keep deprecation notices prominent in Apollo Studio so new clients don’t adopt the deprecated field.
Follow-up 2: What if during the migration you discover that productName and title should return slightly different values because of a business rule change?Strong Answer: This is no longer a rename; it’s a semantic split. Make it explicit: productName keeps its original behavior (maybe adding an @deprecated(reason: "Use 'title' for the new display logic")), and title implements the new rule. Document the difference in the schema description field. Clients deciding which to use read the descriptions and choose deliberately. Don’t silently migrate semantics — that produces bugs only noticed after finance reports diverge.
Follow-up 3: How do you prevent someone from adding a new query that uses the deprecated field during the migration period?Strong Answer: Apollo Studio’s schema linting rules include “no new usage of deprecated fields.” Add it to the CI pipeline for client repositories; any PR that introduces a query with a deprecated field fails the lint. For client teams not in the lint pipeline, the deprecation warnings in Apollo Explorer (the IDE) make the issue visible during development. If you need a stronger gate, add a @requiresExplicitAck directive that forces clients to opt in to using deprecated fields via a query hint.
Common Wrong Answers
  • “We’ll coordinate a synchronized big-bang rename.” Never works. Someone is on vacation, a client is forgotten, a legacy system has a hardcoded query. One team that rolls back blocks the whole migration. Deprecations are asynchronous on purpose.
  • “The client teams should migrate first, then we rename.” Client teams won’t migrate without motivation. Deprecation with a dated removal plan creates the motivation; pure “please migrate” emails get ignored.
Further Reading
  • Netflix Engineering, “GraphQL at Netflix Studio” (2021 Summit talk) — the exact migration pattern.
  • Apollo Docs, “Schema change management” — official tooling for the workflow.
  • Lee Byron, “GraphQL at Facebook” (2015) — origin story of schema evolution principles.
Strong Answer Framework
  1. Pick a single convention for the federated schema. GraphQL convention is camelCase for fields. Pick it, document it, and stop debating.
  2. Inventory existing deviations. A week-long project: go through every subgraph’s schema and list the fields that deviate. Roughly triage into “easy to migrate” (no external clients), “hard to migrate” (many clients), and “historical” (nobody knows why it’s snake_case but it is).
  3. Standardize new subgraphs immediately. The onboarding team writes camelCase from day one. Their schema is correct; they don’t carry legacy debt.
  4. For existing subgraphs, treat the deviation as a deprecation. Add the camelCase version alongside the old one, deprecate the old, migrate clients, remove (same playbook as the previous scenario).
  5. Document conventions in a living style guide. Include naming, nullability defaults, enum uppercasing, directive usage. Enforce with a linter that runs in every subgraph’s CI.
  6. Avoid mapping in the gateway. Some teams try to rename fields at the gateway layer. Don’t — it hides the inconsistency rather than fixing it, and the mapping is yet another place to maintain state. Normalize at the source.
  7. Accept that some inconsistencies will persist. Perfectly consistent schema across 20 subgraphs over 5 years is unrealistic. Pick the high-traffic inconsistencies to fix, accept the rest.
Real-World ExampleThe Airbnb platform team in 2020 published a style guide, “Reconciling the supergraph,” addressing exactly this. They had ~60 subgraphs with roughly 15% field name inconsistencies. The approach was to publish the standard, run lint in CI with a ratchet (no new violations, but existing ones allowed), and prioritize migration of the top-50 most-queried inconsistent fields. After 18 months, inconsistency was under 3%. Complete eradication was deemed not worth the effort.Senior Follow-up Questions
Follow-up 1: One team insists on snake_case because their backend is Python and they want schema to mirror their resolvers. How do you handle the disagreement?Strong Answer: Remind them the schema is a public contract; the resolver is an implementation detail. Strawberry can generate camelCase from Python snake_case automatically via auto_camel_case=True. The Python code stays idiomatic; the schema stays federation-standard. If the team still refuses, escalate with data: query logs showing every other subgraph uses camelCase, and clients that query across subgraphs have to context-switch for this one team’s schema. It’s an organization norm, not a language preference.
Follow-up 2: How do you enforce the convention in a world of 30 subgraph repos?Strong Answer: Central linter shipped as a package (@company/graphql-schema-lint). Every subgraph repo’s CI imports it and runs it as a required check. Rules are defined centrally and updated centrally; subgraphs inherit new rules automatically. For exceptions, the repo adds a lint-ignore comment with a ticket reference and a removal date — exceptions are tracked, not buried.
Follow-up 3: The federation supergraph composition succeeds but the actual query-time experience is inconsistent — some resolvers are synchronous, some are async, some have DataLoaders. How do you address that?Strong Answer: Schema consistency is necessary but not sufficient. Add a “subgraph maturity model” document: level 1 is “schema compliant,” level 2 is “uses DataLoaders for N+1 avoidance,” level 3 is “supports cache hints,” level 4 is “passes chaos test suite.” Rate each subgraph and publish the ratings. New subgraphs must reach level 2 at launch; older subgraphs can stay at level 1 but the platform team prioritizes help for teams wanting to level up. Make it a visible, measurable form of technical excellence rather than a scolding.
Common Wrong Answers
  • “Rewrite all subgraphs to the new convention in one release.” Coordinated rewrites across 30 repos fail. Stage incrementally.
  • “Leave it — clients can handle both.” Clients end up with conditional logic for each subgraph’s quirks, which is tech debt on the client side instead of the server side. Consolidate at the source.
Further Reading
  • Airbnb Engineering Blog, “Schema Stitching at Scale” and follow-ups on their federation migration.
  • Apollo GraphQL Docs, “Schema design best practices.”
  • Shopify GraphQL Design Tutorial — opinionated and thorough style guide.

Apollo Gateway

The gateway (sometimes called the router) is the single entry point your clients talk to. Its job is three-fold: compose the supergraph schema from all subgraph schemas, plan each incoming query as a DAG of subgraph fetches, and execute that plan while handling cross-cutting concerns like auth header forwarding, tracing, and error propagation. In production, the recommended choice is the Apollo Router (written in Rust) because its query-planning and execution are dramatically faster than the Node.js gateway, which matters when you’re processing thousands of QPS. For development and small-scale deployments, the Node.js ApolloGateway is plenty capable and easier to customize. A key architectural point: the gateway is stateless. All it holds is the composed supergraph schema. Requests can be load-balanced across any number of gateway instances. This is deliberately simple — keeping the gateway stateless means you can scale it horizontally without worrying about session affinity or distributed cache consistency. When you need per-user data (auth context, dataloader caches), it lives in request-scoped context, not gateway state. For Python stacks, there’s no equivalent Python-native gateway with the same maturity as Apollo’s. The pragmatic pattern is to run the Apollo Router (Rust binary, language-agnostic) in front of Python strawberry subgraphs. The router doesn’t care what language implements the subgraphs — it speaks federation over HTTP/GraphQL. Below we show Node.js gateway code alongside a YAML router config and a Python FastAPI-based auth proxy.
// gateway/index.js

const { ApolloServer } = require('@apollo/server');
const { expressMiddleware } = require('@apollo/server/express4');
const { ApolloGateway, IntrospectAndCompose, RemoteGraphQLDataSource } = require('@apollo/gateway');
const express = require('express');
const cors = require('cors');
const jwt = require('jsonwebtoken');

// Custom data source for authentication
class AuthenticatedDataSource extends RemoteGraphQLDataSource {
  willSendRequest({ request, context }) {
    // Forward auth token to subgraphs
    if (context.token) {
      request.http.headers.set('authorization', context.token);
    }
    
    // Forward user info
    if (context.user) {
      request.http.headers.set('x-user-id', context.user.id);
      request.http.headers.set('x-user-role', context.user.role);
    }
  }
}

const gateway = new ApolloGateway({
  supergraphSdl: new IntrospectAndCompose({
    subgraphs: [
      { name: 'users', url: 'http://localhost:4001/graphql' },
      { name: 'orders', url: 'http://localhost:4002/graphql' },
      { name: 'products', url: 'http://localhost:4003/graphql' }
    ]
  }),
  buildService({ name, url }) {
    return new AuthenticatedDataSource({ url });
  }
});

const server = new ApolloServer({
  gateway,
  plugins: [
    {
      async requestDidStart() {
        return {
          async didResolveOperation({ request, document, operationName }) {
            console.log(`Operation: ${operationName}`);
          },
          async executionDidStart() {
            return {
              async executionDidEnd(err) {
                if (err) {
                  console.error('Execution error:', err);
                }
              }
            };
          }
        };
      }
    }
  ]
});

async function startServer() {
  await server.start();

  const app = express();
  
  app.use(cors());
  app.use(express.json());
  
  // Authentication middleware
  app.use('/graphql', (req, res, next) => {
    const token = req.headers.authorization?.replace('Bearer ', '');
    
    if (token) {
      try {
        const user = jwt.verify(token, process.env.JWT_SECRET);
        req.user = user;
      } catch (err) {
        // Invalid token - continue without user
      }
    }
    
    next();
  });
  
  app.use(
    '/graphql',
    expressMiddleware(server, {
      context: async ({ req }) => ({
        token: req.headers.authorization,
        user: req.user
      })
    })
  );
  
  app.listen(4000, () => {
    console.log('Gateway ready at http://localhost:4000/graphql');
  });
}

startServer().catch(console.error);

Supergraph Configuration (Alternative to IntrospectAndCompose)

For production, don’t use runtime introspection (IntrospectAndCompose). It’s convenient for local dev, but it tightly couples gateway startup to subgraph availability and makes rollbacks hard. Instead, compose the supergraph schema ahead of time in CI (using rover supergraph compose against a schema registry), publish the resulting supergraph.graphql as an artifact, and point your gateway at it. This gives you atomic deploys, clean rollbacks, and schema validation before anything touches production.
# supergraph.yaml - For production with Apollo Studio

federation_version: =2.0

subgraphs:
  users:
    routing_url: http://users-service:4001/graphql
    schema:
      subgraph_url: http://users-service:4001/graphql
      
  orders:
    routing_url: http://orders-service:4002/graphql
    schema:
      subgraph_url: http://orders-service:4002/graphql
      
  products:
    routing_url: http://products-service:4003/graphql
    schema:
      subgraph_url: http://products-service:4003/graphql
Caveats and Common Pitfalls for the Federation Gateway
  1. The gateway as SPOF (single point of failure). Every client request goes through the gateway. If the gateway process dies, the entire API is down regardless of whether the subgraphs are healthy. Teams deploy one gateway instance in one region and then act surprised when a single crash takes down production.
  2. Persisted query security neglected. Production gateways accept arbitrary queries from clients. A malicious client sends a deeply nested query that fan-outs across subgraphs and brings everything to its knees. Persisted queries (clients only allowed to send pre-registered query hashes) close this attack surface but are often skipped because “we’ll add it later.”
  3. IntrospectAndCompose in production. The gateway pulls schemas from subgraphs at startup. A subgraph is down during a gateway restart; the gateway cannot start. Worse, the gateway picks up a breaking schema change silently because nobody gated composition in CI.
  4. Gateway logs hide which subgraph failed. An error log says “query failed” but doesn’t attribute the failure to a specific subgraph. Ops spends 40 minutes correlating timestamps before finding that Orders was the culprit.
  5. Gateway response time balloons because of sequential subgraph calls. The query plan has a dependency chain: Users → Orders → Products. Each subgraph takes 100ms. The gateway takes 300ms+ even though no single subgraph is slow. Common when @requires creates unnecessary dependencies.
Solutions and Patterns
  • Run the gateway in multiple regions with a load balancer in front. At least 3 instances, auto-scaling on CPU. The gateway is stateless, so this is cheap and essential.
  • Adopt persisted queries before production launch. Clients ship with a map of query hashes → query text, the gateway only accepts registered hashes. Apollo’s Automatic Persisted Queries (APQ) streamlines the workflow.
  • Compose supergraphs ahead of time in CI. rover supergraph compose produces a supergraph SDL file that is committed as an artifact. The gateway boots from the file, not by introspecting subgraphs. Rollbacks are a file swap.
  • Forward subgraph identity in error responses. The gateway wraps subgraph errors with extensions.serviceName so ops knows immediately which service to investigate.
  • Instrument query plan latency per subgraph. Apollo’s OpenTelemetry integration emits spans per subgraph call. A dashboard showing ”% of request time spent in each subgraph” makes slow plans diagnosable at a glance.
  • Rate-limit at the gateway by query complexity score. Depth-limited, cost-limited, and per-user/per-operation rate limits. GraphQL’s expressive power means a single query can be as expensive as 1000 REST calls — the rate limit must reflect that.
Strong Answer Framework
  1. Get a single slow query and its query plan. Capture a 99th-percentile request with its full plan (Apollo Studio shows this). The plan is a DAG: which subgraph calls run in parallel, which are sequential, which feed data from one into another.
  2. Compute the critical path. If the plan has Users (50ms) → Orders (90ms) → Products (70ms) sequential, critical path is 210ms. If parallel, critical path is 90ms. The gap between your measured 800ms and the computed critical path is the problem.
  3. Check for N+1 entity resolution. A query like users { orders { product { name } } } where the gateway calls the Products subgraph once per product (say, 50 products) means 50 sequential subgraph calls. Each is fast; together they are 50 × 20ms = 1000ms. DataLoader on the Products subgraph’s __resolveReference batches these to one call.
  4. Check for serialized @requires. If Subgraph A’s resolver declares @requires(fields: "x y z"), the gateway makes the dependency fetch first, sequentially. A chain of three such fetches is three serial round trips. Audit @requires usage and inline small required fields into the declaring subgraph when possible.
  5. Check connection pooling to subgraphs. The gateway opens a new HTTP connection per subgraph call by default. TLS handshake + TCP setup each add 20-50ms. Enable keep-alive with a sufficient pool size (typically 50-200 per subgraph).
  6. Check for gateway CPU bottleneck. The gateway does query planning (expensive for complex queries) and response merging. If CPU is saturated on the gateway host, responses queue up. Solution: scale out the gateway, or move to the Rust-based Apollo Router which is 10x more efficient than the Node.js gateway.
  7. Check for subgraph coldstarts. In serverless or autoscaled deployments, infrequently-hit subgraphs may cold-start on a request. This looks like “100ms usual, 2000ms occasional” — the gateway inherits the cold-start latency. Keep warm instances or pre-warm on scale events.
Real-World ExampleAt Medium (the publishing platform), the engineering team blogged in 2020 about a 600ms-vs-60ms discrepancy on their federated graph. The root cause: @requires(fields: "authorName") on a Post resolver triggered a separate fetch to Users for each post in a feed query returning 20 posts. 20 × 30ms = 600ms. The fix was to add authorName directly to the Posts subgraph (denormalizing at write time via an event subscription from Users), eliminating the fetch entirely. Gateway response time dropped to 70ms.Senior Follow-up Questions
Follow-up 1: How do you measure the per-subgraph latency without adding too much observability overhead?Strong Answer: OpenTelemetry tracing at the gateway: every subgraph call becomes a span. Sampling is crucial — 1% sample for production traffic gives statistical coverage without flooding the tracing backend. For latency distributions, use a histogram metric per subgraph (P50, P95, P99) in Prometheus; this is always-on and cheap. Reserve full tracing for incidents and targeted investigations.
Follow-up 2: The plan is already parallel everywhere it can be, but the response is still slow. What else could be going on?Strong Answer: Response merging at the gateway. For large responses (thousands of rows), the gateway spends CPU stitching subgraph results together. Check CPU profile on the gateway during slow requests. If merging is the bottleneck, the fix is either response pagination (return fewer rows per query), response streaming (@defer directive so initial response lands quickly and the rest streams), or migrating to Apollo Router (Rust) which is significantly faster at merging.
Follow-up 3: You identified N+1 but DataLoader is already installed on the subgraph. It’s still slow. Why?Strong Answer: DataLoader batches within a single request/tick, but the gateway sends each reference resolution as a separate HTTP request to the subgraph. The subgraph sees 50 separate _entities queries, not one batched query. The fix is at the gateway: Apollo Gateway’s RemoteGraphQLDataSource supports batched entity fetching in Federation 2, but it must be explicitly enabled. Check the gateway config. The net effect is one HTTP call with 50 entity references per subgraph, not 50 calls.
Common Wrong Answers
  • “The network must be slow.” Rarely the actual issue in a single region with properly-sized pools. Measure before blaming infrastructure.
  • “Add a cache.” Caching in GraphQL is much harder than REST because queries are unique. Caching won’t help if each request is a new query shape. Solve the fan-out problem first, then cache selectively.
Further Reading
  • Apollo Blog, “Performance tuning Apollo Gateway” (2022).
  • Medium Engineering, “GraphQL fan-out avoidance at Medium” (2020).
  • Rishi Nair and team at Meta, “DataLoader: history and internals” (GraphQL Conf 2023 talk).
Strong Answer Framework
  1. Horizontal scaling is table stakes. The gateway is stateless, so put at least 3 instances behind a load balancer in at least 2 availability zones. Configure autoscaling on CPU (typically scale out above 60% sustained). This alone eliminates “single crash = total outage.”
  2. Use Apollo Router (Rust) over Apollo Gateway (Node.js) in production. The Rust binary has predictable performance, lower memory usage, and tighter crash characteristics than Node. The gateway crashes most teams see are OOM in Node under load.
  3. Load the supergraph from a static file, not from live introspection. A gateway that starts by introspecting every subgraph is fragile — any subgraph being down prevents startup. A gateway that boots from a supergraph SDL file starts in seconds and is decoupled from subgraph availability.
  4. Implement health checks that don’t depend on all subgraphs. The gateway’s /healthz endpoint should return healthy as long as the gateway process is alive. Subgraph health is a separate concern; one subgraph being down should not cascade to “the gateway is dead.”
  5. Set per-subgraph timeouts aggressively. If a subgraph hangs, the gateway waits. Default timeouts are often too generous (30s). Set them based on the subgraph’s 99th percentile + margin (maybe 2s). Fail fast, return partial data.
  6. Deploy gateway updates with canaries. 1% of traffic to the new version first, then 10%, then 100% over a few minutes. A buggy gateway deploy is the most common way federation goes down; canary protects against this.
  7. Practice failover. Quarterly chaos drill: kill one gateway instance during peak, watch the LB route around it, verify no user-visible errors. If it doesn’t work in the drill, it won’t work during an incident.
Real-World ExampleExpedia’s federation migration, documented at GraphQL Summit 2022, hit a severe incident early on when their single gateway region had a bad deploy that caused a cascading memory leak. Every gateway instance crashed within 20 minutes of the deploy. Their entire booking API was down for 45 minutes. Post-incident, they mandated: (a) Apollo Router instead of Node gateway, (b) supergraph served from S3 artifact, (c) two-region gateway deployment with 10% canary, and (d) a kill switch to revert to a previous supergraph SDL file within 90 seconds. In the following 18 months, their gateway availability was 99.995% despite handling a 4x traffic increase.Senior Follow-up Questions
Follow-up 1: What if a subgraph is intentionally down for deployment — should the gateway refuse requests for its types?Strong Answer: No. The gateway should return partial responses with typed errors for the downed subgraph’s fields. Clients that tolerate partial data (usually possible for non-critical fields like recommendations) get a degraded but functional response. Clients that cannot tolerate nulls error out for that specific path. Blanket refusal is worse than partial failure because it propagates the outage surface beyond the actual affected domain.
Follow-up 2: How does canary deployment interact with schema composition? A canary gateway on a new supergraph version might serve queries differently than the non-canary one.Strong Answer: Two approaches. First, backward-compatible supergraph changes only in canaries — no field removals, no type changes that affect existing queries. This limits what canary can prove but keeps safety. Second, canary by query hash: the canary gateway serves queries that have been validated against the new schema in CI, others go to the old gateway. This is more complex but enables canarying breaking changes. Most teams pick the first approach as the simpler default.
Follow-up 3: The Apollo Router is managed by a different team than the Python subgraphs. How do you coordinate deploys?Strong Answer: Release trains. The router team owns the binary and its version; subgraph teams own their own services. The supergraph SDL (the composition artifact) is the contract between them. A subgraph team publishes a schema change to the registry, the router team’s CI picks up the new composed supergraph and deploys a config update (not a binary update). This is extremely fast because it’s a file swap, not a rebuild. Cross-team coordination is needed only for router binary upgrades or for cross-subgraph breaking changes, which should be rare.
Common Wrong Answers
  • “Add caching to make the gateway optional.” Caching in front of the gateway helps for repeated queries but doesn’t address the failure mode (the gateway itself dying). You need gateway redundancy, not cache redundancy.
  • “Have clients talk directly to subgraphs as a fallback.” This breaks the entire federation abstraction, re-exposes internal schemas, and defeats the benefit of having a single endpoint. It is not a viable fallback.
Further Reading
  • Apollo, “Router vs Gateway comparison” (2023) — operational guidance.
  • Expedia Engineering, “Federation at Scale” (GraphQL Summit 2022).
  • “Site Reliability Engineering” (Google, 2016) — the general principles apply: canarying, defense in depth, graceful degradation.

Advanced Federation Patterns

Entity References and @requires

The @requires directive is one of the most powerful (and most misunderstood) features of Federation. It lets a subgraph declare: “I need field X from another subgraph before I can compute field Y.” The gateway will automatically fetch the required data first, then pass it to your resolver. This avoids the need for your subgraph to make direct HTTP calls to other services — the gateway handles the orchestration for you. Production pitfall: Overusing @requires can create deep dependency chains where the gateway needs to make sequential calls across multiple subgraphs to resolve a single field. Monitor your query plans with Apollo Studio to catch these chains before they impact latency.
# In Reviews subgraph - needs product name from Products subgraph

type Review @key(fields: "id") {
  id: ID!
  rating: Int!
  comment: String
  product: Product!
}

# Extend Product with additional fields
type Product @key(fields: "id") {
  id: ID! @external
  name: String! @external
  
  # Computed field that requires external data
  reviewSummary: ReviewSummary @requires(fields: "name")
}

type ReviewSummary {
  productName: String!  # Comes from @requires
  averageRating: Float!
  totalReviews: Int!
}
The resolver for reviewSummary receives the externally-required name as part of the entity representation passed to it — it doesn’t have to fetch it. This is declarative cross-service data flow: you describe the dependency in SDL, the planner figures out the fetch sequence, and your code stays simple.
// Reviews subgraph resolver
const resolvers = {
  Product: {
    // @requires means 'name' will be fetched before this resolver runs
    reviewSummary: (product) => {
      const productReviews = reviews.filter(r => r.productId === product.id);
      return {
        productName: product.name,  // Available due to @requires
        averageRating: productReviews.reduce((sum, r) => sum + r.rating, 0) / productReviews.length,
        totalReviews: productReviews.length
      };
    }
  }
};

Compound Keys

Sometimes a single-field key isn’t enough — cart items, for example, are naturally identified by the combination of user and product. Federation supports compound keys through the same @key directive, just with space-separated fields.
# Entity with compound key
type CartItem @key(fields: "userId productId") {
  userId: ID!
  productId: ID!
  quantity: Int!
  addedAt: String!
}

Interface Entities

Federation also supports interfaces as entities (Federation 2.3+), which is useful for Relay-style node(id) queries and polymorphic APIs. The gateway can route a single query through the correct concrete-type subgraph based on __typename.
# Shared interface across subgraphs
interface Node {
  id: ID!
}

type User implements Node @key(fields: "id") {
  id: ID!
  name: String!
}

type Product implements Node @key(fields: "id") {
  id: ID!
  name: String!
}

# Query that returns any Node
type Query {
  node(id: ID!): Node
}

Error Handling in Federation

Error handling in federation has to solve two problems at once: local subgraph errors (a DB query failed, a validation check rejected input) and cross-subgraph failures (one subgraph is down while others are healthy). GraphQL’s response model is kind to both: the errors array can contain partial failures alongside a data object with whatever fields resolved successfully. This is fundamentally more forgiving than REST, where a 500 from one service often poisons the entire response. The pattern below shows typed error classes that carry structured extensions (error codes, entity names, field names) so clients can programmatically branch on failure modes rather than parsing error strings. The trick is to resist the temptation to throw on every failure — for non-critical nested fields, returning null with a logged warning often gives users a better experience than a full query failure.
// error-handling.js - Federation-aware error handling

const { GraphQLError } = require('graphql');

// Custom error classes
class NotFoundError extends GraphQLError {
  constructor(entity, id) {
    super(`${entity} with id '${id}' not found`, {
      extensions: {
        code: 'NOT_FOUND',
        entity,
        id
      }
    });
  }
}

class UnauthorizedError extends GraphQLError {
  constructor(message = 'Not authorized') {
    super(message, {
      extensions: {
        code: 'UNAUTHORIZED'
      }
    });
  }
}

class ValidationError extends GraphQLError {
  constructor(message, field) {
    super(message, {
      extensions: {
        code: 'VALIDATION_ERROR',
        field
      }
    });
  }
}

// Error formatting in gateway
const formatError = (error) => {
  // Log all errors
  console.error('GraphQL Error:', error);
  
  // Don't expose internal errors to clients
  if (error.extensions?.code === 'INTERNAL_SERVER_ERROR') {
    return new GraphQLError('Internal server error', {
      extensions: { code: 'INTERNAL_SERVER_ERROR' }
    });
  }
  
  return error;
};

// Subgraph error handling
const resolvers = {
  Query: {
    user: async (_, { id }, context) => {
      // Authorization check
      if (!context.user) {
        throw new UnauthorizedError();
      }
      
      const user = await findUser(id);
      if (!user) {
        throw new NotFoundError('User', id);
      }
      
      return user;
    }
  },
  
  Mutation: {
    createUser: async (_, { input }) => {
      // Validation
      if (!input.email.includes('@')) {
        throw new ValidationError('Invalid email format', 'email');
      }
      
      // Check for duplicates
      const existing = await findUserByEmail(input.email);
      if (existing) {
        throw new ValidationError('Email already in use', 'email');
      }
      
      return createUser(input);
    }
  }
};

// Partial failures handling
const userResolver = {
  orders: async (user) => {
    try {
      return await orderService.getByUserId(user.id);
    } catch (error) {
      // Log error but don't fail the whole query
      console.error(`Failed to fetch orders for user ${user.id}:`, error);
      
      // Return null or empty array instead of failing
      return null;  // Client will see orders: null
    }
  }
};

module.exports = { NotFoundError, UnauthorizedError, ValidationError, formatError };
Caveats and Common Pitfalls for Federation Errors and N+1
  1. N+1 query problem in federated resolvers. A query for users { orders { product { name } } } fans out into one Users call, then one Orders call per user (N), then one Products call per order-item (N*M). Without batching, a list of 50 users with an average of 3 orders and 4 items per order issues roughly 650 subgraph requests. DataLoader is not optional at scale.
  2. Throwing on every nested field failure. A single null value from a degraded service propagates as a whole-query error because the resolver threw. The client sees nothing, when they could have seen most of what they asked for.
  3. Internal error details leaking to clients. The GraphQL errors array includes stack traces from the subgraph. Attackers map out your infrastructure from a single malformed request.
  4. Partial failures invisible to monitoring. A subgraph returns null with an error for a specific user’s data; the query appears “successful” in aggregate metrics (200 response code, data was returned), but that user saw an incomplete result. Partial failures should have a metric of their own.
Solutions and Patterns
  • DataLoader on every subgraph’s __resolveReference and every cross-subgraph field resolver. Batching turns N+1 into 1+1.
  • Catch, log, and return null for non-critical fields. Wrap non-critical nested resolvers in try/catch that returns null. Log the exception with correlation ID for debugging.
  • Normalize error responses at the gateway. A formatError hook strips stack traces, maps internal codes to client-safe codes, and enriches with a user-facing message. Subgraphs can throw detailed errors; the client only sees a curated version.
  • Count partial failures as a distinct metric. graphql_partial_failure_total{subgraph="orders"} incremented in the gateway every time a subgraph returned data with errors. Alert when the rate exceeds a threshold.
  • Set query complexity limits. A single malicious query with 10 levels of nesting can fan out into millions of subgraph calls. graphql-query-complexity (Node) or strawberry-graphql-plus (Python) enforce limits.

Authentication & Authorization

Auth in federated systems follows a clear split: authentication at the gateway, authorization at the subgraph. The gateway validates the incoming JWT or session token once, extracts user claims, and forwards them to subgraphs as trusted headers (typically x-user-id, x-user-role). Subgraphs trust these headers because they’re reachable only through the gateway (enforced via network policy or service mesh). Each subgraph then enforces its own authorization rules: who can see this user’s email, who can modify this order, who can delete a product. The gateway shouldn’t try to centralize authorization because it doesn’t know each subgraph’s domain rules — the Orders subgraph knows what “order owner” means, not the gateway. The critical security rule: subgraphs must never trust arbitrary callers. Either enforce that they can only be reached via the gateway (network-level), or have them re-validate the JWT themselves. The second approach is slower but safer in zero-trust environments. When in doubt, do both.
// auth.js - Authentication across federated services

const { GraphQLError } = require('graphql');
const jwt = require('jsonwebtoken');

// Authentication directive
const authDirective = (directiveName) => {
  return {
    authDirectiveTypeDefs: `directive @${directiveName}(requires: Role = USER) on FIELD_DEFINITION`,
    authDirectiveTransformer: (schema) => {
      // Transform schema to add auth checks
      // In practice, use @graphql-tools/utils
    }
  };
};

// Context creation with user
const createContext = async ({ req }) => {
  const token = req.headers.authorization?.replace('Bearer ', '');
  
  if (!token) {
    return { user: null };
  }
  
  try {
    const decoded = jwt.verify(token, process.env.JWT_SECRET);
    
    // Optionally fetch full user from Users subgraph
    // const user = await usersService.getUser(decoded.userId);
    
    return {
      user: {
        id: decoded.userId,
        email: decoded.email,
        role: decoded.role
      },
      token
    };
  } catch (error) {
    return { user: null };
  }
};

// Field-level authorization in resolvers
const resolvers = {
  User: {
    // Only accessible to the user themselves or admins
    email: (user, _, context) => {
      if (!context.user) {
        throw new GraphQLError('Not authenticated');
      }
      
      if (context.user.id !== user.id && context.user.role !== 'ADMIN') {
        throw new GraphQLError('Not authorized to view email');
      }
      
      return user.email;
    },
    
    // Private field - only for the user
    orders: (user, _, context) => {
      if (!context.user || context.user.id !== user.id) {
        throw new GraphQLError('Not authorized to view orders');
      }
      
      return getOrdersForUser(user.id);
    }
  },
  
  Mutation: {
    // Admin only mutation
    deleteUser: (_, { id }, context) => {
      if (!context.user || context.user.role !== 'ADMIN') {
        throw new GraphQLError('Admin access required');
      }
      
      return deleteUser(id);
    }
  }
};

// Permission matrix for complex authorization
const permissions = {
  User: {
    id: ['*'],                    // Everyone can see
    name: ['*'],                  // Everyone can see
    email: ['self', 'admin'],     // Only self or admin
    orders: ['self', 'admin'],    // Only self or admin
    creditCard: ['self']          // Only self
  },
  Order: {
    '*': ['owner', 'admin'],      // Only owner or admin can see any field
  }
};

function checkPermission(entity, field, context, record) {
  const allowed = permissions[entity]?.[field] || permissions[entity]?.['*'] || ['*'];
  
  if (allowed.includes('*')) return true;
  if (allowed.includes('self') && context.user?.id === record.userId) return true;
  if (allowed.includes('owner') && context.user?.id === record.ownerId) return true;
  if (allowed.includes('admin') && context.user?.role === 'ADMIN') return true;
  
  return false;
}

module.exports = { createContext, permissions, checkPermission };
Caveats and Common Pitfalls for Federation Auth
  1. Subgraphs trust headers from anyone. The gateway sets x-user-id, but a subgraph exposed on the network also accepts it from direct callers who spoof the header. Anyone inside the network can impersonate any user if the subgraph doesn’t validate.
  2. Authorization logic duplicated and drifting across subgraphs. Each subgraph re-implements “is this user an admin?” with subtle differences. Over time the rules diverge, and a user forbidden by one subgraph is allowed by another.
  3. JWT validation skipped in subgraphs. The gateway validates once; subgraphs trust the gateway. But in a hybrid environment with some services accessible via multiple paths (service mesh, direct network calls, webhooks), only validating at the gateway is insufficient.
  4. Overly granular field-level authorization. Every field has its own auth check, making the schema slow and the code a maintenance nightmare. Aggregating auth at the type or subgraph level is usually sufficient.
  5. Persisted queries trusted unconditionally. Client sends a query hash, gateway looks up the pre-registered query, executes it with gateway-level trust. If the client can register new queries (common in dev workflows), they bypass any query complexity limits applied to arbitrary queries.
Solutions and Patterns
  • Enforce “no subgraph reachable from outside the gateway.” Use a service mesh (Istio, Linkerd) or a VPC network policy that blocks direct traffic to subgraphs. Then header-based auth is safe.
  • Re-validate JWTs in each subgraph when zero-trust is required. Use a shared library so the validation logic is identical.
  • Centralize the permission matrix in a shared package. Each subgraph imports the same PERMISSIONS map, so changes propagate atomically.
  • Authorize at the query entry point, not at every field, when possible. If a user can access an order, they can usually see all its fields. Field-level auth is reserved for sensitive fields like SSN or credit card.
  • Audit persisted query registrations. Only deploys via a trusted pipeline can register new query hashes. Dev environments can register freely; prod requires a signed registration.

Testing Federation

Testing federated systems has two layers. First, unit-test each subgraph in isolation — its resolvers should be testable with standard GraphQL execution against a mock database. Second, integration-test the composed supergraph — run real subgraphs (or mocks speaking the federation protocol) behind a gateway and assert that cross-subgraph queries return what you expect. The integration layer is where subtle federation bugs hide: a missing @external directive, a reference resolver returning the wrong shape, a circular dependency between subgraphs. In Python, the easiest integration-test pattern is to spin up each subgraph on a distinct port via pytest fixtures, use httpx.AsyncClient to send queries to a running Apollo Router (or a mock gateway), and assert on the responses. Strawberry ships a schema.execute method that’s handy for testing individual subgraphs without HTTP.
// federation.test.js

const { ApolloServer } = require('@apollo/server');
const { ApolloGateway, LocalGraphQLDataSource } = require('@apollo/gateway');
const { buildSubgraphSchema } = require('@apollo/subgraph');

describe('Federation Integration Tests', () => {
  let gateway;
  
  beforeAll(async () => {
    // Create mock subgraphs
    const usersSchema = buildSubgraphSchema(require('./users-subgraph/schema'));
    const ordersSchema = buildSubgraphSchema(require('./orders-subgraph/schema'));
    const productsSchema = buildSubgraphSchema(require('./products-subgraph/schema'));
    
    gateway = new ApolloGateway({
      supergraphSdl: new IntrospectAndCompose({
        subgraphs: [
          { name: 'users', url: 'http://localhost:4001' },
          { name: 'orders', url: 'http://localhost:4002' },
          { name: 'products', url: 'http://localhost:4003' }
        ]
      }),
      buildService({ name }) {
        const schemas = { users: usersSchema, orders: ordersSchema, products: productsSchema };
        return new LocalGraphQLDataSource(schemas[name]);
      }
    });
  });
  
  test('federated query across subgraphs', async () => {
    const server = new ApolloServer({ gateway });
    
    const result = await server.executeOperation({
      query: `
        query UserWithOrders($userId: ID!) {
          user(id: $userId) {
            id
            name
            orders {
              id
              total
              items {
                product {
                  name
                  price
                }
                quantity
              }
            }
          }
        }
      `,
      variables: { userId: '1' }
    });
    
    expect(result.body.singleResult.errors).toBeUndefined();
    expect(result.body.singleResult.data.user).toMatchObject({
      id: '1',
      name: 'Alice',
      orders: expect.arrayContaining([
        expect.objectContaining({
          items: expect.arrayContaining([
            expect.objectContaining({
              product: expect.objectContaining({
                name: expect.any(String)
              })
            })
          ])
        })
      ])
    });
  });
  
  test('entity resolution works correctly', async () => {
    const server = new ApolloServer({ gateway });
    
    // Query order and verify user is resolved
    const result = await server.executeOperation({
      query: `
        query Order($orderId: ID!) {
          order(id: $orderId) {
            id
            user {
              name
              email
            }
          }
        }
      `,
      variables: { orderId: 'o1' }
    });
    
    expect(result.body.singleResult.data.order.user.name).toBeDefined();
  });
});
Caveats and Common Pitfalls for Testing Federated Systems
  1. Testing subgraphs in isolation, never as a supergraph. Every subgraph’s tests pass, but the composed supergraph has reference-resolution bugs that only surface at integration time. The first federated query in production returns garbage.
  2. Mocking the gateway instead of exercising it. Unit-style tests that stub out federation behavior miss the real contract — they prove the code you control works, not that your code works with Apollo’s query planner.
  3. No contract tests between subgraphs. A subgraph team removes a field used only by the reference resolver in another subgraph. The other subgraph still compiles and starts; at runtime, reference resolution returns nulls. No test caught this.
  4. Tests coupled to test-mode data that diverges from production shape. Integration tests pass because the test fixtures happen to avoid the edge cases. Production hits them and breaks.
Solutions and Patterns
  • Run full supergraph integration tests in CI. Spin up all subgraphs plus a real gateway/router, execute representative queries, assert structure and values. Containerize the setup for reproducibility.
  • Contract-test at the federation boundary. For each entity a subgraph owns, assert that its __resolveReference returns the shape other subgraphs expect. Publish the expected shape to the schema registry and verify on every PR.
  • Run rover subgraph check in CI. It validates that a subgraph schema change is composable with the current production supergraph. Breaking changes blocked at PR time, not at deploy time.
  • Chaos-test the gateway. Randomly fail one subgraph during the test suite and assert queries return partial data with correct error extensions rather than hanging or throwing opaque errors.
  • Keep fixtures close to production shape. Export a sanitized subset of production data as test fixtures; refresh quarterly. Synthetic fixtures miss real-world complexity.

Interview Questions

Answer:Federation: Composing a unified GraphQL API from multiple services.Key components:
  • Subgraphs: Individual GraphQL services
  • Gateway: Entry point that composes supergraph
  • Supergraph: Combined schema from all subgraphs
  • Entities: Types that can span multiple subgraphs
How it works:
  1. Each subgraph defines its portion of the schema
  2. Gateway fetches schemas and composes supergraph
  3. Client sends query to gateway
  4. Gateway plans execution across subgraphs
  5. Results are combined and returned
Key directives:
  • @key: Defines entity’s unique identifier
  • @external: Field defined in another subgraph
  • @requires: Field needs external data
Answer:Entity: A type that can be resolved across multiple subgraphs.Definition:
type User @key(fields: "id") {
  id: ID!
  name: String!
}
Reference resolution:
User: {
  __resolveReference: (userRef) => {
    // userRef = { __typename: 'User', id: '123' }
    return findUserById(userRef.id);
  }
}
How references work:
  1. Orders subgraph returns { __typename: 'User', id: '1' }
  2. Gateway sees it needs User fields
  3. Gateway calls Users subgraph’s __resolveReference
  4. Full User data is returned
Answer:@external: Marks a field as defined in another subgraph.
type Product @key(fields: "id") {
  id: ID! @external
  name: String! @external
}
@requires: Indicates a field needs external data to resolve.
type Product @key(fields: "id") {
  id: ID! @external
  price: Float! @external
  
  # This field needs 'price' from Products subgraph
  priceWithTax: Float! @requires(fields: "price")
}
Execution:
  1. Gateway fetches price from Products subgraph
  2. Passes it to this subgraph’s resolver
  3. Resolver uses it to compute priceWithTax
Answer:At Gateway level:
  1. Validate JWT in gateway middleware
  2. Add user info to context
  3. Forward auth headers to subgraphs
class AuthenticatedDataSource extends RemoteGraphQLDataSource {
  willSendRequest({ request, context }) {
    request.http.headers.set('authorization', context.token);
    request.http.headers.set('x-user-id', context.user?.id);
  }
}
At Subgraph level:
  • Read user from headers
  • Apply authorization in resolvers
  • Use directives for declarative auth
Best practice: Centralize authentication at gateway, distribute authorization to subgraphs.

Chapter Summary

Key Takeaways:
  • Federation composes unified GraphQL from multiple services
  • Entities enable cross-subgraph type resolution
  • Gateway handles query planning and execution
  • Use @key, @external, @requires for entity relationships
  • Forward authentication from gateway to subgraphs
  • Test federated queries with mock subgraphs
  • Both Node.js (Apollo) and Python (strawberry-graphql + Apollo Router) are production-ready stacks
Next Chapter: Interview Preparation - Comprehensive practice for microservices interviews.

Interview Deep-Dive

Strong Answer:The theoretical comparison (fewer round trips, client-driven queries, typed schema) is well-documented. The real-world trade-offs that teams discover after 6 months are more nuanced.GraphQL Federation’s real advantages: the client team iterates faster because they can change their data requirements without backend changes (add a field to a query, no new endpoint needed). The gateway handles query planning across services, which eliminates the BFF’s manual orchestration code. Schema composition catches breaking changes at build time (the gateway composition fails if subgraphs are incompatible).The real costs: First, caching is dramatically harder. REST endpoints are naturally cacheable (HTTP cache headers, CDN caching, reverse proxy caching). GraphQL queries are POST requests with unique bodies — you cannot cache them at the HTTP level. You need application-level caching (Apollo cache) or persisted queries (a pre-registered query map), which adds complexity.Second, monitoring and alerting is harder. In REST, you alert on “POST /orders endpoint has 5% error rate.” In GraphQL, every request hits the same /graphql endpoint. You need to parse the query to know which resolver failed, which makes setting up per-operation alerts more complex.Third, the N+1 query problem moves from the client to the gateway. Instead of the client making N REST calls, the gateway makes N calls to subgraphs to resolve entity references. DataLoader (batch loading) mitigates this, but it requires careful implementation in every subgraph.Fourth, team coordination for schema design. With REST, each team designs their own endpoints independently. With Federation, teams must agree on shared entity types (@key directives), which introduces coordination overhead.My recommendation: REST with BFF for teams under 30 engineers or with fewer than 10 services. GraphQL Federation for larger organizations with dedicated front-end teams that need rapid iteration independence, and a platform team that can own the gateway infrastructure.Follow-up: “How do you handle authentication and authorization in a federated GraphQL architecture?”Authentication happens at the gateway: the gateway validates the JWT and extracts user claims. Authorization happens at the subgraph level: each resolver checks whether the authenticated user has permission to access that specific piece of data. The gateway passes the user context (user ID, roles) to subgraphs via headers or a shared context object. The critical rule: never trust the gateway to enforce all authorization. The User subgraph must independently verify that the requesting user has permission to access another user’s data, because the gateway cannot understand every subgraph’s authorization rules.
Strong Answer:Entity resolution is the mechanism that allows one subgraph to extend a type defined in another subgraph. It works through the @key directive and the __resolveReference method.Concrete example: the User subgraph defines type User @key(fields: "id") { id: ID!, name: String, email: String }. The Order subgraph extends it: extend type User @key(fields: "id") { orders: [Order!]! }. The Order subgraph can add the orders field to User without the User subgraph knowing about it.When a client queries { user(id: "123") { name, orders { total } } }, the gateway creates a query plan: Step one, fetch { user(id: "123") { id, name } } from the User subgraph. Step two, take the user ID from step one and call the Order subgraph’s __resolveReference with { __typename: "User", id: "123" } to fetch the orders.The Order subgraph’s __resolveReference method receives the User entity representation (just the key fields) and uses it to look up orders: (user) => orderService.getOrdersByUserId(user.id). It returns the orders, which the gateway merges with the user data from step one.The key insight: subgraphs do not call each other directly. The gateway orchestrates all communication. This means subgraphs can be developed independently — the Order team adds the orders field to User without the User team changing anything.The performance concern: each entity reference is a separate call from the gateway to the subgraph. A query that returns 50 users, each with orders, would make 50 separate calls to the Order subgraph for entity resolution. This is the N+1 problem. The solution: batch entity resolution. Instead of 50 individual __resolveReference calls, the gateway sends one batch request with all 50 user IDs. The Order subgraph’s batch resolver fetches all 50 users’ orders in a single database query.Follow-up: “What happens when one subgraph is down? Does the entire query fail?”By default in Apollo Gateway, if any subgraph fails, the entire query fails. This is problematic because a user query that includes non-critical data (recommendations) should not fail just because the recommendation subgraph is down. The solution is @defer directive (Apollo Federation 2) which allows the gateway to return partial results and stream the deferred fields when they become available. For non-streamable clients, I implement nullable fields with error extensions: the gateway returns null for the failed subgraph’s fields and includes an error in the response extensions indicating which fields were unavailable. The client handles the partial response gracefully.
Strong Answer:Schema evolution in federation is more constrained than in REST because adding a field to the supergraph requires the composition to succeed. If two subgraphs define conflicting types, the gateway build fails.The process I enforce: every subgraph publishes its schema to a schema registry (Apollo Studio or a self-hosted registry). Before a subgraph deploys a schema change, it runs a composition check that validates the new schema composes successfully with all other subgraphs’ current schemas. If composition fails (naming conflict, incompatible types), the deploy is blocked. This is the GraphQL equivalent of contract testing.For backward-compatible changes (adding a nullable field, adding a new type), the process is straightforward: the subgraph deploys the new schema, publishes it to the registry, and the gateway picks up the change. No other teams are affected.For breaking changes (removing a field, changing a field’s type), I use a deprecation workflow. Step one: mark the field as @deprecated with a reason and a removal date. The gateway still serves the field but logs usage. Step two: notify all client teams that use the deprecated field (Apollo Studio tracks field usage per client). Step three: after all clients have migrated away (usage drops to zero), remove the field. Step four: publish the updated schema and verify composition succeeds.The organizational challenge: type ownership. In federation, multiple subgraphs can contribute fields to the same type (User type gets fields from both User subgraph and Order subgraph). If the User team wants to rename the User type, they need to coordinate with every subgraph that extends it. I mitigate this by establishing clear ownership rules: the subgraph that defines the type with @key is the “owner.” Other subgraphs extend it but cannot modify the key fields. Schema changes to key fields require cross-team coordination (rare by design).Follow-up: “How do you test schema changes across subgraphs before deploying to production?”I use a three-step validation pipeline. First, local composition check: the subgraph developer runs rover subgraph check locally to verify their schema change composes with the current production schemas. Second, CI composition check: the PR pipeline pulls all production subgraph schemas from the registry, substitutes the PR’s schema for its subgraph, and runs composition. If it fails, the PR is blocked. Third, staging validation: the change deploys to a staging environment where all subgraphs run, and integration tests verify that existing queries still return expected results. This catches runtime errors that composition checks miss (a field exists in the schema but the resolver throws an error).