Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Security

Security in microservices is more complex than monoliths due to increased attack surface. Every service-to-service call needs authentication and authorization.
Learning Objectives:
  • Implement JWT-based authentication
  • Design authorization strategies
  • Set up mutual TLS (mTLS)
  • Manage secrets securely
  • Protect against common attacks

Security Challenges in Microservices

In a monolith, you had one front door to lock. In microservices, every service is its own building with its own door, and they all talk to each other constantly. That means the attack surface multiplies with each new service you deploy. Zero-trust architecture flips the old “trust the internal network” model on its head: assume the network is hostile, authenticate every request, and authorize every action — even between your own services. The tradeoff is operational complexity: every service now needs identity, every call needs verification, and every secret needs rotation.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    MICROSERVICES SECURITY CHALLENGES                         │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ATTACK SURFACE                                                              │
│  ──────────────                                                              │
│                                                                              │
│  Monolith:                      Microservices:                              │
│  ┌────────────┐                 ┌─────┐  ┌─────┐  ┌─────┐                   │
│  │            │                 │     │──│     │──│     │                   │
│  │   Single   │   vs            │  A  │  │  B  │  │  C  │                   │
│  │   Entry    │                 │     │──│     │──│     │                   │
│  │            │                 └─────┘  └─────┘  └─────┘                   │
│  └────────────┘                    │        │        │                      │
│       │                            └────────┴────────┘                      │
│  1 entry point                   N entry points + M connections             │
│                                                                              │
│                                                                              │
│  CHALLENGES:                                                                 │
│  ────────────                                                                │
│                                                                              │
│  • How do services authenticate each other?                                 │
│  • How to propagate user identity across services?                          │
│  • Where to enforce authorization?                                          │
│  • How to secure service-to-service communication?                          │
│  • How to manage secrets across many services?                              │
│  • How to audit access across distributed system?                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Authentication Strategies

JWT-Based Authentication

JWTs (JSON Web Tokens) solve a core distributed-systems problem: how do services verify a user’s identity without every service calling back to a central auth service on every request? By cryptographically signing a token once at login time, any downstream service can verify the signature locally using a shared secret or public key. This prevents the “auth service becomes a bottleneck and single point of failure” attack pattern, and enables stateless authentication that scales horizontally. The tradeoff is revocation — because tokens are self-contained, you cannot instantly invalidate them without a blacklist (which reintroduces the centralized check you were trying to avoid). In zero-trust terms, JWTs carry identity claims across trust boundaries so each service can independently verify “who is this, and what are they allowed to do?”
Caveats & Common Pitfalls with JWTs
  • JWT secret rotation causes downtime. A team rotates the signing secret during a maintenance window; all existing tokens immediately fail verification; every logged-in user is kicked out at once; the support inbox fills up in 5 minutes. This is the most common JWT operational incident.
  • alg: none and algorithm confusion. If your verifier trusts the alg claim in the token header (instead of enforcing one server-side), an attacker can send a token with alg: none and no signature, or swap alg: HS256 against your RSA public key — the public key is now used as an HMAC secret, and any attacker with the public key forges tokens freely. This CVE class has been hitting JWT libraries since 2015.
  • Tokens in localStorage. XSS-vulnerable — any injected script can read and exfiltrate the token. Same goes for logging tokens in error responses or stack traces.
  • Long-lived tokens with no revocation path. A 24-hour access token stolen at hour 1 gives the attacker 23 free hours regardless of what you do.
Solutions & Patterns
  • Rotate with overlapping keys (JWKS). Publish a JWKS endpoint (/.well-known/jwks.json) with multiple active public keys identified by kid (key ID) in the token header. When rotating, add the new key first, sign new tokens with it, let the old key stay valid until all tokens signed with it have expired (so, 15 min for short-lived tokens), then remove the old key. Zero downtime. This is how Auth0, Okta, and AWS Cognito handle it.
  • Pin the algorithm server-side. Explicitly pass algorithms=["RS256"] to jwt.decode — never trust the token’s alg header. Use RS256 or ES256 in production (asymmetric, so verifiers only need the public key); HS256 only when the issuer and verifier are the same process.
  • Short-lived access tokens (5-15 min) + long-lived refresh tokens stored server-side (so they are revocable). This bounds stolen-token damage to minutes.
  • Store access tokens in HttpOnly cookies, not localStorage, so JavaScript cannot read them and XSS cannot exfiltrate. Pair with SameSite=Lax or Strict for CSRF protection.
  • Keep a short-TTL Redis blacklist for force-logout scenarios (e.g., “user reported compromise”). The blacklist is small (only revoked tokens), so the Redis check is microseconds.
┌─────────────────────────────────────────────────────────────────────────────┐
│                    JWT AUTHENTICATION FLOW                                   │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. User Login                                                               │
│  ┌──────────┐         ┌──────────────┐                                      │
│  │  Client  │──login─▶│  Auth Service │                                      │
│  │          │◀─JWT────│  (issues JWT) │                                      │
│  └──────────┘         └──────────────┘                                      │
│                                                                              │
│  2. Access Services                                                          │
│  ┌──────────┐  Authorization: Bearer <JWT>   ┌──────────────┐               │
│  │  Client  │───────────────────────────────▶│  API Gateway │               │
│  └──────────┘                                └──────┬───────┘               │
│                                                      │                      │
│  3. JWT Propagation                                  │ JWT                  │
│                                                      ▼                      │
│                       ┌──────────┐      ┌──────────┐      ┌──────────┐     │
│                       │  Order   │─JWT─▶│ Payment  │─JWT─▶│ Inventory│     │
│                       │ Service  │      │ Service  │      │ Service  │     │
│                       └──────────┘      └──────────┘      └──────────┘     │
│                                                                              │
│  JWT Structure:                                                              │
│  ─────────────                                                               │
│  Header.Payload.Signature                                                    │
│                                                                              │
│  Payload: {                                                                  │
│    "sub": "user123",                                                        │
│    "email": "user@example.com",                                             │
│    "roles": ["customer", "premium"],                                        │
│    "permissions": ["orders:read", "orders:write"],                          │
│    "exp": 1705347600,                                                       │
│    "iss": "auth-service"                                                    │
│  }                                                                           │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

JWT Service Implementation

The auth service is the gatekeeper that issues signed identity tokens after verifying credentials. Password hashing with bcrypt (or argon2) prevents an attacker who dumps your database from getting plaintext passwords — bcrypt’s adaptive cost factor makes brute-force attacks computationally expensive even with modern GPUs. Using short-lived access tokens (15 minutes) paired with long-lived refresh tokens limits the damage window if a token is stolen: an attacker has minutes, not days, to exploit it. Storing the refresh token server-side gives you a revocation point (one of the few places stateful auth is worth the complexity). In zero-trust, this service is the identity provider — it establishes “who” before any service decides “what they can do.”
// services/AuthService.js
const jwt = require('jsonwebtoken');
const bcrypt = require('bcrypt');

class AuthService {
  constructor(options = {}) {
    this.accessTokenSecret = process.env.JWT_ACCESS_SECRET;
    this.refreshTokenSecret = process.env.JWT_REFRESH_SECRET;
    this.accessTokenExpiry = options.accessTokenExpiry || '15m';
    this.refreshTokenExpiry = options.refreshTokenExpiry || '7d';
    this.issuer = options.issuer || 'auth-service';
  }

  async login(email, password) {
    const user = await this.userRepository.findByEmail(email);
    
    if (!user || !await bcrypt.compare(password, user.passwordHash)) {
      throw new UnauthorizedError('Invalid credentials');
    }

    const tokens = this.generateTokens(user);
    
    // Store refresh token
    await this.tokenRepository.storeRefreshToken(
      user.id,
      tokens.refreshToken,
      this.refreshTokenExpiry
    );

    return tokens;
  }

  generateTokens(user) {
    const payload = {
      sub: user.id,
      email: user.email,
      roles: user.roles,
      permissions: this.getPermissionsForRoles(user.roles)
    };

    const accessToken = jwt.sign(payload, this.accessTokenSecret, {
      expiresIn: this.accessTokenExpiry,
      issuer: this.issuer,
      audience: 'microservices'
    });

    const refreshToken = jwt.sign(
      { sub: user.id, type: 'refresh' },
      this.refreshTokenSecret,
      {
        expiresIn: this.refreshTokenExpiry,
        issuer: this.issuer
      }
    );

    return { accessToken, refreshToken };
  }

  async refreshAccessToken(refreshToken) {
    try {
      const decoded = jwt.verify(refreshToken, this.refreshTokenSecret);
      
      // Verify token exists in store (not revoked)
      const storedToken = await this.tokenRepository.getRefreshToken(decoded.sub);
      if (storedToken !== refreshToken) {
        throw new UnauthorizedError('Invalid refresh token');
      }

      const user = await this.userRepository.findById(decoded.sub);
      if (!user) {
        throw new UnauthorizedError('User not found');
      }

      return this.generateTokens(user);
    } catch (error) {
      throw new UnauthorizedError('Invalid refresh token');
    }
  }

  async logout(userId, refreshToken) {
    // Revoke refresh token
    await this.tokenRepository.revokeRefreshToken(userId);
    
    // Optionally add access token to blacklist
    // await this.tokenRepository.blacklistAccessToken(accessToken);
  }

  getPermissionsForRoles(roles) {
    const rolePermissions = {
      admin: ['*'],
      customer: [
        'orders:read',
        'orders:create',
        'profile:read',
        'profile:update'
      ],
      support: [
        'orders:read',
        'users:read',
        'tickets:*'
      ]
    };

    const permissions = new Set();
    roles.forEach(role => {
      const perms = rolePermissions[role] || [];
      perms.forEach(p => permissions.add(p));
    });

    return Array.from(permissions);
  }
}

module.exports = { AuthService };

JWT Middleware

The middleware is the first line of defense at every service boundary — it validates that incoming requests carry a legitimate token before any business logic runs. This prevents unauthenticated traffic from reaching your handlers, protecting against direct endpoint abuse, token forgery, and expired-token replay. In zero-trust, this is where “never trust, always verify” becomes concrete: even if the request came from inside the cluster, the middleware re-validates the signature, issuer, audience, and expiration. The tradeoff is latency — every request pays a cryptographic verification cost — but with modern HMAC or RSA, this is microseconds and well worth the security guarantee.
// middleware/auth.js
const jwt = require('jsonwebtoken');

function authenticate(options = {}) {
  const { secret = process.env.JWT_ACCESS_SECRET, optional = false } = options;

  return async (req, res, next) => {
    const authHeader = req.headers.authorization;
    
    if (!authHeader || !authHeader.startsWith('Bearer ')) {
      if (optional) {
        return next();
      }
      return res.status(401).json({ error: 'Missing authorization header' });
    }

    const token = authHeader.substring(7);

    try {
      const decoded = jwt.verify(token, secret, {
        issuer: 'auth-service',
        audience: 'microservices'
      });

      // Check token blacklist
      const isBlacklisted = await tokenBlacklist.check(token);
      if (isBlacklisted) {
        return res.status(401).json({ error: 'Token revoked' });
      }

      req.user = {
        id: decoded.sub,
        email: decoded.email,
        roles: decoded.roles,
        permissions: decoded.permissions
      };

      next();
    } catch (error) {
      if (error.name === 'TokenExpiredError') {
        return res.status(401).json({ error: 'Token expired' });
      }
      return res.status(401).json({ error: 'Invalid token' });
    }
  };
}

// Usage
app.use('/api', authenticate());

Authorization Patterns

Role-Based Access Control (RBAC)

Authentication answers “who are you?”; authorization answers “what can you do?”. RBAC is the most common authorization model because it mirrors how organizations think: users have roles (admin, customer, support), and roles have permissions. This prevents privilege escalation attacks by enforcing the principle of least privilege — a customer cannot suddenly access admin-only endpoints just because they know the URL. In zero-trust, RBAC is typically enforced at every service boundary, not just the gateway, so that even if an attacker bypasses the front door they cannot call sensitive internal endpoints. The tradeoff is granularity: RBAC works well when permissions cleanly map to roles, but gets clumsy when individual users need one-off exceptions (which is where ABAC takes over).
// middleware/authorize.js
function authorize(...allowedRoles) {
  return (req, res, next) => {
    if (!req.user) {
      return res.status(401).json({ error: 'Not authenticated' });
    }

    const hasRole = req.user.roles.some(role => allowedRoles.includes(role));
    
    if (!hasRole) {
      return res.status(403).json({ error: 'Insufficient permissions' });
    }

    next();
  };
}

// Usage
app.get('/admin/users', authenticate(), authorize('admin'), (req, res) => {
  // Only admins can access
});

app.get('/orders', authenticate(), authorize('customer', 'support', 'admin'), (req, res) => {
  // Customers, support, and admins can access
});

Permission-Based Authorization

Permissions are more granular than roles — instead of “is this user an admin?” you ask “does this user have the orders:delete permission?”. This defends against privilege creep, where admin roles accumulate every capability and become too risky to assign. By separating permissions from roles and attaching them to tokens, you enable precise access control: support can read orders but not delete them, even though both actions live on the same resource. Wildcard matching (orders:*) keeps policies maintainable without enumerating every action. In zero-trust, permission-based auth is the “least privilege in practice” layer — each service verifies not just that the caller is authenticated, but that they have the specific capability for this specific operation.
// middleware/permission.js
function checkPermission(...requiredPermissions) {
  return (req, res, next) => {
    if (!req.user) {
      return res.status(401).json({ error: 'Not authenticated' });
    }

    const userPermissions = req.user.permissions;
    
    // Check for wildcard permission
    if (userPermissions.includes('*')) {
      return next();
    }

    const hasPermission = requiredPermissions.every(required => {
      // Check exact match
      if (userPermissions.includes(required)) {
        return true;
      }
      
      // Check wildcard match (e.g., 'orders:*' matches 'orders:read')
      const [resource, action] = required.split(':');
      if (userPermissions.includes(`${resource}:*`)) {
        return true;
      }
      
      return false;
    });

    if (!hasPermission) {
      return res.status(403).json({
        error: 'Insufficient permissions',
        required: requiredPermissions,
        have: userPermissions
      });
    }

    next();
  };
}

// Usage
app.post('/orders', 
  authenticate(),
  checkPermission('orders:create'),
  createOrder
);

app.delete('/orders/:id',
  authenticate(),
  checkPermission('orders:delete'),
  deleteOrder
);

Attribute-Based Access Control (ABAC)

ABAC is the most expressive authorization model — decisions are made based on attributes of the user, resource, action, and environment (time, IP, device). This prevents a class of attacks that RBAC cannot catch: a user with orders:update permission modifying someone else’s order (RBAC says “yes, they have the permission”; ABAC says “no, they are not the owner”). ABAC also enables contextual policies like “no financial operations outside business hours” or “no admin actions from unfamiliar IPs”, which are critical for fraud prevention. In zero-trust, ABAC is where you encode business-level risk: the policy engine evaluates every request against a rulebook, and each new rule hardens the system against a specific threat. The complexity cost is real — policy engines must be debuggable, testable, and performant, because they run on every request.
// authorization/PolicyEngine.js
class PolicyEngine {
  constructor() {
    this.policies = [];
  }

  addPolicy(policy) {
    this.policies.push(policy);
  }

  evaluate(context) {
    for (const policy of this.policies) {
      const result = policy.evaluate(context);
      if (result.effect === 'DENY') {
        return { allowed: false, reason: result.reason };
      }
      if (result.effect === 'ALLOW') {
        return { allowed: true };
      }
    }
    return { allowed: false, reason: 'No matching policy' };
  }
}

// Policy definitions
const policies = [
  {
    name: 'owner-access',
    evaluate: (context) => {
      const { user, resource, action } = context;
      
      if (resource.ownerId === user.id) {
        return { effect: 'ALLOW' };
      }
      return { effect: 'CONTINUE' };
    }
  },
  {
    name: 'admin-access',
    evaluate: (context) => {
      const { user } = context;
      
      if (user.roles.includes('admin')) {
        return { effect: 'ALLOW' };
      }
      return { effect: 'CONTINUE' };
    }
  },
  {
    name: 'time-based-access',
    evaluate: (context) => {
      const { user, action } = context;
      const hour = new Date().getHours();
      
      // No financial operations outside business hours for non-admins
      if (action.startsWith('payment:') && !user.roles.includes('admin')) {
        if (hour < 9 || hour > 17) {
          return {
            effect: 'DENY',
            reason: 'Financial operations only available during business hours'
          };
        }
      }
      return { effect: 'CONTINUE' };
    }
  }
];

// Middleware using policy engine
function abacAuthorize(action, getResource) {
  return async (req, res, next) => {
    const resource = await getResource(req);
    
    const context = {
      user: req.user,
      resource,
      action,
      environment: {
        ip: req.ip,
        time: new Date(),
        method: req.method
      }
    };

    const result = policyEngine.evaluate(context);
    
    if (!result.allowed) {
      return res.status(403).json({
        error: 'Access denied',
        reason: result.reason
      });
    }

    req.resource = resource;
    next();
  };
}

// Usage
app.put('/orders/:id',
  authenticate(),
  abacAuthorize('orders:update', async (req) => {
    return orderRepository.findById(req.params.id);
  }),
  updateOrder
);

Service-to-Service Authentication

API Keys for Internal Services

When services call each other, you cannot rely on user JWTs — many service-to-service calls happen in background jobs or system flows where no user is present. HMAC-signed requests solve this by giving each service a shared secret and requiring it to sign every request with a timestamp. This defends against two critical attacks: impersonation (an attacker cannot forge requests without the secret) and replay (the timestamp prevents replaying captured requests after five minutes). In zero-trust, this is “machine identity” — services prove who they are just as users do. Compared to mTLS, HMAC is easier to deploy (no certificate infrastructure) but weaker (the secret is only as safe as your secrets manager, and there is no transport-layer mutual authentication).
Caveats & Common Pitfalls in Service-to-Service Auth
  • “The network is secure” fallacy. Teams put all services in a VPC, assume the perimeter is enough, and skip per-call authentication. One compromised pod (SSRF, exposed dashboard, leaked credential) and the attacker now has full lateral access — because every internal service trusts every caller by IP.
  • Over-privileged service accounts. The order-service account has DROP TABLE rights on the users DB “because we might need it someday.” When the service is compromised, the blast radius is the entire company’s user data.
  • Shared HMAC secret distributed by email / Slack / config file. It leaks into Git history, CI logs, screenshots, and build artifacts within weeks. Rotation requires coordinated change across every consumer — which nobody does, so the original secret lives forever.
  • No clock-skew tolerance on timestamp validation. NTP drift of a few seconds across pods causes legitimate requests to be rejected as replay attempts. On-call gets paged at 2am and cannot figure out why only some instances fail.
Solutions & Patterns
  • Authenticate every call, including internal. Zero-trust means no implicit trust from network location. Use mTLS in production (via a service mesh) so every hop is cryptographically authenticated — a compromised pod cannot simply call http://payments-svc/charge.
  • One service, one identity, minimum privileges. Each service has its own IAM / SPIFFE identity, its own DB user with row-level scoping, and only the specific permissions it needs. When order-service is compromised, the attacker gets what the order-service account can do, nothing more.
  • Use short-lived, issuer-rotated credentials (Vault dynamic DB creds, AWS STS AssumeRole, SPIFFE SVIDs with 1-hour TTL) so a leaked credential expires before exploitation.
  • Allow ±5 minutes clock skew on HMAC timestamp validation and keep NTP synced via chrony / systemd-timesyncd.
  • Prefer SPIFFE/SPIRE or a service mesh (Istio, Linkerd) over hand-rolled HMAC — they give you automatic rotation, cryptographic identity, and audit logs.
Strong Answer Framework:
  1. Do not flip the table on day one. This pattern exists because it shipped and the fraud team has real performance constraints. Punishing a team for shortcuts they had to take closes the door on future cooperation. Frame it as a shared architectural fix, not a blame assignment.
  2. Quantify the actual risk. The payments DB has the canonical card-on-file data and transaction history. A compromise of fraud-detection — which has attacker-facing inputs (webhooks, ML model scoring, third-party signals) — now becomes a direct breach of payment data. This is a bounded-context violation and also likely a PCI-DSS scope expansion, which has real compliance and audit implications.
  3. Immediate mitigation (this week). Lock down the credentials: the fraud service gets a read-only DB user scoped to only the columns/tables it actually reads. Remove write access. Add row-level security if the DB supports it. This shrinks blast radius without breaking the fraud team.
  4. Short-term fix (this quarter). Work with the payment service owners to expose the exact data the fraud team needs as a read-only internal API — probably a gRPC endpoint that returns a narrow DTO. Benchmark it against the direct-DB path; if the fraud team’s concern was latency, a well-designed gRPC call + Redis cache is usually within 2-5 ms of a direct query and well inside their budget.
  5. Long-term (quarterly). Move to event-driven: payment service emits PaymentProcessed, ChargebackFlagged events to Kafka; fraud service consumes and maintains its own read model. Now there is no synchronous coupling at all, and each service owns its data.
  6. Prevent recurrence. Add a CI check / network policy that denies outbound connections from services to databases they do not own (easy to enforce with Kubernetes NetworkPolicy or service mesh authorization). “Each service owns its DB” becomes a structurally enforced invariant, not a nice-to-have.
Real-World Example: Monzo published a detailed 2019 post-incident blog about exactly this pattern — a cross-service DB access shortcut that, when one of the consumer services had a bug, caused an outage for the owner service. Their fix was strict per-service DBs with API-only access, which is now a foundational rule in their platform.Senior Follow-up Questions:
  • “What if the fraud team pushes back hard on latency?” Instrument both paths end-to-end. Usually the difference is 1-3 ms of network hop plus JSON/gRPC serialization. If that genuinely matters, use gRPC + connection pooling, or push the fraud check into the payment service itself via a sidecar. Almost never is direct DB access the right answer — it is the path of least design effort, not least latency.
  • “How does this interact with PCI-DSS?” Direct DB access likely puts fraud-detection fully in PCI scope (it touches cardholder data). Going through a narrow API keeps it in scope only for the narrow DTO it receives — shrinking audit surface and cost significantly.
  • “What is your enforcement mechanism so someone does not re-add this in 6 months?” NetworkPolicy in Kubernetes + IAM-at-the-database (services authenticate to RDS / Spanner with their own IAM role), plus a service_dependencies.yaml that is code-reviewed on every PR. Three layers: platform, identity, review.
Common Wrong Answers:
  • “Tell them to fix it by next sprint or block the deploy.” Adversarial; does not solve the underlying design problem and breeds workarounds.
  • “Just rotate the credential.” Does nothing — the architecture is still cross-boundary DB access, just with a fresh password.
Further Reading:
Strong Answer Framework:
  1. Name the concrete threats. Env vars in images show up in: docker inspect, kubectl describe pod, core dumps, crash reports, logs if someone does console.log(process.env), CI build logs, registry layer dumps. The secret leaks in a dozen ways and rotation requires rebuilding and redeploying every image.
  2. Introduce a secrets manager without breaking anything. Deploy Vault / AWS Secrets Manager / GCP Secret Manager. Give every service a new Kubernetes ServiceAccount bound to a Vault role via the Kubernetes auth method. The service authenticates to Vault with its pod identity (no bootstrap credential problem).
  3. Dual-read phase (week 1-2). Update each service to read the secret from Vault if available, fall back to env var if not. Deploy this version everywhere. Nothing has changed operationally yet.
  4. Flip the switch (week 3). Populate Vault with the current secret. Services now read from Vault. Env var stays as fallback in case of Vault outage.
  5. Rotation test (week 4). Rotate the secret in Vault with overlap (new kid added, old kept for 1 hour). Services pick up the new one within their cache TTL. Validate zero-downtime rotation works end-to-end.
  6. Remove env var (week 5+). Rebuild images without the env var. Now the secret lives only in Vault with audit logs, TTL, and rotation — and the image is safe to leak to a public registry.
  7. Prevent regression. Add a CI check that fails the build if any env var matches a secret-like name pattern (*_SECRET, *_KEY, *_PASSWORD, *_TOKEN).
Real-World Example: Uber’s security team published (around 2018-2019) a detailed post on migrating from static env-var secrets to Vault dynamic secrets, specifically calling out that they found leaked database credentials in a decommissioned internal wiki that had been copied from image env vars — the migration was partly triggered by that finding.Senior Follow-up Questions:
  • “What if Vault is down? Is your service dead?” In-memory secret cache with a TTL longer than typical Vault outages (e.g., 15-30 min cache, 5-min refresh). Vault HA with Raft cluster. For truly critical paths, secrets are fetched at startup and refreshed in the background — a Vault outage does not affect already-running pods.
  • “How do you rotate a JWT signing secret without logging everyone out?” JWKS with multiple active kids. Add the new key first, sign new tokens with the new kid, let old tokens (signed with old kid) expire naturally. Remove the old key only after the longest token TTL has passed. Zero-downtime rotation.
  • “How do you ensure a rogue or compromised service cannot read another service’s secrets?” Vault policies scoped per role. order-service role has read access to secret/order-service/* only. If compromised, it cannot read secret/payment-service/*. Audit log every read so you can detect anomalous access.
Common Wrong Answers:
  • “Put secrets in Kubernetes Secrets.” Kubernetes Secrets are base64-encoded, not encrypted at rest by default, and anyone with get secrets RBAC can read them in plaintext. Better than env-vars-in-image but still weak — enable KMS encryption at rest and restrict RBAC, or use a real secrets manager.
  • “Rotate the JWT secret at 2am during the maintenance window.” Every logged-in user is logged out. Works once; the team never rotates again because it was painful. Do it right with overlapping keys.
Further Reading:
Strong Answer Framework:
  1. Acknowledge the fundamental issue. A shared API key provides knowledge-based auth — anyone who has it is trusted. There is no cryptographic binding between “this request” and “the real order-service identity.” Stealing the secret is equivalent to being the service.
  2. Move to mTLS with per-service certificates. Every service gets its own certificate signed by an internal CA. The payment service verifies the client cert on every connection; a stolen cert alone is not enough because it is bound to a specific workload identity and has a short TTL.
  3. Automate cert issuance and rotation via a service mesh. Istio, Linkerd, or Consul Connect issue and rotate short-lived certs (1-24 hour TTL) to every pod automatically via SPIFFE SVIDs. The app code does not handle certificates at all — the sidecar proxy does. This removes the operational burden that makes teams avoid mTLS.
  4. Add authorization on top of authentication. mTLS proves identity; authorization policies enforce “who can call what.” In Istio: AuthorizationPolicy that says “only workloads with SPIFFE identity spiffe://cluster/ns/default/sa/order-service can call POST /charge on payment-service.” Identity + action policy, both enforced at the mesh sidecar.
  5. Add audit logging. Every inter-service call logs the client identity (from cert CN or SPIFFE ID), endpoint, and timestamp. After a breach, you can trace exactly what moved laterally.
  6. Defense in depth: network policies. Kubernetes NetworkPolicy or Cilium restricts which pods can reach which pods at L4. Even if auth is bypassed, the network layer denies the connection.
Real-World Example: Netflix’s Application Security team published (2019) a post describing their adoption of SPIFFE for internal mTLS specifically because shared API keys “fail open” — stolen keys give full access with no cryptographic binding. They moved to per-workload cert-bound identities with hour-scale TTLs.Senior Follow-up Questions:
  • “What is the operational cost of running a service mesh?” Non-trivial — you are adding a sidecar to every pod (roughly 100 MB RAM, 1-5 ms added latency). For organizations with 50+ services, the amortized security benefit is worth it; for a 5-service startup, a simpler per-service JWT pattern with rotating keys might suffice until you scale.
  • “What happens if the mesh control plane goes down?” Data plane (sidecars) keeps working with the last cert they received. Istio recommends cert TTLs longer than any realistic control-plane outage (e.g., 24-hour certs with 12-hour rotation). You get graceful degradation, not instant failure.
  • “How do you handle cross-cluster / cross-cloud service calls?” Federated trust: the CA of cluster A is trusted by cluster B. SPIFFE Federation or Istio’s multi-cluster mode makes this explicit. Without it, each cluster is an identity silo.
Common Wrong Answers:
  • “Just rotate the API key more frequently.” A 1-hour rotation does not help if the attacker steals the freshly-rotated key in the first minute. The knowledge-based auth model is the problem.
  • “Use a VPN between services.” Perimeter thinking — once inside the VPN, everyone is trusted again. Zero-trust requires per-call authentication.
Further Reading:
// middleware/serviceAuth.js
const crypto = require('crypto');

class ServiceAuthenticator {
  constructor(options = {}) {
    this.services = new Map();
    this.algorithm = 'sha256';
  }

  registerService(serviceId, secret) {
    this.services.set(serviceId, secret);
  }

  generateSignature(serviceId, timestamp, body) {
    const secret = this.services.get(serviceId);
    if (!secret) {
      throw new Error(`Unknown service: ${serviceId}`);
    }

    const payload = `${serviceId}:${timestamp}:${JSON.stringify(body)}`;
    return crypto
      .createHmac(this.algorithm, secret)
      .update(payload)
      .digest('hex');
  }

  verify(serviceId, timestamp, body, signature) {
    // Check timestamp freshness (prevent replay attacks)
    const age = Date.now() - parseInt(timestamp);
    if (age > 300000) { // 5 minutes
      return { valid: false, reason: 'Request expired' };
    }

    const expectedSignature = this.generateSignature(serviceId, timestamp, body);
    const valid = crypto.timingSafeEqual(
      Buffer.from(signature),
      Buffer.from(expectedSignature)
    );

    return { valid, reason: valid ? null : 'Invalid signature' };
  }
}

// Middleware
function serviceAuthenticate(authenticator) {
  return (req, res, next) => {
    const serviceId = req.headers['x-service-id'];
    const timestamp = req.headers['x-timestamp'];
    const signature = req.headers['x-signature'];

    if (!serviceId || !timestamp || !signature) {
      return res.status(401).json({ error: 'Missing service authentication' });
    }

    const result = authenticator.verify(serviceId, timestamp, req.body, signature);
    
    if (!result.valid) {
      return res.status(401).json({ error: result.reason });
    }

    req.service = { id: serviceId };
    next();
  };
}

// Client-side usage
class ServiceClient {
  constructor(serviceId, secret) {
    this.serviceId = serviceId;
    this.authenticator = new ServiceAuthenticator();
    this.authenticator.registerService(serviceId, secret);
  }

  async request(url, options = {}) {
    const timestamp = Date.now().toString();
    const body = options.body || {};
    const signature = this.authenticator.generateSignature(
      this.serviceId,
      timestamp,
      body
    );

    return fetch(url, {
      ...options,
      headers: {
        ...options.headers,
        'X-Service-ID': this.serviceId,
        'X-Timestamp': timestamp,
        'X-Signature': signature,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify(body)
    });
  }
}

Mutual TLS (mTLS)

Standard TLS proves the server’s identity to the client (“yes, this really is bank.com”), but the server trusts anyone who connects. mTLS flips this: both sides present certificates, so the server also verifies the client. In microservices, this means the Order Service can prove “I am really the Order Service” cryptographically, not just by claiming so in a header. This defeats an entire class of attacks — a compromised pod in the cluster cannot talk to other services because it lacks a valid certificate signed by your internal CA. mTLS is the transport-layer backbone of zero-trust: identity is verified on every connection, not inferred from network location. The tradeoff is heavy operational complexity — you need a certificate authority, issuance pipeline, rotation policy, and a way to handle expired certs gracefully. This is why service meshes like Istio exist: they automate the whole lifecycle.
Caveats & Common Pitfalls with mTLS
  • Expired certificates = instant outage. A 1-year cert silently ticks down; someone forgets the rotation; every service stops accepting connections at midnight on the expiry date. Entire production environments have gone down this way (including a famous Microsoft Teams outage in 2020).
  • Trust store drift. Not every pod has the current CA bundle; some trust the old CA, some the new, and connections fail intermittently. Debugging “works from pod A, fails from pod B” at 3am is miserable.
  • Relying on the network perimeter instead of mTLS. “We are inside the VPC so we trust everything.” One compromised pod (exposed admin dashboard, SSRF, prototype pollution) and the attacker now speaks to every service with full trust.
  • Hand-rolled cert management. A bash script that generates certs once and hopes for the best. Rotation never happens. Certs outlive the people who issued them.
Solutions & Patterns
  • Short-lived certs with automated rotation — 24-hour TTL certs issued by a mesh (Istio, Linkerd, Consul Connect) or SPIRE. Rotation happens continuously in the background; a missed rotation window is one day of downtime, not one year.
  • Monitor certificate expiry. A Prometheus alert on probe_ssl_earliest_cert_expiry - time() under 7*86400 warns when any cert expires in the next 7 days. This catches anything the automation misses.
  • Use cert-manager in Kubernetes for service certs — it handles issuance, renewal, and Kubernetes Secret updates automatically. Pair with Let’s Encrypt for public-facing, or an internal CA (step-ca, HashiCorp Vault PKI) for service-to-service.
  • Pin SPIFFE IDs, not IPs or DNS. The mesh verifies spiffe://cluster/ns/default/sa/order-service, which is cryptographically bound to the pod identity — no DNS spoofing, no IP tricks.
  • Keep sane failure modes: during cert rotation, accept both old and new CA for an overlap window so an in-flight rotation cannot cause an outage.
┌─────────────────────────────────────────────────────────────────────────────┐
│                         MUTUAL TLS (mTLS)                                    │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  Regular TLS:                    Mutual TLS:                                │
│  ────────────                    ──────────────                             │
│                                                                              │
│  Client                          Client                                      │
│    │                               │                                        │
│    │──▶ Verify Server Cert         │──▶ Verify Server Cert                  │
│    │                               │                                        │
│    │    (server authenticated)     │    (server authenticated)              │
│    │                               │                                        │
│    │                               │◀── Verify Client Cert                  │
│    │                               │                                        │
│    ▼                               ▼    (BOTH authenticated)                │
│  Server                          Server                                      │
│                                                                              │
│                                                                              │
│  Certificate Chain:                                                          │
│  ─────────────────                                                          │
│                                                                              │
│       ┌─────────────────────┐                                               │
│       │     Root CA         │                                               │
│       │  (Trusted by all)   │                                               │
│       └─────────┬───────────┘                                               │
│                 │                                                           │
│       ┌─────────┴───────────┐                                               │
│       │   Intermediate CA   │                                               │
│       └─────────┬───────────┘                                               │
│                 │                                                           │
│    ┌────────────┼────────────┐                                              │
│    │            │            │                                              │
│    ▼            ▼            ▼                                              │
│  ┌─────┐    ┌─────┐    ┌─────┐                                             │
│  │Order│    │Pay  │    │Inv  │                                             │
│  │Cert │    │Cert │    │Cert │                                             │
│  └─────┘    └─────┘    └─────┘                                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

mTLS Configuration

Configuring mTLS in application code gives you fine-grained control over what certificate properties you trust (CN, SANs, issuer chain). The server demands a client certificate, validates it against the trusted CA bundle, and extracts the client’s identity from the certificate’s Common Name. This prevents “anonymous caller” attacks — every connection is tied to a specific service identity, and you can audit every byte of traffic back to an actual certificate. In zero-trust, extracting the service name from the cert CN is the foundation of authorization: “the Payment Service is calling me — should I let it?”. The tradeoff is that manual mTLS is brittle compared to a mesh: certificates expire, rotations break things, and debugging TLS handshake errors at 3am is not fun.
// security/mtls.js
const https = require('https');
const fs = require('fs');
const tls = require('tls');

class MTLSServer {
  constructor(options) {
    this.caCert = fs.readFileSync(options.caPath);
    this.serverCert = fs.readFileSync(options.certPath);
    this.serverKey = fs.readFileSync(options.keyPath);
  }

  createServer(app) {
    return https.createServer({
      ca: this.caCert,
      cert: this.serverCert,
      key: this.serverKey,
      requestCert: true,        // Request client certificate
      rejectUnauthorized: true  // Reject invalid certificates
    }, app);
  }

  // Middleware to extract client certificate info
  extractClientCert() {
    return (req, res, next) => {
      const cert = req.socket.getPeerCertificate();
      
      if (!cert || Object.keys(cert).length === 0) {
        return res.status(401).json({ error: 'Client certificate required' });
      }

      req.clientCert = {
        subject: cert.subject,
        issuer: cert.issuer,
        valid_from: cert.valid_from,
        valid_to: cert.valid_to,
        fingerprint: cert.fingerprint,
        serialNumber: cert.serialNumber
      };

      // Extract service name from certificate CN
      req.serviceName = cert.subject.CN;

      next();
    };
  }
}

class MTLSClient {
  constructor(options) {
    this.agent = new https.Agent({
      ca: fs.readFileSync(options.caPath),
      cert: fs.readFileSync(options.certPath),
      key: fs.readFileSync(options.keyPath),
      rejectUnauthorized: true
    });
  }

  async request(url, options = {}) {
    return fetch(url, {
      ...options,
      agent: this.agent
    });
  }
}

// Usage
const mtlsServer = new MTLSServer({
  caPath: '/certs/ca.crt',
  certPath: '/certs/server.crt',
  keyPath: '/certs/server.key'
});

const server = mtlsServer.createServer(app);
app.use(mtlsServer.extractClientCert());

Certificate Generation Script

Shell scripts using OpenSSL are the universal baseline for certificate operations — they work in any CI pipeline, any container, any cloud. The CA is the root of trust; every service certificate is signed by it, and services trust any cert with that signature. In production you would replace this with cert-manager, Vault PKI, or a service mesh CA, but understanding the raw OpenSSL flow is essential for debugging when those higher-level tools fail.
#!/bin/bash
# generate-certs.sh

# Create CA
openssl genrsa -out ca.key 4096
openssl req -new -x509 -days 365 -key ca.key -out ca.crt \
  -subj "/CN=Microservices CA/O=MyCompany"

# Generate service certificate
generate_service_cert() {
  SERVICE=$1
  
  # Generate key
  openssl genrsa -out ${SERVICE}.key 2048
  
  # Generate CSR
  openssl req -new -key ${SERVICE}.key -out ${SERVICE}.csr \
    -subj "/CN=${SERVICE}/O=MyCompany"
  
  # Sign with CA
  openssl x509 -req -days 365 -in ${SERVICE}.csr \
    -CA ca.crt -CAkey ca.key -CAcreateserial \
    -out ${SERVICE}.crt
  
  echo "Generated certificates for ${SERVICE}"
}

generate_service_cert "order-service"
generate_service_cert "payment-service"
generate_service_cert "inventory-service"

Secrets Management

HashiCorp Vault Integration

Secrets management is where most teams have the worst security posture — database passwords in env vars, API keys in CI configs, encryption keys in Git history. Vault solves this by centralizing secret storage with encryption at rest, audit logging, and (critically) dynamic secret generation. Instead of a long-lived database password shared by every instance, Vault issues a unique PostgreSQL user for each service request with a short TTL. This limits blast radius: if a service is compromised, the attacker gets one service’s credentials for one hour, not the master password forever. Kubernetes auth means services authenticate using their pod identity (no bootstrap credential problem), and leases auto-renew so secrets stay fresh. In zero-trust, Vault is the “dynamic identity for machines” layer — short-lived, verifiable, revocable.
Caveats & Common Pitfalls with Secrets Management
  • Secrets in environment variables. Visible to ps, docker inspect, kubectl describe pod, crash dumps, core files, and anything that logs process.env by accident. A single console.log(process.env) in a dev build and your secrets are in CloudWatch.
  • Secrets in Git. Even after git rm --cached the secret lives in history forever. Scanners like truffleHog pick it up within days. Every secret committed must be treated as compromised and rotated.
  • Kubernetes Secrets without encryption at rest. Default Kubernetes Secrets are base64-encoded (not encrypted) and stored in etcd. Anyone with etcd access — including a backup copy — reads them plaintext.
  • Over-broad Vault policies. order-service has read access to secret/* because “simpler to set up.” When compromised, the attacker reads every secret in the organization.
  • Single master key with no rotation path. Envelope encryption saves you only if the master key itself rotates. Most teams never rotate it.
Solutions & Patterns
  • Never env vars for production secrets. Either mount them as files from a secrets manager (Vault agent, CSI Secret Store, AWS Secrets Manager CSI driver) or fetch at startup from Vault. Files can at least be permission-scoped to the app user (0400).
  • Dynamic secrets whenever possible. Vault can issue per-request DB credentials with a 1-hour TTL. A leaked credential expires before it is useful. Works for Postgres, MySQL, MongoDB, AWS IAM, and more.
  • Enable encryption at rest for Kubernetes etcd via the EncryptionConfiguration resource using a KMS provider (AWS KMS, GCP Cloud KMS, HashiCorp Vault). This is non-default and must be explicitly enabled.
  • Fine-grained Vault policies per service. Each service sees only its own secret paths. order-service role reads secret/order-service/* and nothing else. Audit log every read.
  • Regular rotation is automated, not manual. Schedule rotation jobs (Vault’s built-in rotation, AWS Secrets Manager rotation Lambdas) so secrets rotate on a cadence without human involvement. The only secret a human should type is the initial root token.
// security/VaultClient.js
const vault = require('node-vault');

class VaultClient {
  constructor(options = {}) {
    this.client = vault({
      apiVersion: 'v1',
      endpoint: options.endpoint || process.env.VAULT_ADDR,
      token: options.token || process.env.VAULT_TOKEN
    });
    
    this.cache = new Map();
    this.leases = new Map();
  }

  async init() {
    // If using Kubernetes auth
    if (process.env.KUBERNETES_SERVICE_HOST) {
      await this.authenticateKubernetes();
    }
  }

  async authenticateKubernetes() {
    const jwt = require('fs').readFileSync(
      '/var/run/secrets/kubernetes.io/serviceaccount/token',
      'utf8'
    );

    const result = await this.client.kubernetesLogin({
      role: process.env.VAULT_ROLE,
      jwt
    });

    this.client.token = result.auth.client_token;
  }

  async getSecret(path, key = null) {
    // Check cache
    const cached = this.cache.get(path);
    if (cached && cached.expiry > Date.now()) {
      return key ? cached.data[key] : cached.data;
    }

    try {
      const result = await this.client.read(path);
      const data = result.data.data || result.data;
      
      // Cache for 5 minutes
      this.cache.set(path, {
        data,
        expiry: Date.now() + 300000
      });

      return key ? data[key] : data;
    } catch (error) {
      console.error(`Failed to read secret ${path}:`, error);
      throw error;
    }
  }

  async getDatabaseCredentials(role) {
    const result = await this.client.read(`database/creds/${role}`);
    
    // Store lease for renewal
    this.leases.set(role, {
      leaseId: result.lease_id,
      leaseDuration: result.lease_duration
    });

    // Schedule renewal
    this.scheduleRenewal(role, result.lease_duration);

    return {
      username: result.data.username,
      password: result.data.password
    };
  }

  scheduleRenewal(role, leaseDuration) {
    // Renew at 75% of lease duration
    const renewIn = leaseDuration * 0.75 * 1000;
    
    setTimeout(async () => {
      const lease = this.leases.get(role);
      if (lease) {
        try {
          const result = await this.client.write('sys/leases/renew', {
            lease_id: lease.leaseId
          });
          this.scheduleRenewal(role, result.lease_duration);
        } catch (error) {
          console.error(`Failed to renew lease for ${role}:`, error);
        }
      }
    }, renewIn);
  }
}

// Usage
const vaultClient = new VaultClient();
await vaultClient.init();

const dbCreds = await vaultClient.getDatabaseCredentials('order-service');
const apiKey = await vaultClient.getSecret('secret/api-keys', 'stripe');

Environment Variable Encryption

When Vault is overkill, envelope encryption of env vars is the pragmatic middle ground — secrets are encrypted at rest with a master key and decrypted only in memory. AES-256-GCM provides both confidentiality and integrity (the auth tag catches tampering), which is non-negotiable for production. This defends against config file leaks, backup exposure, and container image scans — an attacker who sees the encrypted value cannot decrypt it without the master key (which lives in a KMS or separate bootstrap). In zero-trust, this is a compensating control when a full secrets manager is not yet in place, and a defense-in-depth layer even when one is. For Kubernetes Secrets, remember they are base64-encoded, not encrypted by default — you still need envelope encryption on top.
// security/secrets.js
const crypto = require('crypto');

class SecretManager {
  constructor(masterKey) {
    this.masterKey = Buffer.from(masterKey, 'hex');
  }

  encrypt(plaintext) {
    const iv = crypto.randomBytes(16);
    const cipher = crypto.createCipheriv('aes-256-gcm', this.masterKey, iv);
    
    let encrypted = cipher.update(plaintext, 'utf8', 'hex');
    encrypted += cipher.final('hex');
    
    const authTag = cipher.getAuthTag();
    
    return `${iv.toString('hex')}:${authTag.toString('hex')}:${encrypted}`;
  }

  decrypt(encrypted) {
    const [ivHex, authTagHex, ciphertext] = encrypted.split(':');
    
    const iv = Buffer.from(ivHex, 'hex');
    const authTag = Buffer.from(authTagHex, 'hex');
    
    const decipher = crypto.createDecipheriv('aes-256-gcm', this.masterKey, iv);
    decipher.setAuthTag(authTag);
    
    let decrypted = decipher.update(ciphertext, 'hex', 'utf8');
    decrypted += decipher.final('utf8');
    
    return decrypted;
  }

  // Load and decrypt environment variables
  loadSecrets(encryptedEnv) {
    const secrets = {};
    
    for (const [key, value] of Object.entries(encryptedEnv)) {
      if (value.startsWith('ENC:')) {
        secrets[key] = this.decrypt(value.substring(4));
      } else {
        secrets[key] = value;
      }
    }
    
    return secrets;
  }
}

// Kubernetes Secret management
async function loadK8sSecret(secretName, namespace = 'default') {
  const k8s = require('@kubernetes/client-node');
  const kc = new k8s.KubeConfig();
  kc.loadFromCluster();
  
  const coreApi = kc.makeApiClient(k8s.CoreV1Api);
  
  const response = await coreApi.readNamespacedSecret(secretName, namespace);
  const secrets = {};
  
  for (const [key, value] of Object.entries(response.body.data)) {
    secrets[key] = Buffer.from(value, 'base64').toString('utf8');
  }
  
  return secrets;
}

Security Best Practices

Input Validation

Input validation is your first line of defense against injection attacks — SQL injection, NoSQL injection, command injection, XSS, and the rest. Every input that crosses a trust boundary must be validated against a strict schema: unexpected fields dropped, sizes bounded, formats enforced. This defeats “confused deputy” attacks where an attacker tricks your service into misinterpreting data (for example, passing a giant array to exhaust memory). Parameterized queries are non-negotiable — they separate code from data at the driver level, which is the only reliable way to prevent SQL injection. In zero-trust, “validate all input, even from trusted services” is the rule: a compromised internal service can send malicious payloads too. The tradeoff is code verbosity, but Pydantic and Joi turn that into a small, declarative cost.
// validation/sanitize.js
const Joi = require('joi');
const xss = require('xss');

// Schema validation
const orderSchema = Joi.object({
  customerId: Joi.string().uuid().required(),
  items: Joi.array().items(
    Joi.object({
      productId: Joi.string().uuid().required(),
      quantity: Joi.number().integer().min(1).max(100).required()
    })
  ).min(1).max(50).required(),
  shippingAddress: Joi.object({
    street: Joi.string().max(200).required(),
    city: Joi.string().max(100).required(),
    country: Joi.string().length(2).required()
  }).required()
});

function validateOrder(data) {
  const { error, value } = orderSchema.validate(data, {
    abortEarly: false,
    stripUnknown: true
  });

  if (error) {
    throw new ValidationError(error.details);
  }

  // Sanitize string fields
  value.shippingAddress.street = xss(value.shippingAddress.street);
  value.shippingAddress.city = xss(value.shippingAddress.city);

  return value;
}

// SQL Injection prevention - use parameterized queries
async function findOrder(orderId) {
  // BAD - SQL injection vulnerable
  // const result = await db.query(`SELECT * FROM orders WHERE id = '${orderId}'`);
  
  // GOOD - parameterized query
  const result = await db.query(
    'SELECT * FROM orders WHERE id = $1',
    [orderId]
  );
  
  return result.rows[0];
}

Rate Limiting

Rate limiting is how you survive abuse — brute-force login attempts, credential stuffing, scraping, and straight-up DoS. Without it, an attacker with a botnet can exhaust your database connections or CPU budget faster than you can autoscale. The key insight is that different endpoints need different limits: a product listing can tolerate 1000 req/s, but a login endpoint should cap at 5 attempts per hour per IP (because each failed attempt is a password guess). Backing the limiter with Redis makes it work across multiple instances, so an attacker cannot just round-robin to dodge it. In zero-trust, rate limiting is a quota layer: even authenticated, authorized requests get budgeted, because a compromised legitimate account is still an attack vector.
// middleware/rateLimit.js
const rateLimit = require('express-rate-limit');
const RedisStore = require('rate-limit-redis');
const Redis = require('ioredis');

const redis = new Redis(process.env.REDIS_URL);

// General API rate limit
const apiLimiter = rateLimit({
  store: new RedisStore({
    client: redis,
    prefix: 'rl:api:'
  }),
  windowMs: 15 * 60 * 1000, // 15 minutes
  max: 100, // 100 requests per window
  message: {
    error: 'Too many requests',
    retryAfter: 'Please try again later'
  },
  standardHeaders: true,
  legacyHeaders: false
});

// Stricter limit for auth endpoints
const authLimiter = rateLimit({
  store: new RedisStore({
    client: redis,
    prefix: 'rl:auth:'
  }),
  windowMs: 60 * 60 * 1000, // 1 hour
  max: 5, // 5 failed attempts
  skipSuccessfulRequests: true,
  message: {
    error: 'Too many login attempts',
    retryAfter: 'Account temporarily locked'
  }
});

// Apply
app.use('/api/', apiLimiter);
app.use('/auth/login', authLimiter);

Security Headers

Security headers are the browser-side half of defense in depth. Content Security Policy (CSP) neutralizes XSS by telling the browser “only run scripts from these origins” — even if an attacker injects a <script> tag, the browser refuses to execute it. HSTS forces HTTPS so a man-in-the-middle cannot downgrade to plaintext. X-Frame-Options blocks clickjacking by preventing your site from being iframed. These controls are cheap (one middleware) but prevent entire attack categories. In zero-trust, they extend the “verify everything” principle to the client: the browser is not trusted to render your content safely without explicit instructions. The tradeoff is that strict CSP can break inline scripts and third-party widgets, so rollouts typically start in report-only mode to find violations before enforcing them.
// middleware/security.js
const helmet = require('helmet');

app.use(helmet({
  contentSecurityPolicy: {
    directives: {
      defaultSrc: ["'self'"],
      scriptSrc: ["'self'"],
      styleSrc: ["'self'", "'unsafe-inline'"],
      imgSrc: ["'self'", 'data:', 'https:'],
      connectSrc: ["'self'"],
      fontSrc: ["'self'"],
      objectSrc: ["'none'"],
      mediaSrc: ["'self'"],
      frameSrc: ["'none'"]
    }
  },
  crossOriginEmbedderPolicy: true,
  crossOriginOpenerPolicy: true,
  crossOriginResourcePolicy: { policy: 'same-site' },
  dnsPrefetchControl: { allow: false },
  frameguard: { action: 'deny' },
  hsts: {
    maxAge: 31536000,
    includeSubDomains: true,
    preload: true
  },
  ieNoOpen: true,
  noSniff: true,
  originAgentCluster: true,
  permittedCrossDomainPolicies: { permittedPolicies: 'none' },
  referrerPolicy: { policy: 'strict-origin-when-cross-origin' },
  xssFilter: true
}));

Interview Questions

Answer:Options:
  1. mTLS (Mutual TLS)
    • Both client and server present certificates
    • Strongest authentication
    • Used in service meshes (Istio, Linkerd)
  2. API Keys + HMAC
    • Service ID + timestamp + signature
    • Prevents replay attacks
    • Easier to implement than mTLS
  3. JWT Tokens
    • Service-specific JWTs
    • Can include permissions
    • Short expiry for security
Best Practice:
  • Use mTLS in production
  • API keys for development/simple cases
  • Always encrypt in transit (TLS)
Answer:Storage:
  • Never in localStorage (XSS vulnerable)
  • HttpOnly cookies (preferred)
  • Memory only for SPAs
Token Security:
  • Short expiry (15-30 minutes)
  • Refresh tokens for session management
  • Rotate secrets regularly
Validation:
  • Verify signature, issuer, audience
  • Check expiry and not-before claims
  • Validate against token blacklist
Algorithm:
  • Use RS256 or ES256 for production
  • Never use alg: none
  • Specify algorithm in verification
Answer:Never:
  • Hardcode in source code
  • Commit to version control
  • Store in plain environment variables
Best Practices:
  1. Secrets Manager (HashiCorp Vault, AWS Secrets Manager)
    • Dynamic secret generation
    • Automatic rotation
    • Audit logging
  2. Kubernetes Secrets
    • Encrypted at rest
    • RBAC for access control
    • Mount as files, not env vars
  3. Encryption
    • Encrypt secrets at rest
    • Use envelope encryption
    • Rotate encryption keys
  4. Principle of Least Privilege
    • Service-specific secrets
    • Time-limited access
    • Audit all access

Summary

Key Takeaways

  • JWT for user authentication
  • RBAC/ABAC for authorization
  • mTLS for service-to-service auth
  • Use secrets managers (Vault)
  • Defense in depth approach

Next Steps

In the next chapter, we’ll cover Containerization - Docker best practices for microservices.

Interview Deep-Dive

Strong Answer:The standard pattern is “trust the gateway, verify at the edge.” The API Gateway validates the JWT token once, extracts user claims (user ID, roles, permissions), and forwards them as trusted headers (X-User-ID, X-User-Roles) to downstream services. Internal services trust these headers because they are on an internal network that external clients cannot reach.The critical assumption is network segmentation. Internal services must be unreachable from outside the cluster. If an attacker can send requests directly to Service B bypassing the gateway, they can forge the X-User-ID header. This is why mutual TLS (mTLS) between services is important — it ensures that Service B only accepts requests from authenticated internal services, not from arbitrary sources.For the JWT itself, I propagate the original token alongside the trusted headers. Service B can optionally re-validate the JWT if it needs to make authorization decisions based on claims not forwarded by the gateway. But it should not call the auth service — it validates the signature locally using a cached public key.The tricky scenario is service-to-service calls that are not on behalf of a user — for example, a nightly batch job in Service A calling Service B. For these, I use service accounts with machine-to-machine JWTs (client credentials grant in OAuth2). The JWT identifies the calling service, not a user, and Service B’s authorization logic handles both user-context and service-context requests.Follow-up: “What happens when the JWT expires mid-request in a long-running saga that spans multiple services over several minutes?”This is a real problem with choreographed sagas. The JWT might expire between step 3 and step 4. The solution: the saga orchestrator (or the first service) extracts the relevant claims from the JWT at the start and stores them in the saga context (in the database, not the token). Downstream saga steps use these stored claims rather than the original JWT. The JWT was only needed for the initial authentication — once the saga is in progress, the saga’s own state carries the authorization context. This also handles the case where the user’s permissions change mid-saga: the saga completes with the permissions that were valid when it started, which is the correct behavior for financial transactions.
Strong Answer:The approaches in order of maturity: environment variables, Kubernetes Secrets, and dedicated secrets managers (HashiCorp Vault, AWS Secrets Manager).Environment variables are where most teams start and where many stay too long. The problem is that they are visible in process listings, logged by crash reporters, and dumped in core dumps. They are also static — rotating a secret requires redeploying every service that uses it.Kubernetes Secrets are a step up: they are stored in etcd (encrypted at rest if configured), mounted as files or env vars, and can be updated without redeploying. But they have a fundamental flaw: anyone with kubectl access to the namespace can read them in plaintext. In a team of 30 engineers, that is 30 people who can read every production database password.HashiCorp Vault is the production-grade answer. Each service authenticates to Vault using its Kubernetes service account (no hardcoded credentials). Vault returns short-lived, auto-rotating secrets. Database credentials can be dynamic — Vault creates a unique PostgreSQL user for each service request with a 1-hour TTL. If credentials leak, the blast radius is one service for one hour, not the entire system forever.The failure mode most teams hit with Vault: Vault itself becomes a single point of failure. If Vault is down and a service restarts, it cannot fetch secrets and fails to start. The mitigation is caching secrets locally (encrypted) with a TTL, so a service can start with cached secrets even if Vault is temporarily unavailable. Vault also needs its own HA setup (multiple replicas, auto-unseal with a cloud KMS).Follow-up: “A developer accidentally committed an API key to a public GitHub repository. Walk me through your incident response.”Immediate actions in the first 15 minutes: revoke/rotate the exposed key at the provider (Stripe, AWS, etc.), scan the commit history for other secrets using a tool like trufflehog or gitleaks, and remove the commit from history using git filter-branch or BFG repo cleaner. Then: update all services that use the key with the new rotated value. Post-incident: add a pre-commit hook that scans for secrets (detect-secrets, git-secrets), enable GitHub’s secret scanning feature, and conduct a team retrospective on why the secret was in code in the first place (usually because the local development setup required it — fix that with a .env.example file and Vault integration for development).
Strong Answer:Zero Trust means “never trust, always verify” — even for internal service-to-service communication. In a traditional architecture, services inside the network perimeter are trusted. Zero Trust eliminates the perimeter concept entirely: every request, from any source, must be authenticated and authorized.mTLS is the foundation but not sufficient. mTLS proves identity: “this request came from the Payment Service, not from a random attacker.” But it does not prove authorization: “is the Payment Service allowed to call the User Service’s delete endpoint?” You need both.A complete Zero Trust implementation for microservices includes: mTLS for transport-layer authentication (every connection is encrypted and mutually authenticated), JWT or SPIFFE for application-layer identity (the calling service presents a token with its identity and permissions), authorization policies at each service (Payment Service can call User Service GET /users but not DELETE /users), network policies that restrict which pods can communicate (even with mTLS, the Inventory Service has no business talking to the Auth Service), and audit logging of all service-to-service calls for compliance.In Istio, this translates to: PeerAuthentication for mTLS enforcement, AuthorizationPolicy for per-endpoint access control, and NetworkPolicy for pod-level traffic restriction. The authorization policies are the part most teams skip, but they are critical — they are the microservices equivalent of firewall rules.The operational cost is real: every new service interaction requires an authorization policy update. At 50 services with 10 endpoints each, that is potentially 500 policies to maintain. I use a convention-based approach where services in the same namespace can communicate freely, and cross-namespace calls require explicit policies. This reduces the policy count while maintaining meaningful boundaries.Follow-up: “How do you handle the bootstrap problem where a new service needs to authenticate itself but does not yet have credentials?”In Kubernetes with Istio, the bootstrap is handled by the sidecar injector and Citadel. When a new pod starts, the sidecar automatically receives an mTLS certificate from Istio’s CA, derived from the pod’s service account. No manual credential provisioning needed. For Vault-based systems, the Kubernetes auth method allows a pod to authenticate using its service account token, which Kubernetes issues automatically. The chain of trust goes: Kubernetes API server -> service account token -> Vault authentication -> secret access. The only prerequisite is a one-time Vault role configuration per service, which is part of the service’s Terraform/Helm setup.