Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Service Mesh
A service mesh provides infrastructure-layer functionality for service-to-service communication, taking networking concerns out of your application code. Think of it like the postal system for your microservices: your application writes a letter (sends a request), and the postal system handles routing, tracking, insurance (mTLS), and delivery confirmation (metrics) — your code never touches any of that infrastructure. The trade-off is real overhead: extra latency per request, additional memory per pod, and significant operational complexity. A service mesh is powerful medicine, but not every system needs it.- Understand service mesh architecture and benefits
- Implement Istio for traffic management
- Configure mTLS automation
- Set up advanced traffic patterns (canary, A/B testing)
- Compare Istio vs Linkerd
What is a Service Mesh?
A service mesh solves a class of problems that simply cannot be solved cleanly inside application code, no matter how well-architected your libraries are. The core issue is that cross-cutting networking concerns — retries, mTLS, observability, traffic shaping — must behave identically across every service, in every language, at every version. Application-level libraries inevitably drift: Team A is on version 2.1 of the retry library while Team B is still on 1.3; the Java implementation interprets “timeout” differently from the Python one; a language you just adopted has no library at all. A service mesh moves these concerns out of application processes entirely and into a separate network layer, so the behavior is enforced once, uniformly, by operators rather than by dozens of developers. It is fundamentally a statement about where responsibility should live: infrastructure concerns belong in infrastructure, not sprinkled across product code. A service mesh is worth adopting when you have enough services, enough languages, and enough compliance pressure that fixing these problems at the library level has become a tax on every team. It is overkill when you have a handful of services in a single language and a small ops team — the cognitive load of learning Envoy, xDS, and 100+ CRDs will exceed the benefit you get back.The Problem
As microservices grow, cross-cutting concerns proliferate faster than features. A single service today needs retry logic, circuit breaking, load balancing, TLS, metrics, tracing, rate limiting, and service discovery — and every other service in your cluster needs the exact same set. The natural first instinct is to build a shared library that each service imports, but that breaks the moment you have Java and Python services in the same cluster. Even within a single language, coordinating library upgrades across 20 teams is a nightmare. The deeper issue is a violation of separation of concerns: networking is infrastructure, but you have pushed it into application code where developers must reason about it on every pull request.Caveats & Common Pitfalls: Sidecar Overhead at Scale
Service Mesh Architecture
The sidecar proxy model is how a service mesh pulls off its transparency trick. When a pod starts in a mesh-enabled namespace, an init container rewrites the pod’s iptables rules so that every byte of inbound and outbound TCP traffic is redirected through the sidecar Envoy before reaching the application (inbound) or the network (outbound). Your application makes a plain HTTP call tohttp://inventory-service/items/42, believing it is talking directly to the remote service. What actually happens: the syscall hits iptables, gets redirected to localhost:15001 where Envoy is listening, Envoy performs service discovery, load balancing, mTLS, retries, and metrics collection, then forwards the request to a remote pod’s Envoy, which does its own checks before handing the request to the destination application. The application sees none of this machinery. This is the entire reason a service mesh needs no SDK — the kernel’s network stack does the interception.
The control plane versus data plane distinction is the other foundational concept. Think of it like the difference between air traffic control and the planes themselves: controllers decide the routing rules, but the planes fly independently. In a mesh, the control plane (istiod for Istio) computes configuration from your CRDs and pushes it via the xDS protocol to every sidecar. The data plane (the Envoy sidecars) carries actual request traffic. Because the data plane caches configuration locally, a control plane outage is annoying but not catastrophic — existing routing keeps working; you just lose the ability to change rules or issue new certificates. This decoupling is what makes the mesh operationally viable at scale.
A service mesh splits into two logical planes, and understanding this split is essential for reasoning about failure modes. The data plane consists of the sidecar proxies (usually Envoy) that sit next to every pod and intercept all network traffic — both inbound and outbound. These proxies carry actual request traffic and must stay up for your services to communicate. The control plane (istiod in Istio, the control components in Linkerd) pushes configuration and certificates to the data plane but does not touch live request traffic. This separation means a control plane outage does not immediately break request flow — existing proxies keep running with their last-known config — but you lose the ability to change routing rules, rotate certificates, or onboard new services until it recovers.
Istio Deep Dive
Installing Istio
Before you deploy Istio, understand whatistioctl install actually does: it creates the istiod control plane deployment, installs the CRDs (VirtualService, DestinationRule, and 20+ others), and configures a mutating admission webhook that will inject Envoy sidecars into every pod in labeled namespaces. That webhook is where the “magic” happens — any pod created in a namespace with the istio-injection=enabled label automatically gets a second container (the Envoy proxy) plus init containers that set up iptables rules to redirect traffic through the proxy. If you skip the label, your pods run as usual and are invisible to the mesh.
Deploying a Service with Istio
The beauty of Istio sidecar injection is that your deployment YAML does not change. This is not an accident — it is a design decision that lets you adopt Istio incrementally without forcing every team to rewrite their manifests. Theversion: v1 label matters more than you might think: Istio’s DestinationRule resources use this label to define traffic subsets (v1 vs v2), so your deployments should always include a version label even if you only have one version today. Leaving it off means you cannot do canary deployments later without editing every deployment.
Reading Istio-Injected Environment Variables in Your Service
When the Istio sidecar is injected into your pod, it also exposes useful metadata via environment variables and the downward API — things like the pod’s workload name, namespace, mesh ID, and trust domain. Application code can read these to enrich logs, construct SPIFFE identities in logs, or make mesh-aware configuration decisions (like whether to skip TLS because the mesh is already handling it). A typical pattern is to usepydantic-settings to load these variables into a strongly-typed config object that the rest of your application can depend on. This keeps mesh awareness confined to a single configuration module rather than scattered across the codebase.
- Node.js
- Python
Traffic Management
Traffic management is where the sidecar model really earns its keep. In a meshless world, shaping traffic — canary deployments, A/B tests, header-based routing, fault injection — requires application code changes or bespoke ingress configuration for every service. In a mesh, these become declarative YAML applied at the control plane and propagated instantly to every sidecar. The crucial mental shift is that traffic policy is decoupled from deployment: you can deploy v2 to production and send it zero traffic, then gradually shift traffic in viaVirtualService edits, without touching the deployment spec. This separation of “what runs” from “what gets traffic” is a genuine architectural superpower that Kubernetes alone cannot provide — Kubernetes can only shape traffic by adjusting replica counts, which is coarse and conflates capacity with exposure.
Virtual Services
AVirtualService is Istio’s way of answering “when traffic arrives for this hostname, where should it actually go?” It is fundamentally a routing table that sits in front of your Kubernetes Service. Without a VirtualService, Kubernetes uses round-robin load balancing across all pods backing a Service — that is the limit of what it can express. With a VirtualService, you gain header-based routing, weighted traffic splits, timeouts, retries, fault injection, and traffic mirroring. The mental model: a Kubernetes Service defines the set of pods; a VirtualService defines the routing policy over that set.
Control how requests are routed:
Destination Rules
If a VirtualService answers “where should traffic go,” aDestinationRule answers “how should it be delivered once it arrives.” This is where you define connection pooling, load balancing algorithms, and outlier detection (circuit breaking). The separation exists because routing (VirtualService) and delivery policy (DestinationRule) often change independently — you might change traffic weights weekly while connection pool settings stay fixed for months. Subsets defined in DestinationRule are the “named versions” that VirtualServices route to; without a matching subset definition, a VirtualService referring to subset: v2 simply fails to apply.
Define subsets and policies:
Building an httpx Client That Respects Mesh Timeouts and Retries
When your service runs in a mesh, the sidecar already handles retries and timeouts for outbound traffic. But your HTTP client still needs to set a client-side timeout that is at least as long as the mesh’s maximum total time budget; otherwise, your client cancels the request while Envoy is still retrying, and you miss the benefit. The rule of thumb: client timeout = meshtimeout + perTryTimeout * attempts + a small buffer. For the retries and circuit breaking, let Envoy handle them — duplicating those at the application layer just multiplies latency with no added reliability. The Python example below uses httpx with a configured transport that propagates mesh headers automatically and sets timeouts aligned with the VirtualService.
- Node.js
- Python
Canary Deployments with Istio
Canary deployments are where service mesh really shines compared to Kubernetes-native approaches. Without a mesh, you can only do canary by replica ratio (1 canary pod out of 10 total = 10% traffic). With Istio’s VirtualService, you can route exactly 5% of traffic to v2 regardless of replica count. You can even route based on headers — send all internal employees to v2 while keeping customers on v1. This level of control is impossible with plain Kubernetes services.- Node.js
- Python
Caveats & Common Pitfalls: When Service Mesh Is Over-Engineering
mTLS (Mutual TLS)
mTLS in a service mesh is a genuine game-changer because it solves the two hardest parts of in-cluster encryption: certificate distribution and rotation. In a non-mesh world, rolling out mTLS means every team writes TLS code, manages key material, handles certificate expiration, and rebuilds images when certs change. With a mesh, the control plane issues short-lived X.509 certificates to each sidecar via the SDS (Secret Discovery Service) protocol, rotates them automatically every 24 hours, and your application never sees any of this. The SPIFFE identity baked into each certificate (spiffe://cluster.local/ns/default/sa/payment-service) becomes the basis for authorization policies — you can write rules like “only the order-service workload may call /checkout,” enforced by Envoy at the network layer rather than fought for in every microservice.
The deeper shift: mTLS turns the network itself into a trust boundary. You can migrate from “trust the network” to “trust nothing on the network and verify cryptographic identity per request” without changing a line of application code. This is what makes zero-trust networking operationally feasible at scale.
mTLS is arguably the single most compelling reason to adopt a service mesh. Without a mesh, implementing mutual TLS between services means every team needs to manage certificates, handle rotation, and write TLS code in their language of choice. With Istio, mTLS happens automatically in the sidecar proxy — zero code changes, zero certificate management by developers. The control plane handles issuing, distributing, and rotating certificates. This turns what would be a multi-month security initiative into a single YAML configuration.
Application Code That Works With a Service Mesh
One of the most important things to internalize: when your service runs inside a mesh, your HTTP client code mostly stays the same, but you gain superpowers through headers. The mesh propagates tracing headers (x-request-id, traceparent, x-b3-*) that you should forward on outbound calls so that distributed traces stitch together across hops. If you do not forward these headers, each service call looks like a fresh request to the tracing system and you lose the end-to-end view. The mesh also respects custom headers for routing — setting x-user-type: premium on a request lets Istio route it to a different backend subset without your application having to know anything about subsets.
The tradeoff: if you forget to propagate headers, debugging becomes painful because traces break. Most frameworks offer middleware to handle this automatically, but always verify it is wired up.
- Node.js
- Python
FastAPI with OpenTelemetry: Mesh-Aware Distributed Tracing
Manually extracting and forwarding tracing headers works, but it is fragile — miss one header or one outbound call and the trace breaks. A better approach is to let the OpenTelemetry SDK handle propagation automatically. Withopentelemetry-instrumentation-fastapi and opentelemetry-instrumentation-httpx, incoming B3/W3C TraceContext headers from Istio are parsed into a live span context on request entry, and every subsequent httpx call automatically injects the correct trace headers outbound. The OTLP exporter then ships spans to your collector (Jaeger, Tempo, Honeycomb, whatever), and the spans from your app seamlessly merge with the spans Envoy emits — giving you a complete picture of each request across the mesh and inside each service. This is the gold-standard pattern for mesh-aware observability.
- Node.js
- Python
Circuit Breaking with Istio
Circuit breaking in a service mesh works at the connection and request level, not at the application level. Envoy tracks how many consecutive 5xx responses it has seen from each upstream pod; once the threshold is breached, that pod is temporarily removed from the load balancing pool (“ejected”). This is a fundamentally different mental model from application-level circuit breakers (like Netflix Hystrix or Polly) — the mesh sees per-pod health, while your app code traditionally sees per-service health. Doing both is fine, but the mesh-level breaker is strictly more granular and happens without any application awareness. The risk: set thresholds too aggressive and a few bad requests can eject half your healthy pods, causing cascading failure. Always configuremaxEjectionPercent to cap how much of your fleet can be marked unhealthy simultaneously.
Rate Limiting
Rate limiting in Istio comes in two flavors: local (per-sidecar) and global (centralized counter via an external service). Local rate limiting is what you see below — each Envoy tracks its own token bucket independently. This is fast and has no external dependencies, but it is inaccurate in aggregate: 10 sidecars with a local limit of 100 req/s each can collectively let through 1000 req/s, not 100. For strict cluster-wide limits you need global rate limiting with Envoy’s rate limit service, which introduces a new dependency and a network hop per request. Pick local when approximate per-pod limits are good enough; pick global when a customer-tier quota must be enforced exactly across your entire cluster.Istio vs Linkerd Comparison
Linkerd Quick Start
Observability with Service Mesh
Observability is the area where the service mesh delivers the most leverage per unit of effort. Because every request traverses a sidecar, the mesh can emit a uniform set of RED metrics (Rate, Errors, Duration) for every service-to-service edge without any instrumentation effort from application teams. The mental model: instead of each service being a black box that might or might not have metrics (depending on how the team instrumented it), the network itself becomes instrumented. You get a consistent baseline of observability across the entire fleet on day one, and application-level metrics become supplementary rather than foundational. The distinction matters: mesh metrics tell you that inventory-service is slow; application metrics tell you why (database wait time, cache miss rate, etc.). You need both, and the mesh gives you the first one for free. The same principle applies to distributed tracing. Envoy generates spans for every inbound and outbound request and propagates B3/W3C TraceContext headers automatically. The only work your application has to do is forward those headers on outbound calls — or better, use an OpenTelemetry SDK that handles propagation transparently, as shown earlier.Automatic Metrics
The single biggest “free” benefit of a service mesh is automatic, consistent metrics across every service. Because every request flows through Envoy, every request is counted, timed, and labeled identically — regardless of whether the underlying service is Java, Python, Go, or Rust. This is transformative for an organization with polyglot services: you no longer have to convince (or coerce) every team to instrument their code consistently. The trade-off: you get what Envoy measures, which is request-level metrics. If you need business-level metrics (orders per minute, revenue per region), you still have to instrument those in application code. Istio automatically generates:Distributed Tracing Integration
Service Mesh Interview Questions
Q1: What is a service mesh and when would you use one?
Q1: What is a service mesh and when would you use one?
- Traffic management: Load balancing, routing, retries
- Security: mTLS, authorization
- Observability: Metrics, tracing, logging
- 10+ microservices
- Need consistent security policies
- Multiple languages/frameworks
- Complex traffic patterns
- Small number of services
- Simple architecture
- Resource constraints
- Team unfamiliar with Kubernetes
Q2: Explain the sidecar pattern
Q2: Explain the sidecar pattern
- App doesn’t need networking code
- Language agnostic
- Updated independently
- Consistent behavior
- Latency overhead (1-3ms)
- Resource consumption
- Complexity
Q3: How does mTLS work in a service mesh?
Q3: How does mTLS work in a service mesh?
- Certificate Authority (CA) generates root certificate
- Each service gets unique certificate (SPIFFE identity)
- Sidecars automatically handle TLS handshake
- Both client and server verify each other’s certificates
- Zero code changes
- Automatic rotation
- Service identity verification
- Encrypted in transit
Q4: How do you implement canary deployments with Istio?
Q4: How do you implement canary deployments with Istio?
- Deploy new version alongside old
- Create VirtualService with weight-based routing
- Gradually shift traffic (10% → 25% → 50% → 100%)
- Monitor error rates at each step
- Rollback if issues detected
- Error rates (5xx responses)
- Latency percentiles
- Business metrics
Q5: What's the difference between Istio and Linkerd?
Q5: What's the difference between Istio and Linkerd?
| Aspect | Istio | Linkerd |
|---|---|---|
| Complexity | High | Low |
| Resource usage | Higher | Lower |
| Latency | ~2-3ms | ~1ms |
| Features | Extensive | Focused |
| Proxy | Envoy (C++) | Rust-based |
Service Mesh Adoption Decision Framework
A service mesh is a significant commitment — it adds operational complexity, consumes cluster resources, and requires a team that understands Kubernetes deeply. Here is a structured way to decide whether you need one, and which one to pick.”Do I Even Need a Service Mesh?”
| Signal | Without Mesh | With Mesh | Verdict |
|---|---|---|---|
| 5 services, one team, one language | Library-based resilience (e.g., Polly, resilience4j) works fine | Overkill; the sidecar overhead exceeds the benefit | Skip the mesh |
| 15+ services, 3+ teams, mixed languages | Each team re-implements retries, TLS, tracing differently | Consistent behavior across all services with zero code changes | Strong candidate |
| Compliance requires mTLS everywhere | Manual cert management across all services (months of work) | Automatic mTLS with SPIFFE identities (days of work) | Mesh pays for itself |
| Need canary deployments with traffic splitting | Kubernetes-native canary by replica ratio only | Precise percentage-based traffic splitting with header routing | Mesh adds clear value |
| Running on VMs, not Kubernetes | Service mesh assumes Kubernetes sidecar injection | Most meshes require Kubernetes (Consul Connect is an exception) | Probably not ready |
Istio vs Linkerd vs Consul Connect — Detailed Comparison
| Dimension | Istio | Linkerd | Consul Connect |
|---|---|---|---|
| Proxy | Envoy (C++, battle-tested) | linkerd2-proxy (Rust, purpose-built) | Envoy or built-in proxy |
| Memory per sidecar | ~50-100MB | ~10-20MB | ~30-50MB |
| Latency overhead | ~2-3ms p99 | ~1ms p99 | ~1-2ms p99 |
| Kubernetes required? | Yes | Yes | No (works on VMs too) |
| mTLS | Yes (opt-in, configurable) | Yes (on by default) | Yes (built into Consul) |
| Traffic splitting | Advanced (headers, weights, mirroring) | Basic (weights only) | Moderate (weights, headers) |
| Multi-cluster | Yes (complex setup) | Yes (simpler setup) | Yes (native multi-DC) |
| Learning curve | Steep (100+ CRDs) | Gentle (minimal CRDs) | Moderate (Consul ecosystem) |
| Community/backing | Google, IBM, large community | Buoyant (CNCF graduated) | HashiCorp |
| Best for | Enterprises needing fine-grained control | Teams wanting simplicity with low overhead | Hybrid cloud / VM environments |
Edge Case: Service Mesh + gRPC
Istio handles gRPC traffic well because Envoy natively supports HTTP/2. However, gRPC uses long-lived connections, which means Envoy’s load balancing happens at connection time, not per-request. If you have 3 backend pods and a client opens one gRPC connection, all requests go to the same pod. Solution: configuremax_requests_per_connection in DestinationRule to force periodic reconnection, enabling rebalancing across pods.
Edge Case: Sidecar Startup Race Condition
During pod startup, your application container might start before the Envoy sidecar is ready. Any outbound HTTP call during this window fails because the sidecar is not yet intercepting traffic. Solutions:- Use
holdApplicationUntilProxyStarts: truein Istio’s global mesh config - Add a retry-on-startup loop in your application’s initialization code
- Use Kubernetes init containers to wait for the sidecar
Best Practices
Start Simple
Test Thoroughly
Monitor Resources
Plan for Failures
Chapter Summary
- Service mesh moves networking from app code to infrastructure
- Istio provides advanced traffic management and security
- mTLS ensures encrypted, authenticated service communication
- Canary deployments enable safe progressive rollouts
- Choose between Istio and Linkerd based on complexity needs
Interview Questions: Service Mesh Adoption Decisions
Your team wants to adopt Istio. What questions do you ask before saying yes?
Your team wants to adopt Istio. What questions do you ask before saying yes?
- What specific problems are we solving that cannot be solved with simpler tools? If it is only mTLS: use cert-manager. If it is only tracing: use OpenTelemetry. Istio is justified only when you need three or more of: mTLS, canary/traffic shaping, authorization policies, retry/circuit breaking, observability, multi-cluster routing.
- What is our current service count, language count, and team count? Fewer than 15 services in 1-2 languages with fewer than 3 teams rarely benefits. 30+ services across 5+ teams with polyglot codebases usually does.
- What is our latency budget? 1-3 ms per hop times 5-7 hops equals 10-20 ms of added latency. If SLOs are under 50 ms p99, this is a real bite.
- Do we have the operational capacity? istiod HA, upgrade paths, cert rotation, EnvoyFilter debugging — who owns this? If the answer is “our one platform engineer,” the mesh will own them, not the other way around.
- Have we done a bake-off with Linkerd or Consul Connect? Istio is the default choice, not always the right one. Linkerd costs one-fifth of the overhead for 80% of the features.
- What is our rollout strategy? Permissive mode first, strict mode after full coverage. Namespace-by-namespace, not cluster-wide. If the plan is “enable it on all namespaces next sprint,” reject that plan — it is how outages happen.
- What is our rollback plan? If the mesh causes an outage, can we disable injection and restart pods to remove sidecars? Is this tested? How long does it take?
- “Yes, let’s adopt Istio because it’s the industry standard.” Industry standard is not the same as right-for-your-situation. Most companies publicly touting their mesh adoption have platform teams 10x the size of yours.
- “We’ll adopt Istio to solve future scaling problems.” You cannot pay operational cost now for benefits you might need later. Adopt it when the pain of not having it exceeds the pain of running it.
- Istio docs: “Should you use Istio?” — https://istio.io/latest/docs/concepts/what-is-istio/
- Airbnb Engineering blog, “The Airbnb service mesh journey” (2020).
- Buoyant’s Linkerd vs Istio benchmark (updated annually) — https://buoyant.io/linkerd-vs-istio
Your Istio mesh just started failing mTLS handshakes across the entire fleet at 03:00. How do you diagnose and recover?
Your Istio mesh just started failing mTLS handshakes across the entire fleet at 03:00. How do you diagnose and recover?
- First check the control plane.
kubectl get pods -n istio-system. Is istiod healthy? If it is crashlooping, new pods cannot get certs and existing cert renewals will fail. - Check cert expiry. Use
istioctl proxy-config secret <pod>on a failing pod to see the cert’s NotAfter field. If certs have expired, the root cause is istiod not pushing renewals. - Check for a recent config change.
kubectl get peerauthentications,destinationrules -A -o yaml | grep -i mtls. A PeerAuthentication flipped from PERMISSIVE to STRICT while not all workloads had sidecars will cause global handshake failures. - Check clock skew. Sidecars reject certs whose validity window does not overlap with local time. A node’s clock drifting more than a few minutes breaks mTLS fleet-wide on that node.
- Recovery path. If istiod is down and you need immediate recovery: flip PeerAuthentication to PERMISSIVE (allows plaintext) to restore traffic while you fix the control plane. Do not leave it there — that is emergency-only.
pilot_xds_pushes should be steady; if it drops to zero, istiod is no longer configuring sidecars. Also alert on istio_agent_cert_expiry_seconds dropping below 12 hours across any pod.- “Restart all pods to pick up new certs.” Does not help if istiod cannot issue certs. Also causes a thundering herd of cert requests that can further destabilize istiod.
- “Roll back the Istio version.” Only helps if the control plane was freshly upgraded, which is a specific failure mode, not the general one.
- Istio docs: “Troubleshooting mutual TLS”.
- Monzo Engineering postmortem blog: “Our service mesh outage” (typically searchable by year).
istioctlcommand reference forproxy-configandanalyzesubcommands.
Your team wants to enforce 'all traffic must be mTLS' in strict mode across 80 services. How do you roll this out without taking down production?
Your team wants to enforce 'all traffic must be mTLS' in strict mode across 80 services. How do you roll this out without taking down production?
- Start in PERMISSIVE mode, never STRICT. PERMISSIVE accepts both mTLS and plaintext, so services without sidecars keep working during migration.
- Inventory what’s actually in the mesh.
kubectl get pods -A -o json | jqto count pods with and without theistio-proxycontainer. Do not assume every namespace has injection enabled. - Enable injection namespace-by-namespace. Label the namespace, do a rolling restart of all deployments, verify with Kiali that all traffic is showing mTLS. Check logs for any TLS errors.
- Handle the edge cases. Cron jobs with short-lived pods may not have sidecars because they finish before injection completes — use
holdApplicationUntilProxyStarts. Traffic from outside the mesh (load balancers, monitoring systems) must have explicit ServiceEntry or AuthorizationPolicy exemptions. - Flip STRICT per namespace, not cluster-wide. The failure mode of STRICT is that any un-meshed caller gets refused. Do one namespace at a time, let it bake for a week, roll forward.
- Have the rollback one kubectl away.
kubectl patch peerauthentication default -n <ns> --type=merge -p '{"spec":{"mtls":{"mode":"PERMISSIVE"}}}'should be in your runbook.
istioctl authn tls-check <pod> for programmatic verification. Most robustly, set up a Prometheus alert on istio_requests_total{connection_security_policy="none"} > 0 firing for any in-mesh service.- “Enable STRICT mode everywhere at once and fix what breaks.” This is the Monday-morning outage pattern. Every un-injected pod and every cross-namespace caller fails simultaneously.
- “Use a MeshConfig to enable mTLS globally.” That flag exists but is crude — it does not let you stage the rollout by namespace, which is the entire safety mechanism.
- Istio docs: “Mutual TLS Migration” — the canonical staged-rollout guide.
- “BeyondProd” whitepaper (Google, 2019) — architectural model for zero-trust in-cluster communication.
- Kiali documentation on traffic visualization.
Interview Deep-Dive
'Your team is considering adopting Istio for a 20-service Kubernetes deployment. What questions would you ask before making the decision, and what are the realistic operational costs?'
'Your team is considering adopting Istio for a 20-service Kubernetes deployment. What questions would you ask before making the decision, and what are the realistic operational costs?'
'Explain how mTLS works in a service mesh and what happens when a new service is deployed that does not have mTLS configured.'
'Explain how mTLS works in a service mesh and what happens when a new service is deployed that does not have mTLS configured.'
tls.mode: SIMPLE for the external host. Without the ServiceEntry, Istio’s default behavior (depending on outboundTrafficPolicy) is either to allow the traffic through without mTLS or to block it entirely. I recommend setting outboundTrafficPolicy to REGISTRY_ONLY and explicitly defining all external dependencies as ServiceEntries — this gives you visibility into every external call and the ability to apply retry/timeout policies to them.'How do you implement a canary deployment using a service mesh, and what metrics do you use to decide whether to promote or roll back the canary?'
'How do you implement a canary deployment using a service mesh, and what metrics do you use to decide whether to promote or roll back the canary?'