Agentic Architecture refers to design patterns for building AI systems where multiple specialized agents collaborate to solve complex problems. Instead of one monolithic agent, you have:
Specialized agents with focused capabilities
Orchestration to coordinate agents
Communication protocols between agents
Shared memory/state for collaboration
Key Insight: Complex tasks are better handled by multiple specialized agents than one generalist agent trying to do everything. This is how OpenAI’s o1 and Claude’s research capabilities work internally.
You are designing a multi-agent system for an enterprise workflow. How do you decide between the Supervisor pattern and the Swarm pattern?
Strong Answer:
The Supervisor pattern is the right default for most enterprise workflows because it provides a single point of control, making it easier to reason about execution order, debug failures, and enforce business rules. The supervisor acts as an explicit router: it receives the task, decides which agent to delegate to, collects results, and decides what to do next. This maps well to workflows with well-defined stages like “research, then analyze, then write.”
The Swarm pattern is appropriate when the task is highly dynamic and you cannot predict at design time how many agents you need or what types. For example, an incident response system where you might need to spawn a log-analysis agent, a metrics-analysis agent, and a customer-communication agent depending on the nature of the incident. Swarm excels when parallelism and adaptability matter more than predictable execution.
The key trade-off is control versus flexibility. Supervisors give you deterministic routing and clear audit trails but can become bottlenecks if the supervisor LLM makes poor routing decisions. Swarms give you emergent problem-solving behavior but are harder to debug, harder to test, and harder to set cost guardrails on because agents can spawn other agents unpredictably.
In practice, most production systems I have seen use a hybrid: a supervisor pattern for the top-level orchestration with swarm-like behavior within specific subtasks. For example, the supervisor routes a research task to a research sub-system, and that sub-system uses a small swarm of specialized search agents internally. This gives you control at the macro level and flexibility at the micro level.
One often-overlooked factor: the supervisor pattern degrades more gracefully. If one worker agent fails, the supervisor can retry, route to an alternative, or return a partial result. In a swarm, a failing agent can create cascading confusion because other agents may depend on its output without clear fallback paths.
Follow-up: How do you prevent the supervisor agent from becoming a bottleneck or single point of failure?Two strategies. First, make the supervisor decision lightweight by using a smaller, faster model (gpt-4o-mini) for routing decisions, while worker agents use the full gpt-4o for actual work. The routing decision is usually a classification task that does not need the strongest model. Second, implement supervisor-level caching: if the same type of task comes in repeatedly, cache the routing decision so you skip the supervisor LLM call entirely. For high availability, you can run the supervisor as a stateless function with the execution state stored externally (in Redis or a database), so if the supervisor process dies, another instance can pick up the state and continue.
How do you handle shared state and memory across agents in a multi-agent system without creating race conditions?
Strong Answer:
Shared state in multi-agent systems is one of the hardest design problems because agents may run concurrently and update the same state. The pattern I follow is to treat shared memory as an append-only log rather than a mutable document. Each agent writes new facts, hypotheses, or decisions to the log, and reads the full log before acting. This eliminates write conflicts because you never modify existing entries, only add new ones.
For the implementation, I use a structure similar to the SharedMemory TypedDict shown in this chapter: separate lists for confirmed facts, unconfirmed hypotheses, and decisions. The critical rule is that only one agent at a time can promote a hypothesis to a fact, and this promotion should go through the supervisor or a dedicated “validator” agent to prevent conflicting facts.
When agents run in parallel (like the debate pattern), I use a turn-based protocol rather than true concurrent writes. Each agent gets the current state snapshot, produces its output, and then the orchestrator merges all outputs into the state before the next round. This is essentially an optimistic concurrency model where conflicts are resolved by the orchestrator.
For long-term memory backed by a vector database, each agent gets its own collection or namespace to prevent pollution. A shared collection is useful for cross-agent knowledge, but writes to it should go through a gatekeeper that checks for contradictions with existing entries. In LangGraph, this is handled through the state management system where each node returns state updates that are merged deterministically.
The practical gotcha is context window bloat. If every agent reads the entire shared memory, and the memory grows with each step, you quickly exhaust the context window. I implement a relevance filter: before passing shared memory to an agent, query the memory with the agent’s current task and only include the top-k most relevant entries.
Follow-up: How would you design a memory system that persists across sessions for a multi-agent system?I use a two-tier architecture. Short-term memory (the current task’s facts, hypotheses, decisions) lives in an in-memory data structure or Redis and gets cleared after the task completes. Long-term memory lives in a vector database with metadata tags for the agent that wrote it, the task context, and a timestamp. Before starting a new task, each agent queries long-term memory for relevant past experiences. The key is a decay mechanism: memories that have not been recalled in a configurable time window get downranked and eventually archived, preventing the memory from growing indefinitely and keeping retrieval fast. I also tag memories with a confidence score that degrades over time, so recent memories are preferred over stale ones.
The Debate pattern has agents argue different sides of an issue. When does this actually improve output quality versus just wasting tokens?
Strong Answer:
The Debate pattern genuinely improves output quality in specific scenarios: when the task requires considering trade-offs, when there are legitimate multiple perspectives (policy decisions, architecture choices, risk assessments), or when you need to stress-test a conclusion. The mechanism works because each agent is forced to find weaknesses in the other’s argument, which surfaces edge cases and counterexamples that a single agent would miss.
It wastes tokens when the answer is factual and unambiguous. Having agents debate whether Python was created in 1991 is pure waste. It also underperforms when the LLM does not have strong enough knowledge to construct genuine counterarguments, in which case the “con” agent generates superficial objections that dilute rather than sharpen the analysis.
The biggest value I have seen from the Debate pattern is in code review and architecture decisions. Having a “pro” agent argue for a particular design and a “con” agent argue against it produces surprisingly thorough analysis. The judge agent then synthesizes a recommendation that accounts for trade-offs neither side would have surfaced alone.
For production, I limit debates to 2-3 rounds. Research shows that most of the information gain happens in the first 2 rounds; beyond that, agents start repeating themselves with slight variations. Each additional round costs 2 LLM calls (pro + con) plus the context of all previous arguments, so the cost grows quadratically with context.
An underappreciated optimization: instead of running the debate at generation time (making the user wait), run it offline as a batch process for common query types. Pre-debate the top 100 most-asked questions in your domain and cache the judge’s synthesis. Then at serving time, retrieve the cached debate outcome and only run a live debate for novel queries.
Follow-up: How would you adapt the Debate pattern for a customer-facing product where latency matters?I would restructure it as an async pipeline. The user gets an initial fast response from a single agent (1-2 seconds). In the background, the debate runs asynchronously. If the debate reaches a different conclusion than the initial response, the system notifies the user with a refined answer (like a “we have a more thorough analysis” update). For products where this UX does not work, I use a lightweight version: instead of separate pro/con agents, I use a single prompt that instructs the model to “consider the strongest argument for and against before concluding.” This captures about 60% of the debate benefit at zero latency cost.
Walk me through how you would add observability to a multi-agent system in production. What metrics matter most?
Strong Answer:
Observability in multi-agent systems is harder than single-model applications because you need to trace a request across multiple agents, each with their own LLM calls, tool invocations, and state mutations. The foundation is distributed tracing: assign a trace ID to each user request and propagate it through every agent invocation, so you can reconstruct the full execution path.
The metrics I track fall into three categories. First, per-agent metrics: latency per agent call, success/failure rate, token consumption, and the distribution of tool calls each agent makes. Second, orchestration metrics: how many agents were invoked per request, how many rounds of debate or supervisor loops, and which routing paths are most common. Third, quality metrics: user satisfaction signals (thumbs up/down), answer accuracy on a held-out evaluation set, and the percentage of requests that hit the max-iterations safety cap.
The most important single metric is “steps to completion,” which tells you how efficiently the agent system is solving tasks. A rising trend means either the tasks are getting harder or the agents are degrading. Combined with per-agent failure rates, this helps you pinpoint which agent is struggling.
For implementation, I use structured logging where each log entry includes the trace ID, agent name, step number, input summary, output summary, latency, and token count. I pipe these into an observability platform (Langfuse, LangSmith, or a custom Grafana dashboard) that lets me search by trace ID to reconstruct any request. The AgentTracer class in this chapter is a good starting point, but in production you want this integrated with your existing observability stack rather than standalone.
A practical tip: log the full LLM prompts and responses for a sample of requests (say, 5%) rather than all requests. Full logging of every request generates enormous data volumes and privacy concerns. But having no prompt logs makes debugging hallucinations or routing errors nearly impossible.
Follow-up: How do you set up alerts that catch multi-agent system degradation before users complain?I set up three tiers of alerts. Tier 1 (immediate page): any agent hitting a 100% failure rate over a 5-minute window, or the overall max-iterations hit rate exceeding 10%. Tier 2 (Slack notification within the hour): average steps-to-completion increasing by more than 30% over a rolling 4-hour window, or per-agent latency p95 exceeding 2x the historical baseline. Tier 3 (daily report): drift in routing patterns (if the supervisor suddenly starts routing 80% of tasks to one agent, something changed), cost per request trending upward, and the distribution of user satisfaction scores. The tier 3 alerts catch slow degradation that tier 1 and 2 miss.