Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
The Agent Revolution
Agents are how AI goes from “chat” to “do”. Every company wants AI that can actually take actions—book meetings, write code, research competitors, manage workflows.2025 Agent Landscape
| Agent Type | Description | Example |
|---|---|---|
| Tool-Use Agents | Call APIs, search, calculate | Customer support bots |
| Code Agents | Write, execute, debug code | GitHub Copilot Workspace |
| Computer Use | Control browser/desktop | Anthropic Claude computer use |
| Multi-Agent | Teams of specialized agents | Research + Writing teams |
| MCP Agents | Connect to any data source | Database assistants |
The Agent Mental Model
Production Agent Framework
Complete Implementation
Advanced Agent Patterns
1. Planning Agent
2. Self-Correcting Agent (Reflexion)
3. Multi-Agent Collaboration
Safety and Guardrails
Computer Use Agents (2025)
Anthropic’s Computer Use capability allows agents to control a browser or desktop. This is the frontier of agentic AI.Key Takeaways
ReAct Is Your Foundation
Tools Are Everything
Safety First
Observe and Debug
What’s Next
LangGraph
Interview Deep-Dive
You are building a production agent that can execute code and write files. What safety guardrails do you put in place, and where do most teams get this wrong?
You are building a production agent that can execute code and write files. What safety guardrails do you put in place, and where do most teams get this wrong?
- The first layer is input validation: before any tool executes, validate the arguments against a strict schema. The CodeExecutionTool in this chapter uses subprocess with a timeout, which is a start, but most teams miss the sandboxing requirement. Running user-influenced code in the same process or even the same container as your application is a critical vulnerability. In production, I execute code in an isolated sandbox, either a Docker container with no network access, a gVisor sandbox, or a cloud function with minimal permissions.
- The second layer is an allowlist/denylist approach to actions. The SafeAgent pattern here blocks dangerous strings like “rm -rf” and “sudo,” but string matching is brittle. An attacker can bypass “rm -rf” with “find / -delete” or encode commands in base64. A more robust approach is to restrict the execution environment itself: no filesystem access outside a temporary directory, no network access, no ability to install packages, and a hard memory limit.
- The third layer is the human-in-the-loop requirement for high-consequence actions. The REQUIRE_APPROVAL list is the right pattern, but the key design decision is what goes on that list. Most teams either put too much on it (every action needs approval, which defeats the purpose of automation) or too little (they miss that “write_file” to a path like “/etc/cron.d/backdoor” is catastrophic). I categorize actions by blast radius: reversible actions with low impact get auto-approved, reversible with high impact get logged, and irreversible actions always require human approval.
- The most common mistake I see is trusting the LLM’s judgment about safety. The LLM decides what tools to call and with what arguments, but it has no concept of security. A prompt injection can convince the agent that “delete all user data” is the correct action. The guardrails must be implemented in code, not in the system prompt. System prompt instructions are suggestions to the model; code-level checks are enforced constraints.
- Finally, implement comprehensive audit logging. Every tool call, its arguments, the result, and the agent’s reasoning that led to the call. This is both a security requirement (incident investigation) and a debugging requirement (understanding why the agent took an unexpected action).
Compare the ReAct agent pattern with the Planning agent pattern. When does each shine, and when does each fail?
Compare the ReAct agent pattern with the Planning agent pattern. When does each shine, and when does each fail?
- ReAct (Reasoning + Acting) interleaves thinking and acting step by step. The agent thinks about what to do next, takes one action, observes the result, and repeats. This is fundamentally reactive: the agent discovers the solution path as it goes. Planning agents create a full plan upfront, then execute each step sequentially. This is fundamentally proactive: the agent commits to a strategy before acting.
- ReAct shines for exploratory tasks where you do not know what information you will find or what tools you will need until you start. Web research is a great example: you search for something, read the results, realize you need to search for a related topic, and so on. The strength is adaptability because each step is informed by the results of the previous step.
- Planning agents shine for structured, multi-step tasks where the steps are predictable and the order matters. Generating a report (outline, research each section, write, review) or processing a dataset (validate, clean, transform, analyze) are good fits. The plan gives you a progress bar, makes it easy to resume from failures, and lets you parallelize independent steps.
- ReAct fails when the task requires a coherent long-term strategy. Because each step is decided locally, the agent can wander in circles or make locally rational but globally suboptimal decisions. The classic failure is the agent spending 8 of its 10 allowed steps researching one subtopic and running out of steps before addressing the main question.
- Planning agents fail when the plan itself is wrong. If the LLM creates a flawed plan (missing a critical step or ordering steps incorrectly), the entire execution goes sideways. Unlike ReAct, a planning agent cannot easily deviate from its plan mid-execution. The Reflexion pattern (self-correcting agent) helps here by evaluating the result after execution and creating a revised plan if the first attempt fails.
- The hybrid approach that works best in production: create a high-level plan (3-5 major phases), but execute each phase using a ReAct loop. This gives you strategic direction from the plan and tactical flexibility from ReAct within each phase.
How do you evaluate whether an AI agent is actually performing well in production? What metrics do you track beyond simple success/failure?
How do you evaluate whether an AI agent is actually performing well in production? What metrics do you track beyond simple success/failure?
- Success/failure is necessary but nowhere near sufficient. An agent can “succeed” by giving a plausible-sounding but incorrect answer, or “fail” by hitting the iteration limit but still having gathered useful partial information. I track metrics across four dimensions.
- Efficiency metrics: steps to completion, total tokens consumed per task, total wall-clock time, and the ratio of thinking steps to acting steps. An agent that spends 7 out of 10 steps thinking without acting has a reasoning loop problem. An agent that acts on every step without thinking is tool-spamming.
- Quality metrics: answer correctness (validated against a held-out test set or through LLM-as-judge evaluation), citation accuracy (if the agent claims a source, does the source actually support the claim), and completeness (did the agent address all parts of the user’s question). I sample 5-10% of production responses for automated evaluation and 1% for human review.
- Cost metrics: dollar cost per task (broken down by LLM calls and tool calls), cost distribution across task types (some tasks may cost 100x more than others), and the cost trend over time. A rising cost trend often indicates degradation in the agent’s efficiency.
- User experience metrics: time to first token (how long the user waits before seeing any response), total response time, the rate at which users rephrase their question (indicating the first answer was unsatisfactory), and explicit feedback signals like thumbs up/down.
- The most diagnostic metric I have found is the “tool call efficiency ratio”: the number of unique, useful tool calls divided by the total tool calls. An agent making 10 tool calls where 6 are redundant or irrelevant has a prompt or planning problem. This ratio should be above 0.7 for a well-tuned agent.
Anthropic released computer use agents. What are the unique engineering challenges compared to tool-use agents, and how would you architect a safe deployment?
Anthropic released computer use agents. What are the unique engineering challenges compared to tool-use agents, and how would you architect a safe deployment?
- Computer use agents operate in a fundamentally different paradigm from tool-use agents. Tool-use agents call structured APIs with typed parameters. Computer use agents interact with arbitrary GUIs through screenshots, mouse clicks, and keyboard input. This introduces three major challenges.
- First, the observation space is unstructured. With tool-use, the agent gets structured JSON responses it can reliably parse. With computer use, the agent processes screenshots and must perform visual understanding to determine the current state. OCR errors, dynamic page loading, and visual ambiguity mean the agent’s perception of the state can be wrong, leading to incorrect actions. A button that looks like a “Submit” button might actually say “Submit for Deletion.”
- Second, the action space is enormous and irreversible. A tool-use agent chooses from a finite set of defined tools. A computer use agent can click anywhere on a 1024x768 screen and type any string. Most of those actions are meaningless, but some are catastrophic (clicking “Delete Account” or “Send All” on an email). There is no “undo” in most GUI interactions.
- Third, latency is dramatically higher. Each step requires taking a screenshot (100ms+), sending it to the model for analysis (1-2 seconds), receiving the action, executing it, waiting for the UI to respond (variable), and taking another screenshot. A 10-step task takes 30-60 seconds minimum, versus 5-10 seconds for a tool-use agent doing 10 tool calls.
- For safe deployment, I architect three layers. The environment layer: run the agent in an isolated VM or container with a sandboxed browser. No access to the host system, no persistent storage, no ability to install software. The action layer: maintain a denylist of UI elements the agent cannot interact with (logout buttons, delete buttons, payment forms) using coordinate-based exclusion zones or visual element detection. The monitoring layer: record every screenshot and action for replay and audit, with real-time anomaly detection that pauses the agent if it navigates to unexpected domains or performs rapid-fire clicks.