Effective logging is essential for debugging, monitoring, and auditing LLM applications in production. But LLM logging is fundamentally different from traditional application logging. In a REST API, you log the request and response and you are done. In an LLM application, you need to capture the prompt (which might be 4,000 tokens), the completion (another 2,000 tokens), token counts, latency, cost, which documents were retrieved, which tools were called, and whether the user found the response helpful. Without this, debugging a bad response is like debugging a web app with only the HTTP status code. The golden rule: log enough to reproduce any conversation, but not so much that you violate privacy or fill your disks.Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Structured Logging Setup
Logging Library Comparison
| Library | Output Format | Async Support | Best For | Setup Complexity |
|---|---|---|---|---|
print() | Unstructured text | No | Quick debugging only | None |
logging (stdlib) | Configurable (usually text) | No (blocks event loop) | Simple applications | Low |
structlog | JSON (structured) | Yes (with async processors) | Production LLM apps | Moderate |
loguru | Text or JSON | Yes | Developer experience, colored output | Low |
python-json-logger | JSON | No | Drop-in JSON formatter for stdlib logging | Low |
structlog for production LLM applications. It produces JSON by default, integrates with OpenTelemetry, and supports context binding (attach trace_id, user_id once and they appear on every log line). Use loguru if you prioritize developer experience during development.
Using structlog
Why structured logging instead ofprint() or basic logging? Because structured logs (JSON format) are machine-parseable. You can query them with tools like jq, feed them into Elasticsearch, or aggregate them in Grafana. A log line like {"event": "llm_request", "model": "gpt-4o", "latency_ms": 2300, "tokens": 1847} is infinitely more useful than INFO: Called GPT-4o in 2.3 seconds.
Custom LLM Logger
Request Tracing
Request tracing answers the question: “this user got a bad response — what happened?” In a typical LLM application, a single user request might trigger a retrieval query, two LLM calls (one for query expansion, one for generation), and a tool call. Without tracing, these are five disconnected log entries. With tracing, they are linked by a single trace ID, so you can reconstruct the full journey from user question to final answer. Think of tracing like a receipt at a restaurant: it links the order, the kitchen prep, the plating, and the delivery into one traceable transaction.Audit Logging
Audit logging serves a different purpose than operational logging. Operational logs help you debug; audit logs help you prove compliance. If your LLM application handles medical records, financial data, or personal information, regulators may require you to demonstrate who accessed what data, when, and what the system did with it. Audit logs are append-only (never modified or deleted), tamper-evident (content-hashed), and retained for a defined period. Even if you are not in a regulated industry today, audit logging is cheap insurance. When a customer asks “what did your AI do with my data?”, you want an answer.Debug Mode
Debug mode is your “X-ray vision” during development. It prints full prompts, responses, latency, and token counts for every LLM call. The key design principle: it is controlled entirely by environment variables, so you never need to change code to toggle it on or off. SetLLM_DEBUG=true in your .env during development, and remove it in production. No code changes, no risk of accidentally shipping debug logging.
Error Diagnostics
When an LLM API call fails, the error message is often cryptic: “Error code: 429” or “context_length_exceeded.” The diagnostics system below pattern-matches error messages to known failure modes and provides actionable suggestions. This is especially valuable for on-call engineers who may not be deeply familiar with LLM-specific failure patterns — instead of Googling the error, they get immediate guidance.Log Analysis
Raw logs are useless without analysis. The analyzer below turns your log files into actionable insights: which models are most expensive, where the latency bottlenecks are, and what errors are most common. Run this daily or weekly to catch trends before they become incidents. A gradual increase in P95 latency, for example, might indicate that your prompts are getting longer over time as conversation history grows unbounded.Logging Levels
| Level | When to Use | Examples | Volume Expectation |
|---|---|---|---|
| DEBUG | Development only | Full prompts, responses, intermediate chain steps | 10-100x normal volume — never in production |
| INFO | Normal operations | Token counts, latency, model used, finish_reason | 1 entry per LLM call |
| WARNING | Potential issues | Retries, fallbacks, slow responses (P95+), near-limit token usage | 1-5% of requests |
| ERROR | Failures | API errors, timeouts, validation failures, content filter triggers | 1-3% of requests (normal) |
| CRITICAL | System failures | All providers down, cost budget exceeded, data pipeline broken | Should be rare (pages someone) |
What to Log vs. What Not to Log
This is where most teams get it wrong — they either log too little (cannot debug) or too much (PII exposure, storage costs, GDPR violations).| Always Log | Never Log in Production | Log Conditionally (with consent/policy) |
|---|---|---|
| Request ID, trace ID | Full user prompts (PII risk) | Prompt text (with PII redaction) |
| Model name, provider | API keys, tokens, secrets | Response text (for quality review) |
| Token counts (input/output) | User personal data (name, email) | User ID (if you have consent) |
| Latency (total, TTFT) | Raw file uploads | Retrieved document chunks |
| Finish reason | Internal system prompt text | Tool call arguments |
| Error type and code | Full stack traces with sensitive context | Session/conversation history |
| Cost estimate | Other users’ data in multi-tenant logs | Geographic/IP data |
What is Next
API Design for AI
Learn to design robust APIs for LLM applications