Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
December 2025 Update: Production strategies for managing prompts including version control, A/B testing, and prompt observability.
The Prompt Management Problem
Here is a scenario that plays out at every company building with LLMs: a developer changes a system prompt to fix one edge case, and it silently breaks three other use cases. Nobody notices until customers complain a week later. By then, nobody remembers what the prompt said before or why it was changed. Sound familiar? Prompts are the new configuration files — except they are more fragile, harder to test, and have a bigger blast radius when they break. They are critical business logic, but often managed poorly:Prompt Registry Pattern
A prompt registry is the foundational pattern for managing prompts at scale. Instead of hardcoding prompts as string literals scattered across your codebase, you centralize them in a registry that supports versioning, variable substitution, and metadata tracking. Think of it as a database for prompts — every prompt has a name, a version, an author, and a history of changes.Basic Prompt Registry
File-Based Prompt Management
For teams that want prompt version control without building a database, file-based management is a pragmatic middle ground. You store prompts as text files in your git repo, organized by name and version. This gives you all the benefits of git — diffs, blame, pull request reviews — for free. The trade-off: you cannot change prompts without a code deployment, which is actually a feature for teams that want deployment gates and review processes. Store prompts as files for version control:A/B Testing Prompts
A/B testing for prompts answers the question every team eventually asks: “is this new prompt actually better, or does it just look better on the three examples we tried?” Without controlled experiments, prompt changes are driven by vibes rather than data. The framework below uses deterministic user assignment (hash-based) so the same user always sees the same variant — essential for consistent UX and valid statistical comparisons.Experiment Framework
Prompt Storage: Database vs. Files vs. Config Service
| Approach | Pros | Cons | Best For |
|---|---|---|---|
| Git files (text files in repo) | Full version history via git, PR review process, free | Requires deployment to change prompts, no runtime switching | Teams that want deployment gates and review rigor |
| Database (PostgreSQL/Redis) | Runtime changes without deploy, A/B testing, fast rollback | Need to build admin UI, risk of unreviewed changes, migration complexity | Teams that iterate on prompts frequently in production |
| Config service (LaunchDarkly, Flagsmith) | Feature flags for prompts, gradual rollouts, instant rollback | Cost, limited version history, not purpose-built for prompts | Teams already using feature flags for other config |
| Langfuse/LangSmith | Purpose-built, tracing integration, evaluation workflows | Vendor lock-in, cost at scale, data leaves your infra | Teams already using these for observability |
| Hybrid (git + database cache) | Git for source of truth, database for runtime serving | More complex, need sync mechanism | Mature teams wanting both rigor and flexibility |
Prompt Testing Framework
Prompt testing is the most underinvested area in AI engineering. Teams will spend weeks writing unit tests for their API endpoints but deploy prompt changes to production without any automated testing. The framework below lets you define test cases with expected behaviors (contains certain phrases, stays under word limit, maintains professional tone) and run them automatically before any prompt deployment. Practical tip: start with 5-10 test cases covering your most common and most critical use cases. Add a new test case every time you find a bug in production. Over time, this becomes your regression suite.A/B Testing Edge Cases
Edge case — statistical significance with LLM non-determinism: Unlike traditional A/B tests where the treatment is deterministic (users see button A or button B), LLM outputs vary even within the same prompt version. This means you need larger sample sizes to detect real differences. A rough rule: aim for 200+ observations per variant before drawing conclusions, and use metrics that aggregate well (satisfaction score, task completion rate) rather than individual response quality. Edge case — prompt interactions with context: A prompt version that wins the A/B test on short queries might lose on long queries. Segment your results by input characteristics (query length, topic, user tier) before declaring a winner. The “winning” prompt might only be better for 60% of your traffic. Edge case — user experience consistency: If user_123 gets variant A on Monday and you end the experiment on Tuesday, switching them to the winning variant B means their experience changes mid-conversation. For chatbot-style applications, pin users to their variant for the duration of their session or conversation thread, not just per-request.Prompt Lifecycle Management
Just like code goes through dev, staging, and production, prompts should go through a defined lifecycle: draft, testing, canary, production, deprecated, archived. The lifecycle manager below enforces valid transitions (you cannot jump from draft to production) and maintains an audit trail of who changed what and why. This is not bureaucracy — it is the minimum safety net for a system where a single word change can break your product.| Transition | When | Who Should Approve |
|---|---|---|
| Draft -> Testing | Developer is ready for automated eval | Self (author) |
| Testing -> Canary | All test cases pass | Tech lead or prompt owner |
| Canary -> Production | Canary metrics match or exceed current production | On-call or product owner |
| Production -> Deprecated | New version promoted, or critical bug found | Anyone (emergency) / prompt owner (planned) |
| Any -> Archived | No longer needed, preserved for history | Prompt owner |
Key Takeaways
Version Everything
Treat prompts like code with proper version control
Test Before Deploy
Automated testing catches issues before production
A/B Test Changes
Measure impact with controlled experiments
Enable Rollbacks
Quick rollback capability is essential for production
What’s Next
LLM Orchestration
Learn to orchestrate multiple LLM providers with unified APIs