Documentation Index
Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt
Use this file to discover all available pages before exploring further.
Why FastAPI for AI?
If Flask is a Swiss Army knife, FastAPI is a purpose-built power tool for APIs. It was designed from the ground up around async I/O and type safety — two things that matter enormously when you are wrapping LLM calls. Every LLM request is I/O-bound (you are waiting for the provider’s servers), so async support is not a nice-to-have; it is the difference between handling 10 concurrent users and 10,000. FastAPI is the go-to framework for AI APIs because:- Async native: Handle thousands of concurrent LLM calls without blocking worker threads
- Auto documentation: Swagger UI out of the box — your frontend team can explore the API without reading code
- Type safety: Pydantic validation catches malformed requests before they hit your LLM (saving you tokens and money)
- Fast: One of the fastest Python frameworks, on par with Node.js and Go for I/O-bound workloads
Quick Start
http://localhost:8000/docs for interactive API docs.
Request & Response Models
Pydantic models are your contract with the outside world. Think of them like a bouncer at a club — they check IDs at the door so your endpoint code never has to worry about invalid data. If a client sendstemperature: "hot" instead of temperature: 0.7, Pydantic rejects it with a clear error before your code ever runs. This saves LLM tokens (you never send garbage to the API) and simplifies debugging.
Use Pydantic for validation:
Path & Query Parameters
Dependency Injection
Dependency injection in FastAPI is like a restaurant prep kitchen — ingredients (database connections, API keys, authenticated users) are prepared before the chef (your endpoint function) starts cooking. Instead of every endpoint manually checking API keys and opening database connections, you declare what you need in the function signature and FastAPI wires it up automatically. This keeps your endpoint code focused on business logic. Handle auth, database connections, and shared logic:Async Operations
This is where FastAPI earns its keep for AI applications. A synchronous server handles one LLM call at a time per worker — if a call takes 3 seconds, you need 100 workers to serve 100 concurrent users. An async server can handle hundreds of concurrent calls on a single worker because it releases the thread while waiting for the LLM provider to respond. Practical tip: always useAsyncOpenAI (not OpenAI) inside async endpoints, or you will block the event loop and lose all the concurrency benefits.
Handle concurrent LLM calls efficiently:
Streaming Responses
Streaming is essential for LLM applications because users perceive a streaming response as 5-10x faster than a batch response, even when the total generation time is identical. It is the same reason loading progress bars feel faster than a blank screen. Server-Sent Events (SSE) are the standard transport for LLM streaming — the client opens one HTTP connection and the server pushes tokens as they are generated.Error Handling
Middleware
File Uploads
Document upload is the front door to any RAG pipeline. Users submit PDFs, Word docs, or plain text, and your API needs to validate the file type, enforce size limits, and kick off the processing pipeline. Practical tip: always validate on the server side even if the frontend checks too — clients can be spoofed, and a 500MB PDF will happily crash your worker if you do not enforce limits.Background Tasks
When a user uploads a document, you do not want them staring at a spinner while you chunk, embed, and index it. Background tasks let you return a “processing” status immediately and do the heavy lifting after the response is sent. Think of it like dropping off dry cleaning — you get a ticket instantly and pick up the finished result later. For larger workloads, consider Celery or a dedicated task queue, but FastAPI’s built-inBackgroundTasks is perfect for lightweight jobs.
Routers for Organization
As your AI API grows beyond 5-6 endpoints, a singlemain.py becomes unwieldy. Routers let you split your API into logical modules — chat, documents, search — each in its own file with its own prefix and tags. This is the same idea as Blueprints in Flask, but with better typing support.
Application Structure
Configuration with Pydantic Settings
Never hardcode API keys, model names, or rate limits in your code. Pydantic Settings reads from environment variables (and.env files) and validates them at startup. If OPENAI_API_KEY is missing, your app fails immediately with a clear error instead of crashing on the first request. The @lru_cache decorator ensures the settings are parsed once and reused, not re-read on every request.
Testing
Production Deployment
Practical tip: in production, always run uvicorn with--workers 4 (or 2x your CPU cores for I/O-bound AI workloads). Each worker is a separate process, so a crash in one does not take down the others. For Kubernetes deployments, use a single worker per container and scale horizontally instead — it is easier to manage and gives you cleaner resource limits.
Building MCP Servers with FastMCP
Model Context Protocol (MCP) allows AI assistants like Claude to interact with external tools. FastMCP makes it easy to create MCP servers using FastAPI-like patterns.Install FastMCP
Basic MCP Server
MCP Resources (Read-Only Data)
MCP Prompts (Reusable Templates)
- Bugs and errors
- Performance issues
- Security vulnerabilities
- Code style and readability
Configure Claude Desktop
Add to your Claude Desktop config (claude_desktop_config.json):
Quick Reference
| Feature | Code |
|---|---|
| Create app | app = FastAPI() |
| GET endpoint | @app.get("/path") |
| POST endpoint | @app.post("/path") |
| Path param | @app.get("/items/{id}") |
| Query param | def f(q: str = Query(...)) |
| Request body | def f(body: Model) |
| Dependency | Depends(function) |
| Background task | BackgroundTasks |
| Streaming | StreamingResponse |
| File upload | UploadFile |
| MCP tool | @mcp.tool() |
| MCP resource | @mcp.resource("uri") |
Next Step: Now learn database operations with SQLAlchemy & Databases Crash Course.
Interview Deep-Dive
You are building an AI API with FastAPI that wraps multiple LLM providers. A single request might take 3-30 seconds depending on the model. How do you design the API to handle this gracefully?
You are building an AI API with FastAPI that wraps multiple LLM providers. A single request might take 3-30 seconds depending on the model. How do you design the API to handle this gracefully?
Strong Answer:
- The core problem is that LLM calls are I/O-bound and highly variable in latency. I would build the API with three tiers of response patterns. For fast requests under 5 seconds, a standard async endpoint with streaming via Server-Sent Events so the user sees tokens as they arrive rather than staring at a spinner. For medium requests 5-30 seconds, I would still stream but add a timeout with a clear error message: if the upstream provider has not started streaming within 10 seconds, I return a 504 with a retry-after header. For long-running requests like batch processing or document analysis, I would use a job-based pattern: the POST endpoint returns a 202 Accepted with a job ID immediately, a background task processes the request, and the client polls a GET endpoint or receives a webhook callback.
- The critical design choice in FastAPI specifically is using
AsyncOpenAIinstead of the sync client. If you use the syncOpenAI()client inside an async endpoint, you block the entire event loop — one slow LLM call will freeze every other request. I have seen this bug in production at a startup where 40 concurrent users brought the API to its knees because someone used the sync client. The fix was a one-line change toAsyncOpenAIand throughput went from 5 to 200 concurrent requests. - I would also add middleware for request ID tracking, structured logging of latency per provider, and a circuit breaker that stops sending requests to a provider after 3 consecutive timeouts. FastAPI’s dependency injection makes this clean: the circuit breaker state lives in a dependency that gets injected into every endpoint.
RateLimiter dependency that checks Redis before the endpoint logic runs and raises an HTTPException(429) with a Retry-After header if the limit is exceeded. For per-provider rate limiting, I use a token bucket algorithm because LLM providers have both requests-per-minute and tokens-per-minute limits. The token bucket lives in a shared dependency and is checked before every outbound LLM call, not at the API endpoint level. The key nuance is that these are two different concerns: user rate limiting protects my service, provider rate limiting protects my account from getting throttled. I have seen teams conflate these and end up either over-restricting users or burning through provider rate limits.Explain FastAPI's dependency injection system. Why is it especially useful for AI applications?
Explain FastAPI's dependency injection system. Why is it especially useful for AI applications?
Strong Answer:
- FastAPI’s dependency injection uses Python’s type hints and the
Depends()function to automatically resolve, instantiate, and inject shared resources into endpoint handlers. Dependencies can be chained — aget_current_userdependency can itself depend onget_api_keywhich depends onget_db. FastAPI resolves the full dependency graph per request and handles cleanup via generator dependencies (usingyield). - For AI applications specifically, DI solves three hard problems. First, expensive client initialization: you do not want to create a new
AsyncOpenAI()client on every request. A dependency with@lru_cacheor a lifespan event creates the client once and shares it. Second, context propagation: every LLM call needs the user’s API key, their rate limit tier, and a trace ID for observability. Instead of passing these through every function, I inject them as dependencies. Third, testability: I can swap the real LLM client with a mock in tests by overriding the dependency, which means I can test my API logic without making actual LLM calls that cost money and are non-deterministic. - The pattern I use in production is a
LLMServicedependency that encapsulates the client, handles retries, tracks costs, and logs to the observability platform. Every endpoint gets this service injected rather than talking to OpenAI directly. When we added Anthropic as a fallback provider, we only changed theLLMServicedependency — zero endpoint code changed.
Depends() with a function versus a class, and when would you choose each?Function dependencies are great for simple resolution logic like extracting and validating an API key from a header. Class dependencies shine when you need stateful resources with lifecycle management — for example, a database connection pool or an LLM client with built-in retry state. With a class, I implement __init__ for configuration and make the instance callable so Depends(MyService) works. The class approach also makes testing cleaner: I can subclass for a mock implementation. In practice, my AI APIs use function dependencies for auth and validation, and class dependencies for services like the LLM client, vector store connection, and caching layer. The gotcha is that FastAPI creates a new instance of a class dependency per request by default — if you want a singleton (which you usually do for expensive clients), you need @lru_cache on the factory function or use the lifespan context manager.What is the difference between FastAPI's BackgroundTasks and a proper task queue like Celery? When would you choose each for an AI workload?
What is the difference between FastAPI's BackgroundTasks and a proper task queue like Celery? When would you choose each for an AI workload?
Strong Answer:
- FastAPI’s
BackgroundTasksruns tasks in the same process after the response is sent. It is perfect for lightweight fire-and-forget work: logging an analytics event, sending a notification, updating a cache entry. The task shares the same event loop and process memory, so it has zero serialization overhead and can access in-process state directly. - Celery (or alternatives like Dramatiq, Huey, or even Redis Streams with a custom worker) runs tasks in separate worker processes with a message broker in between. This is what you need for AI workloads that are heavy: embedding a 500-page PDF, running a batch of 1000 LLM evaluations, fine-tuning a model, or generating 50 images.
- The decision criteria are: if the task takes under 5 seconds and you can tolerate losing it on a process crash, use
BackgroundTasks. If the task takes longer, needs retry logic, must survive a server restart, or needs to scale horizontally across multiple workers, use a task queue. In my experience, most AI workloads fall into the second category because LLM calls are inherently slow and expensive enough that you want guaranteed delivery. - A common mistake I see is teams starting with
BackgroundTasksfor document processing because it is simpler, then hitting problems when the FastAPI process runs out of memory or restarts mid-task and the user’s document is silently lost. The migration to Celery later is painful because the code was not designed for serializable task arguments.
UploadFile, validates the file type and size, stores the raw file to S3 or equivalent object storage, and enqueues a processing job with the file reference and metadata. The endpoint returns a 202 with a job ID immediately. The processing pipeline runs in Celery workers: step one chunks the document using a text splitter, step two generates embeddings via the embedding API (batched for efficiency), step three upserts the vectors into the vector database with metadata. Each step is a separate Celery task chained together so a failure at step two retries just the embedding step, not the chunking. The API exposes a GET /documents/{id}/status endpoint that reads the job state from Redis or the database. I also add a webhook callback option so the client does not need to poll. The key design insight is keeping the FastAPI process thin — it only handles HTTP and enqueuing. All heavy compute lives in the workers.How do you test a FastAPI application that depends on external LLM APIs?
How do you test a FastAPI application that depends on external LLM APIs?
Strong Answer:
- I use a three-layer testing strategy. Layer one is unit tests with fully mocked LLM responses. I override the LLM service dependency using FastAPI’s
app.dependency_overridesso every endpoint gets a mock service that returns deterministic responses. This covers request validation, error handling, auth logic, and response formatting — everything except the actual LLM interaction. These tests run in CI in under 10 seconds. - Layer two is integration tests with recorded responses. I use a library like
vcrpyor a custom fixture that records real LLM API responses once and replays them in subsequent test runs. This catches serialization bugs, response parsing edge cases, and content-type mismatches that mocks would miss. I re-record these fixtures monthly or when I change models. - Layer three is contract tests against the live API, run on a schedule (nightly) rather than on every commit. These verify that the LLM providers have not changed their response format, that our API keys are valid, and that the end-to-end flow works. These tests have a cost budget — I cap them at $2 per run.
- The biggest testing mistake specific to AI APIs is making assertions too tight. If you assert
response.json()["answer"] == "Machine learning is...", the test will break every time the LLM gives a slightly different phrasing. Instead, I assert on structure (valid JSON, required fields present, status code), on constraints (response under token limit, no banned words), and use LLM-as-Judge for quality assertions only in the nightly suite.
temperature=0 for all test calls to minimize variance, but even at temperature zero, LLM responses are not perfectly deterministic across API versions. So my assertions focus on invariants rather than exact content. I check structural invariants (is it valid JSON, does it have the required keys), content invariants (does it mention the required topics, is it within length bounds), and safety invariants (does it avoid banned phrases). For quality, I use a lightweight LLM judge in the integration test suite that scores the response on a 1-5 scale and asserts the score is above 3. The judge prompt is versioned alongside the test code. The key principle is: test the contract, not the content.