Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

December 2025 Update: Covers LangChain 0.3+ with LCEL (LangChain Expression Language), async support, and production patterns.

Introduction

In recent years, language models have become more advanced, allowing us to tackle complex tasks and extract information from large documents. However, these models have a limit on the amount of context they can consider, which can be tricky when dealing with lots of information. To overcome this challenge, LLM chains have emerged. They simplify the process of chaining multiple LLM calls together, making it easier to handle large volumes of data. RAG System Architecture with LangChain LLM chains use different language model components to process information and generate responses in a unified way. In this article, we will discuss different components and conventions in LangChain.

What is LangChain?

LangChain provides AI developers with tools to connect language models with external data sources. LangChain Components Diagram LLMs are large deep-learning models pre-trained on large amounts of data that can generate responses to user queries by answering questions or creating images from text-based prompts. LangChain is a framework for building applications powered by LLMs. It provides:
  • Chains: Composable sequences of LLM calls
  • Prompts: Template management and optimization
  • Memory: Conversation and context management
  • Tools: Integration with external APIs and functions
  • Agents: Autonomous decision-making workflows
Why LangChain? While you can build AI apps with raw APIs, LangChain provides abstractions that make production systems easier to build, test, and maintain.

Installation

pip install langchain langchain-openai langchain-core langchain-community

Core Concepts

1. LCEL (LangChain Expression Language)

LCEL is LangChain’s declarative way to compose chains:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

# Define components
llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("Translate to {language}: {text}")
output_parser = StrOutputParser()

# Compose chain
chain = prompt | llm | output_parser

# Invoke
result = chain.invoke({"language": "French", "text": "Hello, world!"})
print(result)  # "Bonjour, le monde!"
Key Benefits:
  • Declarative syntax with pipe operator (|)
  • Automatic async support
  • Built-in streaming
  • Easy debugging and observability

2. Prompts

Prompt Templates:
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

# System + User prompt
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant. Answer in {style}."),
    ("user", "{question}")
])

# With conversation history
conversation_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant."),
    MessagesPlaceholder(variable_name="history"),
    ("user", "{input}")
])
Few-Shot Examples:
from langchain_core.prompts import FewShotChatMessagePromptTemplate

examples = [
    {"input": "happy", "output": "sad"},
    {"input": "tall", "output": "short"},
]

example_prompt = ChatPromptTemplate.from_messages([
    ("human", "{input}"),
    ("ai", "{output}"),
])

few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)

final_prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant that gives antonyms."),
    few_shot_prompt,
    ("human", "{input}"),
])
Prompt Management:
# Save prompts to file
prompt.save("prompts/translation.yaml")

# Load from file
from langchain_core.prompts import load_prompt
loaded_prompt = load_prompt("prompts/translation.yaml")

3. Chains

Simple Chain:
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o")
prompt = ChatPromptTemplate.from_template("What is {topic}?")

chain = prompt | llm

response = chain.invoke({"topic": "quantum computing"})
Sequential Chain:
from langchain_core.output_parsers import StrOutputParser

# Chain 1: Generate question
question_chain = (
    ChatPromptTemplate.from_template("Generate a question about {topic}")
    | llm
    | StrOutputParser()
)

# Chain 2: Answer question
answer_chain = (
    ChatPromptTemplate.from_template("Answer this question: {question}")
    | llm
    | StrOutputParser()
)

# Combine
def qa_chain(topic: str):
    question = question_chain.invoke({"topic": topic})
    answer = answer_chain.invoke({"question": question})
    return {"question": question, "answer": answer}
RunnableParallel (Parallel Execution):
from langchain_core.runnables import RunnableParallel

parallel_chain = RunnableParallel({
    "summary": ChatPromptTemplate.from_template("Summarize: {text}") | llm,
    "sentiment": ChatPromptTemplate.from_template("Sentiment of: {text}") | llm,
    "keywords": ChatPromptTemplate.from_template("Extract keywords from: {text}") | llm,
})

results = parallel_chain.invoke({"text": "LangChain is a framework for LLM applications."})
Conditional Chains:
from langchain_core.runnables import RunnableLambda

def route_chain(input_data: dict):
    if input_data["type"] == "technical":
        return technical_chain
    else:
        return general_chain

routed_chain = RunnableLambda(route_chain)

4. Memory

Conversation Buffer:
from langchain.memory import ConversationBufferMemory
from langchain.chains import ConversationChain

memory = ConversationBufferMemory()
chain = ConversationChain(
    llm=llm,
    memory=memory,
    verbose=True
)

chain.predict(input="Hi, I'm Alice")
chain.predict(input="What's my name?")  # Remembers "Alice"
Conversation Summary Memory:
from langchain.memory import ConversationSummaryMemory

summary_memory = ConversationSummaryMemory(llm=llm)
chain = ConversationChain(llm=llm, memory=summary_memory)

# Long conversation gets summarized automatically
Conversation Buffer Window:
from langchain.memory import ConversationBufferWindowMemory

window_memory = ConversationBufferWindowMemory(k=5)  # Last 5 exchanges
Vector Store Memory:
from langchain.memory import VectorStoreRetrieverMemory
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings()
vectorstore = FAISS.from_texts([""], embeddings)

vector_memory = VectorStoreRetrieverMemory(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3})
)
Custom Memory:
from langchain.memory import BaseMemory
from typing import Dict, List

class CustomMemory(BaseMemory):
    def __init__(self):
        self.memories: List[Dict] = []
    
    def save_context(self, inputs: Dict, outputs: Dict):
        self.memories.append({"inputs": inputs, "outputs": outputs})
    
    def load_memory_variables(self, inputs: Dict) -> Dict:
        return {"history": self.memories}
    
    def clear(self):
        self.memories = []

5. Tools

Creating Tools:
from langchain_core.tools import tool

@tool
def search_web(query: str) -> str:
    """Search the web for information."""
    # Implementation here
    return f"Results for {query}"

@tool
def calculator(expression: str) -> str:
    """Evaluate a mathematical expression."""
    try:
        result = eval(expression)
        return str(result)
    except:
        return "Invalid expression"

tools = [search_web, calculator]
Using Tools with LLM:
from langchain_core.messages import HumanMessage

llm_with_tools = llm.bind_tools(tools)

response = llm_with_tools.invoke([
    HumanMessage(content="What's 15 * 23? Then search for Python tutorials.")
])

# Check for tool calls
if response.tool_calls:
    for tool_call in response.tool_calls:
        print(f"Tool: {tool_call['name']}")
        print(f"Args: {tool_call['args']}")
Structured Tools:
from langchain_core.tools import StructuredTool
from pydantic import BaseModel, Field

class SearchInput(BaseModel):
    query: str = Field(description="Search query")
    max_results: int = Field(default=5, description="Maximum results")

def search_function(query: str, max_results: int = 5) -> str:
    # Implementation
    return f"Found {max_results} results for {query}"

structured_tool = StructuredTool.from_function(
    func=search_function,
    args_schema=SearchInput,
    name="web_search",
    description="Search the web for information"
)

6. RAG with LangChain

Complete RAG Chain:
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.runnables import RunnablePassthrough

# Setup vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_texts(
    texts=["LangChain is a framework for building LLM applications..."],
    embedding=embeddings
)

retriever = vectorstore.as_retriever()

# RAG chain
def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_prompt = ChatPromptTemplate.from_template("""
Answer the question based on the context:

Context: {context}

Question: {question}
""")

rag_chain = (
    {
        "context": retriever | format_docs,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("What is LangChain?")
RAG with Sources:
from langchain_core.runnables import RunnableLambda

def format_docs_with_sources(docs):
    formatted = []
    for i, doc in enumerate(docs):
        formatted.append(f"[Source {i+1}]: {doc.page_content}")
    return "\n\n".join(formatted)

rag_with_sources = (
    {
        "context": retriever | format_docs_with_sources,
        "question": RunnablePassthrough()
    }
    | rag_prompt
    | llm
    | StrOutputParser()
)

7. Streaming

Streaming Responses:
from langchain_core.callbacks import StreamingStdOutCallbackHandler

llm = ChatOpenAI(
    model="gpt-4o",
    streaming=True,
    callbacks=[StreamingStdOutCallbackHandler()]
)

chain.invoke({"topic": "AI"})
Custom Streaming Handler:
from langchain_core.callbacks import BaseCallbackHandler

class CustomStreamHandler(BaseCallbackHandler):
    def __init__(self, on_token):
        self.on_token = on_token
        self.tokens = []
    
    def on_llm_new_token(self, token: str, **kwargs):
        self.tokens.append(token)
        self.on_token(token)
    
    def get_full_response(self) -> str:
        return "".join(self.tokens)

# Usage
tokens_received = []
handler = CustomStreamHandler(lambda t: tokens_received.append(t))

llm = ChatOpenAI(model="gpt-4o", streaming=True, callbacks=[handler])
response = llm.invoke([HumanMessage(content="Explain streaming")])
Async Streaming:
async def stream_chain():
    async for chunk in chain.astream({"topic": "AI"}):
        print(chunk.content, end="", flush=True)

import asyncio
asyncio.run(stream_chain())

8. Observability with LangSmith

import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "your-api-key"
os.environ["LANGCHAIN_PROJECT"] = "my-project"

# All chains automatically traced
chain.invoke({"input": "test"})
# View at smith.langchain.com
Custom Tracing:
from langchain_core.tracers import LangChainTracer
from langchain.callbacks import LangChainTracerV2

tracer = LangChainTracerV2(project_name="my-project")
chain.invoke({"input": "test"}, config={"callbacks": [tracer]})

Production Patterns

Error Handling

from langchain_core.runnables import RunnableLambda

def safe_invoke(chain, input_data):
    try:
        return chain.invoke(input_data)
    except Exception as e:
        return {"error": str(e), "input": input_data}

safe_chain = RunnableLambda(lambda x: safe_invoke(chain, x))
Retry Logic:
from langchain_core.runnables import RunnableRetry

retry_chain = RunnableRetry(
    chain,
    max_attempts=3,
    retry_if_exception_type=(Exception,)
)

Caching

from langchain.cache import InMemoryCache
from langchain.globals import set_llm_cache

set_llm_cache(InMemoryCache())

# First call - hits API
chain.invoke({"input": "test"})

# Second call - uses cache
chain.invoke({"input": "test"})
Redis Cache:
from langchain.cache import RedisCache
import redis

redis_client = redis.Redis()
set_llm_cache(RedisCache(redis_client))

Batch Processing

# Process multiple inputs in parallel
inputs = [{"topic": "AI"}, {"topic": "ML"}, {"topic": "NLP"}]
results = chain.batch(inputs)
Async Batch:
import asyncio

async def batch_process():
    inputs = [{"topic": "AI"}, {"topic": "ML"}]
    results = await chain.abatch(inputs)
    return results

asyncio.run(batch_process())

Output Parsing

Structured Output:
from langchain_core.output_parsers import PydanticOutputParser
from pydantic import BaseModel, Field

class Answer(BaseModel):
    answer: str = Field(description="The answer")
    confidence: float = Field(description="Confidence score")

parser = PydanticOutputParser(pydantic_object=Answer)

chain = prompt | llm | parser

result = chain.invoke({"question": "What is AI?"})
print(result.answer)
print(result.confidence)
JSON Output:
from langchain_core.output_parsers import JsonOutputParser

json_parser = JsonOutputParser()

chain = prompt | llm | json_parser
result = chain.invoke({"input": "Extract key info"})

When to Use LangChain

Use LangChain when:
  • Building complex multi-step workflows
  • Need prompt management and versioning
  • Require memory/conversation management
  • Integrating multiple tools and APIs
  • Want built-in observability (LangSmith)
  • Building production systems that need maintainability
  • Need to compose different LLM providers
Consider raw APIs when:
  • Simple one-off LLM calls
  • Maximum performance is critical
  • Want minimal dependencies
  • Building lightweight prototypes
  • Need fine-grained control over every API call

Limitations

  • Additional abstraction layer adds overhead
  • Learning curve for LCEL syntax
  • Dependency on LangChain ecosystem
  • Can be overkill for simple use cases
  • Version changes can break code

Performance Tips

  • Use async methods (ainvoke, astream) for better concurrency
  • Enable caching for repeated queries
  • Batch process when possible
  • Use streaming for better UX
  • Monitor with LangSmith to identify bottlenecks
  • Cache embeddings and prompts when possible

Key Takeaways

LCEL is Powerful

Use the pipe operator (|) to compose chains declaratively.

Prompts as Templates

Manage prompts separately from code for easier iteration.

Memory Built-in

LangChain provides multiple memory types for conversations.

Production Ready

Built-in observability, caching, and error handling.

What’s Next

LangGraph

Learn how to build complex agent workflows with state machines

Interview Deep-Dive

Strong Answer:
  • The way I think about this is: LangChain is a framework tax, and like all taxes, whether it is worth paying depends on what you get in return. For a simple chatbot that calls one LLM with a fixed prompt, LangChain adds complexity with zero benefit — you are wrapping a 5-line API call in 50 lines of abstraction. For that, I use the SDK directly.
  • LangChain earns its keep when your application has three or more of these requirements: chaining multiple LLM calls with data flowing between them, plugging in retrieval from vector stores, managing conversation memory across turns, integrating external tools, supporting multiple LLM providers with a common interface, or needing observability via LangSmith. If you are building any of those from scratch with raw SDKs, you end up reinventing half of LangChain anyway, but worse.
  • The concrete benefits: LCEL’s pipe operator gives you composable chains with free async support and streaming. The memory abstractions (buffer, summary, window, vector store) are battle-tested and save weeks of implementation. The integration ecosystem means swapping from Chroma to Pinecone is a one-line change instead of a rewrite. LangSmith tracing gives you production observability that would take months to build.
  • The concrete costs: version churn — LangChain has historically had breaking changes between minor versions. The abstraction can obscure what is happening under the hood, making debugging harder for junior engineers. Performance overhead is real but usually small (milliseconds) compared to the LLM call latency (seconds). And the dependency tree is large.
  • My rule of thumb: prototype with raw SDKs to understand the fundamentals. Build production systems with LangChain when you need the ecosystem. Never use LangChain to avoid understanding what the underlying APIs do.
Red Flags: Candidate either unconditionally advocates for LangChain on everything or dismisses it entirely without acknowledging the valid use cases. Another red flag is not mentioning the version stability concern or the debugging overhead.Follow-up: Explain LCEL’s pipe operator. How does the data flow through a chain like prompt | llm | parser?LCEL implements the __or__ operator on Runnable objects so the pipe (|) composes them into a RunnableSequence. When you call .invoke() on the chain, data flows left to right: the prompt template receives a dictionary of variables, formats them into a ChatPromptTemplate message, passes that to the LLM which returns an AIMessage, and the parser extracts the string content. Each Runnable implements .invoke(), .ainvoke(), .stream(), .astream(), and .batch(). The magic is that the entire chain automatically supports all of these — if I call .astream() on the chain, each component’s async streaming method is used, and tokens flow through the pipeline as they are generated. The gotcha is type compatibility: each Runnable declares its input and output types, and if the output of one step does not match the input of the next, you get a runtime error that can be confusing. RunnablePassthrough and RunnableLambda are escape hatches for transforming data between incompatible steps.
Strong Answer:
  • I would diagnose this systematically by isolating each component of the RAG chain, because “mediocre answers” can mean three very different things: bad retrieval, bad generation, or bad prompt.
  • Step one: check retrieval quality independently. I pull the retriever out of the chain and run my test queries directly. For each query, I examine the top-k retrieved documents. Are they relevant? If I am getting irrelevant chunks, the problem is in embedding, chunking, or the vector store config — not the LLM. Common retrieval fixes: try a different embedding model (text-embedding-3-large vs text-embedding-3-small), adjust chunk size (512 tokens is often too small for complex topics, try 1024 with 200-token overlap), add metadata filtering so the retriever only searches relevant document categories, or use hybrid search (combining vector similarity with BM25 keyword matching).
  • Step two: if retrieval is good but answers are still poor, the problem is in the generation prompt or the LLM’s ability to synthesize. I enable LangSmith tracing and inspect the exact prompt that reaches the LLM. Common issues: the retrieved context is dumped in a blob without structure (fix: format each chunk with source labels), the prompt does not tell the model to cite sources or stay faithful to the context (fix: add explicit grounding instructions), or the context window is overloaded with too many chunks and the model gets confused (fix: reduce k from 10 to 3-5 most relevant chunks).
  • Step three: if retrieval and generation are both reasonable in isolation but the chain output is mediocre, the problem is often in how chunks are being formatted or how the prompt template combines context and question. I have seen cases where format_docs joined chunks without separators, causing the model to merge information from different documents incorrectly.
  • My debugging toolkit: LangSmith traces (shows exact inputs and outputs at every chain step), retrieval evaluation scores (NDCG on a labeled test set), and the RAG triad metrics (context relevance, faithfulness, answer relevance) evaluated with an LLM judge.
Red Flags: Candidate immediately blames the LLM model and suggests upgrading to a bigger model without checking retrieval quality, does not know how to isolate chain components for debugging, or does not mention LangSmith or any observability tool.Follow-up: How do you choose between ConversationBufferMemory, ConversationSummaryMemory, and ConversationBufferWindowMemory?Each memory type trades off between completeness, cost, and context window usage. BufferMemory stores the entire conversation verbatim — perfect for short conversations (under 10 turns) but it will blow the context window for longer sessions and costs increase linearly with conversation length. WindowMemory keeps only the last K exchanges, which is cheap and bounded but causes the model to “forget” earlier context — a user who provided their name in turn 1 will be unknown by turn 6 with k=5. SummaryMemory uses an LLM to compress the conversation history into a running summary, which is the best tradeoff for long conversations: bounded size, retains key information, but adds an LLM call per turn for summarization (extra latency and cost). My default for production is a hybrid: WindowMemory with k=5 for recent turns (exact recall) plus SummaryMemory for anything older (compressed context). This gives the model exact memory of the last few turns while preserving a summary of the full conversation. The implementation is ConversationSummaryBufferMemory with a max_token_limit that triggers summarization when the buffer exceeds a threshold.
Strong Answer:
  • This is the conditional chain routing pattern, and it is one of the most valuable patterns for production LLM apps. The idea is that a single one-size-fits-all chain performs worse than specialized chains matched to the query intent.
  • The architecture has three components: a classifier, a router, and the specialized chains. The classifier determines the intent of the incoming query — it can be a lightweight LLM call (GPT-4o-mini with a classification prompt), an embedding-based classifier (compare query embedding against intent cluster centroids), or even a rule-based keyword matcher for simple cases. I prefer the embedding-based approach because it is fast (one embedding call vs an LLM call), cheap, and handles paraphrases well.
  • In LCEL, I implement this with RunnableLambda for the routing logic. The router function takes the classified intent and returns the appropriate chain. For example, technical questions go to a chain with a technical system prompt and access to documentation retrieval. Creative writing requests go to a chain with a higher temperature and a creative prompt. Customer support queries go to a chain with product knowledge RAG and a conservative tone.
  • Each specialized chain can have different: LLM models (expensive model for complex analysis, cheap model for simple Q and A), temperatures (0 for factual, 0.8 for creative), retrieval sources (different vector stores per domain), system prompts (tailored instructions per intent), and tool sets (only customer support gets the ticket creation tool).
  • The fallback pattern is critical: if the classifier confidence is below a threshold, I route to a general-purpose chain rather than risk sending a misclassified query to the wrong specialist. I also log every routing decision and the classifier confidence to monitor for drift and miscategorization over time.
Red Flags: Candidate builds a single monolithic chain for all query types, does not consider the classification step, or hard-codes routing logic instead of making it configurable.Follow-up: How do you test and evaluate a multi-chain routing system end-to-end?I test at three levels. First, I test the classifier in isolation with a labeled dataset of 200+ queries tagged by intent. I measure precision and recall per intent category and set a minimum F1 threshold of 0.85 per category. Second, I test each specialized chain independently against its own eval dataset, using the evaluation patterns we discussed earlier. Third, I test the full routing pipeline end-to-end with a mixed dataset that includes all intent types, adversarial queries that could be ambiguous between categories, and out-of-distribution queries that do not fit any category. The end-to-end test verifies that routing decisions are correct AND that the downstream chain produces good results for correctly routed queries. The metric I care about most is “misrouted and failed”: queries that went to the wrong chain AND got a bad answer. Misrouted queries that still got good answers (because the chains are robust enough) are acceptable. I track this in production using LangSmith tags and set up alerts when the misrouting rate exceeds 5%.