Use this file to discover all available pages before exploring further.
Document processing is fundamental to RAG applications. This chapter covers extracting text from various formats, implementing chunking strategies, and building robust document pipelines.Here is the cold truth about RAG: your retrieval quality is bounded by your document processing quality. If your PDF extractor garbles a table into nonsense, or your chunker splits a key paragraph across two chunks, no amount of prompt engineering will save you. Garbage in, garbage out — except with RAG, the garbage gets cited with a confidence score, which is worse than no answer at all.
PDFs are the most common document format in enterprise RAG systems — and the hardest to extract text from cleanly. A PDF is fundamentally a page-layout format, not a text format. It stores instructions like “draw character ‘H’ at position (72, 400)” rather than “here is a paragraph.” This means extraction quality varies wildly depending on how the PDF was created: a Word-exported PDF extracts cleanly, a scanned document returns nothing without OCR, and a two-column research paper might interleave the columns.
Chunking determines how documents are split for embedding and retrieval. This is arguably the most consequential design decision in a RAG pipeline. Chunk too small, and each piece lacks enough context to be useful. Chunk too large, and you dilute relevant information with noise, and your embedding becomes a blurry average of too many topics. The Goldilocks zone depends on your documents and queries, but 500-1000 characters is a good starting point for most use cases.
Fixed-size chunking is the “good enough” approach that works for 80% of use cases. You set a character or token limit, slide through the document, and cut. The overlap between chunks is like shingles on a roof — each piece overlaps the next so that sentences at the boundary aren’t orphaned.Simple but effective for uniform documents:
from dataclasses import dataclass@dataclassclass Chunk: """Represents a text chunk.""" text: str metadata: dict chunk_index: intclass FixedSizeChunker: """Split text into fixed-size chunks with overlap.""" def __init__( self, chunk_size: int = 1000, chunk_overlap: int = 200 ): self.chunk_size = chunk_size self.chunk_overlap = chunk_overlap def chunk(self, text: str, metadata: dict = None) -> list[Chunk]: """Split text into overlapping chunks.""" metadata = metadata or {} chunks = [] if len(text) <= self.chunk_size: return [Chunk(text=text, metadata=metadata, chunk_index=0)] start = 0 chunk_index = 0 while start < len(text): end = start + self.chunk_size # Find a good break point if end < len(text): end = self._find_break_point(text, end) chunk_text = text[start:end].strip() if chunk_text: chunks.append(Chunk( text=chunk_text, metadata={ **metadata, "chunk_start": start, "chunk_end": end, }, chunk_index=chunk_index )) chunk_index += 1 # Move start with overlap start = end - self.chunk_overlap # Prevent infinite loop if start <= chunks[-1].metadata.get("chunk_start", 0) if chunks else False: start = end return chunks def _find_break_point(self, text: str, position: int) -> int: """Find a natural break point near position.""" # Look for paragraph break para_break = text.rfind('\n\n', position - 100, position) if para_break > position - 100: return para_break + 2 # Look for sentence break for punct in ['. ', '! ', '? ']: sent_break = text.rfind(punct, position - 100, position) if sent_break > position - 100: return sent_break + 2 # Look for word break space = text.rfind(' ', position - 50, position) if space > position - 50: return space + 1 return position# Usagechunker = FixedSizeChunker(chunk_size=500, chunk_overlap=100)text = "Your long document text here..."chunks = chunker.chunk(text, {"source": "document.pdf"})
Semantic chunking is the “smart” approach: instead of cutting at fixed intervals, it detects where the topic changes and cuts there. Imagine reading a textbook — you would naturally break it at section transitions, not every 500 words. The trade-off is cost: you need to embed every sentence to measure similarity between adjacent sentences, which means an API call before you even start building your index. For large document collections, this cost can be significant. Use semantic chunking when retrieval quality is critical and the extra embedding cost is justified.Split based on semantic similarity for better coherence:
import numpy as npfrom openai import OpenAIclass SemanticChunker: """Split text based on semantic similarity.""" def __init__( self, client: OpenAI, similarity_threshold: float = 0.8, min_chunk_size: int = 100, max_chunk_size: int = 2000 ): self.client = client self.similarity_threshold = similarity_threshold self.min_chunk_size = min_chunk_size self.max_chunk_size = max_chunk_size def chunk(self, text: str, metadata: dict = None) -> list[Chunk]: """Split text at semantic boundaries.""" metadata = metadata or {} # Split into sentences sentences = self._split_sentences(text) if len(sentences) <= 1: return [Chunk(text=text, metadata=metadata, chunk_index=0)] # Get embeddings for sentences embeddings = self._embed_sentences(sentences) # Find semantic breaks break_points = self._find_semantic_breaks(embeddings) # Create chunks from break points chunks = [] start_idx = 0 for chunk_index, end_idx in enumerate(break_points): chunk_text = " ".join(sentences[start_idx:end_idx + 1]) # Ensure chunk meets size requirements if len(chunk_text) >= self.min_chunk_size: chunks.append(Chunk( text=chunk_text, metadata=metadata, chunk_index=chunk_index )) elif chunks: # Merge with previous chunk chunks[-1] = Chunk( text=chunks[-1].text + " " + chunk_text, metadata=chunks[-1].metadata, chunk_index=chunks[-1].chunk_index ) start_idx = end_idx + 1 return chunks def _split_sentences(self, text: str) -> list[str]: """Split text into sentences.""" import re # Simple sentence splitting sentences = re.split(r'(?<=[.!?])\s+', text) return [s.strip() for s in sentences if s.strip()] def _embed_sentences(self, sentences: list[str]) -> np.ndarray: """Get embeddings for sentences.""" response = self.client.embeddings.create( model="text-embedding-3-small", input=sentences ) return np.array([e.embedding for e in response.data]) def _find_semantic_breaks(self, embeddings: np.ndarray) -> list[int]: """Find indices where semantic breaks occur.""" breaks = [] current_chunk_start = 0 current_chunk_size = 0 for i in range(len(embeddings) - 1): # Calculate similarity with next sentence similarity = self._cosine_similarity( embeddings[i], embeddings[i + 1] ) current_chunk_size += 1 # Break if low similarity or max size reached if similarity < self.similarity_threshold: breaks.append(i) current_chunk_start = i + 1 current_chunk_size = 0 # Add final break breaks.append(len(embeddings) - 1) return breaks def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float: """Calculate cosine similarity between vectors.""" return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
Recursive chunking is the best general-purpose strategy for structured documents. The idea is to try the most natural split first (double newlines — paragraph breaks), and only fall back to less natural splits (sentences, words, characters) when paragraphs are too large. This preserves document structure: a paragraph that fits within the size limit stays intact, while a 3000-character paragraph gets split at sentence boundaries rather than arbitrarily.Hierarchical chunking for structured documents:
class RecursiveChunker: """Recursively split text using multiple separators.""" def __init__( self, chunk_size: int = 1000, chunk_overlap: int = 200, separators: list[str] = None ): self.chunk_size = chunk_size self.chunk_overlap = chunk_overlap self.separators = separators or [ "\n\n\n", # Multiple newlines "\n\n", # Paragraphs "\n", # Lines ". ", # Sentences " ", # Words "" # Characters ] def chunk(self, text: str, metadata: dict = None) -> list[Chunk]: """Recursively split text.""" metadata = metadata or {} chunks = self._split_recursive(text, self.separators) return [ Chunk(text=chunk, metadata=metadata, chunk_index=i) for i, chunk in enumerate(chunks) ] def _split_recursive( self, text: str, separators: list[str] ) -> list[str]: """Recursively split using separators.""" if not text: return [] if len(text) <= self.chunk_size: return [text] if not separators: # No more separators, force split return self._force_split(text) separator = separators[0] remaining_separators = separators[1:] if separator == "": # Character-level split return self._force_split(text) splits = text.split(separator) chunks = [] current_chunk = "" for split in splits: test_chunk = ( current_chunk + separator + split if current_chunk else split ) if len(test_chunk) <= self.chunk_size: current_chunk = test_chunk else: if current_chunk: chunks.append(current_chunk) if len(split) > self.chunk_size: # Recursively split with next separator sub_chunks = self._split_recursive( split, remaining_separators ) chunks.extend(sub_chunks) current_chunk = "" else: current_chunk = split if current_chunk: chunks.append(current_chunk) return chunks def _force_split(self, text: str) -> list[str]: """Force split text into chunks.""" chunks = [] for i in range(0, len(text), self.chunk_size - self.chunk_overlap): chunk = text[i:i + self.chunk_size] if chunk: chunks.append(chunk) return chunks
The pipeline below ties everything together: load a file (any format), chunk it intelligently, embed the chunks, and package the result for storage in a vector database. The _hash_content method is a detail worth noting — it generates a fingerprint of each document so you can detect when a user re-uploads the same file and skip reprocessing. At scale, this deduplication saves significant embedding costs.Combine extraction, chunking, and embedding:
from dataclasses import dataclassfrom pathlib import Pathimport hashlibimport jsonfrom openai import OpenAI@dataclassclass ProcessedDocument: """A fully processed document with chunks and embeddings.""" source: str chunks: list[Chunk] embeddings: list[list[float]] metadata: dictclass DocumentPipeline: """Complete document processing pipeline.""" def __init__( self, openai_client: OpenAI, chunk_size: int = 1000, chunk_overlap: int = 200, embedding_model: str = "text-embedding-3-small" ): self.client = openai_client self.loader = UniversalDocumentLoader() self.chunker = RecursiveChunker( chunk_size=chunk_size, chunk_overlap=chunk_overlap ) self.embedding_model = embedding_model def process(self, path: str | Path) -> ProcessedDocument: """Process a single document.""" path = Path(path) # Load document document = self.loader.load(path) # Chunk document chunks = self.chunker.chunk( document.content, metadata=document.metadata ) # Generate embeddings embeddings = self._embed_chunks(chunks) return ProcessedDocument( source=str(path), chunks=chunks, embeddings=embeddings, metadata={ **document.metadata, "document_hash": self._hash_content(document.content), "chunk_count": len(chunks), } ) def process_batch( self, paths: list[str | Path] ) -> list[ProcessedDocument]: """Process multiple documents.""" return [self.process(path) for path in paths] def _embed_chunks(self, chunks: list[Chunk]) -> list[list[float]]: """Generate embeddings for chunks.""" if not chunks: return [] texts = [chunk.text for chunk in chunks] # Batch embeddings (max 2048 per request) all_embeddings = [] batch_size = 2048 for i in range(0, len(texts), batch_size): batch = texts[i:i + batch_size] response = self.client.embeddings.create( model=self.embedding_model, input=batch ) batch_embeddings = [e.embedding for e in response.data] all_embeddings.extend(batch_embeddings) return all_embeddings def _hash_content(self, content: str) -> str: """Generate hash of content for deduplication.""" return hashlib.sha256(content.encode()).hexdigest()[:16] def save(self, processed: ProcessedDocument, output_path: str | Path): """Save processed document to JSON.""" output_path = Path(output_path) data = { "source": processed.source, "metadata": processed.metadata, "chunks": [ { "text": chunk.text, "metadata": chunk.metadata, "chunk_index": chunk.chunk_index, "embedding": processed.embeddings[i] } for i, chunk in enumerate(processed.chunks) ] } with open(output_path, "w") as f: json.dump(data, f) def load(self, input_path: str | Path) -> ProcessedDocument: """Load processed document from JSON.""" input_path = Path(input_path) with open(input_path) as f: data = json.load(f) chunks = [ Chunk( text=c["text"], metadata=c["metadata"], chunk_index=c["chunk_index"] ) for c in data["chunks"] ] embeddings = [c["embedding"] for c in data["chunks"]] return ProcessedDocument( source=data["source"], chunks=chunks, embeddings=embeddings, metadata=data["metadata"] )# Usageclient = OpenAI()pipeline = DocumentPipeline(client)# Process single documentprocessed = pipeline.process("research_paper.pdf")print(f"Created {len(processed.chunks)} chunks")# Save for later usepipeline.save(processed, "processed_paper.json")# Load processed documentloaded = pipeline.load("processed_paper.json")
Documents where topic boundaries don’t align with formatting
Excellent — cuts at topic transitions
High (embed every sentence)
High
Markdown/HTML header-based
Documentation, wikis, knowledge bases
Excellent for structured content
None
Medium
Token-based (tiktoken)
When you need precise token budget control
Good — guarantees token limits
None
Low
Decision framework:
Is your document well-structured with headers, paragraphs, and sections (markdown, HTML, docstrings)? Use recursive chunking — it preserves structure at near-zero cost.
Is your document unstructured prose with no clear formatting (OCR output, email threads, transcripts)? Start with fixed-size + sentence boundary detection. Upgrade to semantic chunking only if retrieval quality is measurably poor.
Are you processing code files? Use language-aware chunking that splits at function/class boundaries, not character counts. A function split across two chunks is useless for code Q&A.
Do you have a hard token budget per chunk (e.g., for context window management)? Use token-based chunking with tiktoken to guarantee exact limits.
PDFs with mixed content. A financial report has text, tables, charts, and footnotes. The text extractor pulls everything into a flat string, losing the structure. Tables become garbled rows of numbers. Mitigation: extract tables separately with pdfplumber (as shown above), convert them to markdown, and tag them with metadata ("content_type": "table"). This lets your retrieval system surface the table when a user asks “what were Q3 revenue numbers?”Documents with headers and footers repeated on every page. A 50-page PDF with “Company Confidential - Page X of 50” on every page creates 50 occurrences of that text in your chunks. These pollute embeddings and waste tokens. Add a deduplication step: extract text per page, identify repeated strings across pages, and strip them before chunking.Unicode and encoding issues. Documents from different sources use different encodings. A Windows-generated CSV may use cp1252, while a Linux export uses UTF-8. The read_text(encoding="utf-8") call in the text loader will throw an error or silently mangle characters. Add encoding detection (using chardet or charset_normalizer) as a fallback when utf-8 fails.Empty or near-empty chunks. After splitting and cleaning, some chunks may contain only whitespace, a page number, or a header. These create useless embeddings that waste storage and pollute search results. Add a minimum content threshold (e.g., 50 characters of non-whitespace) and discard chunks that fall below it.Very large single documents. A 500-page PDF generates thousands of chunks. Embedding all of them in one batch can time out or exceed API rate limits. The pipeline’s batching (2048 per request) handles this, but you also need progress tracking and error recovery — if embedding fails on batch 15 of 20, you should not re-embed batches 1-14.
Chunking Best Practices
Start at 500-1000 characters — This maps to roughly 125-250 tokens. Smaller chunks give more precise retrieval but may lack context. Larger chunks provide more context but dilute the embedding signal.
Use 10-20% overlap — A 1000-char chunk with 200-char overlap means every boundary sentence appears in two chunks, so retrieval won’t miss content split across a boundary.
Match chunk size to query length — Short queries (“What is the refund policy?”) match better against shorter chunks. Long, complex queries match better against larger chunks. If your queries vary, consider indexing at multiple granularities.
Preserve metadata — Always store which document, page, and character offset each chunk came from. You will need this for citations, debugging, and re-chunking experiments.
Test with real queries — The only way to know if your chunk size is right is to run your actual user queries against the index and check if the right chunks surface in the top 5.
You are building a RAG pipeline and need to choose a chunking strategy for a corpus of mixed documents: technical manuals, legal contracts, and marketing blog posts. Walk me through how you decide on chunk size, overlap, and splitting strategy.
Strong Answer:
There is no single right chunk size because these three document types have fundamentally different structures. Technical manuals have clear section headers and numbered steps. Legal contracts have nested clause structures with cross-references. Blog posts are freeform prose. A one-size-fits-all chunker produces mediocre results for all three.
For technical manuals, use recursive chunking that respects section headers. Split first on ## (H2 headings), then on ### (H3), then on paragraph breaks. Each section becomes a chunk with its header preserved as metadata. This keeps instructions together with their context rather than splitting a step list across two chunks.
For legal contracts, chunk on section boundaries (Section 1.1, Section 1.2). Legal language depends heavily on defined terms and cross-references, so include the definitions section as a “always-injected” context chunk that gets appended to every retrieval result. Chunk overlap is critical here — a sentence like “Subject to the limitations in Section 4.3” at a chunk boundary needs to appear in both chunks.
For blog posts, fixed-size chunking at 500-800 characters with 100-character overlap works well. Blog prose is semantically fluid without hard structural boundaries, so recursive or semantic chunking adds complexity without meaningful quality improvement.
The universal rule: start at 500-1000 characters, test with 50 real queries from your users, measure whether the correct chunk appears in the top-5 retrieval results. If retrieval recall is below 80%, your chunks are either too large (diluted embeddings) or too small (missing context). Adjust and re-test.
Follow-up: Your retrieval recall is 85% with 500-character chunks but only 70% with 1000-character chunks. However, answer quality is higher with the larger chunks because they provide more context. How do you resolve this trade-off?Use a two-pass approach: index at the smaller granularity (500 characters) for retrieval, then expand at query time. When retrieval returns a 500-character chunk, look up its neighbors — the chunks immediately before and after it in the original document — and concatenate them into a larger context window. You get the retrieval precision of small chunks with the generation quality of large context. This is sometimes called “parent document retrieval” or “context window expansion.” The alternative is multi-granularity indexing: embed the same document at both 500 and 1000 characters, run retrieval against both indexes, and merge results. This doubles your embedding storage but gives the retrieval engine both fine-grained and coarse options.
Your PDF extraction pipeline works perfectly on Word-exported PDFs but produces garbled text on scanned documents and interleaved columns on research papers. How do you build a robust extraction pipeline that handles all three?
Strong Answer:
The fundamental issue is that PDFs are a page-layout format, not a text format. A PDF stores “draw character H at position (72, 400)” not “here is a paragraph.” Extraction quality varies wildly based on how the PDF was created. You need to detect the PDF type and route to the appropriate extraction strategy.
For digitally-created PDFs (Word exports, LaTeX output): PyMuPDF (fitz) extracts text cleanly. This is the fast path — call page.get_text("text") and clean whitespace. 80% of your documents likely fall here.
For scanned documents (images of printed text): you need OCR. Tesseract via pytesseract or a cloud OCR service (AWS Textract, Google Document AI) converts page images to text. The detection heuristic is simple: if PyMuPDF extracts fewer than 50 characters per page for a document that clearly has content, it is scanned. Google Document AI is significantly more accurate than Tesseract for complex layouts but costs money.
For two-column research papers: standard extraction interleaves the columns (“Left column line 1, Right column line 1, Left column line 2…”). PyMuPDF’s page.get_text("blocks") returns text blocks with bounding box coordinates. Sort blocks by x-coordinate first (left column vs right column), then by y-coordinate within each column. This reconstructs the reading order. Alternatively, use a layout-aware tool like Layout Parser or Unstructured.io that handles column detection automatically.
The production pipeline: detect PDF type (digital vs scanned), route to the appropriate extractor, apply domain-specific cleaning (remove headers/footers, fix hyphenated line breaks, handle ligatures), and validate output quality (flag pages with suspiciously low character counts for human review).
Follow-up: You are processing 10,000 PDFs per day. Some PDFs are corrupted, some are password-protected, and some are 500+ pages. How do you make this pipeline reliable at scale?Defensive engineering at every step. First, wrap every extraction call in a try/except with a per-document timeout (30 seconds for extraction, 60 seconds for OCR). Corrupted PDFs throw exceptions; timeouts catch infinite loops. Second, check for password protection before extraction and skip or flag those documents. Third, for 500+ page documents, process pages in parallel batches of 50 rather than sequentially. Fourth, implement a dead letter queue: documents that fail extraction go into a retry queue with exponential backoff. After 3 failures, they move to a manual review queue. Fifth, use content hashing to detect re-uploads — if the SHA-256 hash matches an already-processed document, return the existing chunks instead of reprocessing. At 10,000 PDFs per day, even a 1% failure rate is 100 documents per day that need attention, so your monitoring must surface these clearly.
Semantic chunking embeds every sentence to find topic boundaries. For a corpus of 100,000 documents averaging 50 sentences each, that is 5 million embedding calls just for chunking. Is semantic chunking worth the cost, and when would you use it over recursive chunking?
Strong Answer:
At 0.02per1Mtokenswithtext−embedding−3−small,embedding5Msentences(average20tokenseach)costsabout2. The cost is not the issue — the time is. At 100 sentences per batch and 200ms per API call, 5M sentences take about 2.8 hours. For a one-time indexing job, that is acceptable. For real-time document uploads, it is not.
Semantic chunking is worth it when: (1) your documents have long, flowing text without clear structural markers (think transcripts, meeting notes, freeform reports), (2) retrieval quality is critical and you have measured that recursive chunking produces chunks that split key information across boundaries, and (3) you are indexing once and querying many times, so the upfront cost amortizes over millions of queries.
Recursive chunking wins when: (1) documents have clear structure (headers, sections, numbered lists), (2) you need real-time processing (user uploads a document and expects it indexed in seconds), or (3) you are iterating on chunk parameters frequently and cannot afford to re-embed the entire corpus each time.
The pragmatic middle ground: use recursive chunking as the default (it handles 80% of documents well), and apply semantic chunking only to document types that recursive chunking handles poorly. Measure retrieval quality per document type — if technical manuals retrieve at 90% recall with recursive chunking but meeting transcripts only hit 65%, apply semantic chunking to transcripts only.
A lesser-known alternative: use a sliding window of 3-5 sentences, compute the average embedding similarity between adjacent windows, and split where similarity drops below a threshold. This is cheaper than per-sentence embedding because you embed windows instead of individual sentences, reducing API calls by 3-5x.
Follow-up: You implement semantic chunking and it produces great chunks for most documents, but for some documents it creates one giant chunk of 5,000 characters because the topic never changes enough to trigger a split. How do you handle this?Add a hard maximum chunk size that overrides the semantic signal. If a chunk exceeds 2,000 characters and no semantic boundary has been detected, force-split at the nearest sentence boundary. This is the max_chunk_size parameter in the SemanticChunker implementation. The important detail is that the forced split should still try to find the weakest semantic boundary within the oversized chunk — the point where adjacent sentence similarity is lowest, even if it is above the normal threshold. This gives you “the best available split” rather than an arbitrary character-count cut. Also consider that uniformly high similarity might mean the document is about a single narrow topic, in which case large chunks are actually appropriate — the embedding will be coherent and retrieval will work well.