PDF Text Extraction
PDFs are the most common document format in enterprise RAG systems — and the hardest to extract text from cleanly. A PDF is fundamentally a page-layout format, not a text format. It stores instructions like “draw character ‘H’ at position (72, 400)” rather than “here is a paragraph.” This means extraction quality varies wildly depending on how the PDF was created: a Word-exported PDF extracts cleanly, a scanned document returns nothing without OCR, and a two-column research paper might interleave the columns.Basic PDF Extraction with PyMuPDF
PyMuPDF (fitz) provides fast and accurate PDF text extraction:Table Extraction with pdfplumber
For documents with tables, pdfplumber excels:Text Chunking Strategies
Chunking determines how documents are split for embedding and retrieval. This is arguably the most consequential design decision in a RAG pipeline. Chunk too small, and each piece lacks enough context to be useful. Chunk too large, and you dilute relevant information with noise, and your embedding becomes a blurry average of too many topics. The Goldilocks zone depends on your documents and queries, but 500-1000 characters is a good starting point for most use cases.Fixed-Size Chunking
Fixed-size chunking is the “good enough” approach that works for 80% of use cases. You set a character or token limit, slide through the document, and cut. The overlap between chunks is like shingles on a roof — each piece overlaps the next so that sentences at the boundary aren’t orphaned. Simple but effective for uniform documents:Semantic Chunking
Semantic chunking is the “smart” approach: instead of cutting at fixed intervals, it detects where the topic changes and cuts there. Imagine reading a textbook — you would naturally break it at section transitions, not every 500 words. The trade-off is cost: you need to embed every sentence to measure similarity between adjacent sentences, which means an API call before you even start building your index. For large document collections, this cost can be significant. Use semantic chunking when retrieval quality is critical and the extra embedding cost is justified. Split based on semantic similarity for better coherence:Recursive Chunking
Recursive chunking is the best general-purpose strategy for structured documents. The idea is to try the most natural split first (double newlines — paragraph breaks), and only fall back to less natural splits (sentences, words, characters) when paragraphs are too large. This preserves document structure: a paragraph that fits within the size limit stays intact, while a 3000-character paragraph gets split at sentence boundaries rather than arbitrarily. Hierarchical chunking for structured documents:Document Loaders
Multi-Format Document Loader
Handle various document formats:Complete Document Processing Pipeline
The pipeline below ties everything together: load a file (any format), chunk it intelligently, embed the chunks, and package the result for storage in a vector database. The_hash_content method is a detail worth noting — it generates a fingerprint of each document so you can detect when a user re-uploads the same file and skip reprocessing. At scale, this deduplication saves significant embedding costs.
Combine extraction, chunking, and embedding:
Chunking Strategy Comparison
| Strategy | Best For | Chunk Quality | Processing Cost | Complexity |
|---|---|---|---|---|
| Fixed-size | Uniform documents (plain text, transcripts) | Adequate — may split mid-sentence | None (no API calls) | Low |
| Fixed-size with sentence boundaries | Most general use cases | Good — respects sentence structure | None | Low |
| Recursive (hierarchical separators) | Structured documents (markdown, HTML, code) | Very good — preserves document structure | None | Medium |
| Semantic (embedding-based) | Documents where topic boundaries don’t align with formatting | Excellent — cuts at topic transitions | High (embed every sentence) | High |
| Markdown/HTML header-based | Documentation, wikis, knowledge bases | Excellent for structured content | None | Medium |
| Token-based (tiktoken) | When you need precise token budget control | Good — guarantees token limits | None | Low |
- Is your document well-structured with headers, paragraphs, and sections (markdown, HTML, docstrings)? Use recursive chunking — it preserves structure at near-zero cost.
- Is your document unstructured prose with no clear formatting (OCR output, email threads, transcripts)? Start with fixed-size + sentence boundary detection. Upgrade to semantic chunking only if retrieval quality is measurably poor.
- Are you processing code files? Use language-aware chunking that splits at function/class boundaries, not character counts. A function split across two chunks is useless for code Q&A.
- Do you have a hard token budget per chunk (e.g., for context window management)? Use token-based chunking with tiktoken to guarantee exact limits.
Document Processing Edge Cases
PDFs with mixed content. A financial report has text, tables, charts, and footnotes. The text extractor pulls everything into a flat string, losing the structure. Tables become garbled rows of numbers. Mitigation: extract tables separately with pdfplumber (as shown above), convert them to markdown, and tag them with metadata ("content_type": "table"). This lets your retrieval system surface the table when a user asks “what were Q3 revenue numbers?”
Documents with headers and footers repeated on every page. A 50-page PDF with “Company Confidential - Page X of 50” on every page creates 50 occurrences of that text in your chunks. These pollute embeddings and waste tokens. Add a deduplication step: extract text per page, identify repeated strings across pages, and strip them before chunking.
Unicode and encoding issues. Documents from different sources use different encodings. A Windows-generated CSV may use cp1252, while a Linux export uses UTF-8. The read_text(encoding="utf-8") call in the text loader will throw an error or silently mangle characters. Add encoding detection (using chardet or charset_normalizer) as a fallback when utf-8 fails.
Empty or near-empty chunks. After splitting and cleaning, some chunks may contain only whitespace, a page number, or a header. These create useless embeddings that waste storage and pollute search results. Add a minimum content threshold (e.g., 50 characters of non-whitespace) and discard chunks that fall below it.
Very large single documents. A 500-page PDF generates thousands of chunks. Embedding all of them in one batch can time out or exceed API rate limits. The pipeline’s batching (2048 per request) handles this, but you also need progress tracking and error recovery — if embedding fails on batch 15 of 20, you should not re-embed batches 1-14.
- Start at 500-1000 characters — This maps to roughly 125-250 tokens. Smaller chunks give more precise retrieval but may lack context. Larger chunks provide more context but dilute the embedding signal.
- Use 10-20% overlap — A 1000-char chunk with 200-char overlap means every boundary sentence appears in two chunks, so retrieval won’t miss content split across a boundary.
- Match chunk size to query length — Short queries (“What is the refund policy?”) match better against shorter chunks. Long, complex queries match better against larger chunks. If your queries vary, consider indexing at multiple granularities.
- Preserve metadata — Always store which document, page, and character offset each chunk came from. You will need this for citations, debugging, and re-chunking experiments.
- Test with real queries — The only way to know if your chunk size is right is to run your actual user queries against the index and check if the right chunks surface in the top 5.
Practice Exercise
Build a document ingestion service:- Accept PDF, Markdown, and HTML uploads
- Extract text with proper formatting
- Implement configurable chunking strategies
- Generate embeddings in batches
- Store chunks in a vector database
- Handling large documents efficiently
- Deduplication using content hashes
- Progress tracking for batch processing
- Error recovery for partial failures
Interview Deep-Dive
You are building a RAG pipeline and need to choose a chunking strategy for a corpus of mixed documents: technical manuals, legal contracts, and marketing blog posts. Walk me through how you decide on chunk size, overlap, and splitting strategy.
You are building a RAG pipeline and need to choose a chunking strategy for a corpus of mixed documents: technical manuals, legal contracts, and marketing blog posts. Walk me through how you decide on chunk size, overlap, and splitting strategy.
- There is no single right chunk size because these three document types have fundamentally different structures. Technical manuals have clear section headers and numbered steps. Legal contracts have nested clause structures with cross-references. Blog posts are freeform prose. A one-size-fits-all chunker produces mediocre results for all three.
- For technical manuals, use recursive chunking that respects section headers. Split first on
##(H2 headings), then on###(H3), then on paragraph breaks. Each section becomes a chunk with its header preserved as metadata. This keeps instructions together with their context rather than splitting a step list across two chunks. - For legal contracts, chunk on section boundaries (Section 1.1, Section 1.2). Legal language depends heavily on defined terms and cross-references, so include the definitions section as a “always-injected” context chunk that gets appended to every retrieval result. Chunk overlap is critical here — a sentence like “Subject to the limitations in Section 4.3” at a chunk boundary needs to appear in both chunks.
- For blog posts, fixed-size chunking at 500-800 characters with 100-character overlap works well. Blog prose is semantically fluid without hard structural boundaries, so recursive or semantic chunking adds complexity without meaningful quality improvement.
- The universal rule: start at 500-1000 characters, test with 50 real queries from your users, measure whether the correct chunk appears in the top-5 retrieval results. If retrieval recall is below 80%, your chunks are either too large (diluted embeddings) or too small (missing context). Adjust and re-test.
Your PDF extraction pipeline works perfectly on Word-exported PDFs but produces garbled text on scanned documents and interleaved columns on research papers. How do you build a robust extraction pipeline that handles all three?
Your PDF extraction pipeline works perfectly on Word-exported PDFs but produces garbled text on scanned documents and interleaved columns on research papers. How do you build a robust extraction pipeline that handles all three?
- The fundamental issue is that PDFs are a page-layout format, not a text format. A PDF stores “draw character H at position (72, 400)” not “here is a paragraph.” Extraction quality varies wildly based on how the PDF was created. You need to detect the PDF type and route to the appropriate extraction strategy.
- For digitally-created PDFs (Word exports, LaTeX output): PyMuPDF (fitz) extracts text cleanly. This is the fast path — call
page.get_text("text")and clean whitespace. 80% of your documents likely fall here. - For scanned documents (images of printed text): you need OCR. Tesseract via pytesseract or a cloud OCR service (AWS Textract, Google Document AI) converts page images to text. The detection heuristic is simple: if PyMuPDF extracts fewer than 50 characters per page for a document that clearly has content, it is scanned. Google Document AI is significantly more accurate than Tesseract for complex layouts but costs money.
- For two-column research papers: standard extraction interleaves the columns (“Left column line 1, Right column line 1, Left column line 2…”). PyMuPDF’s
page.get_text("blocks")returns text blocks with bounding box coordinates. Sort blocks by x-coordinate first (left column vs right column), then by y-coordinate within each column. This reconstructs the reading order. Alternatively, use a layout-aware tool like Layout Parser or Unstructured.io that handles column detection automatically. - The production pipeline: detect PDF type (digital vs scanned), route to the appropriate extractor, apply domain-specific cleaning (remove headers/footers, fix hyphenated line breaks, handle ligatures), and validate output quality (flag pages with suspiciously low character counts for human review).
Semantic chunking embeds every sentence to find topic boundaries. For a corpus of 100,000 documents averaging 50 sentences each, that is 5 million embedding calls just for chunking. Is semantic chunking worth the cost, and when would you use it over recursive chunking?
Semantic chunking embeds every sentence to find topic boundaries. For a corpus of 100,000 documents averaging 50 sentences each, that is 5 million embedding calls just for chunking. Is semantic chunking worth the cost, and when would you use it over recursive chunking?
- At 2. The cost is not the issue — the time is. At 100 sentences per batch and 200ms per API call, 5M sentences take about 2.8 hours. For a one-time indexing job, that is acceptable. For real-time document uploads, it is not.
- Semantic chunking is worth it when: (1) your documents have long, flowing text without clear structural markers (think transcripts, meeting notes, freeform reports), (2) retrieval quality is critical and you have measured that recursive chunking produces chunks that split key information across boundaries, and (3) you are indexing once and querying many times, so the upfront cost amortizes over millions of queries.
- Recursive chunking wins when: (1) documents have clear structure (headers, sections, numbered lists), (2) you need real-time processing (user uploads a document and expects it indexed in seconds), or (3) you are iterating on chunk parameters frequently and cannot afford to re-embed the entire corpus each time.
- The pragmatic middle ground: use recursive chunking as the default (it handles 80% of documents well), and apply semantic chunking only to document types that recursive chunking handles poorly. Measure retrieval quality per document type — if technical manuals retrieve at 90% recall with recursive chunking but meeting transcripts only hit 65%, apply semantic chunking to transcripts only.
- A lesser-known alternative: use a sliding window of 3-5 sentences, compute the average embedding similarity between adjacent windows, and split where similarity drops below a threshold. This is cheaper than per-sentence embedding because you embed windows instead of individual sentences, reducing API calls by 3-5x.
max_chunk_size parameter in the SemanticChunker implementation. The important detail is that the forced split should still try to find the weakest semantic boundary within the oversized chunk — the point where adjacent sentence similarity is lowest, even if it is above the normal threshold. This gives you “the best available split” rather than an arbitrary character-count cut. Also consider that uniformly high similarity might mean the document is about a single narrow topic, in which case large chunks are actually appropriate — the embedding will be coherent and retrieval will work well.