Skip to main content

Documentation Index

Fetch the complete documentation index at: https://resources.devweekends.com/llms.txt

Use this file to discover all available pages before exploring further.

Document processing is fundamental to RAG applications. This chapter covers extracting text from various formats, implementing chunking strategies, and building robust document pipelines. Here is the cold truth about RAG: your retrieval quality is bounded by your document processing quality. If your PDF extractor garbles a table into nonsense, or your chunker splits a key paragraph across two chunks, no amount of prompt engineering will save you. Garbage in, garbage out — except with RAG, the garbage gets cited with a confidence score, which is worse than no answer at all.

PDF Text Extraction

PDFs are the most common document format in enterprise RAG systems — and the hardest to extract text from cleanly. A PDF is fundamentally a page-layout format, not a text format. It stores instructions like “draw character ‘H’ at position (72, 400)” rather than “here is a paragraph.” This means extraction quality varies wildly depending on how the PDF was created: a Word-exported PDF extracts cleanly, a scanned document returns nothing without OCR, and a two-column research paper might interleave the columns.

Basic PDF Extraction with PyMuPDF

PyMuPDF (fitz) provides fast and accurate PDF text extraction:
import fitz  # PyMuPDF
from dataclasses import dataclass
from pathlib import Path


@dataclass
class ExtractedPage:
    """Represents an extracted PDF page."""
    page_number: int
    text: str
    metadata: dict


class PDFExtractor:
    """Extract text and metadata from PDF documents."""
    
    def extract(self, pdf_path: str | Path) -> list[ExtractedPage]:
        """Extract all pages from a PDF."""
        pdf_path = Path(pdf_path)
        
        if not pdf_path.exists():
            raise FileNotFoundError(f"PDF not found: {pdf_path}")
        
        pages = []
        
        with fitz.open(pdf_path) as doc:
            metadata = {
                "title": doc.metadata.get("title", ""),
                "author": doc.metadata.get("author", ""),
                "page_count": len(doc),
                "file_name": pdf_path.name,
            }
            
            for page_num, page in enumerate(doc, start=1):
                text = page.get_text("text")
                
                # Clean up whitespace
                text = self._clean_text(text)
                
                pages.append(ExtractedPage(
                    page_number=page_num,
                    text=text,
                    metadata={**metadata, "page": page_num}
                ))
        
        return pages
    
    def _clean_text(self, text: str) -> str:
        """Clean extracted text.
        
        Tip: This is where 80% of extraction bugs hide. Common issues:
        - Ligatures (fi, fl) becoming weird characters
        - Hyphenated line breaks ("docu-\\nment" should become "document")
        - Headers/footers repeated on every page
        Add domain-specific cleaning rules as you discover edge cases.
        """
        import re
        
        # Normalize whitespace (PDFs often have inconsistent spacing)
        text = re.sub(r'\s+', ' ', text)
        
        # Remove null bytes (common artifact in scanned PDFs)
        text = re.sub(r'\x00', '', text)
        
        return text.strip()


# Usage
extractor = PDFExtractor()
pages = extractor.extract("document.pdf")

for page in pages:
    print(f"Page {page.page_number}: {len(page.text)} chars")

Table Extraction with pdfplumber

For documents with tables, pdfplumber excels:
import pdfplumber
from dataclasses import dataclass


@dataclass
class ExtractedTable:
    """Represents an extracted table."""
    page_number: int
    table_number: int
    headers: list[str]
    rows: list[list[str]]


class TableExtractor:
    """Extract tables from PDF documents."""
    
    def __init__(self, min_rows: int = 2):
        self.min_rows = min_rows
    
    def extract_tables(self, pdf_path: str) -> list[ExtractedTable]:
        """Extract all tables from a PDF."""
        tables = []
        
        with pdfplumber.open(pdf_path) as pdf:
            for page_num, page in enumerate(pdf.pages, start=1):
                page_tables = page.extract_tables()
                
                for table_num, table in enumerate(page_tables, start=1):
                    if len(table) >= self.min_rows:
                        # First row as headers
                        headers = [str(cell or "") for cell in table[0]]
                        rows = [
                            [str(cell or "") for cell in row]
                            for row in table[1:]
                        ]
                        
                        tables.append(ExtractedTable(
                            page_number=page_num,
                            table_number=table_num,
                            headers=headers,
                            rows=rows
                        ))
        
        return tables
    
    def table_to_markdown(self, table: ExtractedTable) -> str:
        """Convert table to markdown format."""
        lines = []
        
        # Headers
        lines.append("| " + " | ".join(table.headers) + " |")
        lines.append("| " + " | ".join(["---"] * len(table.headers)) + " |")
        
        # Rows
        for row in table.rows:
            # Pad row if needed
            padded_row = row + [""] * (len(table.headers) - len(row))
            lines.append("| " + " | ".join(padded_row[:len(table.headers)]) + " |")
        
        return "\n".join(lines)


# Usage
table_extractor = TableExtractor()
tables = table_extractor.extract_tables("report.pdf")

for table in tables:
    print(f"Table on page {table.page_number}:")
    print(table_extractor.table_to_markdown(table))

Text Chunking Strategies

Chunking determines how documents are split for embedding and retrieval. This is arguably the most consequential design decision in a RAG pipeline. Chunk too small, and each piece lacks enough context to be useful. Chunk too large, and you dilute relevant information with noise, and your embedding becomes a blurry average of too many topics. The Goldilocks zone depends on your documents and queries, but 500-1000 characters is a good starting point for most use cases.

Fixed-Size Chunking

Fixed-size chunking is the “good enough” approach that works for 80% of use cases. You set a character or token limit, slide through the document, and cut. The overlap between chunks is like shingles on a roof — each piece overlaps the next so that sentences at the boundary aren’t orphaned. Simple but effective for uniform documents:
from dataclasses import dataclass


@dataclass
class Chunk:
    """Represents a text chunk."""
    text: str
    metadata: dict
    chunk_index: int


class FixedSizeChunker:
    """Split text into fixed-size chunks with overlap."""
    
    def __init__(
        self,
        chunk_size: int = 1000,
        chunk_overlap: int = 200
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
    
    def chunk(self, text: str, metadata: dict = None) -> list[Chunk]:
        """Split text into overlapping chunks."""
        metadata = metadata or {}
        chunks = []
        
        if len(text) <= self.chunk_size:
            return [Chunk(text=text, metadata=metadata, chunk_index=0)]
        
        start = 0
        chunk_index = 0
        
        while start < len(text):
            end = start + self.chunk_size
            
            # Find a good break point
            if end < len(text):
                end = self._find_break_point(text, end)
            
            chunk_text = text[start:end].strip()
            
            if chunk_text:
                chunks.append(Chunk(
                    text=chunk_text,
                    metadata={
                        **metadata,
                        "chunk_start": start,
                        "chunk_end": end,
                    },
                    chunk_index=chunk_index
                ))
                chunk_index += 1
            
            # Move start with overlap
            start = end - self.chunk_overlap
            
            # Prevent infinite loop
            if start <= chunks[-1].metadata.get("chunk_start", 0) if chunks else False:
                start = end
        
        return chunks
    
    def _find_break_point(self, text: str, position: int) -> int:
        """Find a natural break point near position."""
        # Look for paragraph break
        para_break = text.rfind('\n\n', position - 100, position)
        if para_break > position - 100:
            return para_break + 2
        
        # Look for sentence break
        for punct in ['. ', '! ', '? ']:
            sent_break = text.rfind(punct, position - 100, position)
            if sent_break > position - 100:
                return sent_break + 2
        
        # Look for word break
        space = text.rfind(' ', position - 50, position)
        if space > position - 50:
            return space + 1
        
        return position


# Usage
chunker = FixedSizeChunker(chunk_size=500, chunk_overlap=100)
text = "Your long document text here..."
chunks = chunker.chunk(text, {"source": "document.pdf"})

Semantic Chunking

Semantic chunking is the “smart” approach: instead of cutting at fixed intervals, it detects where the topic changes and cuts there. Imagine reading a textbook — you would naturally break it at section transitions, not every 500 words. The trade-off is cost: you need to embed every sentence to measure similarity between adjacent sentences, which means an API call before you even start building your index. For large document collections, this cost can be significant. Use semantic chunking when retrieval quality is critical and the extra embedding cost is justified. Split based on semantic similarity for better coherence:
import numpy as np
from openai import OpenAI


class SemanticChunker:
    """Split text based on semantic similarity."""
    
    def __init__(
        self,
        client: OpenAI,
        similarity_threshold: float = 0.8,
        min_chunk_size: int = 100,
        max_chunk_size: int = 2000
    ):
        self.client = client
        self.similarity_threshold = similarity_threshold
        self.min_chunk_size = min_chunk_size
        self.max_chunk_size = max_chunk_size
    
    def chunk(self, text: str, metadata: dict = None) -> list[Chunk]:
        """Split text at semantic boundaries."""
        metadata = metadata or {}
        
        # Split into sentences
        sentences = self._split_sentences(text)
        
        if len(sentences) <= 1:
            return [Chunk(text=text, metadata=metadata, chunk_index=0)]
        
        # Get embeddings for sentences
        embeddings = self._embed_sentences(sentences)
        
        # Find semantic breaks
        break_points = self._find_semantic_breaks(embeddings)
        
        # Create chunks from break points
        chunks = []
        start_idx = 0
        
        for chunk_index, end_idx in enumerate(break_points):
            chunk_text = " ".join(sentences[start_idx:end_idx + 1])
            
            # Ensure chunk meets size requirements
            if len(chunk_text) >= self.min_chunk_size:
                chunks.append(Chunk(
                    text=chunk_text,
                    metadata=metadata,
                    chunk_index=chunk_index
                ))
            elif chunks:
                # Merge with previous chunk
                chunks[-1] = Chunk(
                    text=chunks[-1].text + " " + chunk_text,
                    metadata=chunks[-1].metadata,
                    chunk_index=chunks[-1].chunk_index
                )
            
            start_idx = end_idx + 1
        
        return chunks
    
    def _split_sentences(self, text: str) -> list[str]:
        """Split text into sentences."""
        import re
        
        # Simple sentence splitting
        sentences = re.split(r'(?<=[.!?])\s+', text)
        return [s.strip() for s in sentences if s.strip()]
    
    def _embed_sentences(self, sentences: list[str]) -> np.ndarray:
        """Get embeddings for sentences."""
        response = self.client.embeddings.create(
            model="text-embedding-3-small",
            input=sentences
        )
        
        return np.array([e.embedding for e in response.data])
    
    def _find_semantic_breaks(self, embeddings: np.ndarray) -> list[int]:
        """Find indices where semantic breaks occur."""
        breaks = []
        current_chunk_start = 0
        current_chunk_size = 0
        
        for i in range(len(embeddings) - 1):
            # Calculate similarity with next sentence
            similarity = self._cosine_similarity(
                embeddings[i], embeddings[i + 1]
            )
            
            current_chunk_size += 1
            
            # Break if low similarity or max size reached
            if similarity < self.similarity_threshold:
                breaks.append(i)
                current_chunk_start = i + 1
                current_chunk_size = 0
        
        # Add final break
        breaks.append(len(embeddings) - 1)
        
        return breaks
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        """Calculate cosine similarity between vectors."""
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Recursive Chunking

Recursive chunking is the best general-purpose strategy for structured documents. The idea is to try the most natural split first (double newlines — paragraph breaks), and only fall back to less natural splits (sentences, words, characters) when paragraphs are too large. This preserves document structure: a paragraph that fits within the size limit stays intact, while a 3000-character paragraph gets split at sentence boundaries rather than arbitrarily. Hierarchical chunking for structured documents:
class RecursiveChunker:
    """Recursively split text using multiple separators."""
    
    def __init__(
        self,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        separators: list[str] = None
    ):
        self.chunk_size = chunk_size
        self.chunk_overlap = chunk_overlap
        self.separators = separators or [
            "\n\n\n",  # Multiple newlines
            "\n\n",    # Paragraphs
            "\n",      # Lines
            ". ",      # Sentences
            " ",       # Words
            ""         # Characters
        ]
    
    def chunk(self, text: str, metadata: dict = None) -> list[Chunk]:
        """Recursively split text."""
        metadata = metadata or {}
        
        chunks = self._split_recursive(text, self.separators)
        
        return [
            Chunk(text=chunk, metadata=metadata, chunk_index=i)
            for i, chunk in enumerate(chunks)
        ]
    
    def _split_recursive(
        self,
        text: str,
        separators: list[str]
    ) -> list[str]:
        """Recursively split using separators."""
        if not text:
            return []
        
        if len(text) <= self.chunk_size:
            return [text]
        
        if not separators:
            # No more separators, force split
            return self._force_split(text)
        
        separator = separators[0]
        remaining_separators = separators[1:]
        
        if separator == "":
            # Character-level split
            return self._force_split(text)
        
        splits = text.split(separator)
        
        chunks = []
        current_chunk = ""
        
        for split in splits:
            test_chunk = (
                current_chunk + separator + split
                if current_chunk else split
            )
            
            if len(test_chunk) <= self.chunk_size:
                current_chunk = test_chunk
            else:
                if current_chunk:
                    chunks.append(current_chunk)
                
                if len(split) > self.chunk_size:
                    # Recursively split with next separator
                    sub_chunks = self._split_recursive(
                        split, remaining_separators
                    )
                    chunks.extend(sub_chunks)
                    current_chunk = ""
                else:
                    current_chunk = split
        
        if current_chunk:
            chunks.append(current_chunk)
        
        return chunks
    
    def _force_split(self, text: str) -> list[str]:
        """Force split text into chunks."""
        chunks = []
        
        for i in range(0, len(text), self.chunk_size - self.chunk_overlap):
            chunk = text[i:i + self.chunk_size]
            if chunk:
                chunks.append(chunk)
        
        return chunks

Document Loaders

Multi-Format Document Loader

Handle various document formats:
from abc import ABC, abstractmethod
from pathlib import Path
from dataclasses import dataclass
from typing import Protocol


@dataclass
class Document:
    """Represents a loaded document."""
    content: str
    metadata: dict
    source: str


class DocumentLoader(Protocol):
    """Protocol for document loaders."""
    
    def load(self, path: Path) -> Document:
        """Load a document from path."""
        ...
    
    def supports(self, path: Path) -> bool:
        """Check if loader supports the file type."""
        ...


class PDFLoader:
    """Load PDF documents."""
    
    def supports(self, path: Path) -> bool:
        return path.suffix.lower() == ".pdf"
    
    def load(self, path: Path) -> Document:
        import fitz
        
        text_parts = []
        
        with fitz.open(path) as doc:
            for page in doc:
                text_parts.append(page.get_text())
        
        return Document(
            content="\n\n".join(text_parts),
            metadata={
                "file_type": "pdf",
                "page_count": len(text_parts),
            },
            source=str(path)
        )


class MarkdownLoader:
    """Load Markdown documents."""
    
    def supports(self, path: Path) -> bool:
        return path.suffix.lower() in [".md", ".markdown"]
    
    def load(self, path: Path) -> Document:
        content = path.read_text(encoding="utf-8")
        
        # Extract title from first heading
        title = ""
        for line in content.split("\n"):
            if line.startswith("# "):
                title = line[2:].strip()
                break
        
        return Document(
            content=content,
            metadata={
                "file_type": "markdown",
                "title": title,
            },
            source=str(path)
        )


class TextLoader:
    """Load plain text documents."""
    
    def supports(self, path: Path) -> bool:
        return path.suffix.lower() in [".txt", ".text"]
    
    def load(self, path: Path) -> Document:
        content = path.read_text(encoding="utf-8")
        
        return Document(
            content=content,
            metadata={"file_type": "text"},
            source=str(path)
        )


class HTMLLoader:
    """Load HTML documents."""
    
    def supports(self, path: Path) -> bool:
        return path.suffix.lower() in [".html", ".htm"]
    
    def load(self, path: Path) -> Document:
        from bs4 import BeautifulSoup
        
        html = path.read_text(encoding="utf-8")
        soup = BeautifulSoup(html, "html.parser")
        
        # Remove script and style elements
        for element in soup(["script", "style", "nav", "footer"]):
            element.decompose()
        
        # Extract text
        text = soup.get_text(separator="\n")
        
        # Get title
        title = soup.title.string if soup.title else ""
        
        return Document(
            content=text,
            metadata={
                "file_type": "html",
                "title": title,
            },
            source=str(path)
        )


class UniversalDocumentLoader:
    """Load documents of various formats."""
    
    def __init__(self):
        self.loaders: list[DocumentLoader] = [
            PDFLoader(),
            MarkdownLoader(),
            TextLoader(),
            HTMLLoader(),
        ]
    
    def load(self, path: str | Path) -> Document:
        """Load a document using the appropriate loader."""
        path = Path(path)
        
        if not path.exists():
            raise FileNotFoundError(f"File not found: {path}")
        
        for loader in self.loaders:
            if loader.supports(path):
                return loader.load(path)
        
        raise ValueError(f"Unsupported file type: {path.suffix}")
    
    def load_directory(
        self,
        directory: str | Path,
        recursive: bool = True
    ) -> list[Document]:
        """Load all documents from a directory."""
        directory = Path(directory)
        documents = []
        
        pattern = "**/*" if recursive else "*"
        
        for file_path in directory.glob(pattern):
            if file_path.is_file():
                try:
                    doc = self.load(file_path)
                    documents.append(doc)
                except ValueError:
                    # Skip unsupported files
                    continue
                except Exception as e:
                    print(f"Error loading {file_path}: {e}")
        
        return documents


# Usage
loader = UniversalDocumentLoader()

# Load single document
doc = loader.load("report.pdf")

# Load entire directory
docs = loader.load_directory("documents/", recursive=True)
print(f"Loaded {len(docs)} documents")

Complete Document Processing Pipeline

The pipeline below ties everything together: load a file (any format), chunk it intelligently, embed the chunks, and package the result for storage in a vector database. The _hash_content method is a detail worth noting — it generates a fingerprint of each document so you can detect when a user re-uploads the same file and skip reprocessing. At scale, this deduplication saves significant embedding costs. Combine extraction, chunking, and embedding:
from dataclasses import dataclass
from pathlib import Path
import hashlib
import json
from openai import OpenAI


@dataclass
class ProcessedDocument:
    """A fully processed document with chunks and embeddings."""
    source: str
    chunks: list[Chunk]
    embeddings: list[list[float]]
    metadata: dict


class DocumentPipeline:
    """Complete document processing pipeline."""
    
    def __init__(
        self,
        openai_client: OpenAI,
        chunk_size: int = 1000,
        chunk_overlap: int = 200,
        embedding_model: str = "text-embedding-3-small"
    ):
        self.client = openai_client
        self.loader = UniversalDocumentLoader()
        self.chunker = RecursiveChunker(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
        self.embedding_model = embedding_model
    
    def process(self, path: str | Path) -> ProcessedDocument:
        """Process a single document."""
        path = Path(path)
        
        # Load document
        document = self.loader.load(path)
        
        # Chunk document
        chunks = self.chunker.chunk(
            document.content,
            metadata=document.metadata
        )
        
        # Generate embeddings
        embeddings = self._embed_chunks(chunks)
        
        return ProcessedDocument(
            source=str(path),
            chunks=chunks,
            embeddings=embeddings,
            metadata={
                **document.metadata,
                "document_hash": self._hash_content(document.content),
                "chunk_count": len(chunks),
            }
        )
    
    def process_batch(
        self,
        paths: list[str | Path]
    ) -> list[ProcessedDocument]:
        """Process multiple documents."""
        return [self.process(path) for path in paths]
    
    def _embed_chunks(self, chunks: list[Chunk]) -> list[list[float]]:
        """Generate embeddings for chunks."""
        if not chunks:
            return []
        
        texts = [chunk.text for chunk in chunks]
        
        # Batch embeddings (max 2048 per request)
        all_embeddings = []
        batch_size = 2048
        
        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            
            response = self.client.embeddings.create(
                model=self.embedding_model,
                input=batch
            )
            
            batch_embeddings = [e.embedding for e in response.data]
            all_embeddings.extend(batch_embeddings)
        
        return all_embeddings
    
    def _hash_content(self, content: str) -> str:
        """Generate hash of content for deduplication."""
        return hashlib.sha256(content.encode()).hexdigest()[:16]
    
    def save(self, processed: ProcessedDocument, output_path: str | Path):
        """Save processed document to JSON."""
        output_path = Path(output_path)
        
        data = {
            "source": processed.source,
            "metadata": processed.metadata,
            "chunks": [
                {
                    "text": chunk.text,
                    "metadata": chunk.metadata,
                    "chunk_index": chunk.chunk_index,
                    "embedding": processed.embeddings[i]
                }
                for i, chunk in enumerate(processed.chunks)
            ]
        }
        
        with open(output_path, "w") as f:
            json.dump(data, f)
    
    def load(self, input_path: str | Path) -> ProcessedDocument:
        """Load processed document from JSON."""
        input_path = Path(input_path)
        
        with open(input_path) as f:
            data = json.load(f)
        
        chunks = [
            Chunk(
                text=c["text"],
                metadata=c["metadata"],
                chunk_index=c["chunk_index"]
            )
            for c in data["chunks"]
        ]
        
        embeddings = [c["embedding"] for c in data["chunks"]]
        
        return ProcessedDocument(
            source=data["source"],
            chunks=chunks,
            embeddings=embeddings,
            metadata=data["metadata"]
        )


# Usage
client = OpenAI()
pipeline = DocumentPipeline(client)

# Process single document
processed = pipeline.process("research_paper.pdf")
print(f"Created {len(processed.chunks)} chunks")

# Save for later use
pipeline.save(processed, "processed_paper.json")

# Load processed document
loaded = pipeline.load("processed_paper.json")

Chunking Strategy Comparison

StrategyBest ForChunk QualityProcessing CostComplexity
Fixed-sizeUniform documents (plain text, transcripts)Adequate — may split mid-sentenceNone (no API calls)Low
Fixed-size with sentence boundariesMost general use casesGood — respects sentence structureNoneLow
Recursive (hierarchical separators)Structured documents (markdown, HTML, code)Very good — preserves document structureNoneMedium
Semantic (embedding-based)Documents where topic boundaries don’t align with formattingExcellent — cuts at topic transitionsHigh (embed every sentence)High
Markdown/HTML header-basedDocumentation, wikis, knowledge basesExcellent for structured contentNoneMedium
Token-based (tiktoken)When you need precise token budget controlGood — guarantees token limitsNoneLow
Decision framework:
  1. Is your document well-structured with headers, paragraphs, and sections (markdown, HTML, docstrings)? Use recursive chunking — it preserves structure at near-zero cost.
  2. Is your document unstructured prose with no clear formatting (OCR output, email threads, transcripts)? Start with fixed-size + sentence boundary detection. Upgrade to semantic chunking only if retrieval quality is measurably poor.
  3. Are you processing code files? Use language-aware chunking that splits at function/class boundaries, not character counts. A function split across two chunks is useless for code Q&A.
  4. Do you have a hard token budget per chunk (e.g., for context window management)? Use token-based chunking with tiktoken to guarantee exact limits.

Document Processing Edge Cases

PDFs with mixed content. A financial report has text, tables, charts, and footnotes. The text extractor pulls everything into a flat string, losing the structure. Tables become garbled rows of numbers. Mitigation: extract tables separately with pdfplumber (as shown above), convert them to markdown, and tag them with metadata ("content_type": "table"). This lets your retrieval system surface the table when a user asks “what were Q3 revenue numbers?” Documents with headers and footers repeated on every page. A 50-page PDF with “Company Confidential - Page X of 50” on every page creates 50 occurrences of that text in your chunks. These pollute embeddings and waste tokens. Add a deduplication step: extract text per page, identify repeated strings across pages, and strip them before chunking. Unicode and encoding issues. Documents from different sources use different encodings. A Windows-generated CSV may use cp1252, while a Linux export uses UTF-8. The read_text(encoding="utf-8") call in the text loader will throw an error or silently mangle characters. Add encoding detection (using chardet or charset_normalizer) as a fallback when utf-8 fails. Empty or near-empty chunks. After splitting and cleaning, some chunks may contain only whitespace, a page number, or a header. These create useless embeddings that waste storage and pollute search results. Add a minimum content threshold (e.g., 50 characters of non-whitespace) and discard chunks that fall below it. Very large single documents. A 500-page PDF generates thousands of chunks. Embedding all of them in one batch can time out or exceed API rate limits. The pipeline’s batching (2048 per request) handles this, but you also need progress tracking and error recovery — if embedding fails on batch 15 of 20, you should not re-embed batches 1-14.
Chunking Best Practices
  • Start at 500-1000 characters — This maps to roughly 125-250 tokens. Smaller chunks give more precise retrieval but may lack context. Larger chunks provide more context but dilute the embedding signal.
  • Use 10-20% overlap — A 1000-char chunk with 200-char overlap means every boundary sentence appears in two chunks, so retrieval won’t miss content split across a boundary.
  • Match chunk size to query length — Short queries (“What is the refund policy?”) match better against shorter chunks. Long, complex queries match better against larger chunks. If your queries vary, consider indexing at multiple granularities.
  • Preserve metadata — Always store which document, page, and character offset each chunk came from. You will need this for citations, debugging, and re-chunking experiments.
  • Test with real queries — The only way to know if your chunk size is right is to run your actual user queries against the index and check if the right chunks surface in the top 5.

Practice Exercise

Build a document ingestion service:
  1. Accept PDF, Markdown, and HTML uploads
  2. Extract text with proper formatting
  3. Implement configurable chunking strategies
  4. Generate embeddings in batches
  5. Store chunks in a vector database
Focus on:
  • Handling large documents efficiently
  • Deduplication using content hashes
  • Progress tracking for batch processing
  • Error recovery for partial failures

Interview Deep-Dive

Strong Answer:
  • The fundamental issue is that PDFs are a page-layout format, not a text format. A PDF stores “draw character H at position (72, 400)” not “here is a paragraph.” Extraction quality varies wildly based on how the PDF was created. You need to detect the PDF type and route to the appropriate extraction strategy.
  • For digitally-created PDFs (Word exports, LaTeX output): PyMuPDF (fitz) extracts text cleanly. This is the fast path — call page.get_text("text") and clean whitespace. 80% of your documents likely fall here.
  • For scanned documents (images of printed text): you need OCR. Tesseract via pytesseract or a cloud OCR service (AWS Textract, Google Document AI) converts page images to text. The detection heuristic is simple: if PyMuPDF extracts fewer than 50 characters per page for a document that clearly has content, it is scanned. Google Document AI is significantly more accurate than Tesseract for complex layouts but costs money.
  • For two-column research papers: standard extraction interleaves the columns (“Left column line 1, Right column line 1, Left column line 2…”). PyMuPDF’s page.get_text("blocks") returns text blocks with bounding box coordinates. Sort blocks by x-coordinate first (left column vs right column), then by y-coordinate within each column. This reconstructs the reading order. Alternatively, use a layout-aware tool like Layout Parser or Unstructured.io that handles column detection automatically.
  • The production pipeline: detect PDF type (digital vs scanned), route to the appropriate extractor, apply domain-specific cleaning (remove headers/footers, fix hyphenated line breaks, handle ligatures), and validate output quality (flag pages with suspiciously low character counts for human review).
Follow-up: You are processing 10,000 PDFs per day. Some PDFs are corrupted, some are password-protected, and some are 500+ pages. How do you make this pipeline reliable at scale?Defensive engineering at every step. First, wrap every extraction call in a try/except with a per-document timeout (30 seconds for extraction, 60 seconds for OCR). Corrupted PDFs throw exceptions; timeouts catch infinite loops. Second, check for password protection before extraction and skip or flag those documents. Third, for 500+ page documents, process pages in parallel batches of 50 rather than sequentially. Fourth, implement a dead letter queue: documents that fail extraction go into a retry queue with exponential backoff. After 3 failures, they move to a manual review queue. Fifth, use content hashing to detect re-uploads — if the SHA-256 hash matches an already-processed document, return the existing chunks instead of reprocessing. At 10,000 PDFs per day, even a 1% failure rate is 100 documents per day that need attention, so your monitoring must surface these clearly.
Strong Answer:
  • At 0.02per1Mtokenswithtextembedding3small,embedding5Msentences(average20tokenseach)costsabout0.02 per 1M tokens with text-embedding-3-small, embedding 5M sentences (average 20 tokens each) costs about 2. The cost is not the issue — the time is. At 100 sentences per batch and 200ms per API call, 5M sentences take about 2.8 hours. For a one-time indexing job, that is acceptable. For real-time document uploads, it is not.
  • Semantic chunking is worth it when: (1) your documents have long, flowing text without clear structural markers (think transcripts, meeting notes, freeform reports), (2) retrieval quality is critical and you have measured that recursive chunking produces chunks that split key information across boundaries, and (3) you are indexing once and querying many times, so the upfront cost amortizes over millions of queries.
  • Recursive chunking wins when: (1) documents have clear structure (headers, sections, numbered lists), (2) you need real-time processing (user uploads a document and expects it indexed in seconds), or (3) you are iterating on chunk parameters frequently and cannot afford to re-embed the entire corpus each time.
  • The pragmatic middle ground: use recursive chunking as the default (it handles 80% of documents well), and apply semantic chunking only to document types that recursive chunking handles poorly. Measure retrieval quality per document type — if technical manuals retrieve at 90% recall with recursive chunking but meeting transcripts only hit 65%, apply semantic chunking to transcripts only.
  • A lesser-known alternative: use a sliding window of 3-5 sentences, compute the average embedding similarity between adjacent windows, and split where similarity drops below a threshold. This is cheaper than per-sentence embedding because you embed windows instead of individual sentences, reducing API calls by 3-5x.
Follow-up: You implement semantic chunking and it produces great chunks for most documents, but for some documents it creates one giant chunk of 5,000 characters because the topic never changes enough to trigger a split. How do you handle this?Add a hard maximum chunk size that overrides the semantic signal. If a chunk exceeds 2,000 characters and no semantic boundary has been detected, force-split at the nearest sentence boundary. This is the max_chunk_size parameter in the SemanticChunker implementation. The important detail is that the forced split should still try to find the weakest semantic boundary within the oversized chunk — the point where adjacent sentence similarity is lowest, even if it is above the normal threshold. This gives you “the best available split” rather than an arbitrary character-count cut. Also consider that uniformly high similarity might mean the document is about a single narrow topic, in which case large chunks are actually appropriate — the embedding will be coherent and retrieval will work well.