Skip to main content

1. LLM Engineering & Prompt Design

Zero-shot prompting asks the model to perform a task without examples, relying only on pre-trained knowledge.
Few-shot prompting provides 1–5 examples to guide structure, style, and tone.
Use zero-shot when:
  • Task is simple and well-known (e.g., translation)
  • No examples available or cost-sensitive
  • You want unbiased responses
Use few-shot when:
  • Output format consistency is needed
  • Domain-specific context or edge cases exist
  • Zero-shot outputs are inconsistent
Example:
  • Zero-shot: “Classify sentiment: ‘I love this phone!’”
  • Few-shot: Add 2–3 labeled examples before the query.
Few-shot usually improves accuracy by 10–30% but increases token usage 3–5x.
Temperature controls randomness in text generation (range: 0–2).
  • Low (0–0.3): Deterministic, precise outputs (coding, classification)
  • Medium (0.4–0.7): Balanced tone (summaries, Q&A)
  • High (0.8–1.0): Creative, diverse results (brainstorming)
Examples:
  • Coding assistant → temp=0.1
  • Customer support → temp=0.4
  • Marketing content → temp=0.8
Combine with top_p (nucleus sampling) and top_k for fine control.
Prompt injection happens when malicious input manipulates an LLM to break rules or expose system prompts.Attack types:
  • Direct: “Ignore previous instructions…”
  • Context: Hidden malicious text in documents
  • Jailbreak: Role-play or DAN-style prompts
  • Prompt leak: Forcing model to reveal system prompt
Defenses:
  1. Input validation: Filter keywords like “ignore”, “system prompt”.
  2. Prompt separation: Clearly delimit system and user input.
  3. Instruction hierarchy: Reiterate rules after user input.
  4. Output validation: Sanitize responses before showing.
  5. Monitoring: Log blocked attempts for review.
Best practice: Layered security validate, isolate, and filter both input and output.
Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Tokens can be words, subwords, or characters depending on the tokenizer.How it works:
  • Text → Token IDs → Model processing → Generation
  • Different tokenizers (BPE, WordPiece, SentencePiece) use different strategies
  • Token count directly affects cost and context window usage
Impact on generation:
  • Token limits: Models have maximum token limits (context window)
  • Cost: Pricing is typically per token (input + output)
  • Quality: Better tokenization preserves semantic meaning
  • Speed: Fewer tokens = faster processing
Example: “Hello world” might tokenize to ["Hello", " world"] (2 tokens) or ["Hel", "lo", " wor", "ld"] (4 tokens) depending on tokenizer.Best practices:
  • Understand your model’s tokenizer (GPT uses BPE, BERT uses WordPiece)
  • Monitor token usage to optimize costs
  • Consider tokenization when chunking documents for RAG
Embeddings convert text into dense numerical vectors (arrays of numbers) that capture semantic meaning. Similar texts have similar embeddings.How they work:
  1. Training: Models learn from large text corpora that words appearing in similar contexts should have similar vectors
  2. Vector space: Words/concepts are positioned in high-dimensional space (typically 384, 768, or 1536 dimensions)
  3. Similarity: Cosine similarity or dot product measures how “close” two embeddings are
  4. Semantic capture: “king” - “man” + “woman” ≈ “queen” (famous word2vec example)
Key concepts:
  • Dense vectors: Every dimension has meaning (unlike sparse one-hot encoding)
  • Fixed size: All texts map to same dimension vector
  • Learned representations: Capture semantic relationships from training data
Use cases:
  • Semantic search (find similar documents)
  • RAG (retrieve relevant context)
  • Clustering and classification
  • Recommendation systems
Example: “machine learning” and “artificial intelligence” have high cosine similarity (≈0.85) because they’re semantically related.
Attention mechanism:
  • Allows model to focus on relevant parts of input when generating each token
  • Computes weighted relationships between all tokens
  • Enables understanding of long-range dependencies
  • Self-attention: tokens attend to other tokens in same sequence
Positional encoding:
  • Adds information about token position since transformers process all tokens in parallel
  • Without it, “dog bites man” and “man bites dog” would be identical
  • Can be learned (BERT) or fixed sinusoidal patterns (original Transformer)
Why both matter:
  • Attention: “what to focus on” (semantic relationships)
  • Position: “where things are” (order matters for meaning)
  • Together: Model understands both meaning and structure
Example: In “The cat sat on the mat”, attention helps model understand “cat” relates to “sat” and “mat”, while positional encoding ensures correct order.
Fine-tuning adapts a pre-trained model to specific tasks or domains.What changes:
  1. Model weights: Selected layers get updated (not all layers necessarily)
  2. Learning rate: Much lower than pre-training (typically 1e-5 to 1e-3)
  3. Optimizers: Often AdamW or Adam with weight decay
  4. Schedulers: Cosine annealing, linear warmup, or constant LR
  5. Layer freezing: Early layers often frozen, only top layers trained
Common strategies:
  • Full fine-tuning: All parameters updated (expensive, needs more data)
  • PEFT (Parameter-Efficient Fine-Tuning): LoRA, Adapters, only train small subset
  • Layer freezing: Freeze embeddings and early transformer layers, train only classifier head
Hyperparameters:
  • Batch size: 4-32 (depends on GPU memory)
  • Epochs: 1-5 (often 1-3 is enough)
  • Gradient accumulation: Simulate larger batches
  • Mixed precision: FP16/BF16 for memory efficiency
Best practices:
  • Start with frozen layers, gradually unfreeze
  • Use learning rate finder to find optimal LR
  • Monitor validation loss to prevent overfitting
  • Save checkpoints frequently
The “Attention is All You Need” paper revolutionized NLP by showing attention alone (without RNNs/CNNs) could achieve state-of-the-art results.Why attention works:
  1. Parallelization: Unlike RNNs, all tokens processed simultaneously (faster training)
  2. Long-range dependencies: Direct connections between any two tokens (RNNs struggle with distance)
  3. Interpretability: Attention weights show what model focuses on
  4. Flexibility: Can attend to any part of input, not just sequential neighbors
Key innovations:
  • Multi-head attention: Multiple attention mechanisms capture different relationships
  • Self-attention: Tokens attend to other tokens in same sequence
  • Scaled dot-product attention: Efficient computation with scaling factor
Why it’s not just marketing:
  • Empirically proven: Achieved SOTA on translation, outperformed RNNs/CNNs
  • Enables modern LLMs: GPT, BERT, T5 all use attention
  • Scalable: Works with billions of parameters
  • Foundation for current AI: Most LLMs are transformer-based
Limitation: Quadratic complexity with sequence length (O(n²)), but recent work (Flash Attention, sparse attention) mitigates this.
Encoder-only (BERT, RoBERTa):
  • Bidirectional understanding (sees full context)
  • Best for: Classification, NER, sentiment analysis, understanding tasks
  • Example: BERT for question answering (reads passage, finds answer)
Decoder-only (GPT, LLaMA):
  • Autoregressive generation (predicts next token)
  • Best for: Text generation, completion, chat, creative writing
  • Example: GPT for story generation or code completion
Encoder-decoder (T5, BART):
  • Both understanding and generation
  • Best for: Translation, summarization, text-to-text tasks
  • Example: T5 for “translate English to French: Hello” → “Bonjour”
Decision framework:
  • Need to understand input? → Encoder or encoder-decoder
  • Need to generate text? → Decoder or encoder-decoder
  • Need both? → Encoder-decoder
  • Most modern LLMs are decoder-only (GPT-style) because they’re more flexible
Modern trend: Decoder-only models (GPT-4, Claude) can do both understanding and generation, making them versatile for most tasks.
BPE (Byte Pair Encoding):
  • Used by: GPT, RoBERTa
  • How: Starts with characters, iteratively merges most frequent pairs
  • Pros: Handles unknown words, good for multilingual
  • Cons: Can split words awkwardly
  • Breaks down: Very long words, domain-specific terms
WordPiece:
  • Used by: BERT, DistilBERT
  • How: Similar to BPE but uses language model likelihood
  • Pros: Better word boundaries, handles subwords well
  • Cons: Less flexible than BPE
  • Breaks down: Rare technical terms, code snippets
SentencePiece:
  • Used by: T5, ALBERT, multilingual models
  • How: Treats input as Unicode, works at sentence level
  • Pros: Language-agnostic, handles any Unicode text
  • Cons: Can be slower, larger vocabulary
  • Breaks down: Very rare characters, mixed scripts
Comparison:
  • BPE: Best for general-purpose, multilingual
  • WordPiece: Best for English, better word preservation
  • SentencePiece: Best for multilingual, code, special characters
When they break down:
  • Very long technical terms (e.g., chemical names)
  • Mixed languages in single sentence
  • Code with special syntax
  • Emojis and special Unicode characters
  • Domain-specific jargon not in training data
Why cosine similarity dominates:
  1. Magnitude-independent: Focuses on direction, not vector length
  2. Normalized: Range is [-1, 1], easy to interpret
  3. Efficient: Fast computation, works well with approximate nearest neighbor search
  4. Semantic focus: Captures semantic similarity better than Euclidean distance
Mathematical intuition:
  • Cosine similarity = dot product of normalized vectors
  • Measures angle between vectors, not distance
  • Vectors pointing same direction = similar meaning
When it works well:
  • Semantic search (find similar documents)
  • Recommendation systems
  • Clustering similar texts
  • RAG retrieval
When it fails:
  1. Magnitude matters: If vector length encodes importance, cosine ignores it
  2. Sparse vectors: Works poorly with very sparse embeddings
  3. High dimensionality: Can become less discriminative in very high dimensions
  4. Domain mismatch: Embeddings from different models aren’t comparable
  5. Fine-grained differences: May not capture subtle distinctions
Alternatives:
  • Dot product: When magnitude matters
  • Euclidean distance: When absolute distance is important
  • Manhattan distance: For sparse vectors
  • Learned similarity: Train a model to learn similarity function
Best practice: Use cosine for semantic similarity, but validate on your specific use case.
Context window constraints:
  • Models have fixed maximum context (e.g., GPT-4: 128k tokens, Claude: 200k)
  • Input + output must fit within limit
  • Longer context = higher cost and latency
Design implications:
  • Must truncate or summarize long documents
  • Need strategies for multi-turn conversations
  • RAG becomes essential for knowledge beyond context
  • Chunking strategy critical for document processing
Long-context strategies:
  1. Sliding window: Process document in overlapping chunks
  2. Hierarchical summarization: Summarize chunks, then summarize summaries
  3. Retrieval: Use RAG to fetch relevant parts instead of including everything
  4. Context compression: Use smaller models to compress context before main model
  5. Relevance filtering: Only include most relevant parts of long documents
Production hacks:
  • Prompt caching: Cache system prompts to save tokens
  • Streaming: Start generating before full context processed
  • Progressive loading: Load context incrementally
  • Smart truncation: Keep beginning and end, truncate middle (often most important parts)
Example: For 100k token document with 8k context window:
  1. Chunk into 10 pieces of 8k each
  2. Embed and store in vector DB
  3. For query, retrieve top 3 most relevant chunks
  4. Include only those in context (24k tokens, fits in window)
Trade-offs:
  • More context = better understanding but higher cost
  • Less context = faster/cheaper but may miss information
  • Balance based on use case requirements
Greedy decoding:
  • Always picks highest probability token
  • Fastest, deterministic
  • Can get stuck in repetitive loops
  • Best for: Code generation, when determinism needed
Beam search:
  • Keeps top-k candidates at each step
  • Explores multiple paths, finds better sequences
  • Slower (k× slower), more memory
  • Best for: Translation, when quality > speed
Nucleus (top-p) sampling:
  • Samples from smallest set covering p% of probability mass
  • Good balance of quality and diversity
  • Faster than beam, more diverse than greedy
  • Best for: Creative tasks, chat, when need variety
For summarization API under latency pressure:
  • Choose: Greedy or top-p with low temperature (0.3-0.5)
  • Why: Summarization needs accuracy, not creativity
  • Greedy: Fastest, good for factual summaries
  • Top-p (p=0.9, temp=0.3): Slightly slower but more natural phrasing
Recommendation:
  • Start with greedy for maximum speed
  • If quality issues, use top-p with low temperature
  • Avoid beam search (too slow for API)
  • Consider caching common summaries
Production tip: Use streaming with greedy/top-p to send first tokens immediately, improving perceived latency.
Why positional encoding is critical:
  • Transformers process all tokens in parallel (no inherent order)
  • “The cat sat” vs “sat cat The” would be identical without position info
  • Position encoding tells model where each token is in sequence
What happens without it:
  • Model can’t distinguish word order
  • “Dog bites man” = “Man bites dog” (same meaning to model)
  • Grammar and syntax understanding breaks down
  • Language is inherently sequential order matters
Types of positional encoding:
  1. Fixed sinusoidal: Original Transformer, mathematical patterns
  2. Learned: BERT-style, model learns positions during training
  3. Relative: T5-style, encodes relative distances between tokens
When models “forget” order:
  • Position encoding gets corrupted or removed
  • Very long sequences beyond training length
  • Position embeddings not properly initialized
  • Result: Nonsensical output, loss of grammatical structure
Example failure:
  • Input: “I love programming in Python”
  • Without position: Model might generate “Python in programming love I”
  • With position: Correct order maintained
Best practices:
  • Always include positional encoding
  • For long contexts, use models trained on long sequences
  • Consider relative position encoding for variable-length inputs
Distilled models (smaller, faster):
  • Examples: DistilBERT, TinyBERT, GPT-3.5-turbo vs GPT-4
  • Trained to mimic larger models
  • 2-10× smaller, 3-5× faster
When to choose distilled:
  1. Latency constraints: Real-time applications, edge devices
  2. Cost optimization: Lower inference costs, especially at scale
  3. Resource limits: Mobile apps, embedded systems, limited GPU memory
  4. Simple tasks: When smaller model is sufficient (classification, simple Q&A)
  5. High throughput: Need to process many requests quickly
When to choose frontier model:
  1. Complex reasoning: Need advanced capabilities (GPT-4, Claude Opus)
  2. Quality critical: When accuracy is more important than speed
  3. Novel tasks: Tasks smaller models can’t handle
  4. Low volume: When cost isn’t concern, quality is priority
Decision framework:
  • Task complexity: Simple → distilled, complex → frontier
  • Latency requirement: <100ms → distilled, can wait → frontier
  • Volume: High volume → distilled, low volume → frontier
  • Budget: Limited → distilled, flexible → frontier
Production strategy:
  • Use distilled for 80% of requests (fast, cheap)
  • Route complex queries to frontier model (smart routing)
  • A/B test to find right balance
Example: Customer support chatbot
  • Use GPT-3.5-turbo for common questions (fast, cheap)
  • Escalate complex issues to GPT-4 (better reasoning)
Knowledge cutoff:
  • Date when model’s training data ends
  • Model doesn’t know events/information after that date
  • Example: GPT-4 trained on data up to April 2023
Testing impact:
  1. Create test set: Questions about events before and after cutoff
  2. Measure accuracy: Compare performance on pre vs post-cutoff questions
  3. Check hallucinations: Model may confidently make up post-cutoff information
  4. Domain-specific: Test your specific domain (tech, finance, etc.)
Test methodology:
  • Before cutoff: “What happened in 2022?” → Should answer correctly
  • After cutoff: “What happened in 2024?” → May hallucinate or say “I don’t know”
  • Edge cases: Events right around cutoff date
Mitigation strategies:
  1. RAG: Use retrieval to get current information
  2. Web search: Integrate search API for recent events
  3. Fine-tuning: Fine-tune on recent data (if available)
  4. Hybrid approach: Use model for reasoning, external sources for facts
Production monitoring:
  • Track questions about recent events
  • Flag potential hallucinations
  • Use RAG for time-sensitive queries
  • Set expectations with users about knowledge cutoff
Example test:
  • Query: “Latest Python version in 2024?”
  • Without RAG: May give outdated answer or hallucinate
  • With RAG: Retrieves current info, gives accurate answer
Best practice: Always use RAG for time-sensitive information, regardless of model’s knowledge cutoff.
Key factors:
  1. Scale: Number of vectors, query volume
  2. Latency: Response time requirements
  3. Features: Filtering, metadata, hybrid search
  4. Deployment: Managed vs self-hosted
  5. Cost: Pricing model, infrastructure costs
Comparison:Pinecone (Managed):
  • Pros: Easy setup, good performance, managed scaling
  • Cons: Expensive at scale, vendor lock-in
  • Best for: Quick prototypes, small to medium scale
Chroma (Self-hosted):
  • Pros: Open source, easy to use, good for development
  • Cons: Less scalable, fewer features
  • Best for: Development, small projects, learning
Weaviate (Self-hosted/Managed):
  • Pros: Feature-rich, good performance, hybrid search
  • Cons: More complex setup
  • Best for: Production systems needing advanced features
Milvus (Self-hosted):
  • Pros: Highly scalable, production-ready, open source
  • Cons: Complex setup, needs infrastructure
  • Best for: Large-scale production systems
OpenSearch/Elasticsearch:
  • Pros: Mature, good ecosystem, supports vector search
  • Cons: Not optimized specifically for vectors
  • Best for: When you need full-text + vector search
Qdrant:
  • Pros: Fast, good filtering, open source
  • Cons: Smaller community
  • Best for: Performance-critical applications
Decision framework:
  • Prototype: Chroma or Pinecone
  • Production <10M vectors: Pinecone or Weaviate
  • Production >10M vectors: Milvus or Qdrant
  • Need full-text search: OpenSearch
  • Budget constrained: Self-hosted (Chroma, Milvus, Qdrant)
Best practice: Start with managed (Pinecone) for speed, migrate to self-hosted (Milvus) as you scale.
Challenge:
  • Updating embedding model changes all vector representations
  • Old and new embeddings aren’t compatible
  • Need to re-embed all documents without service interruption
Zero-downtime strategies:1. Dual-write approach:
  • Write to both old and new vector DBs simultaneously
  • Gradually migrate reads from old to new
  • Once migration complete, deprecate old DB
2. Blue-green deployment:
  • Maintain two environments (blue = old, green = new)
  • Re-embed all documents in green environment
  • Switch traffic when ready
  • Keep blue as backup
3. Incremental backfill:
  • Re-embed documents in batches
  • Use message queue to process updates
  • Update vector DB incrementally
  • Route queries to appropriate DB based on document version
4. Versioned embeddings:
  • Store multiple embedding versions per document
  • Query both versions, merge results
  • Gradually phase out old version
Implementation:
  1. Preparation: Set up new embedding pipeline, new vector DB
  2. Dual-write: New documents go to both DBs
  3. Backfill: Re-process existing documents in background
  4. Gradual cutover: Route percentage of queries to new DB
  5. Validation: Compare results, ensure quality maintained
  6. Full cutover: Switch all traffic to new DB
  7. Cleanup: Remove old DB after validation period
Best practices:
  • Use feature flags to control rollout
  • Monitor metrics during migration
  • Keep old system as fallback
  • Test with small subset first
  • Document the process
Example: Upgrading from text-embedding-ada-002 to text-embedding-3-small
  • Embed new documents with both models
  • Backfill existing documents in background
  • Gradually switch queries to new embeddings
  • Validate quality hasn’t degraded
Evaluation metrics:Precision@k:
  • Fraction of retrieved items that are relevant
  • Precision@5 = 3 relevant out of 5 retrieved = 0.6
  • Measures accuracy of top-k results
Recall@k:
  • Fraction of all relevant items that were retrieved
  • Recall@10 = 7 relevant retrieved out of 10 total relevant = 0.7
  • Measures coverage
MRR (Mean Reciprocal Rank):
  • Average of 1/rank of first relevant result
  • Higher is better, emphasizes top results
NDCG (Normalized Discounted Cumulative Gain):
  • Considers ranking quality, discounts lower positions
  • Best for when relevance has degrees (highly relevant vs somewhat relevant)
Reranking:
  • Second-stage ranking using more expensive model
  • Improves precision by reordering initial results
  • Trade-off: Better quality but higher latency/cost
Citation evaluation:
  • Check if retrieved documents support the answer
  • Verify citations are accurate and relevant
  • Measure citation precision (correct citations / total citations)
Evaluation process:
  1. Create test set: Queries with known relevant documents
  2. Run retrieval: Get top-k results for each query
  3. Label relevance: Human annotators mark relevant/irrelevant
  4. Calculate metrics: Precision@k, Recall@k, MRR, NDCG
  5. Iterate: Improve retrieval based on results
Production monitoring:
  • Track precision@k over time
  • Monitor user feedback (thumbs up/down)
  • A/B test different retrieval strategies
  • Alert on quality degradation
Best practices:
  • Use multiple metrics (precision + recall)
  • Test on domain-specific data
  • Monitor in production, not just offline
  • Use reranking for critical queries

2. Prompting & Context Engineering

Zero-shot outperforms when:
  • Task is well-defined and common (translation, summarization)
  • Model has strong pre-training on the task
  • Cost/token usage is critical
  • Need unbiased responses without example influence
  • Examples are hard to construct or may introduce bias
Few-shot outperforms when:
  • Need specific output format (JSON, structured data)
  • Domain-specific terminology or edge cases
  • Zero-shot produces inconsistent results
  • Task requires demonstration of pattern
  • Working with unusual or complex patterns
Real-world examples:Zero-shot better:
  • Translation: “Translate to French: Hello” (model knows this well)
  • Simple classification: “Is this positive or negative?” (clear task)
  • General Q&A: “What is machine learning?” (common knowledge)
Few-shot better:
  • Code generation with specific style: Need examples showing preferred patterns
  • Complex extraction: “Extract entities in this format: [name, age, location]”
  • Domain-specific: Medical terminology, legal documents (need examples)
  • Multi-step reasoning: Chain-of-thought needs examples
Decision rule:
  • Start with zero-shot (simpler, cheaper)
  • Add few-shot if quality/consistency issues
  • Monitor token usage vs quality trade-off
  • A/B test to measure actual improvement
Best practice: Use few-shot strategically 2-3 high-quality examples often better than 5+ mediocre ones.
Why CoT fails at scale:
  1. Model limitations:
    • Smaller models lack reasoning capacity
    • Can’t maintain coherent reasoning chains
    • Gets confused with complex multi-step problems
  2. Prompt quality:
    • Poor examples lead to poor reasoning
    • Inconsistent formatting confuses model
    • Too many steps overwhelm model
  3. Error propagation:
    • Early reasoning mistake cascades
    • Model can’t self-correct mid-chain
    • Accumulates errors across steps
  4. Context limits:
    • Long reasoning chains exceed context
    • Model forgets earlier steps
    • Truncation breaks reasoning flow
  5. Task mismatch:
    • CoT not suitable for all tasks
    • Simple tasks don’t need reasoning
    • Over-engineering can hurt performance
When CoT works:
  • Large models (GPT-4, Claude) with strong reasoning
  • Complex problems requiring multi-step thinking
  • Well-constructed prompts with clear examples
  • Tasks that benefit from explicit reasoning
When CoT fails:
  • Small models trying to reason beyond capacity
  • Simple tasks that don’t need reasoning
  • Poorly constructed prompts
  • Tasks requiring factual recall, not reasoning
Mitigation strategies:
  • Use CoT only with capable models
  • Start with simple CoT, add complexity gradually
  • Validate reasoning steps, not just final answer
  • Use self-consistency (generate multiple chains, pick best)
  • Monitor for reasoning quality, not just answer correctness
Best practice: Test CoT on your specific use case it’s not always better than direct prompting.
Versioning strategies:
  1. Git-based versioning:
    • Store prompts in version control
    • Tag versions, track changes
    • Enable rollback to previous versions
    • Review prompt changes like code
  2. Prompt registry:
    • Centralized system for prompt management
    • Version numbers, metadata (author, date, purpose)
    • A/B testing different versions
    • Track performance per version
  3. Template system:
    • Parameterized prompts with variables
    • Version templates, not individual prompts
    • Easier to update and maintain
    • Example: {system_prompt_v2} + {user_input}
  4. Configuration files:
    • YAML/JSON configs for prompts
    • Environment-specific prompts (dev, prod)
    • Easy to update without code changes
    • Version configs separately
Best practices:
  • Naming convention: prompt_v1.2.3_task_name
  • Documentation: Document why each version exists
  • Testing: Test prompts before deploying
  • Monitoring: Track performance per version
  • Rollback plan: Keep previous versions for quick rollback
Implementation example:
prompts:
  classification_v1.0:
    system: "You are a classifier..."
    examples: [...]
    created: "2024-01-15"
  
  classification_v1.1:
    system: "You are an expert classifier..."
    examples: [...]
    created: "2024-02-01"
    changes: "Added domain-specific examples"
Production workflow:
  1. Develop prompt in staging
  2. Version and test
  3. Deploy with feature flag
  4. Monitor performance
  5. Gradually roll out
  6. Keep old version as fallback
Best practice: Treat prompts like code version, test, review, and monitor.
Systematic debugging approach:
  1. Logging:
    • Log all prompts and responses
    • Include metadata (timestamp, user, model version)
    • Store for analysis and debugging
    • Enable search and filtering
  2. Categorize failures:
    • Format errors: Wrong output structure
    • Hallucinations: Made-up information
    • Refusals: Model refuses valid requests
    • Inconsistency: Same input, different outputs
    • Off-topic: Model goes off-topic
  3. Root cause analysis:
    • Prompt issues: Ambiguous instructions, poor examples
    • Model limitations: Task beyond model capability
    • Input quality: Garbage in, garbage out
    • Context problems: Missing or wrong context
    • Parameter issues: Wrong temperature, top_p settings
  4. Debugging techniques:
    • Simplify: Remove complexity, test basic version
    • Isolate: Test individual components
    • Compare: A/B test different prompts
    • Iterate: Make small changes, test each
    • Validate: Check against known good examples
  5. Tools:
    • Prompt testing frameworks
    • A/B testing platforms
    • Evaluation metrics (accuracy, latency)
    • User feedback collection
Debugging checklist:
  • Is prompt clear and unambiguous?
  • Are examples high-quality and relevant?
  • Is context complete and accurate?
  • Are parameters (temp, top_p) appropriate?
  • Is model capable of the task?
  • Are there edge cases not handled?
Best practices:
  • Build test suite of known good/bad cases
  • Monitor failure rates and patterns
  • Create runbook for common issues
  • Document solutions for future reference
  • Set up alerts for quality degradation
Example debugging session:
  1. User reports: “Model gives wrong answer”
  2. Check logs: Find prompt and response
  3. Reproduce: Run same prompt, see if consistent
  4. Simplify: Test with minimal prompt
  5. Compare: Try different prompt variations
  6. Fix: Identify issue, update prompt
  7. Validate: Test on known cases
  8. Deploy: Roll out fix with monitoring
Guardrail approaches:Regex filters:
  • How: Pattern matching on input/output
  • Pros: Fast, simple, interpretable, no model needed
  • Cons: Brittle, easy to bypass, can’t understand context
  • Use when: Simple keyword blocking, known patterns
Classifiers:
  • How: ML model classifies content (toxic, PII, etc.)
  • Pros: Understands context, more robust, can be tuned
  • Cons: Needs training data, slower, may have false positives
  • Use when: Need semantic understanding, complex patterns
Fine-tuning:
  • How: Train model to refuse harmful requests
  • Pros: Most robust, understands nuance, built-in
  • Cons: Expensive, time-consuming, may reduce capabilities
  • Use when: Need model-level safety, have resources
Trade-offs:
ApproachSpeedRobustnessCostComplexity
RegexFastLowLowLow
ClassifierMediumMediumMediumMedium
Fine-tuningSlowHighHighHigh
Layered approach (recommended):
  1. Regex: Block obvious patterns (quick wins)
  2. Classifier: Catch semantic issues (context-aware)
  3. Fine-tuning: Model-level safety (deep protection)
  4. Output validation: Final check before returning
Best practice:
  • Start with regex for known issues
  • Add classifier for complex cases
  • Use fine-tuning for critical safety requirements
  • Combine approaches for defense in depth
Example:
  • Regex: Block known attack patterns
  • Classifier: Detect toxic content
  • Fine-tuning: Model refuses harmful requests
  • Output validation: Final safety check
Instruction injection:
  • Explicit rules, constraints, format requirements
  • Pros: More control, consistent output, predictable
  • Cons: Can be too rigid, may limit creativity, longer prompts
Free-flow:
  • Minimal instructions, let model be creative
  • Pros: Natural responses, creative, flexible
  • Cons: Less control, inconsistent, may go off-topic
Balancing strategies:
  1. Task-dependent:
    • Structured tasks: More instructions (extraction, formatting)
    • Creative tasks: Less instructions (writing, brainstorming)
    • Critical tasks: More instructions (safety, accuracy)
  2. Progressive disclosure:
    • Start with minimal instructions
    • Add constraints only if needed
    • Test to find minimum viable instructions
  3. Layered instructions:
    • Core instructions (always)
    • Optional constraints (when needed)
    • Examples (for complex tasks)
  4. User control:
    • Let users choose strictness level
    • “Creative mode” vs “Precise mode”
    • Adjust instructions based on user preference
Best practices:
  • Start minimal, add only what’s necessary
  • Test different instruction levels
  • Monitor user satisfaction
  • Balance control with naturalness
  • Document instruction rationale
Example evolution:
  • v1: “Summarize this text” (too free, inconsistent)
  • v2: “Summarize in 3 bullet points” (better structure)
  • v3: “Summarize in 3 bullet points, each max 20 words” (too rigid)
  • v4: “Summarize in 3 concise bullet points” (balanced)
Decision framework:
  • Need consistency? → More instructions
  • Need creativity? → Less instructions
  • Need safety? → More instructions
  • Need naturalness? → Less instructions
Prompt drift monitoring:What to monitor:
  1. Output quality metrics:
    • Accuracy, relevance, correctness
    • User satisfaction (thumbs up/down)
    • Task completion rate
    • Error rates
  2. Output characteristics:
    • Response length (may indicate drift)
    • Tone/style changes
    • Format consistency
    • Hallucination rate
  3. Model behavior:
    • Refusal rate (may increase/decrease)
    • Confidence scores (if available)
    • Response time (may indicate issues)
    • Token usage (may change)
Monitoring design:
  1. Baseline establishment:
    • Measure metrics on known good prompts
    • Establish normal ranges
    • Set thresholds for alerts
  2. Continuous tracking:
    • Log all prompts and responses
    • Calculate metrics in real-time
    • Store for historical analysis
  3. Anomaly detection:
    • Statistical tests (z-scores, percentiles)
    • Machine learning models (detect patterns)
    • Rule-based alerts (threshold breaches)
  4. Alerting:
    • Real-time alerts for critical issues
    • Daily/weekly reports for trends
    • Dashboard for visualization
Implementation:
# Pseudo-code
def monitor_prompt_drift():
    baseline_metrics = get_baseline()
    current_metrics = calculate_current_metrics()
    
    for metric in metrics:
        if abs(current - baseline) > threshold:
            alert(f"Drift detected in {metric}")
    
    if user_satisfaction_dropped():
        alert("User satisfaction declining")
Best practices:
  • Monitor multiple metrics (not just one)
  • Use statistical significance tests
  • Track trends over time
  • Set up automated alerts
  • Have runbook for common issues
  • Regular review and adjustment
Example monitoring:
  • Accuracy: Baseline 95%, current 92% → Alert
  • Response length: Baseline 100 tokens, current 150 → Investigate
  • User satisfaction: Baseline 4.5/5, current 3.8/5 → Alert
Zero-shot works better:
  • Well-known tasks (translation, summarization)
  • When model has strong pre-training
  • Cost/token sensitive scenarios
  • Need unbiased responses
  • Examples are hard to construct
Few-shot works better:
  • Need specific output format
  • Domain-specific terminology
  • Complex or unusual patterns
  • Zero-shot produces inconsistent results
  • Task requires demonstration
Decision matrix:
Task TypeZero-shotFew-shot
Simple classification
Translation
Code generation
Complex extraction
General Q&A
Domain-specific
Best practice: Start with zero-shot, add few-shot only if needed. Test to measure actual improvement.
Robust system prompt design:
  1. Clear role definition:
    • Define model’s role clearly
    • Set boundaries and limitations
    • Specify behavior expectations
  2. Explicit constraints:
    • What model should do
    • What model shouldn’t do
    • How to handle edge cases
  3. Consistent structure:
    • Use clear sections (Role, Instructions, Constraints)
    • Consistent formatting
    • Easy to read and maintain
  4. Test with diverse inputs:
    • Test with different user types
    • Test edge cases
    • Test adversarial inputs
    • Test various languages/styles
  5. Version and iterate:
    • Version system prompts
    • A/B test different versions
    • Monitor performance
    • Update based on feedback
Example robust system prompt:
You are a helpful assistant. Your role is to:
- Answer questions accurately and helpfully
- Admit when you don't know something
- Refuse harmful or inappropriate requests

Constraints:
- Always be truthful
- Don't make up information
- If unsure, say so

Format:
- Use clear, concise language
- Structure complex answers with bullet points
Best practices:
  • Keep prompts concise but complete
  • Use examples for complex behaviors
  • Test with real users
  • Monitor for prompt injection attempts
  • Update based on observed issues
Determinism strategies:
  1. Temperature = 0:
    • Pure greedy decoding
    • Always picks highest probability token
    • Most deterministic approach
  2. Fixed seed:
    • Set random seed for reproducibility
    • Same seed = same output
    • Works with temperature > 0
  3. Top-k = 1:
    • Only consider top token
    • Combined with temperature = 0
    • Maximum determinism
  4. Prompt caching:
    • Cache system prompts
    • Reduces variability from prompt processing
    • Improves consistency
Trade-offs:
  • Deterministic: Predictable but may be less natural
  • Non-deterministic: More natural but less predictable
  • Balanced: Low temperature (0.1-0.3) for slight variation
When to use:
  • Deterministic: Testing, debugging, when consistency critical
  • Non-deterministic: Creative tasks, when variety desired
  • Balanced: Most production use cases
Best practice: Use temperature = 0 for deterministic tasks, low temperature (0.1-0.3) for natural but consistent outputs.
Context management:
  1. Versioning:
    • Version context documents
    • Track changes over time
    • Enable rollback to previous versions
  2. Tracking:
    • Log which context was used for each query
    • Store context version with responses
    • Enable audit trail
  3. Backfilling:
    • Re-process queries with updated context
    • Update responses if context changed
    • Notify users of significant changes
Implementation:
  • Store context in versioned database
  • Tag queries with context version
  • Re-run queries when context updates
  • Compare old vs new responses
Best practices:
  • Version all context documents
  • Track context usage per query
  • Automate backfilling for critical updates
  • Monitor for context-related issues
  • Document context changes
Memory systems for LLMs:
  1. Short-term memory (conversation context):
    • Last N messages in conversation
    • Stored in session/cache
    • TTL: 1-24 hours
  2. Long-term memory (user preferences):
    • User profile, preferences
    • Stored in database
    • Persists across sessions
  3. Episodic memory (conversation history):
    • Past conversations
    • Searchable, retrievable
    • Used for context
  4. Semantic memory (knowledge base):
    • RAG system with embeddings
    • Retrieves relevant information
    • Updates as knowledge changes
Maintenance:
  • Refresh: Update memory with new information
  • Prune: Remove outdated information
  • Validate: Check memory accuracy
  • Index: Make memory searchable
Best practices:
  • Use RAG for knowledge memory
  • Store user preferences in database
  • Cache recent conversations
  • Index memory for fast retrieval
  • Regularly update and validate

3. AI System Architecture

Requirements: 10k users, <2s latency, 99.9% uptime, cost-efficient.Architecture overview:
  1. Load Balancer: NGINX / AWS ALB (SSL termination, DDoS protection)
  2. API Gateway: Kong / AWS Gateway (rate limiting, JWT auth)
  3. App Servers: FastAPI / Express (Kubernetes, auto-scaling)
  4. Cache: Redis cluster for sessions & frequent responses
  5. Model Serving:
    • Managed (OpenAI/Vertex) for simplicity
    • Self-hosted (vLLM + A100 GPUs) for cost optimization
  6. Message Queue: RabbitMQ / Kafka for async tasks
  7. Databases: PostgreSQL (metadata) + Pinecone/Milvus (vector search)
  8. Monitoring: Prometheus + Grafana + ELK stack
Flow: User → Gateway → Cache → DB/vector store → Model → Streamed response
Cost: ~33k(managed)or 33k (managed) or ~15k (self-hosted) monthly.
Reliability: Circuit breakers, auto-scaling, blue-green deployments.
Vector DBs (Pinecone, Milvus, Weaviate) store high-dimensional embeddings for semantic search.Use cases:
  • RAG (Retrieval Augmented Generation)
  • Semantic similarity search
  • Contextual recommendations
Advantages:
  • Fast cosine/dot-product search
  • Horizontal scalability
  • <100ms retrieval time
Example: Retrieve top 5 semantically similar docs before LLM generation.
Caching stores frequent responses or context to minimize repeated model calls.Tools: Redis / Memcached
Strategies:
  • Response caching: For common queries
  • Context caching: Last N user messages
  • Rate limiting: Prevent abuse
    Eviction: LRU with TTL (1h for context, 24h for cache)

3. Model Deployment & Serving

  • vLLM: High throughput inference, dynamic batching
  • TorchServe: Scalable PyTorch serving
  • TensorRT / ONNX Runtime: Optimized inference for GPUs/CPUs
  • Ray Serve: Distributed deployment for microservices
Example: Deploy Mistral-7B on vLLM with 4×A100 GPUs for 200–300 tokens/sec per GPU.
  • Batch inference: Process many inputs together efficient for offline jobs
  • Streaming: Generate and send tokens live ideal for chat or long text
Example: Chatbot → stream tokens for smoother UX.
Managed API (OpenAI, Anthropic):
  • ✅ Fast to integrate
  • ✅ No infra maintenance
  • ❌ Expensive at scale
  • ❌ Limited customization
Self-hosted (vLLM, Text-Gen WebUI):
  • ✅ Lower cost after 2–5M req/month
  • ✅ Control over weights
  • ❌ Needs GPU infra + MLOps skills

4. Fine-Tuning & Alignment

LoRA (Low-Rank Adaptation):
  • Trains small adapter matrices instead of full weights
  • Only updates 0.1-1% of parameters
  • Much faster, less memory, cheaper
  • Preserves base model capabilities
Full fine-tuning:
  • Updates all model parameters
  • More expressive, can learn complex patterns
  • Slower, needs more memory, expensive
  • Risk of catastrophic forgetting
When to use LoRA:
  • Limited compute resources
  • Want to preserve base model
  • Quick iterations needed
  • Multiple task-specific adapters
  • Fine-tuning on consumer hardware
When to use full fine-tuning:
  • Large dataset available
  • Task significantly different from pre-training
  • Need maximum performance
  • Have sufficient compute resources
  • Single specialized model needed
Decision framework:
  • Resources limited? → LoRA
  • Large dataset? → Full fine-tuning
  • Multiple tasks? → LoRA (different adapters)
  • Maximum performance? → Full fine-tuning
  • Quick experiments? → LoRA
Best practice: Start with LoRA, move to full fine-tuning only if LoRA doesn’t meet requirements.
QLoRA (Quantized LoRA):
  • Quantizes model to 4-bit, then applies LoRA
  • Enables fine-tuning on single GPU
  • Very memory efficient
Hidden costs:
  1. Quantization overhead:
    • Dequantization during training
    • Slight accuracy loss from quantization
    • More complex implementation
  2. Performance trade-offs:
    • Slower than full precision
    • May not reach full fine-tuning quality
    • Limited to certain model architectures
  3. Debugging complexity:
    • Harder to debug quantized models
    • Less interpretable
    • More moving parts
  4. Compatibility:
    • Not all models support quantization
    • May need specific libraries
    • Hardware requirements vary
When QLoRA makes sense:
  • Very limited GPU memory
  • Fine-tuning large models (7B+)
  • Research/experimentation
  • Cost-constrained scenarios
When to avoid:
  • Need maximum accuracy
  • Have sufficient resources
  • Production-critical applications
  • Small models (can use full fine-tuning)
Best practice: Use QLoRA when memory is the constraint, but be aware of accuracy trade-offs.
RLHF (Reinforcement Learning from Human Feedback):
  • Uses reinforcement learning with human preferences
  • More complex, needs reward model
  • Proven in production (ChatGPT, Claude)
  • Better for complex alignment
DPO (Direct Preference Optimization):
  • Directly optimizes on preference pairs
  • Simpler, no reward model needed
  • Faster training, easier to implement
  • Good for preference alignment
For safety-critical use case:Choose RLHF if:
  • Need maximum safety guarantees
  • Have resources for complex setup
  • Need fine-grained control
  • Working with large models
Choose DPO if:
  • Need faster iteration
  • Limited resources
  • Simpler alignment needs
  • Want easier implementation
Best practice: For safety-critical, use RLHF with extensive red teaming and safety testing. DPO can be good starting point, but RLHF offers more control.
SFT (Supervised Fine-Tuning):
  • Full fine-tuning on labeled data
  • Updates all parameters
  • More expressive, can learn complex patterns
  • Needs more data and compute
PEFT (Parameter-Efficient Fine-Tuning):
  • LoRA, Adapters, Prompt Tuning
  • Updates small subset of parameters
  • Faster, cheaper, less data needed
  • May not reach SFT performance
When SFT is overkill:
  • Small dataset (<1000 examples)
  • Task similar to pre-training
  • Limited compute resources
  • Quick experiments
  • Multiple tasks (use PEFT adapters)
When PEFT is insufficient:
  • Large, diverse dataset
  • Task very different from pre-training
  • Need maximum performance
  • Have sufficient resources
  • Single specialized model
Decision rule:
  • Start with PEFT (LoRA)
  • If performance insufficient, try SFT
  • Consider hybrid: PEFT for quick iteration, SFT for final model
Best practice: PEFT is rarely overkill it’s usually the right starting point. Use SFT when PEFT doesn’t meet requirements.
Common reasons:
  1. Poor data quality:
    • Low-quality training data
    • Misaligned with use case
    • Insufficient diversity
    • Noisy labels
  2. Overfitting:
    • Too many epochs
    • Small validation set
    • Model memorizes training data
    • Poor generalization
  3. Catastrophic forgetting:
    • Loses general capabilities
    • Too focused on specific task
    • Forgets pre-training knowledge
  4. Hyperparameter issues:
    • Wrong learning rate
    • Poor scheduler choice
    • Inappropriate batch size
    • No proper validation
  5. Evaluation mismatch:
    • Evaluated on different metrics
    • Test set doesn’t reflect real use
    • Overfitting to test set
How to avoid:
  • Use high-quality, diverse data
  • Proper train/validation/test splits
  • Early stopping
  • Monitor validation metrics
  • Test on real-world scenarios
  • Use PEFT to preserve base capabilities
Best practice: Fine-tune carefully with proper validation, and test on real-world data before deployment.
Stick with RAG + prompting when:
  1. Data availability:
    • Limited training data
    • Data changes frequently
    • Hard to collect labeled data
  2. Flexibility:
    • Need to update knowledge quickly
    • Multiple knowledge domains
    • Dynamic content requirements
  3. Cost:
    • Can’t afford fine-tuning compute
    • Low volume of requests
    • Cost of fine-tuning > cost of RAG
  4. Transparency:
    • Need to cite sources
    • Want to verify answers
    • Regulatory requirements
  5. Multi-domain:
    • Need to handle multiple domains
    • Different knowledge bases
    • General-purpose system
Choose fine-tuning when:
  • Large, stable dataset
  • Task-specific behavior needed
  • High volume, cost-sensitive
  • Need consistent style/format
  • Domain-specific terminology
Best practice: Start with RAG + prompting. Fine-tune only if RAG doesn’t meet requirements or cost/performance justifies it.
Common gotchas:
  1. Data leakage:
    • Test data in training set
    • Validation contamination
    • Overfitting to test metrics
  2. Distribution shift:
    • Training data ≠ production data
    • Different user behavior
    • Changing requirements
  3. Evaluation gaps:
    • Evaluating on wrong metrics
    • Not testing on real scenarios
    • Ignoring edge cases
  4. Cost underestimation:
    • Fine-tuning cost
    • Inference cost changes
    • Maintenance overhead
  5. Model degradation:
    • Catastrophic forgetting
    • Losing general capabilities
    • Performance on other tasks drops
  6. Deployment issues:
    • Model size increases
    • Latency changes
    • Infrastructure needs
How to avoid:
  • Proper data splits
  • Test on production-like data
  • Monitor all metrics, not just target
  • Budget for full lifecycle
  • Test general capabilities after fine-tuning
  • Plan for deployment infrastructure
Best practice: Always test fine-tuned models on real-world scenarios and monitor for degradation in general capabilities.

5. RAG Systems (Retrieval Augmented Generation)

RAG = Retrieval + Generation retrieves relevant docs before generating output, grounding the model in factual context.Benefits:
  • Reduces hallucination
  • Keeps results current
  • Enables domain adaptation without retraining
Pipeline: Embed → Store → Retrieve (top-k) → Construct prompt → Generate response
Text is converted into numerical vectors using embedding models (e.g., text-embedding-3-small).Steps:
  1. Tokenize text
  2. Convert to fixed-size vector
  3. Store in vector DB
Distance metrics: Cosine similarity, dot product, Euclidean distance.
  • Smart document chunking (500–800 tokens)
  • Use metadata filters (type, tags, date)
  • Cache top-k retrievals
  • Re-rank using relevance scores
How it works:
  1. Embedding: Convert documents to vectors
  2. Storage: Store vectors in vector database
  3. Query embedding: Convert query to vector
  4. Similarity search: Find closest document vectors
  5. Retrieval: Return top-k most similar documents
  6. Generation: Use retrieved docs as context for LLM
Where it breaks:
  1. Semantic mismatch:
    • Query and documents use different terminology
    • Embeddings don’t capture exact match needs
    • Example: Query “ML” vs document “machine learning”
  2. Context loss:
    • Chunking loses document structure
    • Missing surrounding context
    • Fragmented information
  3. Retrieval quality:
    • Wrong documents retrieved
    • Missing relevant documents
    • Too many irrelevant results
  4. Scale issues:
    • Slow retrieval at large scale
    • Vector DB limitations
    • Cost of embedding everything
  5. Domain mismatch:
    • Embedding model not trained on domain
    • Different languages or formats
    • Specialized terminology
Mitigation:
  • Use hybrid search (dense + sparse)
  • Better chunking strategies
  • Domain-specific embedding models
  • Reranking for better results
  • Metadata filtering
Best practice: Always validate retrieval quality on your specific use case embeddings aren’t perfect.
Comparison:Pinecone:
  • Managed service, easy setup
  • Good performance, auto-scaling
  • Expensive at scale
  • Best for: Quick prototypes, managed solution
FAISS (Facebook AI Similarity Search):
  • Library, not a database
  • Very fast, open source
  • No persistence, needs integration
  • Best for: Research, in-memory search
Weaviate:
  • Self-hosted or managed
  • Feature-rich, hybrid search
  • More complex setup
  • Best for: Production with advanced needs
Decision framework:
  • Prototype quickly? → Pinecone
  • Research/experiment? → FAISS
  • Production with features? → Weaviate
  • Large scale? → Milvus or Qdrant
  • Budget constrained? → Self-hosted (Weaviate, Milvus)
Best practice: Start with Pinecone for speed, migrate to self-hosted (Weaviate/Milvus) as you scale.
Hybrid retrieval:
  • Combines dense (semantic) and sparse (keyword) search
  • Weighted combination of scores
  • Captures both semantic similarity and exact matches
When it matters:
  1. Exact matches needed:
    • Code search, version numbers
    • Proper nouns, technical terms
    • When precision critical
  2. Semantic understanding needed:
    • General queries, synonyms
    • Conceptual search
    • When recall important
  3. Production systems:
    • Need best of both worlds
    • Can’t afford to miss results
    • Quality is priority
When it doesn’t matter:
  • Pure semantic search sufficient
  • Pure keyword search sufficient
  • Simple use cases
  • Cost-sensitive scenarios
Implementation:
# Pseudo-code
dense_score = cosine_similarity(query_embedding, doc_embedding)
sparse_score = bm25(query, doc)
final_score = α * dense_score + (1-α) * sparse_score
Best practice: Use hybrid retrieval in production RAG systems for best results.
Reranking:
  • Second-stage ranking using more expensive model
  • Reorders initial retrieval results
  • Improves precision of top results
Why it helps:
  • Initial retrieval may miss subtle relevance
  • Reranker understands context better
  • Can catch semantic nuances
  • Improves top-k precision significantly
When it helps:
  • Initial retrieval has good recall but poor precision
  • Need high-quality top results
  • Can afford extra latency/cost
  • Complex queries requiring understanding
When it doesn’t help:
  • Initial retrieval already very good
  • Latency/cost critical
  • Simple queries
  • Reranker not better than initial retrieval
Trade-offs:
  • Better quality but higher latency/cost
  • Typically 2-5× slower than initial retrieval
  • Worth it for critical queries
Best practice: Use reranking for important queries where quality matters more than speed.
Quantitative metrics:
  1. Retrieval metrics:
    • Precision@k: Fraction of retrieved docs that are relevant
    • Recall@k: Fraction of relevant docs that were retrieved
    • MRR: Mean reciprocal rank of first relevant result
    • NDCG: Normalized discounted cumulative gain
  2. Generation metrics:
    • Answer accuracy: Correctness of generated answer
    • Faithfulness: Answer grounded in retrieved docs
    • Completeness: Answer covers all aspects
    • Citation accuracy: Correct source attribution
  3. End-to-end metrics:
    • Task completion rate
    • User satisfaction (thumbs up/down)
    • Time to correct answer
    • Error rate
Evaluation framework:
  1. Create test set with known good answers
  2. Run RAG pipeline on test set
  3. Measure retrieval quality
  4. Measure generation quality
  5. Measure end-to-end performance
  6. Compare against baselines
Best practices:
  • Use multiple metrics (not just one)
  • Test on real-world scenarios
  • Monitor in production
  • A/B test improvements
  • Regular evaluation cycles
Best practice: Combine quantitative metrics with qualitative evaluation for comprehensive RAG assessment.
Common failure modes:
  1. Semantic mismatch:
    • Query and documents use different terms
    • Embeddings don’t capture exact need
    • Example: “ML” vs “machine learning”
  2. Over-retrieval:
    • Too many irrelevant documents
    • Dilutes relevant context
    • Model gets confused
  3. Under-retrieval:
    • Missing critical documents
    • Incomplete context
    • Model makes up information
  4. Chunking issues:
    • Relevant info split across chunks
    • Missing context from surrounding text
    • Fragmented information
  5. Temporal mismatch:
    • Outdated information retrieved
    • Wrong version of document
    • Stale knowledge base
  6. Domain mismatch:
    • Embedding model not suited for domain
    • Different language or format
    • Specialized terminology
Impact:
  • Hallucinations (model makes up info)
  • Inaccurate answers
  • Missing information
  • Poor user experience
Mitigation:
  • Improve chunking strategy
  • Use hybrid search
  • Rerank results
  • Update embedding model
  • Filter by metadata (date, type)
  • Test retrieval quality regularly
Best practice: Monitor retrieval quality and have fallback strategies for when retrieval fails.
Naive RAG limitations:Small scale (<10k documents):
  • Works fine
  • Simple vector search sufficient
  • No major issues
Medium scale (10k-1M documents):
  • Starts to show issues
  • Retrieval quality may degrade
  • Need better chunking/filtering
Large scale (>1M documents):
  • Significant problems:
    • Slow retrieval
    • Poor precision (too many results)
    • Cost increases
    • Quality degradation
When it falls apart:
  1. Retrieval quality:
    • Too many similar documents
    • Hard to find most relevant
    • Precision drops significantly
  2. Performance:
    • Slow vector search
    • High latency
    • Cost increases
  3. Maintenance:
    • Hard to update embeddings
    • Complex to manage
    • Scaling challenges
Solutions:
  • Hierarchical retrieval (coarse → fine)
  • Metadata filtering
  • Better chunking strategies
  • Hybrid search
  • Distributed vector DBs
Best practice: Plan for scale from the start naive RAG works for prototypes but needs optimization for production scale.
Strategies:
  1. RAG (Retrieval Augmented Generation):
    • Ground answers in retrieved documents
    • Enables citation and verification
    • Reduces hallucinations
  2. Citation and sources:
    • Always cite sources
    • Link to original documents
    • Enable fact-checking
  3. Confidence scores:
    • Provide confidence levels
    • Flag uncertain answers
    • Admit when unsure
  4. Validation:
    • Cross-check with multiple sources
    • Verify against known facts
    • Human review for critical answers
  5. Transparency:
    • Show retrieved context
    • Explain reasoning
    • Make process auditable
Implementation:
  • Use RAG for factual queries
  • Always include citations
  • Provide confidence scores
  • Enable source verification
  • Human review for critical cases
Best practice: Combine RAG with citations and confidence scores for verifiable, reliable answers.
RAG pipeline:
  1. Document processing:
    • Chunk documents into smaller pieces
    • Embed chunks into vectors
    • Store in vector database
  2. Query processing:
    • Embed user query into vector
    • Search for similar document chunks
    • Retrieve top-k most relevant chunks
  3. Context construction:
    • Combine retrieved chunks
    • Format as context for LLM
    • Include in prompt
  4. Generation:
    • LLM generates answer using context
    • Grounded in retrieved documents
    • Can cite sources
Benefits:
  • Reduces hallucinations
  • Keeps information current
  • Enables domain adaptation
  • Provides citations
Best practice: RAG is essential for factual, verifiable LLM applications.
Benefits:
  1. Reduced hallucinations:
    • Grounded in real documents
    • Less likely to make up information
    • More accurate answers
  2. Current information:
    • Can update knowledge base
    • No need to retrain model
    • Always up-to-date
  3. Domain adaptation:
    • Add domain-specific documents
    • No fine-tuning needed
    • Quick to adapt
  4. Transparency:
    • Can cite sources
    • Verifiable answers
    • Auditable process
  5. Cost-effective:
    • No model retraining
    • Update knowledge easily
    • Lower maintenance
Best practice: RAG is the standard approach for factual, domain-specific LLM applications.
Use fine-tuning when:
  1. Task-specific behavior:
    • Need specific output format
    • Consistent style required
    • Domain-specific terminology
  2. Large, stable dataset:
    • Have sufficient training data
    • Data doesn’t change frequently
    • Can afford fine-tuning cost
  3. Performance critical:
    • Need maximum performance
    • Latency sensitive
    • High volume
  4. Consistency:
    • Need very consistent outputs
    • Style/format critical
    • Behavior must be predictable
Use RAG when:
  • Need current information
  • Multiple knowledge domains
  • Data changes frequently
  • Need citations
  • Quick to deploy
Best practice: Start with RAG. Fine-tune only if RAG doesn’t meet requirements.
Architecture patterns:
  1. RAG (Retrieval Augmented Generation):
    • Retrieve relevant docs, use as context
    • No model changes
    • Easy to update
    • Best for: Knowledge bases, Q&A
  2. Fine-tuning:
    • Train model on proprietary data
    • Model learns from data
    • More integrated
    • Best for: Task-specific behavior
  3. Hybrid:
    • Fine-tune + RAG
    • Model fine-tuned for task
    • RAG for knowledge
    • Best for: Complex requirements
  4. Prompt engineering:
    • Customize via prompts
    • No model changes
    • Very flexible
    • Best for: Quick customization
Decision framework:
  • Knowledge base? → RAG
  • Task behavior? → Fine-tuning
  • Both? → Hybrid
  • Quick test? → Prompt engineering
Best practice: Choose pattern based on requirements RAG for knowledge, fine-tuning for behavior.

5. MLOps & LLMOps

MLOps applies DevOps principles to ML ensuring consistent deployment, monitoring, and governance.Key areas:
  • Continuous integration (CI)
  • Continuous training (CT)
  • Continuous deployment (CD)
  • Model registry/versioning
  • Drift detection and rollback
Why important:
  • Faster deployment cycles
  • Better model quality
  • Reduced risk
  • Reproducibility
  • Scalability
Best practice: MLOps is essential for production ML systems enables reliable, scalable deployments.
Metrics: Latency, accuracy, token usage, cost, drift
Tools: Prometheus, Grafana, OpenTelemetry
Alerts: P95 latency, error spikes, accuracy drops
Best practices:
  • Log prompts safely
  • Anonymize data
  • Track feedback loops
Key metrics:
  • Latency: P50, P95, P99 response times
  • Accuracy: Task-specific metrics
  • Token usage: Input/output tokens
  • Cost: Per request, per day
  • Drift: Data and model drift
Best practice: Monitor multiple metrics latency, accuracy, cost, drift for comprehensive monitoring.
  • Avoid harmful or biased outputs
  • Enforce strict usage policies
  • Apply constitutional AI + red teaming
  • Audit model data and behavior
Examples:
  • Refuse harmful requests
  • Document dataset sources
  • Test for bias before release
Best practice: Safety and ethics are critical implement guardrails, test for bias, and monitor for harmful outputs.
MLOps pipeline:
  1. Data collection:
    • Raw data ingestion
    • Data validation
    • Data storage
  2. Data processing:
    • Cleaning, transformation
    • Feature engineering
    • Data versioning
  3. Model training:
    • Training pipeline
    • Hyperparameter tuning
    • Model evaluation
  4. Model registry:
    • Version models
    • Store metadata
    • Track performance
  5. Model deployment:
    • Model serving
    • A/B testing
    • Gradual rollout
  6. Monitoring:
    • Performance metrics
    • Drift detection
    • Error tracking
  7. Feedback loop:
    • Collect user feedback
    • Log predictions
    • Retrain with new data
Best practice: Build end-to-end pipeline data → model → serving → feedback for continuous improvement.
Drift monitoring:
  1. Data drift:
    • Monitor input distribution
    • Statistical tests (KS test, chi-square)
    • Alert on significant changes
  2. Model drift:
    • Monitor prediction distribution
    • Compare with baseline
    • Alert on changes
  3. Performance drift:
    • Monitor accuracy metrics
    • Compare with baseline
    • Alert on degradation
Hallucination monitoring:
  1. Output validation:
    • Check for factual claims
    • Verify against sources
    • Flag suspicious outputs
  2. Confidence scores:
    • Monitor confidence levels
    • Flag low-confidence outputs
    • Review manually
  3. User feedback:
    • Collect thumbs up/down
    • Track user reports
    • Identify patterns
Best practice: Monitor drift and hallucinations continuously set up alerts and review regularly.
Logging strategy:
  1. What to log:
    • Prompts (system + user)
    • Model responses
    • Metadata (timestamp, user, model version)
    • Performance metrics
  2. Privacy:
    • Anonymize PII
    • Hash sensitive data
    • Comply with regulations
  3. Storage:
    • Centralized logging (ELK, Splunk)
    • Searchable, filterable
    • Retention policies
  4. Access control:
    • Role-based access
    • Audit logs
    • Secure storage
Best practices:
  • Log everything for debugging
  • Anonymize for privacy
  • Enable search and filtering
  • Set retention policies
Best practice: Log prompts and outputs securely essential for debugging and auditing.
LLM-specific challenges:
  1. Prompt versioning:
    • Version prompts like code
    • A/B test prompts
    • Rollback prompts
  2. Model updates:
    • Base model updates
    • Fine-tuned model versions
    • Embedding model updates
  3. Context management:
    • Version context documents
    • Update knowledge bases
    • Backfill queries
  4. Evaluation:
    • LLM-specific metrics
    • Human evaluation
    • A/B testing
  5. Deployment:
    • Model serving (vLLM, etc.)
    • Prompt caching
    • Streaming responses
Differences from traditional ML:
  • Prompts: Version and test prompts
  • Context: Manage dynamic context
  • Evaluation: LLM-specific metrics
  • Deployment: Streaming, caching
Best practice: Adapt CI/CD for LLMs version prompts, manage context, use LLM-specific evaluation.
Deployment playbook:
  1. API development:
    • FastAPI for Python API
    • Define endpoints
    • Error handling
  2. Containerization:
    • Docker for containerization
    • Multi-stage builds
    • Optimize image size
  3. Orchestration:
    • Kubernetes for orchestration
    • Deployments, services
    • Auto-scaling
  4. Model serving:
    • vLLM, TorchServe for serving
    • GPU allocation
    • Batching
  5. Monitoring:
    • Prometheus metrics
    • Grafana dashboards
    • Alerts
  6. CI/CD:
    • GitHub Actions, GitLab CI
    • Automated testing
    • Deployment pipelines
Best practices:
  • Use FastAPI for APIs
  • Containerize with Docker
  • Orchestrate with Kubernetes
  • Monitor with Prometheus/Grafana
Best practice: Use FastAPI + Docker + Kubernetes for production LLM APIs standard, scalable stack.
LLM drift detection:
  1. Input drift:
    • Monitor prompt patterns
    • Track user query types
    • Alert on changes
  2. Output drift:
    • Monitor response patterns
    • Track response length
    • Alert on changes
  3. Performance drift:
    • Monitor accuracy metrics
    • Track user satisfaction
    • Alert on degradation
  4. Model drift:
    • Compare model versions
    • Track behavior changes
    • A/B test
Methods:
  • Statistical tests (KS test, chi-square)
  • Distribution comparison
  • Threshold-based alerts
Best practice: Monitor drift continuously set up automated alerts and review regularly.
Offline evaluation:
  • Test on held-out dataset
  • Fast, cheap
  • No user impact
  • May not reflect real usage
Online evaluation:
  • Test with real users
  • A/B testing
  • Reflects real usage
  • Slower, more expensive
Trade-offs:
  • Offline: Fast, cheap, but may not reflect reality
  • Online: Realistic, but slower and more expensive
Best practice: Use offline for initial evaluation, online (A/B testing) for final validation.
Key metrics:
  1. Latency:
    • P50, P95, P99 response times
    • Time to first token
    • End-to-end latency
  2. Accuracy:
    • Task-specific metrics
    • User satisfaction
    • Error rates
  3. Cost:
    • Token usage (input + output)
    • Cost per request
    • Daily/monthly costs
  4. Quality:
    • Hallucination rate
    • Citation accuracy
    • User feedback
  5. System:
    • Throughput (requests/sec)
    • Error rate
    • Availability
Best practice: Monitor latency, accuracy, cost, and quality essential metrics for LLMOps.
LLM rollback differences:
  1. Prompt rollbacks:
    • Rollback prompts quickly
    • No model retraining needed
    • Version control
  2. Model rollbacks:
    • Rollback model versions
    • May need infrastructure changes
    • More complex
  3. Context rollbacks:
    • Rollback context documents
    • May need re-embedding
    • Backfill queries
  4. Fast rollbacks:
    • Prompts: Very fast
    • Models: Slower
    • Context: Medium
Best practice: Version everything prompts, models, context for quick rollbacks.
Cost optimization:
  1. Model optimization:
    • Quantization (INT8, INT4)
    • Model distillation
    • Smaller models
  2. Batching:
    • Dynamic batching
    • Continuous batching (vLLM)
    • Higher throughput
  3. Caching:
    • Prompt caching
    • Response caching
    • Reduce redundant calls
  4. Smart routing:
    • Route simple queries to smaller models
    • Route complex to larger models
    • Cost-aware routing
  5. Infrastructure:
    • Spot instances
    • Auto-scaling
    • Right-sizing
Best practice: Optimize models, use batching and caching, smart routing for cost-effective scaling.
CI/CD design:
  1. Version control:
    • Git for prompts
    • Model registry for checkpoints
    • Track versions
  2. Testing:
    • Test prompts on sample queries
    • Test models on validation set
    • Automated tests
  3. Deployment:
    • Feature flags for prompts
    • Gradual rollout for models
    • A/B testing
  4. Monitoring:
    • Monitor performance
    • Track metrics
    • Alert on issues
  5. Rollback:
    • Quick rollback for prompts
    • Model rollback capability
    • Version management
Best practice: Design CI/CD for both prompts and models version, test, deploy, monitor, rollback.

6. Document Digitization & Chunking

Chunking:
  • Breaking documents into smaller pieces
  • Makes documents fit in context window
  • Enables better retrieval
Why chunk:
  1. Context limits: Models have max context (e.g., 128k tokens)
  2. Better retrieval: Smaller chunks = more precise retrieval
  3. Cost: Smaller chunks = lower embedding costs
  4. Performance: Faster processing of smaller pieces
Best practices:
  • Chunk size: 500-800 tokens (balance context vs precision)
  • Overlap: 50-100 tokens between chunks (preserve context)
  • Semantic boundaries: Split at sentence/paragraph boundaries
Best practice: Chunk size depends on use case smaller for precise retrieval, larger for more context.
Factors:
  1. Model context window:
    • Max tokens model can handle
    • Need space for query + retrieved chunks
    • Example: 8k context → chunks of 500-800 tokens
  2. Retrieval precision:
    • Smaller chunks = more precise retrieval
    • Larger chunks = more context per chunk
    • Balance precision vs context
  3. Document structure:
    • Paragraphs, sections, chapters
    • Natural boundaries matter
    • Preserve semantic units
  4. Use case:
    • Q&A: Smaller chunks for precise answers
    • Summarization: Larger chunks for context
    • Analysis: Medium chunks for balance
  5. Embedding model:
    • Max tokens per embedding
    • Some models handle longer texts better
    • Consider model limitations
Best practice: Start with 500-800 token chunks, adjust based on retrieval quality and use case.
Chunking methods:
  1. Fixed-size chunking:
    • Split by character/token count
    • Simple, fast
    • May break sentences/paragraphs
  2. Sentence-based chunking:
    • Split at sentence boundaries
    • Preserves sentence structure
    • Better semantic units
  3. Paragraph-based chunking:
    • Split at paragraph boundaries
    • Preserves paragraph context
    • Good for structured documents
  4. Recursive chunking:
    • Try different strategies hierarchically
    • Start with paragraphs, fall back to sentences
    • Best of multiple approaches
  5. Semantic chunking:
    • Split based on semantic similarity
    • Uses embeddings to find boundaries
    • Most sophisticated, preserves meaning
  6. Sliding window:
    • Overlapping chunks
    • Preserves context across boundaries
    • More chunks but better coverage
Best practice: Use recursive or semantic chunking for best results, with overlap to preserve context.
Finding ideal chunk size:
  1. Start with baseline:
    • Common: 500-800 tokens
    • Test with your documents
    • Measure retrieval quality
  2. Test different sizes:
    • Small (200-400): More precise, less context
    • Medium (500-800): Balanced
    • Large (1000-1500): More context, less precise
  3. Evaluate:
    • Precision@k: Are retrieved chunks relevant?
    • Recall@k: Do we find all relevant chunks?
    • End-to-end: Does RAG quality improve?
  4. Consider factors:
    • Document type (technical vs narrative)
    • Query type (specific vs general)
    • Model context window
    • Use case requirements
  5. Iterate:
    • Start with medium size
    • Adjust based on results
    • Test on real queries
Best practice: Test multiple chunk sizes on your specific documents and queries ideal size varies by use case.
For complex documents (annual reports):
  1. Preprocessing:
    • Extract text from PDF
    • Preserve structure (tables, sections)
    • Clean formatting
  2. Structure-aware chunking:
    • Identify sections (executive summary, financials, etc.)
    • Chunk within sections
    • Preserve section context
  3. Hierarchical chunking:
    • Document → Sections → Subsections → Paragraphs
    • Store hierarchy in metadata
    • Enable section-level retrieval
  4. Special handling:
    • Tables: Extract as structured data, chunk separately
    • Charts: Extract captions, link to images
    • Footnotes: Include with relevant sections
  5. Metadata:
    • Section name, page number, date
    • Document type, year
    • Enable filtering by metadata
Implementation:
  • Use document parsers (PyPDF2, pdfplumber)
  • Structure detection (section headers)
  • Table extraction (tabula, camelot)
  • Semantic chunking within sections
Best practice: For complex documents, use structure-aware chunking with rich metadata for better retrieval.
Table handling strategies:
  1. Extract as structured data:
    • Convert to CSV/JSON
    • Store separately from text
    • Embed table descriptions
  2. Text representation:
    • Convert table to markdown/text
    • Include in chunks
    • Preserve structure
  3. Hybrid approach:
    • Store structured data separately
    • Include table summary in chunks
    • Link table data to text chunks
  4. Metadata:
    • Table type, headers, row count
    • Enable table-specific queries
    • Filter by table metadata
Best practices:
  • Extract tables with specialized tools (tabula, camelot)
  • Include table context (surrounding text)
  • Store both structured and text representations
  • Use metadata for table-specific retrieval
Best practice: Extract tables separately, include summaries in text chunks, and store full tables as structured data.
For very large tables:
  1. Split by rows:
    • Chunk table into row groups
    • Preserve header row in each chunk
    • Maintain table structure
  2. Column-based chunking:
    • Split by columns for column-specific queries
    • Include row identifiers
    • Preserve relationships
  3. Summary chunks:
    • Create summary of table
    • Include statistics, key insights
    • Use for high-level queries
  4. Metadata:
    • Table name, dimensions, date
    • Column names, data types
    • Enable filtering
  5. Structured storage:
    • Store full table in database
    • Embed summaries and descriptions
    • Link chunks to full table
Best practice: For large tables, create summary chunks for retrieval, store full table separately, and link them.
List handling:
  1. Preserve list structure:
    • Keep list items together
    • Don’t split mid-list
    • Maintain list context
  2. List as single chunk:
    • Small lists: Keep as one chunk
    • Preserves relationships
    • Better semantic unit
  3. Split long lists:
    • Large lists: Split into groups
    • Include list title/context
    • Maintain item relationships
  4. Metadata:
    • List type (ordered, unordered)
    • List title, item count
    • Enable list-specific queries
Best practice: Keep lists together when possible, split only if necessary, and preserve list context.
Production pipeline:
  1. Document ingestion:
    • Support multiple formats (PDF, DOCX, HTML)
    • Handle errors gracefully
    • Validate document quality
  2. Preprocessing:
    • Extract text, preserve structure
    • Clean formatting
    • Handle special elements (tables, images)
  3. Chunking:
    • Structure-aware chunking
    • Preserve context
    • Generate metadata
  4. Embedding:
    • Batch processing
    • Error handling
    • Retry logic
  5. Indexing:
    • Store in vector DB
    • Store metadata
    • Enable filtering
  6. Monitoring:
    • Track processing time
    • Monitor errors
    • Quality metrics
  7. Versioning:
    • Version documents
    • Track changes
    • Enable rollback
Best practices:
  • Use async processing for scale
  • Implement retry logic
  • Monitor pipeline health
  • Version everything
  • Test on production-like data
Best practice: Build robust pipeline with error handling, monitoring, and versioning for production use.
Graphs and charts handling:
  1. Extract text:
    • Chart titles, labels, captions
    • Axis labels, legends
    • Include in text chunks
  2. Image embeddings:
    • Use vision models for image embeddings
    • Store image embeddings separately
    • Link to text chunks
  3. Metadata:
    • Chart type, data source
    • Date, context
    • Enable filtering
  4. Hybrid approach:
    • Text description in chunks
    • Image embeddings for visual search
    • Link images to text
  5. Structured data:
    • Extract underlying data if available
    • Store as structured data
    • Link to chart images
Best practice: Extract text from charts, use image embeddings for visual search, and link charts to relevant text chunks.

7. Embedding Models

Vector embeddings:
  • Numerical representations of text
  • Dense vectors (arrays of numbers)
  • Capture semantic meaning
  • Similar texts have similar vectors
Embedding model:
  • Neural network that generates embeddings
  • Trained on large text corpora
  • Maps text to fixed-size vectors
  • Examples: text-embedding-3-small, sentence-transformers
How it works:
  • Input: Text (sentence, paragraph, document)
  • Output: Vector (e.g., 384, 768, 1536 dimensions)
  • Similar texts → similar vectors
Use cases:
  • Semantic search
  • RAG retrieval
  • Clustering
  • Classification
Best practice: Choose embedding model based on your domain and use case different models work better for different tasks.
In LLM applications:
  1. RAG (Retrieval Augmented Generation):
    • Embed documents for retrieval
    • Embed queries for search
    • Find similar documents
    • Use as context for LLM
  2. Semantic search:
    • Find similar documents
    • Understand user intent
    • Improve search quality
  3. Context selection:
    • Select relevant context from large corpus
    • Filter documents
    • Rank by relevance
  4. Hybrid search:
    • Combine with keyword search
    • Best of both approaches
    • Improved retrieval
Pipeline:
  • Documents → Embeddings → Vector DB
  • Query → Embedding → Search → Retrieve → LLM
Best practice: Embeddings are essential for RAG choose model that matches your domain and use case.
Short content (sentences, phrases):
  • Better semantic capture
  • More precise embeddings
  • Faster processing
  • Less context loss
Long content (paragraphs, documents):
  • More context preserved
  • Better for document-level search
  • Slower processing
  • May lose fine-grained details
Trade-offs:
  • Short: Better precision, less context
  • Long: More context, less precision
Best practices:
  • Short: For precise retrieval, Q&A
  • Long: For document-level search, summarization
  • Hybrid: Embed both short and long versions
Best practice: Embed at chunk level (500-800 tokens) for RAG balance context and precision.
Benchmarking process:
  1. Create test set:
    • Queries with known relevant documents
    • Label relevance (relevant/irrelevant)
    • Cover different query types
  2. Embed documents:
    • Use different embedding models
    • Store in vector DB
    • Track model versions
  3. Run retrieval:
    • Query each model
    • Retrieve top-k results
    • Measure retrieval quality
  4. Evaluate:
    • Precision@k: Fraction of relevant results
    • Recall@k: Fraction of relevant docs found
    • MRR: Mean reciprocal rank
    • NDCG: Normalized discounted cumulative gain
  5. Compare:
    • Compare models on same test set
    • Consider latency, cost
    • Choose best model
Best practices:
  • Test on domain-specific data
  • Use multiple metrics
  • Consider latency and cost
  • Test on real queries
Best practice: Benchmark on your specific data generic benchmarks may not reflect your use case.
Improvement strategies:
  1. Try different models:
    • text-embedding-3-small vs text-embedding-3-large
    • Different dimensions
    • Domain-specific models
  2. Fine-tune embedding model:
    • Train on your domain data
    • Better domain understanding
    • Improved accuracy
  3. Improve chunking:
    • Better chunk size
    • Semantic chunking
    • Preserve context
  4. Hybrid search:
    • Add keyword search (BM25)
    • Combine dense + sparse
    • Better coverage
  5. Reranking:
    • Second-stage ranking
    • More expensive but better
    • Improves precision
  6. Query expansion:
    • Expand queries with synonyms
    • Better query understanding
    • Improved retrieval
  7. Metadata filtering:
    • Filter by document type, date
    • Narrow search space
    • Better precision
Best practice: Start with hybrid search and reranking often easier than fine-tuning and gives good results.
Improvement steps:
  1. Baseline evaluation:
    • Test current model
    • Measure retrieval quality
    • Identify issues
  2. Data preparation:
    • Collect domain-specific data
    • Create training pairs (query, relevant doc)
    • Label relevance
  3. Fine-tuning:
    • Use sentence-transformers library
    • Train on domain data
    • Monitor validation metrics
  4. Evaluation:
    • Test on held-out set
    • Compare with baseline
    • Measure improvement
  5. Iteration:
    • Adjust hyperparameters
    • Add more training data
    • Improve data quality
  6. Deployment:
    • Deploy new model
    • A/B test against old model
    • Monitor performance
Best practices:
  • Start with small dataset
  • Use contrastive learning
  • Monitor overfitting
  • Test on real queries
Best practice: Fine-tune on domain-specific data with proper evaluation often improves accuracy significantly.

8. Internal Working of Vector Databases

Vector database:
  • Specialized database for vector embeddings
  • Optimized for similarity search
  • Stores high-dimensional vectors
  • Fast nearest neighbor search
Key features:
  • Vector storage and indexing
  • Similarity search (cosine, dot product, Euclidean)
  • Metadata filtering
  • Scalability
Examples:
  • Pinecone, Milvus, Weaviate, Qdrant, Chroma
Use cases:
  • RAG systems
  • Semantic search
  • Recommendation systems
  • Similarity matching
Best practice: Use vector DB for production RAG systems much faster than naive similarity search.
Differences:Traditional databases (SQL, NoSQL):
  • Exact match queries
  • Structured data
  • Indexes for exact lookups
  • Not optimized for similarity
Vector databases:
  • Similarity search
  • High-dimensional vectors
  • Approximate nearest neighbor (ANN) algorithms
  • Optimized for vector operations
Key differences:
  • Query type: Exact match vs similarity
  • Data structure: Tables vs vectors
  • Indexing: B-tree vs ANN indexes
  • Use case: Structured data vs embeddings
When to use each:
  • Traditional: Structured data, exact queries
  • Vector: Embeddings, similarity search
Best practice: Use vector DB for embeddings, traditional DB for metadata and structured data.
How it works:
  1. Storage:
    • Store vectors with metadata
    • Index vectors for fast search
    • Maintain data structures
  2. Indexing:
    • Build ANN indexes (HNSW, IVF, etc.)
    • Enable fast approximate search
    • Balance accuracy vs speed
  3. Query:
    • Embed query into vector
    • Search for similar vectors
    • Return top-k results
  4. Similarity calculation:
    • Cosine similarity, dot product, Euclidean
    • Fast computation
    • Optimized algorithms
Internal mechanisms:
  • HNSW (Hierarchical Navigable Small World): Graph-based index
  • IVF (Inverted File Index): Clustering-based
  • LSH (Locality-Sensitive Hashing): Hash-based
Best practice: Vector DBs use sophisticated indexing algorithms for fast similarity search at scale.
Vector index:
  • Data structure for fast similarity search
  • Examples: HNSW, IVF, LSH
  • Can be used standalone (FAISS)
  • No persistence, needs integration
Vector database:
  • Full database system with vector support
  • Persistence, querying, management
  • Examples: Pinecone, Milvus, Weaviate
  • Production-ready solution
Vector plugins:
  • Add vector capabilities to existing DBs
  • Examples: pgvector (PostgreSQL), vector search in Elasticsearch
  • Extends traditional databases
  • Hybrid approach
Comparison:
  • Index: Fast, no persistence, needs integration
  • DB: Full solution, persistence, production-ready
  • Plugin: Extends existing DB, hybrid approach
Best practice: Use vector DB for production, vector index for research, plugins for hybrid needs.
For perfect accuracy, speed not concern:Choose: Exact nearest neighbor search (brute force)Why:
  • Perfect accuracy: Checks all vectors, finds true nearest neighbors
  • No approximation: No accuracy loss from indexing
  • Small dataset: Brute force is feasible for small datasets
  • Simple: No index tuning needed
How it works:
  • Compare query vector with all vectors
  • Calculate similarity for each
  • Return top-k most similar
Trade-offs:
  • Accuracy: Perfect (100%)
  • Speed: Slow (O(n) where n = dataset size)
  • Scalability: Doesn’t scale to large datasets
When to use:
  • Small datasets (<10k vectors)
  • Accuracy critical
  • Speed not concern
  • Simple implementation
Best practice: For small datasets where accuracy is critical, brute force is the right choice simple and perfect accuracy.
Clustering-based (IVF - Inverted File Index):
  • Cluster vectors into groups
  • Search only in relevant clusters
  • Reduces search space
  • Faster but approximate
Locality-Sensitive Hashing (LSH):
  • Hash similar vectors to same buckets
  • Search only in relevant buckets
  • Fast approximate search
  • Probabilistic guarantees
Comparison:
  • Clustering: Better accuracy, needs training
  • LSH: Faster, probabilistic
Best practice: Use clustering (IVF) for better accuracy, LSH for maximum speed.
How clustering works:
  • Group similar vectors into clusters
  • For query, find relevant clusters
  • Search only in those clusters
  • Reduces search space significantly
When it fails:
  1. Query near cluster boundary:
    • May miss vectors in adjacent clusters
    • Solution: Search multiple clusters
  2. Poor clustering:
    • Clusters don’t match query distribution
    • Solution: Better clustering algorithm, more clusters
  3. High-dimensional data:
    • Clustering less effective
    • Solution: Dimensionality reduction, better algorithms
Mitigation:
  • Search multiple clusters
  • Improve clustering quality
  • Use hierarchical clustering
  • Combine with other strategies
Best practice: Use clustering with multi-cluster search for better accuracy while maintaining speed.
Random projection:
  • Projects high-dimensional vectors to lower dimensions
  • Preserves distances approximately (Johnson-Lindenstrauss lemma)
  • Faster search in lower dimensions
  • Approximate but fast
How it works:
  • Multiply vectors by random matrix
  • Reduce dimensions (e.g., 1536 → 128)
  • Search in lower-dimensional space
  • Faster but approximate
Trade-offs:
  • Speed: Much faster (lower dimensions)
  • Accuracy: Approximate (some loss)
  • Memory: Less memory needed
Best practice: Use random projection for very large datasets where speed is critical and some accuracy loss is acceptable.
LSH (Locality-Sensitive Hashing):
  • Hash similar vectors to same buckets
  • Search only in relevant buckets
  • Fast approximate nearest neighbor search
  • Probabilistic guarantees
How it works:
  • Create hash functions that map similar vectors to same hash
  • Hash query vector
  • Search in matching buckets
  • Return top-k results
Key properties:
  • Similar vectors → same hash (high probability)
  • Different vectors → different hash (high probability)
  • Fast lookup (hash-based)
Trade-offs:
  • Speed: Very fast (hash lookup)
  • Accuracy: Approximate (probabilistic)
  • Memory: Hash tables needed
Best practice: Use LSH for very large datasets where speed is critical and approximate results are acceptable.
Product Quantization (PQ):
  • Compresses vectors using quantization
  • Reduces memory usage
  • Enables fast approximate search
  • Trade-off: accuracy vs memory
How it works:
  • Split vector into subvectors
  • Quantize each subvector (reduce precision)
  • Store quantized codes
  • Fast distance computation using lookup tables
Benefits:
  • Memory: Much less memory (compressed)
  • Speed: Fast distance computation
  • Scalability: Can handle very large datasets
Trade-offs:
  • Accuracy: Some loss from quantization
  • Complexity: More complex implementation
Best practice: Use PQ for very large datasets where memory is a constraint and some accuracy loss is acceptable.
Comparison:HNSW (Hierarchical Navigable Small World):
  • Graph-based index
  • High accuracy, good speed
  • Best for: General-purpose, production
IVF (Inverted File Index):
  • Clustering-based
  • Good accuracy, fast
  • Best for: Large datasets, known distribution
LSH (Locality-Sensitive Hashing):
  • Hash-based
  • Fast, approximate
  • Best for: Very large datasets, speed critical
PQ (Product Quantization):
  • Compression-based
  • Memory efficient
  • Best for: Memory-constrained, large datasets
Decision framework:
  • General production: HNSW
  • Large scale: IVF or HNSW
  • Memory constrained: PQ
  • Speed critical: LSH
  • Accuracy critical: HNSW or exact search
Best practice: Start with HNSW for general use, consider others based on specific constraints.
Similarity metrics:Cosine similarity:
  • Measures angle between vectors
  • Magnitude-independent
  • Best for: Semantic similarity, general use
Dot product:
  • Measures magnitude and direction
  • Magnitude-dependent
  • Best for: When magnitude matters
Euclidean distance:
  • Measures absolute distance
  • Magnitude-dependent
  • Best for: When absolute distance matters
Decision factors:
  • Vector normalization: Normalized → cosine, not normalized → dot product
  • Magnitude importance: Matters → dot product/Euclidean, doesn’t → cosine
  • Use case: Semantic search → cosine, recommendation → dot product
Best practice: Use cosine similarity for semantic search (most common), dot product for recommendation systems.
Filtering types:
  1. Metadata filtering:
    • Filter by document type, date, tags
    • Pre-filter before vector search
    • Reduces search space
  2. Post-filtering:
    • Filter after vector search
    • May reduce results below k
    • Simpler but less efficient
  3. Pre-filtering:
    • Filter before vector search
    • More efficient
    • May miss relevant results
Challenges:
  • Performance: Filtering can slow down search
  • Result quality: Pre-filtering may miss results
  • Complexity: Combining filters is complex
  • Indexing: Need indexes for fast filtering
Best practices:
  • Use metadata indexes
  • Combine pre and post-filtering
  • Test filtering impact on quality
  • Optimize filter queries
Best practice: Use metadata filtering to narrow search space, but test impact on result quality.
Decision factors:
  1. Scale:
    • Number of vectors
    • Query volume
    • Growth rate
  2. Features:
    • Filtering, hybrid search
    • Metadata support
    • Advanced features
  3. Deployment:
    • Managed vs self-hosted
    • Infrastructure requirements
    • Maintenance burden
  4. Cost:
    • Pricing model
    • Infrastructure costs
    • Total cost of ownership
  5. Performance:
    • Latency requirements
    • Throughput needs
    • Accuracy requirements
Decision framework:
  • Prototype: Pinecone or Chroma
  • Production <10M: Pinecone or Weaviate
  • Production >10M: Milvus or Qdrant
  • Budget constrained: Self-hosted (Chroma, Milvus)
  • Need features: Weaviate
Best practice: Start with managed (Pinecone) for speed, migrate to self-hosted (Milvus) as you scale.

9. Advanced Search Algorithms

Strategies:
  1. Hybrid search:
    • Combine dense + sparse
    • Better coverage
    • Improved accuracy
  2. Hierarchical retrieval:
    • Coarse → Fine search
    • Reduce search space
    • Faster retrieval
  3. Metadata filtering:
    • Filter by type, date, tags
    • Narrow search space
    • Better precision
  4. Reranking:
    • Second-stage ranking
    • Improves precision
    • Better top-k results
  5. Indexing:
    • Efficient indexes (HNSW, IVF)
    • Fast approximate search
    • Scalable
  6. Caching:
    • Cache frequent queries
    • Reduce computation
    • Lower latency
Best practice: Combine multiple strategies hybrid search, filtering, reranking for best results.
Improvement steps:
  1. Diagnose issues:
    • Measure retrieval quality (precision@k, recall@k)
    • Identify failure modes
    • Analyze query types
  2. Improve chunking:
    • Better chunk size
    • Semantic chunking
    • Preserve context
  3. Improve embeddings:
    • Try different embedding models
    • Fine-tune on domain data
    • Domain-specific models
  4. Add hybrid search:
    • Combine dense + sparse
    • Better coverage
    • Improved accuracy
  5. Add reranking:
    • Second-stage ranking
    • Improves precision
    • Better top-k results
  6. Metadata filtering:
    • Filter by type, date
    • Narrow search space
    • Better precision
  7. Query expansion:
    • Expand queries with synonyms
    • Better query understanding
    • Improved retrieval
  8. Evaluate:
    • Test on real queries
    • Measure improvement
    • Iterate
Best practice: Start with hybrid search and reranking often gives biggest improvement with least effort.
Keyword-based retrieval:
  1. TF-IDF (Term Frequency-Inverse Document Frequency):
    • Weights terms by frequency and rarity
    • Common terms get lower weight
    • Rare terms get higher weight
    • Classic information retrieval
  2. BM25 (Best Matching 25):
    • Improved version of TF-IDF
    • Better term saturation
    • Handles document length better
    • Industry standard
  3. Inverted index:
    • Maps terms to documents
    • Fast lookup
    • Efficient storage
    • Foundation of keyword search
How it works:
  • Extract keywords from query
  • Look up in inverted index
  • Score documents by term frequency
  • Rank by relevance score
Pros:
  • Fast, exact matches
  • Interpretable
  • No model needed
Cons:
  • Misses synonyms
  • No semantic understanding
  • Limited to exact matches
Best practice: Use keyword search for exact matches, combine with semantic search for best results.
Fine-tuning reranking models:
  1. Data preparation:
    • Query-document pairs
    • Relevance labels (relevant/irrelevant)
    • Multiple relevance levels (highly relevant, somewhat relevant, etc.)
  2. Model selection:
    • Cross-encoder models (BERT, RoBERTa)
    • Better than bi-encoders for reranking
    • Understands query-document interaction
  3. Training:
    • Use sentence-transformers library
    • Contrastive learning
    • Train on domain data
    • Monitor validation metrics
  4. Evaluation:
    • Test on held-out set
    • Measure precision@k, MRR, NDCG
    • Compare with baseline
  5. Deployment:
    • Deploy as second-stage reranker
    • Use after initial retrieval
    • Monitor performance
Best practices:
  • Use domain-specific data
  • Multiple relevance levels
  • Monitor overfitting
  • Test on real queries
Best practice: Fine-tune reranking models on domain-specific data for best results.
Common metrics:
  1. Precision@k:
    • Fraction of retrieved items that are relevant
    • Measures accuracy of top-k results
    • Fails when: Need to measure coverage (recall)
  2. Recall@k:
    • Fraction of relevant items that were retrieved
    • Measures coverage
    • Fails when: Need to measure accuracy (precision)
  3. MRR (Mean Reciprocal Rank):
    • Average of 1/rank of first relevant result
    • Emphasizes top results
    • Fails when: Need to measure overall quality (NDCG)
  4. NDCG (Normalized Discounted Cumulative Gain):
    • Considers ranking quality, discounts lower positions
    • Best for graded relevance
    • Fails when: Need simple binary relevance
When metrics fail:
  • Precision: When recall is important
  • Recall: When precision is important
  • MRR: When need overall ranking quality
  • NDCG: When need simple binary relevance
Best practice: Use multiple metrics precision + recall, or MRR + NDCG for comprehensive evaluation.
For Quora-like Q&A system:Choose: MRR (Mean Reciprocal Rank)Why:
  • User experience: Users want first relevant answer quickly
  • MRR emphasizes top results: Measures rank of first relevant answer
  • Fast answers: Lower rank = faster to find answer
  • User satisfaction: Users typically read top results
Alternative metrics:
  • Precision@k: Measures accuracy but not position
  • Recall@k: Measures coverage but not speed
  • NDCG: Good but more complex, MRR simpler
MRR calculation:
  • For each query, find rank of first relevant answer
  • Calculate 1/rank
  • Average across queries
  • Higher MRR = better (answers found faster)
Best practice: Use MRR for Q&A systems where users want first relevant answer quickly.
For recommendation systems:Choose: NDCG (Normalized Discounted Cumulative Gain)Why:
  • Graded relevance: Recommendations have degrees (highly relevant, somewhat relevant)
  • Position matters: Top recommendations more important
  • Ranking quality: Measures how well system ranks recommendations
  • Industry standard: Widely used for recommendation systems
Alternative metrics:
  • Precision@k: Good but doesn’t consider position
  • Recall@k: Good but doesn’t consider position
  • MRR: Good but assumes binary relevance
NDCG benefits:
  • Considers relevance grades
  • Discounts lower positions
  • Normalized (comparable across queries)
  • Industry standard
Best practice: Use NDCG for recommendation systems where ranking quality and graded relevance matter.
Comparison:Precision@k:
  • Measures: Accuracy of top-k results
  • Use when: Accuracy is priority
  • Example: Search engine results
Recall@k:
  • Measures: Coverage of relevant items
  • Use when: Coverage is priority
  • Example: Document retrieval
MRR (Mean Reciprocal Rank):
  • Measures: Rank of first relevant result
  • Use when: First relevant result matters
  • Example: Q&A systems
NDCG (Normalized Discounted Cumulative Gain):
  • Measures: Ranking quality with graded relevance
  • Use when: Ranking and relevance grades matter
  • Example: Recommendation systems
F1@k:
  • Measures: Harmonic mean of precision and recall
  • Use when: Need balance of both
  • Example: Balanced evaluation
Decision framework:
  • Accuracy priority: Precision@k
  • Coverage priority: Recall@k
  • First result matters: MRR
  • Ranking quality: NDCG
  • Balance: F1@k
Best practice: Use multiple metrics for comprehensive evaluation precision + recall, or MRR + NDCG.
Hybrid search:
  1. Dense search (semantic):
    • Embed query and documents
    • Calculate cosine similarity
    • Rank by semantic similarity
  2. Sparse search (keyword):
    • Extract keywords from query
    • Use BM25/TF-IDF
    • Rank by keyword matching
  3. Score combination:
    • Normalize scores (0-1)
    • Weighted combination: final_score = α × dense_score + (1-α) × sparse_score
    • Typical α = 0.7 (70% dense, 30% sparse)
  4. Reranking:
    • Optional: Rerank combined results
    • Use cross-encoder
    • Improve precision
Benefits:
  • Captures semantic similarity (dense)
  • Captures exact matches (sparse)
  • Better coverage
  • Improved accuracy
Best practice: Use hybrid search in production RAG systems best of both approaches.
Merging strategies:
  1. Score normalization:
    • Normalize scores to same range (0-1)
    • Use min-max or z-score normalization
    • Enables fair combination
  2. Weighted combination:
    • final_score = α × method1_score + (1-α) × method2_score
    • Adjust α based on method performance
    • Typical: 0.7 dense + 0.3 sparse
  3. Reciprocal rank fusion (RRF):
    • Combine ranks, not scores
    • RRF_score = Σ(1 / (k + rank))
    • Works with different score ranges
    • Popular in information retrieval
  4. Learning to rank:
    • Train model to combine scores
    • Learns optimal combination
    • More complex but better
  5. Reranking:
    • Merge initial results
    • Rerank with cross-encoder
    • Improves final ranking
Best practice: Use reciprocal rank fusion (RRF) for merging works well with different score ranges.
Multi-hop queries:
  1. Iterative retrieval:
    • First hop: Retrieve initial documents
    • Extract entities/concepts
    • Second hop: Query with extracted entities
    • Combine results
  2. Graph-based retrieval:
    • Build knowledge graph
    • Traverse graph for multi-hop
    • Find connected entities
  3. Query decomposition:
    • Break query into sub-queries
    • Retrieve for each sub-query
    • Combine results
  4. Agent-based:
    • Use LLM agent
    • Plan retrieval steps
    • Execute iteratively
Best practices:
  • Use iterative retrieval for simple multi-hop
  • Use graph-based for complex relationships
  • Use agents for complex reasoning
Best practice: Use iterative retrieval for multi-hop queries retrieve, extract, query again.
Retrieval improvement techniques:
  1. Hybrid search:
    • Combine dense + sparse
    • Better coverage
    • Improved accuracy
  2. Reranking:
    • Second-stage ranking
    • Improves precision
    • Better top-k results
  3. Query expansion:
    • Add synonyms, related terms
    • Better query understanding
    • Improved retrieval
  4. Metadata filtering:
    • Filter by type, date, tags
    • Narrow search space
    • Better precision
  5. Better chunking:
    • Semantic chunking
    • Preserve context
    • Better retrieval
  6. Fine-tune embeddings:
    • Domain-specific models
    • Better domain understanding
    • Improved accuracy
  7. Multi-stage retrieval:
    • Coarse → Fine search
    • Hierarchical retrieval
    • Faster and better
Best practice: Combine multiple techniques hybrid search, reranking, filtering for best results.

10. Prompt Engineering & Basics of LLM

Predictive/Discriminative AI:
  • Predicts labels or classes
  • Examples: Classification, regression
  • Input → Output (label/class)
  • Trained on labeled data
  • Examples: Image classification, sentiment analysis
Generative AI:
  • Generates new content
  • Examples: Text generation, image generation
  • Input → Output (new content)
  • Trained on unlabeled data
  • Examples: GPT, DALL-E, ChatGPT
Key differences:
  • Purpose: Prediction vs generation
  • Output: Label vs content
  • Training: Labeled vs unlabeled data
  • Use case: Classification vs creation
Best practice: Use discriminative AI for classification, generative AI for content creation.
LLM (Large Language Model):
  • Neural network trained on large text corpora
  • Generates human-like text
  • Examples: GPT, BERT, LLaMA
How LLMs are trained:
  1. Pre-training:
    • Train on large unlabeled text corpus
    • Learn language patterns
    • Self-supervised learning (predict next token)
    • Massive compute and data
  2. Fine-tuning:
    • Adapt to specific tasks
    • Supervised learning on labeled data
    • Task-specific behavior
    • Much less data needed
  3. Alignment:
    • RLHF, DPO for human preferences
    • Safety and helpfulness
    • Human feedback
    • Aligns with human values
Training process:
  • Pre-training: Months on thousands of GPUs
  • Fine-tuning: Hours to days
  • Alignment: Days to weeks
Best practice: LLMs are pre-trained on massive data, then fine-tuned and aligned for specific use cases.
Token:
  • Basic unit of text processing
  • Can be word, subword, or character
  • Depends on tokenizer (BPE, WordPiece, SentencePiece)
How tokens work:
  • Text → Tokens → Token IDs → Model
  • Model processes tokens, not raw text
  • Token count affects cost and context
Examples:
  • “Hello world” → 2 tokens (BPE)
  • “Machine learning” → 2-3 tokens (depending on tokenizer)
Tokenization methods:
  • BPE: Byte Pair Encoding (GPT)
  • WordPiece: (BERT)
  • SentencePiece: (T5, multilingual)
Best practice: Understand your model’s tokenizer affects cost, context window, and performance.
Cost estimation:SaaS-based (OpenAI, Anthropic):
  • Pricing: Per token (input + output)
  • Example: GPT-4: 0.03/1kinputtokens,0.03/1k input tokens, 0.06/1k output tokens
  • Calculate: (input_tokens × input_price) + (output_tokens × output_price)
  • Monthly: Estimate tokens/month × price
Open source (self-hosted):
  • Infrastructure: GPU instances (A100, H100)
  • Cost: $5-15k/month for GPU instances
  • Break-even: ~2-5M requests/month
  • Additional: Storage, networking, maintenance
Factors:
  • Volume: More requests = higher cost
  • Model size: Larger models = higher cost
  • Context length: Longer context = more tokens
  • Region: Different pricing by region
Best practice: Calculate based on expected volume SaaS for low volume, self-hosted for high volume.
Temperature:
  • Controls randomness in generation (0-2)
  • Lower = more deterministic
  • Higher = more creative
How to set:
  • 0.0-0.3: Deterministic (code, classification)
  • 0.4-0.7: Balanced (Q&A, summaries)
  • 0.8-1.0: Creative (writing, brainstorming)
Best practices:
  • Start with 0.3-0.5 for most tasks
  • Use 0.0 for deterministic tasks
  • Use 0.8+ for creative tasks
  • Test different values
Best practice: Use low temperature (0.1-0.3) for factual tasks, higher (0.7-0.9) for creative tasks.
Decoding strategies:
  1. Greedy:
    • Always picks highest probability token
    • Fastest, deterministic
    • Can get repetitive
  2. Beam search:
    • Keeps top-k candidates
    • Better quality, slower
    • Good for translation
  3. Top-k sampling:
    • Samples from top-k tokens
    • More diverse, less deterministic
    • Good for creative tasks
  4. Top-p (nucleus) sampling:
    • Samples from smallest set covering p% probability
    • Good balance of quality and diversity
    • Most common for chat
  5. Temperature sampling:
    • Scales probabilities before sampling
    • Controls randomness
    • Often combined with top-p
Best practice: Use top-p (p=0.9) with temperature (0.7-0.9) for chat, greedy for code.
Stopping criteria:
  1. Max tokens:
    • Stop after N tokens
    • Prevents long outputs
    • Most common
  2. Stop sequences:
    • Stop when specific sequence appears
    • Example: ”###” or “\n\n”
    • Useful for structured output
  3. EOS token:
    • Stop at end-of-sequence token
    • Model-generated
    • Natural stopping point
  4. Custom logic:
    • Stop based on content
    • Example: Complete sentence, paragraph
    • More complex
Best practice: Use max tokens + stop sequences for reliable stopping.
Stop sequences:
  1. Define sequences:
    • List of strings to stop at
    • Example: [”###”, “\n\n\n”]
    • Model stops when any sequence appears
  2. Use cases:
    • Structured output (JSON, XML)
    • Multi-turn conversations
    • Preventing continuation
  3. Best practices:
    • Use unique sequences
    • Test to ensure they work
    • Combine with max tokens
Example:
  • Stop sequence: ”###”
  • Model stops when ”###” appears
  • Useful for structured output
Best practice: Use stop sequences for structured output prevents model from continuing beyond desired point.
Prompt structure:
  1. System prompt:
    • Defines model’s role
    • Sets behavior and constraints
    • Example: “You are a helpful assistant.”
  2. Context:
    • Relevant information
    • Retrieved documents (RAG)
    • User history
  3. Instructions:
    • What model should do
    • Format requirements
    • Examples
  4. User input:
    • Actual query or request
    • User’s question or task
Example structure:
System: You are a helpful assistant.
Context: [Retrieved documents]
Instructions: Answer based on context, cite sources.
User: What is machine learning?
Best practice: Use clear structure system prompt, context, instructions, user input.
In-context learning:
  • Model learns from examples in prompt
  • No weight updates
  • Examples guide model behavior
  • Types: Zero-shot, few-shot, chain-of-thought
How it works:
  • Provide examples in prompt
  • Model learns pattern from examples
  • Applies pattern to new input
  • No training needed
Types:
  • Zero-shot: No examples
  • Few-shot: 1-5 examples
  • Chain-of-thought: Examples with reasoning
Best practice: Use few-shot learning when zero-shot doesn’t work examples guide model behavior.
Prompt engineering types:
  1. Zero-shot:
    • No examples
    • Model uses pre-training
    • Fastest, cheapest
  2. Few-shot:
    • 1-5 examples
    • Guides model behavior
    • Better consistency
  3. Chain-of-thought:
    • Examples with reasoning steps
    • Improves reasoning
    • Better for complex tasks
  4. Role-playing:
    • Define model’s role
    • Sets behavior
    • Example: “You are an expert…”
  5. Template-based:
    • Structured prompts
    • Consistent format
    • Easy to maintain
Best practice: Start with zero-shot, add few-shot if needed, use chain-of-thought for reasoning tasks.
Few-shot prompting considerations:
  1. Example quality:
    • High-quality, relevant examples
    • Representative of task
    • Clear and correct
  2. Example quantity:
    • 2-5 examples usually best
    • Diminishing returns beyond 5
    • Balance cost and quality
  3. Example diversity:
    • Cover different cases
    • Avoid bias
    • Representative sample
  4. Token usage:
    • Examples increase tokens
    • Higher cost
    • Monitor usage
  5. Format consistency:
    • Consistent format across examples
    • Clear structure
    • Easy to follow
Best practice: Use 2-3 high-quality, diverse examples more doesn’t always help.
Prompt writing strategies:
  1. Be clear and specific:
    • Clear instructions
    • Specific requirements
    • Avoid ambiguity
  2. Use examples:
    • Few-shot examples
    • Show desired format
    • Guide behavior
  3. Structure prompts:
    • System prompt, context, instructions
    • Clear sections
    • Easy to read
  4. Iterate:
    • Test different prompts
    • Refine based on results
    • A/B test
  5. Version control:
    • Version prompts
    • Track changes
    • Enable rollback
Best practice: Write clear, structured prompts with examples iterate and test.
Hallucination:
  • Model generates false information
  • Confidently states incorrect facts
  • Common in LLMs
Control with prompt engineering:
  1. Ground in context:
    • Use RAG to provide context
    • Instruct model to use only context
    • Cite sources
  2. Explicit instructions:
    • “Only use provided context”
    • “If unsure, say so”
    • “Don’t make up information”
  3. Few-shot examples:
    • Show correct behavior
    • Examples of admitting uncertainty
    • Guide model
  4. Output format:
    • Structured output
    • Confidence scores
    • Source citations
Best practice: Use RAG + explicit instructions to reduce hallucinations ground answers in context.
Improve reasoning:
  1. Chain-of-thought:
    • Ask model to think step-by-step
    • Show reasoning in examples
    • Improves complex reasoning
  2. Few-shot CoT:
    • Examples with reasoning steps
    • Model learns pattern
    • Better reasoning
  3. Self-consistency:
    • Generate multiple reasoning chains
    • Pick most common answer
    • Improves accuracy
  4. Verification:
    • Ask model to verify answer
    • Check reasoning
    • Catch errors
Best practice: Use chain-of-thought prompting for complex reasoning ask model to think step-by-step.
If CoT fails:
  1. Simplify problem:
    • Break into smaller steps
    • Solve step-by-step
    • Combine solutions
  2. Better examples:
    • Higher quality examples
    • Clearer reasoning
    • More relevant
  3. Different approach:
    • Try different reasoning style
    • Alternative methods
    • Experiment
  4. Model upgrade:
    • Use larger model
    • Better reasoning capability
    • GPT-4, Claude Opus
  5. External tools:
    • Use calculator, code execution
    • Verify with tools
    • Hybrid approach
Best practice: Simplify problem, improve examples, or upgrade model CoT isn’t always sufficient.

11. Cost & Latency Tradeoffs

Token reduction strategies:
  1. Prompt optimization:
    • Remove unnecessary text
    • Use concise instructions
    • Remove redundant examples
  2. Context management:
    • Only include relevant context
    • Use RAG to retrieve only needed docs
    • Truncate long documents
  3. Prompt caching:
    • Cache system prompts
    • Reuse across requests
    • Significant savings
  4. Response limits:
    • Set max tokens for output
    • Stop early when possible
    • Use stop sequences
  5. Model selection:
    • Use smaller models when possible
    • Distilled models
    • Task-specific models
Best practice: Optimize prompts, use caching, and manage context can reduce token usage by 30-50%.
When to quantize:
  1. Memory constraints:
    • Limited GPU memory
    • Need to fit larger models
    • Edge devices
  2. Cost optimization:
    • Reduce inference cost
    • Lower infrastructure costs
    • Scale more efficiently
  3. Latency requirements:
    • Need faster inference
    • Real-time applications
    • Lower latency
Trade-offs:
  • Pros: Less memory, faster, cheaper
  • Cons: Some accuracy loss, more complex
Best practice: Quantize when memory/cost/latency are constraints and small accuracy loss is acceptable.
Batching strategy:
  1. Dynamic batching:
    • Batch requests together
    • Process multiple requests simultaneously
    • Higher throughput
  2. Continuous batching (vLLM):
    • Add requests to batch dynamically
    • Remove completed requests
    • Optimal GPU utilization
  3. Batch size:
    • Balance latency vs throughput
    • Larger batches = higher throughput
    • Smaller batches = lower latency
Caching strategy:
  1. Prompt caching:
    • Cache system prompts
    • Reuse across requests
    • Significant latency reduction
  2. Response caching:
    • Cache common queries
    • Return cached responses
    • Very fast
  3. Context caching:
    • Cache conversation context
    • Reuse for multi-turn
    • Faster responses
Best practice: Use continuous batching + prompt caching for best latency/throughput balance.
Hosted APIs (OpenAI, Anthropic):
  • Use when:
    • Low to medium volume
    • Need latest models
    • Don’t want infrastructure management
    • Quick to market
Open-source models (self-hosted):
  • Use when:
    • High volume (>2-5M requests/month)
    • Cost-sensitive
    • Need data privacy
    • Want control over models
Decision framework:
  • Volume: Low → hosted, high → self-hosted
  • Cost: Low volume → hosted, high volume → self-hosted
  • Privacy: Need privacy → self-hosted
  • Speed: Quick to market → hosted
Best practice: Start with hosted APIs, migrate to self-hosted as you scale and cost becomes concern.

12. Agentic AI

Agent definition:
  • LLM that can use tools and take actions
  • Can plan, execute, and iterate
  • Autonomous decision-making
  • Examples: Code execution, web search, API calls
Key capabilities:
  • Tool use: Call functions, APIs, tools
  • Planning: Break down tasks into steps
  • Execution: Take actions based on plan
  • Iteration: Refine based on results
Practical examples:
  • Code agent: Writes and executes code
  • Research agent: Searches web, synthesizes info
  • API agent: Calls APIs, processes data
Best practice: Agents are LLMs with tool-use capabilities enable autonomous task completion.
Orchestration challenges:
  1. Error handling:
    • Tool failures
    • Partial failures
    • Recovery strategies
  2. State management:
    • Track execution state
    • Manage context across tools
    • Handle state transitions
  3. Planning:
    • Determine tool sequence
    • Handle dependencies
    • Adapt to failures
  4. Coordination:
    • Coordinate multiple tools
    • Handle async operations
    • Manage timeouts
  5. Debugging:
    • Complex execution paths
    • Hard to trace issues
    • Difficult to reproduce
Best practice: Error handling and state management are hardest design robust error handling and clear state management.
Why agents loop/stall:
  1. Poor planning:
    • Incomplete plans
    • Circular dependencies
    • Unclear goals
  2. No termination:
    • No stopping criteria
    • Keep trying indefinitely
    • No timeout
  3. Error recovery:
    • Same error repeatedly
    • No alternative strategies
    • Stuck in loop
  4. Context limits:
    • Lose track of progress
    • Forget what tried
    • Repeat actions
  5. Tool failures:
    • Keep retrying failed tools
    • No fallback strategies
    • Stuck on failures
Mitigation:
  • Set max iterations
  • Implement timeouts
  • Track execution history
  • Use fallback strategies
  • Clear stopping criteria
Best practice: Set max iterations, timeouts, and track execution history to prevent loops and stalls.
Single-agent:
  • Simpler, easier to debug
  • Good for most tasks
  • Single point of failure
Multi-agent:
  • More complex, harder to debug
  • Good for complex tasks
  • Parallel execution
When multi-agent pays off:
  • Complex tasks: Need multiple specialists
  • Parallel work: Can work simultaneously
  • Specialization: Different agents for different tasks
  • Scale: Handle more complex workflows
When single-agent better:
  • Simple tasks: Single agent sufficient
  • Debugging: Easier to debug
  • Cost: Lower complexity
Best practice: Use single-agent for most tasks, multi-agent only when complexity justifies it.
Evaluation metrics:
  1. Task completion:
    • Success rate
    • Task completion time
    • Quality of results
  2. Efficiency:
    • Number of steps
    • Tool calls per task
    • Time to completion
  3. Reliability:
    • Error rate
    • Recovery from failures
    • Consistency
  4. User satisfaction:
    • User feedback
    • Task success rate
    • Time saved
  5. Cost:
    • Cost per task
    • Tool usage costs
    • Total cost
Evaluation approach:
  • A/B test: Agentic vs non-agentic
  • Measure metrics above
  • Compare performance
  • User feedback
Best practice: Evaluate with A/B testing measure task completion, efficiency, reliability, and user satisfaction.
Agent types:
  1. Simple reflex agents:
    • React to current percept
    • No memory, no planning
    • Condition-action rules
    • Example: Thermostat (if temp > threshold, turn on AC)
  2. Model-based reflex agents:
    • Maintain internal model of world
    • Track how world evolves
    • Better decisions with history
    • Example: Agent tracking inventory changes
  3. Goal-based agents:
    • Have explicit goals
    • Plan actions to achieve goals
    • Consider future consequences
    • Example: Navigation agent finding path to destination
  4. Utility-based agents:
    • Maximize utility function
    • Handle uncertainty and trade-offs
    • Choose best action given preferences
    • Example: Trading agent maximizing profit while managing risk
  5. Learning agents:
    • Improve performance over time
    • Learn from experience
    • Adapt to new situations
    • Example: Agent that improves recommendations based on feedback
Comparison:
  • Simple reflex: Fastest, simplest, limited
  • Model-based: More capable, needs world model
  • Goal-based: Can plan, needs goal specification
  • Utility-based: Handles trade-offs, needs utility function
  • Learning: Most flexible, needs training data
Best practice: Choose agent type based on task complexity simple reflex for basic tasks, learning agents for complex adaptive tasks.
Reactive agents:
  • Respond to current situation only
  • No internal state or memory
  • Immediate action based on percept
  • Simple condition-action rules
How they work:
  1. Perceive: Observe current environment
  2. Match: Match percept to condition
  3. Act: Execute corresponding action
  4. Repeat: No memory of past actions
Characteristics:
  • Fast: No planning overhead
  • Simple: Easy to implement
  • Limited: Can’t handle complex tasks
  • No learning: Don’t improve over time
Use cases:
  • Simple control systems
  • Real-time responses
  • When speed > sophistication
  • Deterministic environments
Limitations:
  • Can’t plan ahead
  • No memory of past actions
  • Limited to simple tasks
  • Can’t handle uncertainty well
Best practice: Use reactive agents for simple, fast-response tasks where immediate action is more important than planning.
ReAct agents:
  • Combine reasoning and acting
  • Interleave thinking and action
  • Use chain-of-thought reasoning
  • Take actions based on reasoning
How ReAct works:
  1. Think: Reason about current situation
  2. Act: Take action based on reasoning
  3. Observe: See result of action
  4. Think: Reason about new situation
  5. Repeat: Continue until goal achieved
Key components:
  • Reasoning: Chain-of-thought thinking
  • Acting: Tool/function calls
  • Observation: Results from actions
  • Iteration: Refine based on observations
Advantages:
  • Transparency: Can see reasoning process
  • Flexibility: Adapts to new situations
  • Error recovery: Can reason about failures
  • Better decisions: Thoughtful actions
Example:
Thought: I need to find the weather. Let me search for it.
Action: search_web("weather today")
Observation: Weather is sunny, 75°F
Thought: User asked about weather. I have the answer.
Action: respond("The weather is sunny and 75°F")
Best practice: Use ReAct for complex tasks requiring reasoning combines thinking and action for better results.
Agent reaction patterns:
  1. Immediate reaction:
    • React instantly to stimulus
    • No deliberation
    • Fast response
    • Simple reflex agents
  2. Deliberative reaction:
    • Think before acting
    • Consider options
    • Plan actions
    • Goal-based agents
  3. Adaptive reaction:
    • Learn from experience
    • Adjust behavior
    • Improve over time
    • Learning agents
  4. Contextual reaction:
    • Consider context
    • Use memory/history
    • Better decisions
    • Model-based agents
Reaction mechanisms:
  • Stimulus → Action: Direct mapping
  • Stimulus → Reasoning → Action: With deliberation
  • Stimulus → Memory → Reasoning → Action: With context
  • Stimulus → Learning → Adaptation → Action: With improvement
Factors affecting reaction:
  • Agent type: Reflex vs deliberative
  • Environment: Deterministic vs stochastic
  • Goals: Immediate vs long-term
  • Experience: New vs learned
Best practice: Design agents to react appropriately immediate for urgent, deliberative for complex, adaptive for changing environments.
Agent evaluation metrics:
  1. Task success:
    • Success rate (% tasks completed)
    • Goal achievement rate
    • Task completion quality
    • Accuracy of results
  2. Efficiency:
    • Steps to completion
    • Tool calls per task
    • Time to completion
    • Resource usage
  3. Reliability:
    • Error rate
    • Failure recovery rate
    • Consistency across runs
    • Robustness to edge cases
  4. Cost:
    • Cost per task
    • Token usage
    • Tool/API costs
    • Total cost of ownership
  5. User experience:
    • User satisfaction
    • Response time
    • Quality of interactions
    • Helpfulness
  6. Learning (for learning agents):
    • Improvement over time
    • Adaptation to new tasks
    • Generalization ability
    • Sample efficiency
Evaluation framework:
  • Offline: Test on held-out dataset
  • Online: A/B test with real users
  • Simulation: Test in controlled environment
  • Human evaluation: Expert review
Best practice: Use multiple metrics task success, efficiency, reliability, and user experience for comprehensive evaluation.
End-to-end evaluation:
  1. Define evaluation tasks:
    • Realistic scenarios
    • Diverse task types
    • Clear success criteria
    • Representative of real use
  2. Set up test environment:
    • Simulated or real environment
    • Tools and APIs available
    • Controlled conditions
    • Reproducible setup
  3. Run agent on tasks:
    • Execute agent on each task
    • Record all actions
    • Capture outputs
    • Log errors/failures
  4. Measure performance:
    • Task success rate
    • Steps to completion
    • Time to completion
    • Quality of results
    • Cost per task
  5. Analyze results:
    • Identify failure modes
    • Find common errors
    • Analyze efficiency
    • Compare with baselines
  6. Iterate:
    • Fix identified issues
    • Improve agent
    • Re-evaluate
    • Continuous improvement
Evaluation datasets:
  • WebArena: Web navigation tasks
  • AgentBench: Multi-domain agent tasks
  • ToolBench: Tool-using tasks
  • Custom: Domain-specific tasks
Best practice: Evaluate end-to-end on realistic tasks measure success, efficiency, and quality comprehensively.
Tool calling:
  • Agents invoke external functions/tools
  • Extends agent capabilities
  • Enables real-world actions
  • Examples: API calls, code execution, web search
How agents use tools:
  1. Tool definition:
    • Define available tools
    • Specify parameters
    • Document functionality
    • Example: search_web(query: str) -> str
  2. Tool selection:
    • Agent decides which tool to use
    • Based on current task
    • Considers tool capabilities
    • Matches tool to need
  3. Tool invocation:
    • Call tool with parameters
    • Execute tool function
    • Get result
    • Handle errors
  4. Result processing:
    • Process tool output
    • Use result for next action
    • Integrate into reasoning
    • Continue task
Tool calling patterns:
  • Sequential: One tool at a time
  • Parallel: Multiple tools simultaneously
  • Conditional: Tool based on condition
  • Iterative: Tool in loop until done
Best practice: Design tools with clear interfaces agents need well-defined tools with good documentation.
OpenAI Functions:
  • Structured way to define tools
  • Model decides when to call functions
  • Returns structured function calls
  • Enables reliable tool use
How it works:
  1. Define functions:
    {
      "name": "get_weather",
      "description": "Get current weather",
      "parameters": {
        "type": "object",
        "properties": {
          "location": {"type": "string"}
        }
      }
    }
    
  2. Model decides:
    • Model sees function definitions
    • Decides if function needed
    • Returns function call if needed
    • Or continues conversation
  3. Execute function:
    • Parse function call
    • Execute with parameters
    • Get result
    • Return to model
  4. Model continues:
    • Model sees function result
    • Uses result in response
    • Can call more functions
    • Completes task
Advantages:
  • Reliable: Structured function calls
  • Flexible: Model decides when to use
  • Type-safe: JSON schema validation
  • Easy integration: Standard format
Best practice: Use OpenAI Functions for reliable tool calling structured, type-safe, and model-controlled.
MCP (Model Context Protocol):
  • Standard protocol for agent-tool communication
  • Enables agents to use external tools
  • Provides context to models
  • Standardizes tool interfaces
How MCP works:
  1. Tool registration:
    • Tools register with MCP server
    • Define capabilities
    • Specify interfaces
    • Make available to agents
  2. Context provision:
    • MCP provides tool context
    • Describes available tools
    • Shows tool capabilities
    • Updates dynamically
  3. Tool invocation:
    • Agent requests tool use
    • MCP routes to tool
    • Executes tool
    • Returns result
  4. Context updates:
    • MCP updates context
    • Reflects tool results
    • Maintains state
    • Enables multi-step tasks
Key features:
  • Standardization: Common protocol
  • Interoperability: Works across systems
  • Context management: Maintains state
  • Tool discovery: Agents find tools
Use cases:
  • Multi-tool agent systems
  • Tool marketplace integration
  • Standardized agent platforms
  • Cross-platform tool use
Best practice: Use MCP for standardized tool integration enables agents to discover and use tools reliably.
Agent-to-Agent (A2A) communication:
  • Agents communicate with each other
  • Coordinate on tasks
  • Share information
  • Collaborate on goals
A2A patterns:
  1. Direct communication:
    • Agents send messages directly
    • Point-to-point communication
    • Simple but limited scale
    • Example: Two agents coordinating
  2. Broadcast communication:
    • One agent broadcasts to all
    • Announcements, updates
    • Efficient for one-to-many
    • Example: Leader announcing plan
  3. Mediated communication:
    • Communication through mediator
    • Centralized coordination
    • Better for complex systems
    • Example: Message broker
  4. Shared memory:
    • Agents share common memory
    • Read/write shared state
    • Coordination through state
    • Example: Blackboard architecture
Coordination strategies:
  1. Task delegation:
    • One agent delegates to others
    • Divide and conquer
    • Specialized agents
    • Example: Manager delegates to workers
  2. Consensus:
    • Agents agree on action
    • Voting, negotiation
    • Democratic decision-making
    • Example: Agents vote on plan
  3. Auction:
    • Agents bid on tasks
    • Market-based coordination
    • Efficient resource allocation
    • Example: Task auction system
  4. Contract net:
    • One agent announces task
    • Others bid on task
    • Select best bidder
    • Example: Task allocation
Best practice: Design A2A systems with clear communication protocols enables effective coordination and collaboration.
Multi-agent system design:
  1. Agent roles:
    • Define agent responsibilities
    • Specialize agents
    • Clear role boundaries
    • Example: Researcher, Writer, Reviewer
  2. Communication protocol:
    • Define message format
    • Specify communication channels
    • Establish protocols
    • Example: JSON messages, REST API
  3. Coordination mechanism:
    • How agents coordinate
    • Task allocation
    • Conflict resolution
    • Example: Manager agent, voting
  4. Shared resources:
    • Common knowledge base
    • Shared memory
    • Tool access
    • Example: Shared database
  5. Error handling:
    • Agent failure recovery
    • Communication failures
    • Task reassignment
    • Example: Backup agents, retries
Design patterns:
  1. Hierarchical:
    • Manager-worker structure
    • Top-down coordination
    • Clear hierarchy
    • Example: Manager delegates to workers
  2. Peer-to-peer:
    • Equal agents
    • Distributed coordination
    • No central authority
    • Example: Swarm agents
  3. Market-based:
    • Agents trade resources
    • Auction-based allocation
    • Economic incentives
    • Example: Task marketplace
  4. Blackboard:
    • Shared blackboard
    • Agents read/write
    • Opportunistic coordination
    • Example: Shared knowledge base
Best practice: Design multi-agent systems with clear roles, communication protocols, and coordination mechanisms enables effective collaboration.
A2A communication challenges:
  1. Message understanding:
    • Agents interpret messages
    • Ambiguity in communication
    • Different vocabularies
    • Misunderstandings
  2. Synchronization:
    • Timing of messages
    • Async vs sync communication
    • Race conditions
    • Deadlocks
  3. Scalability:
    • Communication overhead
    • Message flooding
    • Network congestion
    • Performance degradation
  4. Reliability:
    • Message delivery
    • Lost messages
    • Duplicate messages
    • Ordering guarantees
  5. Security:
    • Authentication
    • Authorization
    • Message encryption
    • Trust between agents
  6. Coordination:
    • Avoiding conflicts
    • Resolving disputes
    • Consensus building
    • Task allocation
Solutions:
  • Protocols: Standardized communication
  • Message queues: Reliable delivery
  • Encryption: Secure communication
  • Authentication: Trusted agents
  • Coordination algorithms: Conflict resolution
Best practice: Address communication challenges with protocols, reliability mechanisms, and security critical for multi-agent systems.
A2A system evaluation:
  1. System-level metrics:
    • Overall task completion
    • System efficiency
    • Resource utilization
    • End-to-end performance
  2. Agent-level metrics:
    • Individual agent performance
    • Agent contribution
    • Agent reliability
    • Agent efficiency
  3. Communication metrics:
    • Message overhead
    • Communication latency
    • Message success rate
    • Coordination efficiency
  4. Coordination metrics:
    • Task allocation quality
    • Conflict resolution rate
    • Consensus building time
    • Coordination overhead
  5. Scalability metrics:
    • Performance with more agents
    • Communication overhead growth
    • System stability
    • Resource usage
Evaluation approaches:
  • Simulation: Test in controlled environment
  • Benchmarks: Standard test suites
  • Real-world: Deploy and monitor
  • Stress testing: Test under load
Best practice: Evaluate A2A systems at multiple levels system, agent, communication, and coordination for comprehensive assessment.
Reactive agents:
  • React to current situation
  • No planning or memory
  • Fast response
  • Simple implementation
  • Limited to simple tasks
Deliberative agents:
  • Plan before acting
  • Consider future consequences
  • Slower but better decisions
  • More complex
  • Handle complex tasks
Hybrid agents:
  • Combine reactive and deliberative
  • React for urgent, deliberate for complex
  • Balance speed and quality
  • Most practical
  • Best of both worlds
Comparison:
AspectReactiveDeliberativeHybrid
SpeedFastSlowMedium
ComplexitySimpleComplexMedium
PlanningNoYesSelective
MemoryNoYesYes
Use caseSimpleComplexGeneral
When to use:
  • Reactive: Simple, fast-response tasks
  • Deliberative: Complex planning tasks
  • Hybrid: General-purpose agents
Best practice: Use hybrid agents for most applications balance speed and quality, react when needed, deliberate when beneficial.
Plan-and-execute:
  • Plan entire task upfront
  • Execute plan step by step
  • Rigid execution
  • Can’t adapt to changes
  • Good for predictable tasks
ReAct:
  • Interleave reasoning and acting
  • Plan incrementally
  • Adapt to observations
  • Flexible execution
  • Good for dynamic tasks
Comparison:Plan-and-execute:
  • Pros: Clear plan, efficient execution
  • Cons: Rigid, can’t adapt, fails if plan wrong
  • Use when: Task is predictable, plan is reliable
ReAct:
  • Pros: Flexible, adapts, handles uncertainty
  • Cons: More steps, slower, more tokens
  • Use when: Task is dynamic, needs adaptation
Example:Plan-and-execute:
1. Plan: Search weather → Get location → Format response
2. Execute: Search weather
3. Execute: Get location
4. Execute: Format response
ReAct:
1. Think: Need weather, let me search
2. Act: search_web("weather")
3. Observe: Weather is sunny
4. Think: Good, now format response
5. Act: respond("Weather is sunny")
Best practice: Use ReAct for dynamic tasks, plan-and-execute for predictable tasks choose based on task characteristics.
Learning agent improvement:
  1. Experience collection:
    • Collect training data
    • Record actions and outcomes
    • Build experience database
    • Track performance
  2. Learning mechanisms:
    • Supervised learning: Learn from labeled examples
    • Reinforcement learning: Learn from rewards
    • Unsupervised learning: Discover patterns
    • Meta-learning: Learn to learn
  3. Performance improvement:
    • Better decision-making
    • Fewer errors
    • Faster task completion
    • Higher success rate
  4. Adaptation:
    • Adapt to new tasks
    • Handle edge cases
    • Generalize from experience
    • Transfer learning
Learning approaches:
  1. Online learning:
    • Learn during operation
    • Continuous improvement
    • Real-time adaptation
    • Example: Agent learns from user feedback
  2. Offline learning:
    • Learn from historical data
    • Batch training
    • Periodic updates
    • Example: Retrain on collected data
  3. Transfer learning:
    • Learn from related tasks
    • Apply to new domains
    • Faster adaptation
    • Example: Agent trained on task A helps with task B
Best practice: Design learning agents with clear learning objectives collect experience, learn continuously, and adapt to new situations.
Agent evaluation frameworks:
  1. AgentBench:
    • Multi-domain agent tasks
    • Standardized evaluation
    • Diverse task types
    • Comprehensive metrics
  2. WebArena:
    • Web navigation tasks
    • Realistic scenarios
    • Browser automation
    • Success rate metrics
  3. ToolBench:
    • Tool-using tasks
    • Function calling evaluation
    • Tool selection accuracy
    • Task completion rate
  4. ALFWorld:
    • Household tasks
    • Embodied agents
    • Sequential actions
    • Task success metrics
  5. Custom frameworks:
    • Domain-specific tasks
    • Real-world scenarios
    • Business metrics
    • User satisfaction
Evaluation dimensions:
  • Task success: Can agent complete task?
  • Efficiency: How many steps?
  • Quality: How good is result?
  • Reliability: How consistent?
  • Cost: How expensive?
Best practice: Use standardized frameworks (AgentBench, WebArena) for comparison, custom frameworks for domain-specific evaluation.

13. System Design Thinking

Determinism strategies:
  1. Temperature = 0:
    • Pure greedy decoding
    • Deterministic outputs
    • Reproducible
  2. Fixed seed:
    • Set random seed
    • Same seed = same output
    • Reproducible
  3. Structured output:
    • Use JSON schema
    • Validate output format
    • Consistent structure
  4. Prompt engineering:
    • Clear instructions
    • Few-shot examples
    • Consistent format
Reducing brittleness:
  1. Error handling:
    • Graceful degradation
    • Fallback strategies
    • Retry logic
  2. Validation:
    • Input validation
    • Output validation
    • Error detection
  3. Monitoring:
    • Track failures
    • Alert on issues
    • Quick response
Best practice: Use temperature=0, structured output, and robust error handling for deterministic, robust systems.
Fallback strategies:
  1. Retry:
    • Retry with same prompt
    • Exponential backoff
    • Max retries
  2. Simplified prompt:
    • Retry with simpler prompt
    • Remove complexity
    • Basic version
  3. Cached response:
    • Return cached response
    • Similar queries
    • Fast fallback
  4. Template response:
    • Pre-written responses
    • Generic answers
    • User-friendly
  5. Human escalation:
    • Route to human
    • For critical tasks
    • Last resort
Best practice: Implement layered fallbacks retry → simplified prompt → cached response → template → human escalation.
When to avoid LLMs:
  1. Simple tasks:
    • Rule-based sufficient
    • No need for AI
    • Faster, cheaper
  2. Deterministic tasks:
    • Need exact results
    • No ambiguity
    • Traditional methods better
  3. Cost-sensitive:
    • LLM too expensive
    • Simple solution sufficient
    • Cost optimization
  4. Latency-critical:
    • Need very fast response
    • LLM too slow
    • Real-time requirements
When to avoid vector DBs:
  1. Small dataset:
    • Can use simple search
    • No need for vector DB
    • Overkill
  2. Exact matches:
    • Keyword search sufficient
    • No semantic search needed
    • Simpler solution
Best practice: Consider simpler solutions first use LLMs/vector DBs only when needed.
SQL databases:
  • Use for: Structured data, exact queries, transactions
  • Examples: PostgreSQL, MySQL
  • Best for: User data, transactions, structured queries
NoSQL databases:
  • Use for: Unstructured data, flexible schema, scale
  • Examples: MongoDB, DynamoDB
  • Best for: Documents, flexible schema, high scale
Vector databases:
  • Use for: Embeddings, similarity search, RAG
  • Examples: Pinecone, Milvus, Weaviate
  • Best for: Semantic search, RAG systems
Decision framework:
  • Structured data + exact queries: SQL
  • Unstructured data + flexible schema: NoSQL
  • Embeddings + similarity search: Vector
Best practice: Use SQL for structured data, NoSQL for documents, vector DB for embeddings.

14. Risks, Integrity & Compliance

Hallucination monitoring:
  1. Output validation:
    • Check for factual claims
    • Verify against sources
    • Flag suspicious outputs
  2. Confidence scores:
    • Monitor confidence levels
    • Flag low-confidence outputs
    • Review manually
  3. User feedback:
    • Collect thumbs up/down
    • Track user reports
    • Identify patterns
  4. Citation accuracy:
    • Verify citations
    • Check source relevance
    • Measure citation precision
  5. Automated checks:
    • Fact-checking APIs
    • Knowledge base verification
    • Pattern detection
Best practice: Monitor hallucinations with output validation, confidence scores, and user feedback essential for production.
Bias vs fairness:Bias:
  • Statistical bias in model
  • Can be measured
  • Technical issue
Fairness:
  • Social concept
  • Subjective
  • Context-dependent
When fixing makes it worse:
  1. Over-correction:
    • Fixing one bias creates another
    • Unintended consequences
    • Worse outcomes
  2. Wrong metrics:
    • Optimizing wrong fairness metric
    • Doesn’t improve real fairness
    • Makes system worse
  3. Context mismatch:
    • Fixing for one context
    • Doesn’t work in other contexts
    • Creates new issues
Best practice: Carefully define fairness metrics, test in real contexts, and monitor for unintended consequences.
Red-teaming checklist:
  1. Safety:
    • Harmful content generation
    • Jailbreak attempts
    • Prompt injection
    • Safety bypasses
  2. Bias:
    • Demographic bias
    • Stereotyping
    • Unfair treatment
    • Representation issues
  3. Privacy:
    • PII leakage
    • Data exposure
    • Privacy violations
    • Compliance issues
  4. Security:
    • Prompt injection
    • Model extraction
    • Data poisoning
    • Adversarial attacks
  5. Reliability:
    • Hallucinations
    • Inconsistency
    • Error handling
    • Edge cases
Best practice: Red-team before launch test safety, bias, privacy, security, and reliability comprehensively.
Privacy handling:
  1. Anonymization:
    • Remove PII
    • Hash sensitive data
    • Pseudonymize users
  2. Access control:
    • Role-based access
    • Audit logs
    • Secure storage
  3. Retention:
    • Set retention policies
    • Delete old logs
    • Comply with regulations
  4. Encryption:
    • Encrypt at rest
    • Encrypt in transit
    • Secure storage
  5. Compliance:
    • GDPR, CCPA compliance
    • User consent
    • Right to deletion
Best practice: Anonymize logs, control access, set retention, encrypt data, and comply with regulations.

15. Scaling & Business Impact

Trade-offs:Cost vs accuracy:
  • Use smaller models → lower cost, lower accuracy
  • Use larger models → higher cost, higher accuracy
  • Decision: Balance based on requirements
Latency vs accuracy:
  • Use faster models → lower latency, lower accuracy
  • Use better models → higher latency, higher accuracy
  • Decision: Balance based on use case
Cost vs latency:
  • Use caching → lower cost, lower latency
  • Use more GPUs → higher cost, lower latency
  • Decision: Balance based on budget
Example scenario:
  • Situation: High latency, need to reduce
  • Solution: Use smaller model, add caching
  • Trade-off: Slight accuracy loss, but acceptable
  • Result: Latency reduced 50%, accuracy dropped 5%
Best practice: Understand trade-offs often need to balance cost, latency, and accuracy based on priorities.
Infrastructure concerns:
  1. Security:
    • Data privacy
    • Compliance
    • Access control
    • Encryption
  2. Reliability:
    • Uptime requirements
    • Error handling
    • Disaster recovery
    • SLAs
  3. Scalability:
    • Handle enterprise scale
    • Performance at scale
    • Cost at scale
    • Infrastructure needs
  4. Integration:
    • Existing systems
    • APIs, authentication
    • Data pipelines
    • Workflows
  5. Support:
    • Documentation
    • Support channels
    • Training
    • Maintenance
Best practice: Address security, reliability, scalability, integration, and support critical for enterprise adoption.
Design principles:
  1. Real value:
    • Solve real problems
    • Clear value proposition
    • User needs first
  2. Quality:
    • High accuracy
    • Reliable performance
    • Consistent results
  3. Scalability:
    • Handle growth
    • Cost-effective
    • Performance at scale
  4. User experience:
    • Intuitive interface
    • Fast responses
    • Good error handling
  5. Iteration:
    • Continuous improvement
    • User feedback
    • Regular updates
Best practice: Focus on real value, quality, scalability, and user experience build for long-term, not just demo.

16. Real-World Scenarios

Migration strategy:
  1. Dual-write:
    • Write to both old and new vector DBs
    • Gradually migrate reads
    • Deprecate old DB
  2. Blue-green:
    • Maintain two environments
    • Re-embed in green
    • Switch traffic when ready
  3. Incremental:
    • Re-embed in batches
    • Update incrementally
    • Route queries appropriately
  4. Validation:
    • Compare results
    • Ensure quality maintained
    • Monitor metrics
Best practice: Use dual-write or blue-green deployment migrate safely with validation and rollback capability.
Fine-tuning process:
  1. Data collection:
    • Collect user behavior data
    • Label data
    • Create training set
  2. Fine-tuning:
    • Train on user behavior
    • Monitor validation metrics
    • Iterate
  3. Evaluation:
    • Test on held-out set
    • Compare with baseline
    • Measure improvement
  4. Deployment:
    • A/B test against baseline
    • Gradual rollout
    • Monitor performance
  5. Monitoring:
    • Track metrics
    • Monitor user feedback
    • Iterate based on results
Best practice: Collect data, fine-tune carefully, evaluate thoroughly, deploy gradually, and monitor continuously.
Cost optimization strategies:
  1. Model optimization:
    • Quantization
    • Model distillation
    • Smaller models
  2. Caching:
    • Prompt caching
    • Response caching
    • Reduce redundant calls
  3. Smart routing:
    • Route simple queries to smaller models
    • Route complex to larger models
    • Cost-aware routing
  4. Batching:
    • Dynamic batching
    • Continuous batching
    • Higher throughput
  5. Infrastructure:
    • Spot instances
    • Auto-scaling
    • Right-sizing
Best practice: Optimize models, use caching and smart routing, batch requests can reduce costs 30-50% without quality loss.
Debugging process:
  1. Reproduce:
    • Reproduce the issue
    • Capture prompt and response
    • Identify pattern
  2. Analyze:
    • Check prompt quality
    • Review context
    • Check model parameters
  3. Isolate:
    • Test with minimal prompt
    • Remove complexity
    • Identify root cause
  4. Fix:
    • Update prompt
    • Adjust parameters
    • Add examples
  5. Validate:
    • Test on known cases
    • Verify fix works
    • Monitor in production
Example:
  • Issue: Model gives wrong answer
  • Debug: Check prompt, find ambiguous instruction
  • Fix: Clarify instruction, add example
  • Validate: Test on known cases, works correctly
Best practice: Systematic debugging reproduce, analyze, isolate, fix, validate.

Conclusion & Interview Tips

This guide covers all major AI engineering areas from prompt design to scalable systems and ethical deployment.

Key Preparation Tips

  • Understand system trade-offs
  • Build RAG or LLM-serving demos
  • Learn caching, monitoring, CI/CD
  • Emphasize ethics & safety
  • Explain architecture choices clearly

During the Interview

  • Clarify before answering
  • Think aloud for reasoning
  • Mention latency/cost trade-offs
  • Talk about monitoring and fallback
  • Stay calm & confident
Interviews test not just your AI knowledge but your reasoning about scale, safety, and reliability. Stay grounded and structured.
Good luck with your AI Engineer interviews!