1. LLM Engineering & Prompt Design
What are few-shot vs zero-shot prompting and when to use each?
What are few-shot vs zero-shot prompting and when to use each?
Few-shot prompting provides 1–5 examples to guide structure, style, and tone.Use zero-shot when:
- Task is simple and well-known (e.g., translation)
- No examples available or cost-sensitive
- You want unbiased responses
- Output format consistency is needed
- Domain-specific context or edge cases exist
- Zero-shot outputs are inconsistent
- Zero-shot: “Classify sentiment: ‘I love this phone!’”
- Few-shot: Add 2–3 labeled examples before the query.
What is temperature in LLM generation and how does it affect outputs?
What is temperature in LLM generation and how does it affect outputs?
- Low (0–0.3): Deterministic, precise outputs (coding, classification)
- Medium (0.4–0.7): Balanced tone (summaries, Q&A)
- High (0.8–1.0): Creative, diverse results (brainstorming)
- Coding assistant →
temp=0.1 - Customer support →
temp=0.4 - Marketing content →
temp=0.8
top_p (nucleus sampling) and top_k for fine control.What is prompt injection and how to defend against it?
What is prompt injection and how to defend against it?
- Direct: “Ignore previous instructions…”
- Context: Hidden malicious text in documents
- Jailbreak: Role-play or DAN-style prompts
- Prompt leak: Forcing model to reveal system prompt
- Input validation: Filter keywords like “ignore”, “system prompt”.
- Prompt separation: Clearly delimit system and user input.
- Instruction hierarchy: Reiterate rules after user input.
- Output validation: Sanitize responses before showing.
- Monitoring: Log blocked attempts for review.
What is tokenization, and how does it affect generation?
What is tokenization, and how does it affect generation?
- Text → Token IDs → Model processing → Generation
- Different tokenizers (BPE, WordPiece, SentencePiece) use different strategies
- Token count directly affects cost and context window usage
- Token limits: Models have maximum token limits (context window)
- Cost: Pricing is typically per token (input + output)
- Quality: Better tokenization preserves semantic meaning
- Speed: Fewer tokens = faster processing
["Hello", " world"] (2 tokens) or ["Hel", "lo", " wor", "ld"] (4 tokens) depending on tokenizer.Best practices:- Understand your model’s tokenizer (GPT uses BPE, BERT uses WordPiece)
- Monitor token usage to optimize costs
- Consider tokenization when chunking documents for RAG
How do embeddings really work?
How do embeddings really work?
- Training: Models learn from large text corpora that words appearing in similar contexts should have similar vectors
- Vector space: Words/concepts are positioned in high-dimensional space (typically 384, 768, or 1536 dimensions)
- Similarity: Cosine similarity or dot product measures how “close” two embeddings are
- Semantic capture: “king” - “man” + “woman” ≈ “queen” (famous word2vec example)
- Dense vectors: Every dimension has meaning (unlike sparse one-hot encoding)
- Fixed size: All texts map to same dimension vector
- Learned representations: Capture semantic relationships from training data
- Semantic search (find similar documents)
- RAG (retrieve relevant context)
- Clustering and classification
- Recommendation systems
What's the role of attention, positional encoding?
What's the role of attention, positional encoding?
- Allows model to focus on relevant parts of input when generating each token
- Computes weighted relationships between all tokens
- Enables understanding of long-range dependencies
- Self-attention: tokens attend to other tokens in same sequence
- Adds information about token position since transformers process all tokens in parallel
- Without it, “dog bites man” and “man bites dog” would be identical
- Can be learned (BERT) or fixed sinusoidal patterns (original Transformer)
- Attention: “what to focus on” (semantic relationships)
- Position: “where things are” (order matters for meaning)
- Together: Model understands both meaning and structure
What changes during fine-tuning? (optimizers, schedulers, layer freezing)
What changes during fine-tuning? (optimizers, schedulers, layer freezing)
- Model weights: Selected layers get updated (not all layers necessarily)
- Learning rate: Much lower than pre-training (typically 1e-5 to 1e-3)
- Optimizers: Often AdamW or Adam with weight decay
- Schedulers: Cosine annealing, linear warmup, or constant LR
- Layer freezing: Early layers often frozen, only top layers trained
- Full fine-tuning: All parameters updated (expensive, needs more data)
- PEFT (Parameter-Efficient Fine-Tuning): LoRA, Adapters, only train small subset
- Layer freezing: Freeze embeddings and early transformer layers, train only classifier head
- Batch size: 4-32 (depends on GPU memory)
- Epochs: 1-5 (often 1-3 is enough)
- Gradient accumulation: Simulate larger batches
- Mixed precision: FP16/BF16 for memory efficiency
- Start with frozen layers, gradually unfreeze
- Use learning rate finder to find optimal LR
- Monitor validation loss to prevent overfitting
- Save checkpoints frequently
Transformers hinge on attention can you explain why 'attention is all you need' isn't just marketing?
Transformers hinge on attention can you explain why 'attention is all you need' isn't just marketing?
- Parallelization: Unlike RNNs, all tokens processed simultaneously (faster training)
- Long-range dependencies: Direct connections between any two tokens (RNNs struggle with distance)
- Interpretability: Attention weights show what model focuses on
- Flexibility: Can attend to any part of input, not just sequential neighbors
- Multi-head attention: Multiple attention mechanisms capture different relationships
- Self-attention: Tokens attend to other tokens in same sequence
- Scaled dot-product attention: Efficient computation with scaling factor
- Empirically proven: Achieved SOTA on translation, outperformed RNNs/CNNs
- Enables modern LLMs: GPT, BERT, T5 all use attention
- Scalable: Works with billions of parameters
- Foundation for current AI: Most LLMs are transformer-based
Encoder vs decoder vs encoder-decoder: in what scenarios would you prefer each?
Encoder vs decoder vs encoder-decoder: in what scenarios would you prefer each?
- Bidirectional understanding (sees full context)
- Best for: Classification, NER, sentiment analysis, understanding tasks
- Example: BERT for question answering (reads passage, finds answer)
- Autoregressive generation (predicts next token)
- Best for: Text generation, completion, chat, creative writing
- Example: GPT for story generation or code completion
- Both understanding and generation
- Best for: Translation, summarization, text-to-text tasks
- Example: T5 for “translate English to French: Hello” → “Bonjour”
- Need to understand input? → Encoder or encoder-decoder
- Need to generate text? → Decoder or encoder-decoder
- Need both? → Encoder-decoder
- Most modern LLMs are decoder-only (GPT-style) because they’re more flexible
Walk me through tokenization choices (BPE vs WordPiece vs SentencePiece). Where do they break down?
Walk me through tokenization choices (BPE vs WordPiece vs SentencePiece). Where do they break down?
- Used by: GPT, RoBERTa
- How: Starts with characters, iteratively merges most frequent pairs
- Pros: Handles unknown words, good for multilingual
- Cons: Can split words awkwardly
- Breaks down: Very long words, domain-specific terms
- Used by: BERT, DistilBERT
- How: Similar to BPE but uses language model likelihood
- Pros: Better word boundaries, handles subwords well
- Cons: Less flexible than BPE
- Breaks down: Rare technical terms, code snippets
- Used by: T5, ALBERT, multilingual models
- How: Treats input as Unicode, works at sentence level
- Pros: Language-agnostic, handles any Unicode text
- Cons: Can be slower, larger vocabulary
- Breaks down: Very rare characters, mixed scripts
- BPE: Best for general-purpose, multilingual
- WordPiece: Best for English, better word preservation
- SentencePiece: Best for multilingual, code, special characters
- Very long technical terms (e.g., chemical names)
- Mixed languages in single sentence
- Code with special syntax
- Emojis and special Unicode characters
- Domain-specific jargon not in training data
Embeddings: why does cosine similarity dominate, and when does it fail?
Embeddings: why does cosine similarity dominate, and when does it fail?
- Magnitude-independent: Focuses on direction, not vector length
- Normalized: Range is [-1, 1], easy to interpret
- Efficient: Fast computation, works well with approximate nearest neighbor search
- Semantic focus: Captures semantic similarity better than Euclidean distance
- Cosine similarity = dot product of normalized vectors
- Measures angle between vectors, not distance
- Vectors pointing same direction = similar meaning
- Semantic search (find similar documents)
- Recommendation systems
- Clustering similar texts
- RAG retrieval
- Magnitude matters: If vector length encodes importance, cosine ignores it
- Sparse vectors: Works poorly with very sparse embeddings
- High dimensionality: Can become less discriminative in very high dimensions
- Domain mismatch: Embeddings from different models aren’t comparable
- Fine-grained differences: May not capture subtle distinctions
- Dot product: When magnitude matters
- Euclidean distance: When absolute distance is important
- Manhattan distance: For sparse vectors
- Learned similarity: Train a model to learn similarity function
How do context windows constrain design? How have you handled long-context hacks in production?
How do context windows constrain design? How have you handled long-context hacks in production?
- Models have fixed maximum context (e.g., GPT-4: 128k tokens, Claude: 200k)
- Input + output must fit within limit
- Longer context = higher cost and latency
- Must truncate or summarize long documents
- Need strategies for multi-turn conversations
- RAG becomes essential for knowledge beyond context
- Chunking strategy critical for document processing
- Sliding window: Process document in overlapping chunks
- Hierarchical summarization: Summarize chunks, then summarize summaries
- Retrieval: Use RAG to fetch relevant parts instead of including everything
- Context compression: Use smaller models to compress context before main model
- Relevance filtering: Only include most relevant parts of long documents
- Prompt caching: Cache system prompts to save tokens
- Streaming: Start generating before full context processed
- Progressive loading: Load context incrementally
- Smart truncation: Keep beginning and end, truncate middle (often most important parts)
- Chunk into 10 pieces of 8k each
- Embed and store in vector DB
- For query, retrieve top 3 most relevant chunks
- Include only those in context (24k tokens, fits in window)
- More context = better understanding but higher cost
- Less context = faster/cheaper but may miss information
- Balance based on use case requirements
Greedy vs beam vs nucleus sampling which would you pick for a summarization API under latency pressure?
Greedy vs beam vs nucleus sampling which would you pick for a summarization API under latency pressure?
- Always picks highest probability token
- Fastest, deterministic
- Can get stuck in repetitive loops
- Best for: Code generation, when determinism needed
- Keeps top-k candidates at each step
- Explores multiple paths, finds better sequences
- Slower (k× slower), more memory
- Best for: Translation, when quality > speed
- Samples from smallest set covering p% of probability mass
- Good balance of quality and diversity
- Faster than beam, more diverse than greedy
- Best for: Creative tasks, chat, when need variety
- Choose: Greedy or top-p with low temperature (0.3-0.5)
- Why: Summarization needs accuracy, not creativity
- Greedy: Fastest, good for factual summaries
- Top-p (p=0.9, temp=0.3): Slightly slower but more natural phrasing
- Start with greedy for maximum speed
- If quality issues, use top-p with low temperature
- Avoid beam search (too slow for API)
- Consider caching common summaries
Why is positional encoding critical, and what happens when models forget 'order'?
Why is positional encoding critical, and what happens when models forget 'order'?
- Transformers process all tokens in parallel (no inherent order)
- “The cat sat” vs “sat cat The” would be identical without position info
- Position encoding tells model where each token is in sequence
- Model can’t distinguish word order
- “Dog bites man” = “Man bites dog” (same meaning to model)
- Grammar and syntax understanding breaks down
- Language is inherently sequential order matters
- Fixed sinusoidal: Original Transformer, mathematical patterns
- Learned: BERT-style, model learns positions during training
- Relative: T5-style, encodes relative distances between tokens
- Position encoding gets corrupted or removed
- Very long sequences beyond training length
- Position embeddings not properly initialized
- Result: Nonsensical output, loss of grammatical structure
- Input: “I love programming in Python”
- Without position: Model might generate “Python in programming love I”
- With position: Correct order maintained
- Always include positional encoding
- For long contexts, use models trained on long sequences
- Consider relative position encoding for variable-length inputs
What's the trade-off between dense embeddings and sparse/lexical search?
What's the trade-off between dense embeddings and sparse/lexical search?
- Vector representations from neural networks
- Captures semantic meaning, synonyms, context
- Pros: Understands intent, handles synonyms, multilingual
- Cons: Can miss exact keyword matches, requires embedding model
- Traditional TF-IDF, BM25, keyword matching
- Exact word matching, no semantic understanding
- Pros: Fast, exact matches, interpretable, no model needed
- Cons: Misses synonyms, no semantic understanding
- Dense: Better for “find documents about machine learning” (understands ML = AI = deep learning)
- Sparse: Better for “find documents containing ‘Python 3.9’” (exact version matters)
- Combine dense + sparse scores
- Weighted combination:
final_score = α × dense_score + (1-α) × sparse_score - Captures both semantic similarity and exact matches
- Dense only: Semantic search, Q&A, when synonyms matter
- Sparse only: Exact keyword search, code search, when precision critical
- Hybrid: Production RAG systems (recommended)
- Dense: Finds “test automation”, “QA automation”, “CI/CD testing”
- Sparse: Only finds exact phrase “automated testing”
- Hybrid: Finds both semantic matches and exact phrase
When would you choose a smaller distilled model over a frontier model?
When would you choose a smaller distilled model over a frontier model?
- Examples: DistilBERT, TinyBERT, GPT-3.5-turbo vs GPT-4
- Trained to mimic larger models
- 2-10× smaller, 3-5× faster
- Latency constraints: Real-time applications, edge devices
- Cost optimization: Lower inference costs, especially at scale
- Resource limits: Mobile apps, embedded systems, limited GPU memory
- Simple tasks: When smaller model is sufficient (classification, simple Q&A)
- High throughput: Need to process many requests quickly
- Complex reasoning: Need advanced capabilities (GPT-4, Claude Opus)
- Quality critical: When accuracy is more important than speed
- Novel tasks: Tasks smaller models can’t handle
- Low volume: When cost isn’t concern, quality is priority
- Task complexity: Simple → distilled, complex → frontier
- Latency requirement: <100ms → distilled, can wait → frontier
- Volume: High volume → distilled, low volume → frontier
- Budget: Limited → distilled, flexible → frontier
- Use distilled for 80% of requests (fast, cheap)
- Route complex queries to frontier model (smart routing)
- A/B test to find right balance
- Use GPT-3.5-turbo for common questions (fast, cheap)
- Escalate complex issues to GPT-4 (better reasoning)
How do you test whether a model's 'knowledge cutoff' impacts reliability for your use case?
How do you test whether a model's 'knowledge cutoff' impacts reliability for your use case?
- Date when model’s training data ends
- Model doesn’t know events/information after that date
- Example: GPT-4 trained on data up to April 2023
- Create test set: Questions about events before and after cutoff
- Measure accuracy: Compare performance on pre vs post-cutoff questions
- Check hallucinations: Model may confidently make up post-cutoff information
- Domain-specific: Test your specific domain (tech, finance, etc.)
- Before cutoff: “What happened in 2022?” → Should answer correctly
- After cutoff: “What happened in 2024?” → May hallucinate or say “I don’t know”
- Edge cases: Events right around cutoff date
- RAG: Use retrieval to get current information
- Web search: Integrate search API for recent events
- Fine-tuning: Fine-tune on recent data (if available)
- Hybrid approach: Use model for reasoning, external sources for facts
- Track questions about recent events
- Flag potential hallucinations
- Use RAG for time-sensitive queries
- Set expectations with users about knowledge cutoff
- Query: “Latest Python version in 2024?”
- Without RAG: May give outdated answer or hallucinate
- With RAG: Retrieves current info, gives accurate answer
How do you choose a vector DB (Chroma, Pinecone, OpenSearch…)?
How do you choose a vector DB (Chroma, Pinecone, OpenSearch…)?
- Scale: Number of vectors, query volume
- Latency: Response time requirements
- Features: Filtering, metadata, hybrid search
- Deployment: Managed vs self-hosted
- Cost: Pricing model, infrastructure costs
- Pros: Easy setup, good performance, managed scaling
- Cons: Expensive at scale, vendor lock-in
- Best for: Quick prototypes, small to medium scale
- Pros: Open source, easy to use, good for development
- Cons: Less scalable, fewer features
- Best for: Development, small projects, learning
- Pros: Feature-rich, good performance, hybrid search
- Cons: More complex setup
- Best for: Production systems needing advanced features
- Pros: Highly scalable, production-ready, open source
- Cons: Complex setup, needs infrastructure
- Best for: Large-scale production systems
- Pros: Mature, good ecosystem, supports vector search
- Cons: Not optimized specifically for vectors
- Best for: When you need full-text + vector search
- Pros: Fast, good filtering, open source
- Cons: Smaller community
- Best for: Performance-critical applications
- Prototype: Chroma or Pinecone
- Production <10M vectors: Pinecone or Weaviate
- Production >10M vectors: Milvus or Qdrant
- Need full-text search: OpenSearch
- Budget constrained: Self-hosted (Chroma, Milvus, Qdrant)
Can you update or backfill embeddings with zero downtime?
Can you update or backfill embeddings with zero downtime?
- Updating embedding model changes all vector representations
- Old and new embeddings aren’t compatible
- Need to re-embed all documents without service interruption
- Write to both old and new vector DBs simultaneously
- Gradually migrate reads from old to new
- Once migration complete, deprecate old DB
- Maintain two environments (blue = old, green = new)
- Re-embed all documents in green environment
- Switch traffic when ready
- Keep blue as backup
- Re-embed documents in batches
- Use message queue to process updates
- Update vector DB incrementally
- Route queries to appropriate DB based on document version
- Store multiple embedding versions per document
- Query both versions, merge results
- Gradually phase out old version
- Preparation: Set up new embedding pipeline, new vector DB
- Dual-write: New documents go to both DBs
- Backfill: Re-process existing documents in background
- Gradual cutover: Route percentage of queries to new DB
- Validation: Compare results, ensure quality maintained
- Full cutover: Switch all traffic to new DB
- Cleanup: Remove old DB after validation period
- Use feature flags to control rollout
- Monitor metrics during migration
- Keep old system as fallback
- Test with small subset first
- Document the process
- Embed new documents with both models
- Backfill existing documents in background
- Gradually switch queries to new embeddings
- Validate quality hasn’t degraded
How do you evaluate retrieval quality (precision@k, reranking, citation)?
How do you evaluate retrieval quality (precision@k, reranking, citation)?
- Fraction of retrieved items that are relevant
- Precision@5 = 3 relevant out of 5 retrieved = 0.6
- Measures accuracy of top-k results
- Fraction of all relevant items that were retrieved
- Recall@10 = 7 relevant retrieved out of 10 total relevant = 0.7
- Measures coverage
- Average of 1/rank of first relevant result
- Higher is better, emphasizes top results
- Considers ranking quality, discounts lower positions
- Best for when relevance has degrees (highly relevant vs somewhat relevant)
- Second-stage ranking using more expensive model
- Improves precision by reordering initial results
- Trade-off: Better quality but higher latency/cost
- Check if retrieved documents support the answer
- Verify citations are accurate and relevant
- Measure citation precision (correct citations / total citations)
- Create test set: Queries with known relevant documents
- Run retrieval: Get top-k results for each query
- Label relevance: Human annotators mark relevant/irrelevant
- Calculate metrics: Precision@k, Recall@k, MRR, NDCG
- Iterate: Improve retrieval based on results
- Track precision@k over time
- Monitor user feedback (thumbs up/down)
- A/B test different retrieval strategies
- Alert on quality degradation
- Use multiple metrics (precision + recall)
- Test on domain-specific data
- Monitor in production, not just offline
- Use reranking for critical queries
2. Prompting & Context Engineering
Zero-shot vs few-shot: when have you seen one clearly outperform the other?
Zero-shot vs few-shot: when have you seen one clearly outperform the other?
- Task is well-defined and common (translation, summarization)
- Model has strong pre-training on the task
- Cost/token usage is critical
- Need unbiased responses without example influence
- Examples are hard to construct or may introduce bias
- Need specific output format (JSON, structured data)
- Domain-specific terminology or edge cases
- Zero-shot produces inconsistent results
- Task requires demonstration of pattern
- Working with unusual or complex patterns
- Translation: “Translate to French: Hello” (model knows this well)
- Simple classification: “Is this positive or negative?” (clear task)
- General Q&A: “What is machine learning?” (common knowledge)
- Code generation with specific style: Need examples showing preferred patterns
- Complex extraction: “Extract entities in this format: [name, age, location]”
- Domain-specific: Medical terminology, legal documents (need examples)
- Multi-step reasoning: Chain-of-thought needs examples
- Start with zero-shot (simpler, cheaper)
- Add few-shot if quality/consistency issues
- Monitor token usage vs quality trade-off
- A/B test to measure actual improvement
Why do chain-of-thought prompts sometimes collapse into nonsense at scale?
Why do chain-of-thought prompts sometimes collapse into nonsense at scale?
-
Model limitations:
- Smaller models lack reasoning capacity
- Can’t maintain coherent reasoning chains
- Gets confused with complex multi-step problems
-
Prompt quality:
- Poor examples lead to poor reasoning
- Inconsistent formatting confuses model
- Too many steps overwhelm model
-
Error propagation:
- Early reasoning mistake cascades
- Model can’t self-correct mid-chain
- Accumulates errors across steps
-
Context limits:
- Long reasoning chains exceed context
- Model forgets earlier steps
- Truncation breaks reasoning flow
-
Task mismatch:
- CoT not suitable for all tasks
- Simple tasks don’t need reasoning
- Over-engineering can hurt performance
- Large models (GPT-4, Claude) with strong reasoning
- Complex problems requiring multi-step thinking
- Well-constructed prompts with clear examples
- Tasks that benefit from explicit reasoning
- Small models trying to reason beyond capacity
- Simple tasks that don’t need reasoning
- Poorly constructed prompts
- Tasks requiring factual recall, not reasoning
- Use CoT only with capable models
- Start with simple CoT, add complexity gradually
- Validate reasoning steps, not just final answer
- Use self-consistency (generate multiple chains, pick best)
- Monitor for reasoning quality, not just answer correctness
What's your approach to versioning prompts for reproducibility?
What's your approach to versioning prompts for reproducibility?
-
Git-based versioning:
- Store prompts in version control
- Tag versions, track changes
- Enable rollback to previous versions
- Review prompt changes like code
-
Prompt registry:
- Centralized system for prompt management
- Version numbers, metadata (author, date, purpose)
- A/B testing different versions
- Track performance per version
-
Template system:
- Parameterized prompts with variables
- Version templates, not individual prompts
- Easier to update and maintain
- Example:
{system_prompt_v2} + {user_input}
-
Configuration files:
- YAML/JSON configs for prompts
- Environment-specific prompts (dev, prod)
- Easy to update without code changes
- Version configs separately
- Naming convention:
prompt_v1.2.3_task_name - Documentation: Document why each version exists
- Testing: Test prompts before deploying
- Monitoring: Track performance per version
- Rollback plan: Keep previous versions for quick rollback
- Develop prompt in staging
- Version and test
- Deploy with feature flag
- Monitor performance
- Gradually roll out
- Keep old version as fallback
Prompt failures are inevitable how do you debug them systematically?
Prompt failures are inevitable how do you debug them systematically?
-
Logging:
- Log all prompts and responses
- Include metadata (timestamp, user, model version)
- Store for analysis and debugging
- Enable search and filtering
-
Categorize failures:
- Format errors: Wrong output structure
- Hallucinations: Made-up information
- Refusals: Model refuses valid requests
- Inconsistency: Same input, different outputs
- Off-topic: Model goes off-topic
-
Root cause analysis:
- Prompt issues: Ambiguous instructions, poor examples
- Model limitations: Task beyond model capability
- Input quality: Garbage in, garbage out
- Context problems: Missing or wrong context
- Parameter issues: Wrong temperature, top_p settings
-
Debugging techniques:
- Simplify: Remove complexity, test basic version
- Isolate: Test individual components
- Compare: A/B test different prompts
- Iterate: Make small changes, test each
- Validate: Check against known good examples
-
Tools:
- Prompt testing frameworks
- A/B testing platforms
- Evaluation metrics (accuracy, latency)
- User feedback collection
- Is prompt clear and unambiguous?
- Are examples high-quality and relevant?
- Is context complete and accurate?
- Are parameters (temp, top_p) appropriate?
- Is model capable of the task?
- Are there edge cases not handled?
- Build test suite of known good/bad cases
- Monitor failure rates and patterns
- Create runbook for common issues
- Document solutions for future reference
- Set up alerts for quality degradation
- User reports: “Model gives wrong answer”
- Check logs: Find prompt and response
- Reproduce: Run same prompt, see if consistent
- Simplify: Test with minimal prompt
- Compare: Try different prompt variations
- Fix: Identify issue, update prompt
- Validate: Test on known cases
- Deploy: Roll out fix with monitoring
Guardrails: regex filters, classifiers, or fine-tuning? What's the trade-off?
Guardrails: regex filters, classifiers, or fine-tuning? What's the trade-off?
- How: Pattern matching on input/output
- Pros: Fast, simple, interpretable, no model needed
- Cons: Brittle, easy to bypass, can’t understand context
- Use when: Simple keyword blocking, known patterns
- How: ML model classifies content (toxic, PII, etc.)
- Pros: Understands context, more robust, can be tuned
- Cons: Needs training data, slower, may have false positives
- Use when: Need semantic understanding, complex patterns
- How: Train model to refuse harmful requests
- Pros: Most robust, understands nuance, built-in
- Cons: Expensive, time-consuming, may reduce capabilities
- Use when: Need model-level safety, have resources
| Approach | Speed | Robustness | Cost | Complexity |
|---|---|---|---|---|
| Regex | Fast | Low | Low | Low |
| Classifier | Medium | Medium | Medium | Medium |
| Fine-tuning | Slow | High | High | High |
- Regex: Block obvious patterns (quick wins)
- Classifier: Catch semantic issues (context-aware)
- Fine-tuning: Model-level safety (deep protection)
- Output validation: Final check before returning
- Start with regex for known issues
- Add classifier for complex cases
- Use fine-tuning for critical safety requirements
- Combine approaches for defense in depth
- Regex: Block known attack patterns
- Classifier: Detect toxic content
- Fine-tuning: Model refuses harmful requests
- Output validation: Final safety check
How do you balance injecting instructions vs letting the model free-flow?
How do you balance injecting instructions vs letting the model free-flow?
- Explicit rules, constraints, format requirements
- Pros: More control, consistent output, predictable
- Cons: Can be too rigid, may limit creativity, longer prompts
- Minimal instructions, let model be creative
- Pros: Natural responses, creative, flexible
- Cons: Less control, inconsistent, may go off-topic
-
Task-dependent:
- Structured tasks: More instructions (extraction, formatting)
- Creative tasks: Less instructions (writing, brainstorming)
- Critical tasks: More instructions (safety, accuracy)
-
Progressive disclosure:
- Start with minimal instructions
- Add constraints only if needed
- Test to find minimum viable instructions
-
Layered instructions:
- Core instructions (always)
- Optional constraints (when needed)
- Examples (for complex tasks)
-
User control:
- Let users choose strictness level
- “Creative mode” vs “Precise mode”
- Adjust instructions based on user preference
- Start minimal, add only what’s necessary
- Test different instruction levels
- Monitor user satisfaction
- Balance control with naturalness
- Document instruction rationale
- v1: “Summarize this text” (too free, inconsistent)
- v2: “Summarize in 3 bullet points” (better structure)
- v3: “Summarize in 3 bullet points, each max 20 words” (too rigid)
- v4: “Summarize in 3 concise bullet points” (balanced)
- Need consistency? → More instructions
- Need creativity? → Less instructions
- Need safety? → More instructions
- Need naturalness? → Less instructions
If you had to monitor prompt drift in a live system, how would you design it?
If you had to monitor prompt drift in a live system, how would you design it?
-
Output quality metrics:
- Accuracy, relevance, correctness
- User satisfaction (thumbs up/down)
- Task completion rate
- Error rates
-
Output characteristics:
- Response length (may indicate drift)
- Tone/style changes
- Format consistency
- Hallucination rate
-
Model behavior:
- Refusal rate (may increase/decrease)
- Confidence scores (if available)
- Response time (may indicate issues)
- Token usage (may change)
-
Baseline establishment:
- Measure metrics on known good prompts
- Establish normal ranges
- Set thresholds for alerts
-
Continuous tracking:
- Log all prompts and responses
- Calculate metrics in real-time
- Store for historical analysis
-
Anomaly detection:
- Statistical tests (z-scores, percentiles)
- Machine learning models (detect patterns)
- Rule-based alerts (threshold breaches)
-
Alerting:
- Real-time alerts for critical issues
- Daily/weekly reports for trends
- Dashboard for visualization
- Monitor multiple metrics (not just one)
- Use statistical significance tests
- Track trends over time
- Set up automated alerts
- Have runbook for common issues
- Regular review and adjustment
- Accuracy: Baseline 95%, current 92% → Alert
- Response length: Baseline 100 tokens, current 150 → Investigate
- User satisfaction: Baseline 4.5/5, current 3.8/5 → Alert
Few-shot vs zero-shot - which works better where?
Few-shot vs zero-shot - which works better where?
- Well-known tasks (translation, summarization)
- When model has strong pre-training
- Cost/token sensitive scenarios
- Need unbiased responses
- Examples are hard to construct
- Need specific output format
- Domain-specific terminology
- Complex or unusual patterns
- Zero-shot produces inconsistent results
- Task requires demonstration
| Task Type | Zero-shot | Few-shot |
|---|---|---|
| Simple classification | ✅ | ❌ |
| Translation | ✅ | ❌ |
| Code generation | ❌ | ✅ |
| Complex extraction | ❌ | ✅ |
| General Q&A | ✅ | ❌ |
| Domain-specific | ❌ | ✅ |
How do you design system prompts that are robust across users?
How do you design system prompts that are robust across users?
-
Clear role definition:
- Define model’s role clearly
- Set boundaries and limitations
- Specify behavior expectations
-
Explicit constraints:
- What model should do
- What model shouldn’t do
- How to handle edge cases
-
Consistent structure:
- Use clear sections (Role, Instructions, Constraints)
- Consistent formatting
- Easy to read and maintain
-
Test with diverse inputs:
- Test with different user types
- Test edge cases
- Test adversarial inputs
- Test various languages/styles
-
Version and iterate:
- Version system prompts
- A/B test different versions
- Monitor performance
- Update based on feedback
- Keep prompts concise but complete
- Use examples for complex behaviors
- Test with real users
- Monitor for prompt injection attempts
- Update based on observed issues
How do you make output deterministic?
How do you make output deterministic?
-
Temperature = 0:
- Pure greedy decoding
- Always picks highest probability token
- Most deterministic approach
-
Fixed seed:
- Set random seed for reproducibility
- Same seed = same output
- Works with temperature > 0
-
Top-k = 1:
- Only consider top token
- Combined with temperature = 0
- Maximum determinism
-
Prompt caching:
- Cache system prompts
- Reduces variability from prompt processing
- Improves consistency
- Deterministic: Predictable but may be less natural
- Non-deterministic: More natural but less predictable
- Balanced: Low temperature (0.1-0.3) for slight variation
- Deterministic: Testing, debugging, when consistency critical
- Non-deterministic: Creative tasks, when variety desired
- Balanced: Most production use cases
How do you track, version, and backfill changing context?
How do you track, version, and backfill changing context?
-
Versioning:
- Version context documents
- Track changes over time
- Enable rollback to previous versions
-
Tracking:
- Log which context was used for each query
- Store context version with responses
- Enable audit trail
-
Backfilling:
- Re-process queries with updated context
- Update responses if context changed
- Notify users of significant changes
- Store context in versioned database
- Tag queries with context version
- Re-run queries when context updates
- Compare old vs new responses
- Version all context documents
- Track context usage per query
- Automate backfilling for critical updates
- Monitor for context-related issues
- Document context changes
How do you build/maintain the memory?
How do you build/maintain the memory?
-
Short-term memory (conversation context):
- Last N messages in conversation
- Stored in session/cache
- TTL: 1-24 hours
-
Long-term memory (user preferences):
- User profile, preferences
- Stored in database
- Persists across sessions
-
Episodic memory (conversation history):
- Past conversations
- Searchable, retrievable
- Used for context
-
Semantic memory (knowledge base):
- RAG system with embeddings
- Retrieves relevant information
- Updates as knowledge changes
- Refresh: Update memory with new information
- Prune: Remove outdated information
- Validate: Check memory accuracy
- Index: Make memory searchable
- Use RAG for knowledge memory
- Store user preferences in database
- Cache recent conversations
- Index memory for fast retrieval
- Regularly update and validate
3. AI System Architecture
Design a scalable AI chatbot for 10,000 concurrent users
Design a scalable AI chatbot for 10,000 concurrent users
- Load Balancer: NGINX / AWS ALB (SSL termination, DDoS protection)
- API Gateway: Kong / AWS Gateway (rate limiting, JWT auth)
- App Servers: FastAPI / Express (Kubernetes, auto-scaling)
- Cache: Redis cluster for sessions & frequent responses
- Model Serving:
- Managed (OpenAI/Vertex) for simplicity
- Self-hosted (vLLM + A100 GPUs) for cost optimization
- Message Queue: RabbitMQ / Kafka for async tasks
- Databases: PostgreSQL (metadata) + Pinecone/Milvus (vector search)
- Monitoring: Prometheus + Grafana + ELK stack
Cost: ~15k (self-hosted) monthly.
Reliability: Circuit breakers, auto-scaling, blue-green deployments.
Why use a vector database in AI systems?
Why use a vector database in AI systems?
- RAG (Retrieval Augmented Generation)
- Semantic similarity search
- Contextual recommendations
- Fast cosine/dot-product search
- Horizontal scalability
- <100ms retrieval time
Explain caching in AI system design
Explain caching in AI system design
Strategies:
- Response caching: For common queries
- Context caching: Last N user messages
- Rate limiting: Prevent abuse
Eviction: LRU with TTL (1h for context, 24h for cache)
3. Model Deployment & Serving
What are common model serving frameworks?
What are common model serving frameworks?
- vLLM: High throughput inference, dynamic batching
- TorchServe: Scalable PyTorch serving
- TensorRT / ONNX Runtime: Optimized inference for GPUs/CPUs
- Ray Serve: Distributed deployment for microservices
Explain batch inference vs streaming
Explain batch inference vs streaming
- Batch inference: Process many inputs together efficient for offline jobs
- Streaming: Generate and send tokens live ideal for chat or long text
What is the difference between managed API vs self-hosted model?
What is the difference between managed API vs self-hosted model?
- ✅ Fast to integrate
- ✅ No infra maintenance
- ❌ Expensive at scale
- ❌ Limited customization
- ✅ Lower cost after 2–5M req/month
- ✅ Control over weights
- ❌ Needs GPU infra + MLOps skills
4. Fine-Tuning & Alignment
LoRA vs full fine-tuning: when is each justified?
LoRA vs full fine-tuning: when is each justified?
- Trains small adapter matrices instead of full weights
- Only updates 0.1-1% of parameters
- Much faster, less memory, cheaper
- Preserves base model capabilities
- Updates all model parameters
- More expressive, can learn complex patterns
- Slower, needs more memory, expensive
- Risk of catastrophic forgetting
- Limited compute resources
- Want to preserve base model
- Quick iterations needed
- Multiple task-specific adapters
- Fine-tuning on consumer hardware
- Large dataset available
- Task significantly different from pre-training
- Need maximum performance
- Have sufficient compute resources
- Single specialized model needed
- Resources limited? → LoRA
- Large dataset? → Full fine-tuning
- Multiple tasks? → LoRA (different adapters)
- Maximum performance? → Full fine-tuning
- Quick experiments? → LoRA
QLoRA claims efficiency what's the hidden cost?
QLoRA claims efficiency what's the hidden cost?
RLHF vs DPO: which alignment approach would you pick for a safety-critical use case?
RLHF vs DPO: which alignment approach would you pick for a safety-critical use case?
- Uses reinforcement learning with human preferences
- More complex, needs reward model
- Proven in production (ChatGPT, Claude)
- Better for complex alignment
- Directly optimizes on preference pairs
- Simpler, no reward model needed
- Faster training, easier to implement
- Good for preference alignment
- Need maximum safety guarantees
- Have resources for complex setup
- Need fine-grained control
- Working with large models
- Need faster iteration
- Limited resources
- Simpler alignment needs
- Want easier implementation
SFT vs PEFT: explain when one is overkill
SFT vs PEFT: explain when one is overkill
- Full fine-tuning on labeled data
- Updates all parameters
- More expressive, can learn complex patterns
- Needs more data and compute
- LoRA, Adapters, Prompt Tuning
- Updates small subset of parameters
- Faster, cheaper, less data needed
- May not reach SFT performance
- Small dataset (<1000 examples)
- Task similar to pre-training
- Limited compute resources
- Quick experiments
- Multiple tasks (use PEFT adapters)
- Large, diverse dataset
- Task very different from pre-training
- Need maximum performance
- Have sufficient resources
- Single specialized model
- Start with PEFT (LoRA)
- If performance insufficient, try SFT
- Consider hybrid: PEFT for quick iteration, SFT for final model
Why do many open-source fine-tuned models underperform their base model in the wild?
Why do many open-source fine-tuned models underperform their base model in the wild?
-
Poor data quality:
- Low-quality training data
- Misaligned with use case
- Insufficient diversity
- Noisy labels
-
Overfitting:
- Too many epochs
- Small validation set
- Model memorizes training data
- Poor generalization
-
Catastrophic forgetting:
- Loses general capabilities
- Too focused on specific task
- Forgets pre-training knowledge
-
Hyperparameter issues:
- Wrong learning rate
- Poor scheduler choice
- Inappropriate batch size
- No proper validation
-
Evaluation mismatch:
- Evaluated on different metrics
- Test set doesn’t reflect real use
- Overfitting to test set
- Use high-quality, diverse data
- Proper train/validation/test splits
- Early stopping
- Monitor validation metrics
- Test on real-world scenarios
- Use PEFT to preserve base capabilities
When would you argue for sticking with retrieval + prompting instead of fine-tuning?
When would you argue for sticking with retrieval + prompting instead of fine-tuning?
-
Data availability:
- Limited training data
- Data changes frequently
- Hard to collect labeled data
-
Flexibility:
- Need to update knowledge quickly
- Multiple knowledge domains
- Dynamic content requirements
-
Cost:
- Can’t afford fine-tuning compute
- Low volume of requests
- Cost of fine-tuning > cost of RAG
-
Transparency:
- Need to cite sources
- Want to verify answers
- Regulatory requirements
-
Multi-domain:
- Need to handle multiple domains
- Different knowledge bases
- General-purpose system
- Large, stable dataset
- Task-specific behavior needed
- High volume, cost-sensitive
- Need consistent style/format
- Domain-specific terminology
Proprietary fine-tuning: what's the most common 'gotcha' teams miss?
Proprietary fine-tuning: what's the most common 'gotcha' teams miss?
-
Data leakage:
- Test data in training set
- Validation contamination
- Overfitting to test metrics
-
Distribution shift:
- Training data ≠ production data
- Different user behavior
- Changing requirements
-
Evaluation gaps:
- Evaluating on wrong metrics
- Not testing on real scenarios
- Ignoring edge cases
-
Cost underestimation:
- Fine-tuning cost
- Inference cost changes
- Maintenance overhead
-
Model degradation:
- Catastrophic forgetting
- Losing general capabilities
- Performance on other tasks drops
-
Deployment issues:
- Model size increases
- Latency changes
- Infrastructure needs
- Proper data splits
- Test on production-like data
- Monitor all metrics, not just target
- Budget for full lifecycle
- Test general capabilities after fine-tuning
- Plan for deployment infrastructure
5. RAG Systems (Retrieval Augmented Generation)
What is RAG and why is it useful?
What is RAG and why is it useful?
- Reduces hallucination
- Keeps results current
- Enables domain adaptation without retraining
Explain the embedding process
Explain the embedding process
text-embedding-3-small).Steps:- Tokenize text
- Convert to fixed-size vector
- Store in vector DB
How to optimize RAG performance?
How to optimize RAG performance?
- Smart document chunking (500–800 tokens)
- Use metadata filters (type, tags, date)
- Cache top-k retrievals
- Re-rank using relevance scores
How do embeddings + similarity search actually work in RAG? Where does it break?
How do embeddings + similarity search actually work in RAG? Where does it break?
- Embedding: Convert documents to vectors
- Storage: Store vectors in vector database
- Query embedding: Convert query to vector
- Similarity search: Find closest document vectors
- Retrieval: Return top-k most similar documents
- Generation: Use retrieved docs as context for LLM
-
Semantic mismatch:
- Query and documents use different terminology
- Embeddings don’t capture exact match needs
- Example: Query “ML” vs document “machine learning”
-
Context loss:
- Chunking loses document structure
- Missing surrounding context
- Fragmented information
-
Retrieval quality:
- Wrong documents retrieved
- Missing relevant documents
- Too many irrelevant results
-
Scale issues:
- Slow retrieval at large scale
- Vector DB limitations
- Cost of embedding everything
-
Domain mismatch:
- Embedding model not trained on domain
- Different languages or formats
- Specialized terminology
- Use hybrid search (dense + sparse)
- Better chunking strategies
- Domain-specific embedding models
- Reranking for better results
- Metadata filtering
Vector DBs: Pinecone vs FAISS vs Weaviate. what's your decision framework?
Vector DBs: Pinecone vs FAISS vs Weaviate. what's your decision framework?
- Managed service, easy setup
- Good performance, auto-scaling
- Expensive at scale
- Best for: Quick prototypes, managed solution
- Library, not a database
- Very fast, open source
- No persistence, needs integration
- Best for: Research, in-memory search
- Self-hosted or managed
- Feature-rich, hybrid search
- More complex setup
- Best for: Production with advanced needs
- Prototype quickly? → Pinecone
- Research/experiment? → FAISS
- Production with features? → Weaviate
- Large scale? → Milvus or Qdrant
- Budget constrained? → Self-hosted (Weaviate, Milvus)
Hybrid retrieval (sparse+dense): when does it matter?
Hybrid retrieval (sparse+dense): when does it matter?
- Combines dense (semantic) and sparse (keyword) search
- Weighted combination of scores
- Captures both semantic similarity and exact matches
-
Exact matches needed:
- Code search, version numbers
- Proper nouns, technical terms
- When precision critical
-
Semantic understanding needed:
- General queries, synonyms
- Conceptual search
- When recall important
-
Production systems:
- Need best of both worlds
- Can’t afford to miss results
- Quality is priority
- Pure semantic search sufficient
- Pure keyword search sufficient
- Simple use cases
- Cost-sensitive scenarios
Why does reranking help and when does it not?
Why does reranking help and when does it not?
- Second-stage ranking using more expensive model
- Reorders initial retrieval results
- Improves precision of top results
- Initial retrieval may miss subtle relevance
- Reranker understands context better
- Can catch semantic nuances
- Improves top-k precision significantly
- Initial retrieval has good recall but poor precision
- Need high-quality top results
- Can afford extra latency/cost
- Complex queries requiring understanding
- Initial retrieval already very good
- Latency/cost critical
- Simple queries
- Reranker not better than initial retrieval
- Better quality but higher latency/cost
- Typically 2-5× slower than initial retrieval
- Worth it for critical queries
How do you measure RAG quality beyond 'felt useful'?
How do you measure RAG quality beyond 'felt useful'?
-
Retrieval metrics:
- Precision@k: Fraction of retrieved docs that are relevant
- Recall@k: Fraction of relevant docs that were retrieved
- MRR: Mean reciprocal rank of first relevant result
- NDCG: Normalized discounted cumulative gain
-
Generation metrics:
- Answer accuracy: Correctness of generated answer
- Faithfulness: Answer grounded in retrieved docs
- Completeness: Answer covers all aspects
- Citation accuracy: Correct source attribution
-
End-to-end metrics:
- Task completion rate
- User satisfaction (thumbs up/down)
- Time to correct answer
- Error rate
- Create test set with known good answers
- Run RAG pipeline on test set
- Measure retrieval quality
- Measure generation quality
- Measure end-to-end performance
- Compare against baselines
- Use multiple metrics (not just one)
- Test on real-world scenarios
- Monitor in production
- A/B test improvements
- Regular evaluation cycles
Describe failure modes when your retriever fetches irrelevant context
Describe failure modes when your retriever fetches irrelevant context
-
Semantic mismatch:
- Query and documents use different terms
- Embeddings don’t capture exact need
- Example: “ML” vs “machine learning”
-
Over-retrieval:
- Too many irrelevant documents
- Dilutes relevant context
- Model gets confused
-
Under-retrieval:
- Missing critical documents
- Incomplete context
- Model makes up information
-
Chunking issues:
- Relevant info split across chunks
- Missing context from surrounding text
- Fragmented information
-
Temporal mismatch:
- Outdated information retrieved
- Wrong version of document
- Stale knowledge base
-
Domain mismatch:
- Embedding model not suited for domain
- Different language or format
- Specialized terminology
- Hallucinations (model makes up info)
- Inaccurate answers
- Missing information
- Poor user experience
- Improve chunking strategy
- Use hybrid search
- Rerank results
- Update embedding model
- Filter by metadata (date, type)
- Test retrieval quality regularly
At what scale does naive RAG architecture fall apart?
At what scale does naive RAG architecture fall apart?
- Works fine
- Simple vector search sufficient
- No major issues
- Starts to show issues
- Retrieval quality may degrade
- Need better chunking/filtering
- Significant problems:
- Slow retrieval
- Poor precision (too many results)
- Cost increases
- Quality degradation
-
Retrieval quality:
- Too many similar documents
- Hard to find most relevant
- Precision drops significantly
-
Performance:
- Slow vector search
- High latency
- Cost increases
-
Maintenance:
- Hard to update embeddings
- Complex to manage
- Scaling challenges
- Hierarchical retrieval (coarse → fine)
- Metadata filtering
- Better chunking strategies
- Hybrid search
- Distributed vector DBs
how to increase accuracy, and reliability & make answers verifiable in LLM
how to increase accuracy, and reliability & make answers verifiable in LLM
-
RAG (Retrieval Augmented Generation):
- Ground answers in retrieved documents
- Enables citation and verification
- Reduces hallucinations
-
Citation and sources:
- Always cite sources
- Link to original documents
- Enable fact-checking
-
Confidence scores:
- Provide confidence levels
- Flag uncertain answers
- Admit when unsure
-
Validation:
- Cross-check with multiple sources
- Verify against known facts
- Human review for critical answers
-
Transparency:
- Show retrieved context
- Explain reasoning
- Make process auditable
- Use RAG for factual queries
- Always include citations
- Provide confidence scores
- Enable source verification
- Human review for critical cases
How does RAG work?
How does RAG work?
-
Document processing:
- Chunk documents into smaller pieces
- Embed chunks into vectors
- Store in vector database
-
Query processing:
- Embed user query into vector
- Search for similar document chunks
- Retrieve top-k most relevant chunks
-
Context construction:
- Combine retrieved chunks
- Format as context for LLM
- Include in prompt
-
Generation:
- LLM generates answer using context
- Grounded in retrieved documents
- Can cite sources
- Reduces hallucinations
- Keeps information current
- Enables domain adaptation
- Provides citations
What are some benefits of using the RAG system?
What are some benefits of using the RAG system?
-
Reduced hallucinations:
- Grounded in real documents
- Less likely to make up information
- More accurate answers
-
Current information:
- Can update knowledge base
- No need to retrain model
- Always up-to-date
-
Domain adaptation:
- Add domain-specific documents
- No fine-tuning needed
- Quick to adapt
-
Transparency:
- Can cite sources
- Verifiable answers
- Auditable process
-
Cost-effective:
- No model retraining
- Update knowledge easily
- Lower maintenance
When should I use Fine-tuning instead of RAG?
When should I use Fine-tuning instead of RAG?
-
Task-specific behavior:
- Need specific output format
- Consistent style required
- Domain-specific terminology
-
Large, stable dataset:
- Have sufficient training data
- Data doesn’t change frequently
- Can afford fine-tuning cost
-
Performance critical:
- Need maximum performance
- Latency sensitive
- High volume
-
Consistency:
- Need very consistent outputs
- Style/format critical
- Behavior must be predictable
- Need current information
- Multiple knowledge domains
- Data changes frequently
- Need citations
- Quick to deploy
What are the architecture patterns for customizing LLM with proprietary data?
What are the architecture patterns for customizing LLM with proprietary data?
-
RAG (Retrieval Augmented Generation):
- Retrieve relevant docs, use as context
- No model changes
- Easy to update
- Best for: Knowledge bases, Q&A
-
Fine-tuning:
- Train model on proprietary data
- Model learns from data
- More integrated
- Best for: Task-specific behavior
-
Hybrid:
- Fine-tune + RAG
- Model fine-tuned for task
- RAG for knowledge
- Best for: Complex requirements
-
Prompt engineering:
- Customize via prompts
- No model changes
- Very flexible
- Best for: Quick customization
- Knowledge base? → RAG
- Task behavior? → Fine-tuning
- Both? → Hybrid
- Quick test? → Prompt engineering
5. MLOps & LLMOps
What is MLOps and why is it important?
What is MLOps and why is it important?
- Continuous integration (CI)
- Continuous training (CT)
- Continuous deployment (CD)
- Model registry/versioning
- Drift detection and rollback
- Faster deployment cycles
- Better model quality
- Reduced risk
- Reproducibility
- Scalability
How to monitor AI models in production?
How to monitor AI models in production?
Tools: Prometheus, Grafana, OpenTelemetry
Alerts: P95 latency, error spikes, accuracy dropsBest practices:
- Log prompts safely
- Anonymize data
- Track feedback loops
- Latency: P50, P95, P99 response times
- Accuracy: Task-specific metrics
- Token usage: Input/output tokens
- Cost: Per request, per day
- Drift: Data and model drift
Explain AI safety and ethical considerations
Explain AI safety and ethical considerations
- Avoid harmful or biased outputs
- Enforce strict usage policies
- Apply constitutional AI + red teaming
- Audit model data and behavior
- Refuse harmful requests
- Document dataset sources
- Test for bias before release
Sketch a pipeline: from raw data → model → serving → feedback
Sketch a pipeline: from raw data → model → serving → feedback
-
Data collection:
- Raw data ingestion
- Data validation
- Data storage
-
Data processing:
- Cleaning, transformation
- Feature engineering
- Data versioning
-
Model training:
- Training pipeline
- Hyperparameter tuning
- Model evaluation
-
Model registry:
- Version models
- Store metadata
- Track performance
-
Model deployment:
- Model serving
- A/B testing
- Gradual rollout
-
Monitoring:
- Performance metrics
- Drift detection
- Error tracking
-
Feedback loop:
- Collect user feedback
- Log predictions
- Retrain with new data
How would you monitor performance drift or hallucinations?
How would you monitor performance drift or hallucinations?
-
Data drift:
- Monitor input distribution
- Statistical tests (KS test, chi-square)
- Alert on significant changes
-
Model drift:
- Monitor prediction distribution
- Compare with baseline
- Alert on changes
-
Performance drift:
- Monitor accuracy metrics
- Compare with baseline
- Alert on degradation
-
Output validation:
- Check for factual claims
- Verify against sources
- Flag suspicious outputs
-
Confidence scores:
- Monitor confidence levels
- Flag low-confidence outputs
- Review manually
-
User feedback:
- Collect thumbs up/down
- Track user reports
- Identify patterns
How do you log prompts and outputs for debugging and auditing?
How do you log prompts and outputs for debugging and auditing?
-
What to log:
- Prompts (system + user)
- Model responses
- Metadata (timestamp, user, model version)
- Performance metrics
-
Privacy:
- Anonymize PII
- Hash sensitive data
- Comply with regulations
-
Storage:
- Centralized logging (ELK, Splunk)
- Searchable, filterable
- Retention policies
-
Access control:
- Role-based access
- Audit logs
- Secure storage
- Log everything for debugging
- Anonymize for privacy
- Enable search and filtering
- Set retention policies
CI/CD for LLM workflows - what's different from ML?
CI/CD for LLM workflows - what's different from ML?
-
Prompt versioning:
- Version prompts like code
- A/B test prompts
- Rollback prompts
-
Model updates:
- Base model updates
- Fine-tuned model versions
- Embedding model updates
-
Context management:
- Version context documents
- Update knowledge bases
- Backfill queries
-
Evaluation:
- LLM-specific metrics
- Human evaluation
- A/B testing
-
Deployment:
- Model serving (vLLM, etc.)
- Prompt caching
- Streaming responses
- Prompts: Version and test prompts
- Context: Manage dynamic context
- Evaluation: LLM-specific metrics
- Deployment: Streaming, caching
What's your playbook for deploying an LLM API (FastAPI, Docker, K8s)?
What's your playbook for deploying an LLM API (FastAPI, Docker, K8s)?
-
API development:
- FastAPI for Python API
- Define endpoints
- Error handling
-
Containerization:
- Docker for containerization
- Multi-stage builds
- Optimize image size
-
Orchestration:
- Kubernetes for orchestration
- Deployments, services
- Auto-scaling
-
Model serving:
- vLLM, TorchServe for serving
- GPU allocation
- Batching
-
Monitoring:
- Prometheus metrics
- Grafana dashboards
- Alerts
-
CI/CD:
- GitHub Actions, GitLab CI
- Automated testing
- Deployment pipelines
- Use FastAPI for APIs
- Containerize with Docker
- Orchestrate with Kubernetes
- Monitor with Prometheus/Grafana
Drift detection: how do you monitor it with LLMs?
Drift detection: how do you monitor it with LLMs?
-
Input drift:
- Monitor prompt patterns
- Track user query types
- Alert on changes
-
Output drift:
- Monitor response patterns
- Track response length
- Alert on changes
-
Performance drift:
- Monitor accuracy metrics
- Track user satisfaction
- Alert on degradation
-
Model drift:
- Compare model versions
- Track behavior changes
- A/B test
- Statistical tests (KS test, chi-square)
- Distribution comparison
- Threshold-based alerts
Evaluation pipelines: offline vs online trade-offs?
Evaluation pipelines: offline vs online trade-offs?
- Test on held-out dataset
- Fast, cheap
- No user impact
- May not reflect real usage
- Test with real users
- A/B testing
- Reflects real usage
- Slower, more expensive
- Offline: Fast, cheap, but may not reflect reality
- Online: Realistic, but slower and more expensive
What observability metrics matter most in LLMOps?
What observability metrics matter most in LLMOps?
-
Latency:
- P50, P95, P99 response times
- Time to first token
- End-to-end latency
-
Accuracy:
- Task-specific metrics
- User satisfaction
- Error rates
-
Cost:
- Token usage (input + output)
- Cost per request
- Daily/monthly costs
-
Quality:
- Hallucination rate
- Citation accuracy
- User feedback
-
System:
- Throughput (requests/sec)
- Error rate
- Availability
Rollbacks: what's different here compared to traditional ML?
Rollbacks: what's different here compared to traditional ML?
-
Prompt rollbacks:
- Rollback prompts quickly
- No model retraining needed
- Version control
-
Model rollbacks:
- Rollback model versions
- May need infrastructure changes
- More complex
-
Context rollbacks:
- Rollback context documents
- May need re-embedding
- Backfill queries
-
Fast rollbacks:
- Prompts: Very fast
- Models: Slower
- Context: Medium
How do you scale inference infra without bleeding cost?
How do you scale inference infra without bleeding cost?
-
Model optimization:
- Quantization (INT8, INT4)
- Model distillation
- Smaller models
-
Batching:
- Dynamic batching
- Continuous batching (vLLM)
- Higher throughput
-
Caching:
- Prompt caching
- Response caching
- Reduce redundant calls
-
Smart routing:
- Route simple queries to smaller models
- Route complex to larger models
- Cost-aware routing
-
Infrastructure:
- Spot instances
- Auto-scaling
- Right-sizing
CI/CD for prompts + fine-tuned checkpoints how would you design it?
CI/CD for prompts + fine-tuned checkpoints how would you design it?
-
Version control:
- Git for prompts
- Model registry for checkpoints
- Track versions
-
Testing:
- Test prompts on sample queries
- Test models on validation set
- Automated tests
-
Deployment:
- Feature flags for prompts
- Gradual rollout for models
- A/B testing
-
Monitoring:
- Monitor performance
- Track metrics
- Alert on issues
-
Rollback:
- Quick rollback for prompts
- Model rollback capability
- Version management
6. Document Digitization & Chunking
What is chunking, and why do we chunk our data?
What is chunking, and why do we chunk our data?
- Breaking documents into smaller pieces
- Makes documents fit in context window
- Enables better retrieval
- Context limits: Models have max context (e.g., 128k tokens)
- Better retrieval: Smaller chunks = more precise retrieval
- Cost: Smaller chunks = lower embedding costs
- Performance: Faster processing of smaller pieces
- Chunk size: 500-800 tokens (balance context vs precision)
- Overlap: 50-100 tokens between chunks (preserve context)
- Semantic boundaries: Split at sentence/paragraph boundaries
What factors influence chunk size?
What factors influence chunk size?
-
Model context window:
- Max tokens model can handle
- Need space for query + retrieved chunks
- Example: 8k context → chunks of 500-800 tokens
-
Retrieval precision:
- Smaller chunks = more precise retrieval
- Larger chunks = more context per chunk
- Balance precision vs context
-
Document structure:
- Paragraphs, sections, chapters
- Natural boundaries matter
- Preserve semantic units
-
Use case:
- Q&A: Smaller chunks for precise answers
- Summarization: Larger chunks for context
- Analysis: Medium chunks for balance
-
Embedding model:
- Max tokens per embedding
- Some models handle longer texts better
- Consider model limitations
What are the different types of chunking methods?
What are the different types of chunking methods?
-
Fixed-size chunking:
- Split by character/token count
- Simple, fast
- May break sentences/paragraphs
-
Sentence-based chunking:
- Split at sentence boundaries
- Preserves sentence structure
- Better semantic units
-
Paragraph-based chunking:
- Split at paragraph boundaries
- Preserves paragraph context
- Good for structured documents
-
Recursive chunking:
- Try different strategies hierarchically
- Start with paragraphs, fall back to sentences
- Best of multiple approaches
-
Semantic chunking:
- Split based on semantic similarity
- Uses embeddings to find boundaries
- Most sophisticated, preserves meaning
-
Sliding window:
- Overlapping chunks
- Preserves context across boundaries
- More chunks but better coverage
How to find the ideal chunk size?
How to find the ideal chunk size?
-
Start with baseline:
- Common: 500-800 tokens
- Test with your documents
- Measure retrieval quality
-
Test different sizes:
- Small (200-400): More precise, less context
- Medium (500-800): Balanced
- Large (1000-1500): More context, less precise
-
Evaluate:
- Precision@k: Are retrieved chunks relevant?
- Recall@k: Do we find all relevant chunks?
- End-to-end: Does RAG quality improve?
-
Consider factors:
- Document type (technical vs narrative)
- Query type (specific vs general)
- Model context window
- Use case requirements
-
Iterate:
- Start with medium size
- Adjust based on results
- Test on real queries
What is the best method to digitize and chunk complex documents like annual reports?
What is the best method to digitize and chunk complex documents like annual reports?
-
Preprocessing:
- Extract text from PDF
- Preserve structure (tables, sections)
- Clean formatting
-
Structure-aware chunking:
- Identify sections (executive summary, financials, etc.)
- Chunk within sections
- Preserve section context
-
Hierarchical chunking:
- Document → Sections → Subsections → Paragraphs
- Store hierarchy in metadata
- Enable section-level retrieval
-
Special handling:
- Tables: Extract as structured data, chunk separately
- Charts: Extract captions, link to images
- Footnotes: Include with relevant sections
-
Metadata:
- Section name, page number, date
- Document type, year
- Enable filtering by metadata
- Use document parsers (PyPDF2, pdfplumber)
- Structure detection (section headers)
- Table extraction (tabula, camelot)
- Semantic chunking within sections
How to handle tables during chunking?
How to handle tables during chunking?
-
Extract as structured data:
- Convert to CSV/JSON
- Store separately from text
- Embed table descriptions
-
Text representation:
- Convert table to markdown/text
- Include in chunks
- Preserve structure
-
Hybrid approach:
- Store structured data separately
- Include table summary in chunks
- Link table data to text chunks
-
Metadata:
- Table type, headers, row count
- Enable table-specific queries
- Filter by table metadata
- Extract tables with specialized tools (tabula, camelot)
- Include table context (surrounding text)
- Store both structured and text representations
- Use metadata for table-specific retrieval
How do you handle very large table for better retrieval?
How do you handle very large table for better retrieval?
-
Split by rows:
- Chunk table into row groups
- Preserve header row in each chunk
- Maintain table structure
-
Column-based chunking:
- Split by columns for column-specific queries
- Include row identifiers
- Preserve relationships
-
Summary chunks:
- Create summary of table
- Include statistics, key insights
- Use for high-level queries
-
Metadata:
- Table name, dimensions, date
- Column names, data types
- Enable filtering
-
Structured storage:
- Store full table in database
- Embed summaries and descriptions
- Link chunks to full table
How to handle list item during chunking?
How to handle list item during chunking?
-
Preserve list structure:
- Keep list items together
- Don’t split mid-list
- Maintain list context
-
List as single chunk:
- Small lists: Keep as one chunk
- Preserves relationships
- Better semantic unit
-
Split long lists:
- Large lists: Split into groups
- Include list title/context
- Maintain item relationships
-
Metadata:
- List type (ordered, unordered)
- List title, item count
- Enable list-specific queries
How do you build production grade document processing and indexing pipeline?
How do you build production grade document processing and indexing pipeline?
-
Document ingestion:
- Support multiple formats (PDF, DOCX, HTML)
- Handle errors gracefully
- Validate document quality
-
Preprocessing:
- Extract text, preserve structure
- Clean formatting
- Handle special elements (tables, images)
-
Chunking:
- Structure-aware chunking
- Preserve context
- Generate metadata
-
Embedding:
- Batch processing
- Error handling
- Retry logic
-
Indexing:
- Store in vector DB
- Store metadata
- Enable filtering
-
Monitoring:
- Track processing time
- Monitor errors
- Quality metrics
-
Versioning:
- Version documents
- Track changes
- Enable rollback
- Use async processing for scale
- Implement retry logic
- Monitor pipeline health
- Version everything
- Test on production-like data
How to handle graphs & charts in RAG
How to handle graphs & charts in RAG
-
Extract text:
- Chart titles, labels, captions
- Axis labels, legends
- Include in text chunks
-
Image embeddings:
- Use vision models for image embeddings
- Store image embeddings separately
- Link to text chunks
-
Metadata:
- Chart type, data source
- Date, context
- Enable filtering
-
Hybrid approach:
- Text description in chunks
- Image embeddings for visual search
- Link images to text
-
Structured data:
- Extract underlying data if available
- Store as structured data
- Link to chart images
7. Embedding Models
What are vector embeddings, and what is an embedding model?
What are vector embeddings, and what is an embedding model?
- Numerical representations of text
- Dense vectors (arrays of numbers)
- Capture semantic meaning
- Similar texts have similar vectors
- Neural network that generates embeddings
- Trained on large text corpora
- Maps text to fixed-size vectors
- Examples: text-embedding-3-small, sentence-transformers
- Input: Text (sentence, paragraph, document)
- Output: Vector (e.g., 384, 768, 1536 dimensions)
- Similar texts → similar vectors
- Semantic search
- RAG retrieval
- Clustering
- Classification
How is an embedding model used in the context of LLM applications?
How is an embedding model used in the context of LLM applications?
-
RAG (Retrieval Augmented Generation):
- Embed documents for retrieval
- Embed queries for search
- Find similar documents
- Use as context for LLM
-
Semantic search:
- Find similar documents
- Understand user intent
- Improve search quality
-
Context selection:
- Select relevant context from large corpus
- Filter documents
- Rank by relevance
-
Hybrid search:
- Combine with keyword search
- Best of both approaches
- Improved retrieval
- Documents → Embeddings → Vector DB
- Query → Embedding → Search → Retrieve → LLM
What is the difference between embedding short and long content?
What is the difference between embedding short and long content?
- Better semantic capture
- More precise embeddings
- Faster processing
- Less context loss
- More context preserved
- Better for document-level search
- Slower processing
- May lose fine-grained details
- Short: Better precision, less context
- Long: More context, less precision
- Short: For precise retrieval, Q&A
- Long: For document-level search, summarization
- Hybrid: Embed both short and long versions
How to benchmark embedding models on your data?
How to benchmark embedding models on your data?
-
Create test set:
- Queries with known relevant documents
- Label relevance (relevant/irrelevant)
- Cover different query types
-
Embed documents:
- Use different embedding models
- Store in vector DB
- Track model versions
-
Run retrieval:
- Query each model
- Retrieve top-k results
- Measure retrieval quality
-
Evaluate:
- Precision@k: Fraction of relevant results
- Recall@k: Fraction of relevant docs found
- MRR: Mean reciprocal rank
- NDCG: Normalized discounted cumulative gain
-
Compare:
- Compare models on same test set
- Consider latency, cost
- Choose best model
- Test on domain-specific data
- Use multiple metrics
- Consider latency and cost
- Test on real queries
Suppose you are working with an open AI embedding model, after benchmarking accuracy is coming low, how would you further improve the accuracy of embedding the search model?
Suppose you are working with an open AI embedding model, after benchmarking accuracy is coming low, how would you further improve the accuracy of embedding the search model?
-
Try different models:
- text-embedding-3-small vs text-embedding-3-large
- Different dimensions
- Domain-specific models
-
Fine-tune embedding model:
- Train on your domain data
- Better domain understanding
- Improved accuracy
-
Improve chunking:
- Better chunk size
- Semantic chunking
- Preserve context
-
Hybrid search:
- Add keyword search (BM25)
- Combine dense + sparse
- Better coverage
-
Reranking:
- Second-stage ranking
- More expensive but better
- Improves precision
-
Query expansion:
- Expand queries with synonyms
- Better query understanding
- Improved retrieval
-
Metadata filtering:
- Filter by document type, date
- Narrow search space
- Better precision
Walk me through steps of improving sentence transformer model used for embedding?
Walk me through steps of improving sentence transformer model used for embedding?
-
Baseline evaluation:
- Test current model
- Measure retrieval quality
- Identify issues
-
Data preparation:
- Collect domain-specific data
- Create training pairs (query, relevant doc)
- Label relevance
-
Fine-tuning:
- Use sentence-transformers library
- Train on domain data
- Monitor validation metrics
-
Evaluation:
- Test on held-out set
- Compare with baseline
- Measure improvement
-
Iteration:
- Adjust hyperparameters
- Add more training data
- Improve data quality
-
Deployment:
- Deploy new model
- A/B test against old model
- Monitor performance
- Start with small dataset
- Use contrastive learning
- Monitor overfitting
- Test on real queries
8. Internal Working of Vector Databases
What is a vector database?
What is a vector database?
- Specialized database for vector embeddings
- Optimized for similarity search
- Stores high-dimensional vectors
- Fast nearest neighbor search
- Vector storage and indexing
- Similarity search (cosine, dot product, Euclidean)
- Metadata filtering
- Scalability
- Pinecone, Milvus, Weaviate, Qdrant, Chroma
- RAG systems
- Semantic search
- Recommendation systems
- Similarity matching
How does a vector database differ from traditional databases?
How does a vector database differ from traditional databases?
- Exact match queries
- Structured data
- Indexes for exact lookups
- Not optimized for similarity
- Similarity search
- High-dimensional vectors
- Approximate nearest neighbor (ANN) algorithms
- Optimized for vector operations
- Query type: Exact match vs similarity
- Data structure: Tables vs vectors
- Indexing: B-tree vs ANN indexes
- Use case: Structured data vs embeddings
- Traditional: Structured data, exact queries
- Vector: Embeddings, similarity search
How does a vector database work?
How does a vector database work?
-
Storage:
- Store vectors with metadata
- Index vectors for fast search
- Maintain data structures
-
Indexing:
- Build ANN indexes (HNSW, IVF, etc.)
- Enable fast approximate search
- Balance accuracy vs speed
-
Query:
- Embed query into vector
- Search for similar vectors
- Return top-k results
-
Similarity calculation:
- Cosine similarity, dot product, Euclidean
- Fast computation
- Optimized algorithms
- HNSW (Hierarchical Navigable Small World): Graph-based index
- IVF (Inverted File Index): Clustering-based
- LSH (Locality-Sensitive Hashing): Hash-based
Explain difference between vector index, vector DB & vector plugins?
Explain difference between vector index, vector DB & vector plugins?
- Data structure for fast similarity search
- Examples: HNSW, IVF, LSH
- Can be used standalone (FAISS)
- No persistence, needs integration
- Full database system with vector support
- Persistence, querying, management
- Examples: Pinecone, Milvus, Weaviate
- Production-ready solution
- Add vector capabilities to existing DBs
- Examples: pgvector (PostgreSQL), vector search in Elasticsearch
- Extends traditional databases
- Hybrid approach
- Index: Fast, no persistence, needs integration
- DB: Full solution, persistence, production-ready
- Plugin: Extends existing DB, hybrid approach
You are working on a project that involves a small dataset of customer reviews. Your task is to find similar reviews in the dataset. The priority is to achieve perfect accuracy in finding the most similar reviews, and the speed of the search is not a primary concern. Which search strategy would you choose and why?
You are working on a project that involves a small dataset of customer reviews. Your task is to find similar reviews in the dataset. The priority is to achieve perfect accuracy in finding the most similar reviews, and the speed of the search is not a primary concern. Which search strategy would you choose and why?
- Perfect accuracy: Checks all vectors, finds true nearest neighbors
- No approximation: No accuracy loss from indexing
- Small dataset: Brute force is feasible for small datasets
- Simple: No index tuning needed
- Compare query vector with all vectors
- Calculate similarity for each
- Return top-k most similar
- Accuracy: Perfect (100%)
- Speed: Slow (O(n) where n = dataset size)
- Scalability: Doesn’t scale to large datasets
- Small datasets (<10k vectors)
- Accuracy critical
- Speed not concern
- Simple implementation
Explain vector search strategies like clustering and Locality-Sensitive Hashing
Explain vector search strategies like clustering and Locality-Sensitive Hashing
- Cluster vectors into groups
- Search only in relevant clusters
- Reduces search space
- Faster but approximate
- Hash similar vectors to same buckets
- Search only in relevant buckets
- Fast approximate search
- Probabilistic guarantees
- Clustering: Better accuracy, needs training
- LSH: Faster, probabilistic
How does clustering reduce search space? When does it fail and how can we mitigate these failures?
How does clustering reduce search space? When does it fail and how can we mitigate these failures?
- Group similar vectors into clusters
- For query, find relevant clusters
- Search only in those clusters
- Reduces search space significantly
-
Query near cluster boundary:
- May miss vectors in adjacent clusters
- Solution: Search multiple clusters
-
Poor clustering:
- Clusters don’t match query distribution
- Solution: Better clustering algorithm, more clusters
-
High-dimensional data:
- Clustering less effective
- Solution: Dimensionality reduction, better algorithms
- Search multiple clusters
- Improve clustering quality
- Use hierarchical clustering
- Combine with other strategies
Explain Random projection index?
Explain Random projection index?
- Projects high-dimensional vectors to lower dimensions
- Preserves distances approximately (Johnson-Lindenstrauss lemma)
- Faster search in lower dimensions
- Approximate but fast
- Multiply vectors by random matrix
- Reduce dimensions (e.g., 1536 → 128)
- Search in lower-dimensional space
- Faster but approximate
- Speed: Much faster (lower dimensions)
- Accuracy: Approximate (some loss)
- Memory: Less memory needed
Explain Locality-sensitive hashing (LHS) indexing method?
Explain Locality-sensitive hashing (LHS) indexing method?
- Hash similar vectors to same buckets
- Search only in relevant buckets
- Fast approximate nearest neighbor search
- Probabilistic guarantees
- Create hash functions that map similar vectors to same hash
- Hash query vector
- Search in matching buckets
- Return top-k results
- Similar vectors → same hash (high probability)
- Different vectors → different hash (high probability)
- Fast lookup (hash-based)
- Speed: Very fast (hash lookup)
- Accuracy: Approximate (probabilistic)
- Memory: Hash tables needed
Explain product quantization (PQ) indexing method?
Explain product quantization (PQ) indexing method?
- Compresses vectors using quantization
- Reduces memory usage
- Enables fast approximate search
- Trade-off: accuracy vs memory
- Split vector into subvectors
- Quantize each subvector (reduce precision)
- Store quantized codes
- Fast distance computation using lookup tables
- Memory: Much less memory (compressed)
- Speed: Fast distance computation
- Scalability: Can handle very large datasets
- Accuracy: Some loss from quantization
- Complexity: More complex implementation
Compare different Vector index and given a scenario, which vector index you would use for a project?
Compare different Vector index and given a scenario, which vector index you would use for a project?
- Graph-based index
- High accuracy, good speed
- Best for: General-purpose, production
- Clustering-based
- Good accuracy, fast
- Best for: Large datasets, known distribution
- Hash-based
- Fast, approximate
- Best for: Very large datasets, speed critical
- Compression-based
- Memory efficient
- Best for: Memory-constrained, large datasets
- General production: HNSW
- Large scale: IVF or HNSW
- Memory constrained: PQ
- Speed critical: LSH
- Accuracy critical: HNSW or exact search
How would you decide ideal search similarity metrics for the use case?
How would you decide ideal search similarity metrics for the use case?
- Measures angle between vectors
- Magnitude-independent
- Best for: Semantic similarity, general use
- Measures magnitude and direction
- Magnitude-dependent
- Best for: When magnitude matters
- Measures absolute distance
- Magnitude-dependent
- Best for: When absolute distance matters
- Vector normalization: Normalized → cosine, not normalized → dot product
- Magnitude importance: Matters → dot product/Euclidean, doesn’t → cosine
- Use case: Semantic search → cosine, recommendation → dot product
Explain different types and challenges associated with filtering in vector DB?
Explain different types and challenges associated with filtering in vector DB?
-
Metadata filtering:
- Filter by document type, date, tags
- Pre-filter before vector search
- Reduces search space
-
Post-filtering:
- Filter after vector search
- May reduce results below k
- Simpler but less efficient
-
Pre-filtering:
- Filter before vector search
- More efficient
- May miss relevant results
- Performance: Filtering can slow down search
- Result quality: Pre-filtering may miss results
- Complexity: Combining filters is complex
- Indexing: Need indexes for fast filtering
- Use metadata indexes
- Combine pre and post-filtering
- Test filtering impact on quality
- Optimize filter queries
How to decide the best vector database for your needs?
How to decide the best vector database for your needs?
-
Scale:
- Number of vectors
- Query volume
- Growth rate
-
Features:
- Filtering, hybrid search
- Metadata support
- Advanced features
-
Deployment:
- Managed vs self-hosted
- Infrastructure requirements
- Maintenance burden
-
Cost:
- Pricing model
- Infrastructure costs
- Total cost of ownership
-
Performance:
- Latency requirements
- Throughput needs
- Accuracy requirements
- Prototype: Pinecone or Chroma
- Production <10M: Pinecone or Weaviate
- Production >10M: Milvus or Qdrant
- Budget constrained: Self-hosted (Chroma, Milvus)
- Need features: Weaviate
9. Advanced Search Algorithms
What are architecture patterns for information retrieval & semantic search?
What are architecture patterns for information retrieval & semantic search?
-
Dense retrieval (semantic search):
- Embeddings for semantic similarity
- Vector database for storage
- Cosine similarity for ranking
- Best for: Semantic understanding
-
Sparse retrieval (keyword search):
- TF-IDF, BM25 for keyword matching
- Inverted index for storage
- Keyword-based ranking
- Best for: Exact matches
-
Hybrid retrieval:
- Combine dense + sparse
- Weighted combination of scores
- Best of both approaches
- Best for: Production systems
-
Reranking:
- Two-stage retrieval
- Initial retrieval (fast)
- Reranking (expensive but better)
- Best for: Quality-critical queries
-
Multi-stage retrieval:
- Coarse → Fine retrieval
- Hierarchical search
- Progressive refinement
- Best for: Large-scale systems
Why it's important to have very good search
Why it's important to have very good search
-
RAG quality:
- Good retrieval = good RAG
- Bad retrieval = hallucinations
- Foundation of RAG system
-
User experience:
- Fast, accurate results
- Relevant information
- User satisfaction
-
System performance:
- Reduces LLM calls
- Lower latency
- Lower cost
-
Accuracy:
- Correct information retrieved
- Reduces hallucinations
- Better answers
How can you achieve efficient and accurate search results in large-scale datasets?
How can you achieve efficient and accurate search results in large-scale datasets?
-
Hybrid search:
- Combine dense + sparse
- Better coverage
- Improved accuracy
-
Hierarchical retrieval:
- Coarse → Fine search
- Reduce search space
- Faster retrieval
-
Metadata filtering:
- Filter by type, date, tags
- Narrow search space
- Better precision
-
Reranking:
- Second-stage ranking
- Improves precision
- Better top-k results
-
Indexing:
- Efficient indexes (HNSW, IVF)
- Fast approximate search
- Scalable
-
Caching:
- Cache frequent queries
- Reduce computation
- Lower latency
Consider a scenario where a client has already built a RAG-based system that is not giving accurate results, upon investigation you find out that the retrieval system is not accurate, what steps you will take to improve it?
Consider a scenario where a client has already built a RAG-based system that is not giving accurate results, upon investigation you find out that the retrieval system is not accurate, what steps you will take to improve it?
-
Diagnose issues:
- Measure retrieval quality (precision@k, recall@k)
- Identify failure modes
- Analyze query types
-
Improve chunking:
- Better chunk size
- Semantic chunking
- Preserve context
-
Improve embeddings:
- Try different embedding models
- Fine-tune on domain data
- Domain-specific models
-
Add hybrid search:
- Combine dense + sparse
- Better coverage
- Improved accuracy
-
Add reranking:
- Second-stage ranking
- Improves precision
- Better top-k results
-
Metadata filtering:
- Filter by type, date
- Narrow search space
- Better precision
-
Query expansion:
- Expand queries with synonyms
- Better query understanding
- Improved retrieval
-
Evaluate:
- Test on real queries
- Measure improvement
- Iterate
Explain the keyword-based retrieval method
Explain the keyword-based retrieval method
-
TF-IDF (Term Frequency-Inverse Document Frequency):
- Weights terms by frequency and rarity
- Common terms get lower weight
- Rare terms get higher weight
- Classic information retrieval
-
BM25 (Best Matching 25):
- Improved version of TF-IDF
- Better term saturation
- Handles document length better
- Industry standard
-
Inverted index:
- Maps terms to documents
- Fast lookup
- Efficient storage
- Foundation of keyword search
- Extract keywords from query
- Look up in inverted index
- Score documents by term frequency
- Rank by relevance score
- Fast, exact matches
- Interpretable
- No model needed
- Misses synonyms
- No semantic understanding
- Limited to exact matches
How to fine-tune re-ranking models?
How to fine-tune re-ranking models?
-
Data preparation:
- Query-document pairs
- Relevance labels (relevant/irrelevant)
- Multiple relevance levels (highly relevant, somewhat relevant, etc.)
-
Model selection:
- Cross-encoder models (BERT, RoBERTa)
- Better than bi-encoders for reranking
- Understands query-document interaction
-
Training:
- Use sentence-transformers library
- Contrastive learning
- Train on domain data
- Monitor validation metrics
-
Evaluation:
- Test on held-out set
- Measure precision@k, MRR, NDCG
- Compare with baseline
-
Deployment:
- Deploy as second-stage reranker
- Use after initial retrieval
- Monitor performance
- Use domain-specific data
- Multiple relevance levels
- Monitor overfitting
- Test on real queries
Explain most common metric used in information retrieval and when it fails
Explain most common metric used in information retrieval and when it fails
-
Precision@k:
- Fraction of retrieved items that are relevant
- Measures accuracy of top-k results
- Fails when: Need to measure coverage (recall)
-
Recall@k:
- Fraction of relevant items that were retrieved
- Measures coverage
- Fails when: Need to measure accuracy (precision)
-
MRR (Mean Reciprocal Rank):
- Average of 1/rank of first relevant result
- Emphasizes top results
- Fails when: Need to measure overall quality (NDCG)
-
NDCG (Normalized Discounted Cumulative Gain):
- Considers ranking quality, discounts lower positions
- Best for graded relevance
- Fails when: Need simple binary relevance
- Precision: When recall is important
- Recall: When precision is important
- MRR: When need overall ranking quality
- NDCG: When need simple binary relevance
If you were to create an algorithm for a Quora-like question-answering system, with the objective of ensuring users find the most pertinent answers as quickly as possible, which evaluation metric would you choose to assess the effectiveness of your system?
If you were to create an algorithm for a Quora-like question-answering system, with the objective of ensuring users find the most pertinent answers as quickly as possible, which evaluation metric would you choose to assess the effectiveness of your system?
- User experience: Users want first relevant answer quickly
- MRR emphasizes top results: Measures rank of first relevant answer
- Fast answers: Lower rank = faster to find answer
- User satisfaction: Users typically read top results
- Precision@k: Measures accuracy but not position
- Recall@k: Measures coverage but not speed
- NDCG: Good but more complex, MRR simpler
- For each query, find rank of first relevant answer
- Calculate 1/rank
- Average across queries
- Higher MRR = better (answers found faster)
I have a recommendation system, which metric should I use to evaluate the system?
I have a recommendation system, which metric should I use to evaluate the system?
- Graded relevance: Recommendations have degrees (highly relevant, somewhat relevant)
- Position matters: Top recommendations more important
- Ranking quality: Measures how well system ranks recommendations
- Industry standard: Widely used for recommendation systems
- Precision@k: Good but doesn’t consider position
- Recall@k: Good but doesn’t consider position
- MRR: Good but assumes binary relevance
- Considers relevance grades
- Discounts lower positions
- Normalized (comparable across queries)
- Industry standard
Compare different information retrieval metrics and which one to use when
Compare different information retrieval metrics and which one to use when
- Measures: Accuracy of top-k results
- Use when: Accuracy is priority
- Example: Search engine results
- Measures: Coverage of relevant items
- Use when: Coverage is priority
- Example: Document retrieval
- Measures: Rank of first relevant result
- Use when: First relevant result matters
- Example: Q&A systems
- Measures: Ranking quality with graded relevance
- Use when: Ranking and relevance grades matter
- Example: Recommendation systems
- Measures: Harmonic mean of precision and recall
- Use when: Need balance of both
- Example: Balanced evaluation
- Accuracy priority: Precision@k
- Coverage priority: Recall@k
- First result matters: MRR
- Ranking quality: NDCG
- Balance: F1@k
How does hybrid search works?
How does hybrid search works?
-
Dense search (semantic):
- Embed query and documents
- Calculate cosine similarity
- Rank by semantic similarity
-
Sparse search (keyword):
- Extract keywords from query
- Use BM25/TF-IDF
- Rank by keyword matching
-
Score combination:
- Normalize scores (0-1)
- Weighted combination:
final_score = α × dense_score + (1-α) × sparse_score - Typical α = 0.7 (70% dense, 30% sparse)
-
Reranking:
- Optional: Rerank combined results
- Use cross-encoder
- Improve precision
- Captures semantic similarity (dense)
- Captures exact matches (sparse)
- Better coverage
- Improved accuracy
If you have search results from multiple methods, how would you merge and homogenize the rankings into a single result set?
If you have search results from multiple methods, how would you merge and homogenize the rankings into a single result set?
-
Score normalization:
- Normalize scores to same range (0-1)
- Use min-max or z-score normalization
- Enables fair combination
-
Weighted combination:
final_score = α × method1_score + (1-α) × method2_score- Adjust α based on method performance
- Typical: 0.7 dense + 0.3 sparse
-
Reciprocal rank fusion (RRF):
- Combine ranks, not scores
RRF_score = Σ(1 / (k + rank))- Works with different score ranges
- Popular in information retrieval
-
Learning to rank:
- Train model to combine scores
- Learns optimal combination
- More complex but better
-
Reranking:
- Merge initial results
- Rerank with cross-encoder
- Improves final ranking
How to handle multi-hop/multifaceted queries?
How to handle multi-hop/multifaceted queries?
-
Iterative retrieval:
- First hop: Retrieve initial documents
- Extract entities/concepts
- Second hop: Query with extracted entities
- Combine results
-
Graph-based retrieval:
- Build knowledge graph
- Traverse graph for multi-hop
- Find connected entities
-
Query decomposition:
- Break query into sub-queries
- Retrieve for each sub-query
- Combine results
-
Agent-based:
- Use LLM agent
- Plan retrieval steps
- Execute iteratively
- Use iterative retrieval for simple multi-hop
- Use graph-based for complex relationships
- Use agents for complex reasoning
What are different techniques to be used to improved retrieval?
What are different techniques to be used to improved retrieval?
-
Hybrid search:
- Combine dense + sparse
- Better coverage
- Improved accuracy
-
Reranking:
- Second-stage ranking
- Improves precision
- Better top-k results
-
Query expansion:
- Add synonyms, related terms
- Better query understanding
- Improved retrieval
-
Metadata filtering:
- Filter by type, date, tags
- Narrow search space
- Better precision
-
Better chunking:
- Semantic chunking
- Preserve context
- Better retrieval
-
Fine-tune embeddings:
- Domain-specific models
- Better domain understanding
- Improved accuracy
-
Multi-stage retrieval:
- Coarse → Fine search
- Hierarchical retrieval
- Faster and better
10. Prompt Engineering & Basics of LLM
What is the difference between Predictive/Discriminative AI and Generative AI?
What is the difference between Predictive/Discriminative AI and Generative AI?
- Predicts labels or classes
- Examples: Classification, regression
- Input → Output (label/class)
- Trained on labeled data
- Examples: Image classification, sentiment analysis
- Generates new content
- Examples: Text generation, image generation
- Input → Output (new content)
- Trained on unlabeled data
- Examples: GPT, DALL-E, ChatGPT
- Purpose: Prediction vs generation
- Output: Label vs content
- Training: Labeled vs unlabeled data
- Use case: Classification vs creation
What is LLM, and how are LLMs trained?
What is LLM, and how are LLMs trained?
- Neural network trained on large text corpora
- Generates human-like text
- Examples: GPT, BERT, LLaMA
-
Pre-training:
- Train on large unlabeled text corpus
- Learn language patterns
- Self-supervised learning (predict next token)
- Massive compute and data
-
Fine-tuning:
- Adapt to specific tasks
- Supervised learning on labeled data
- Task-specific behavior
- Much less data needed
-
Alignment:
- RLHF, DPO for human preferences
- Safety and helpfulness
- Human feedback
- Aligns with human values
- Pre-training: Months on thousands of GPUs
- Fine-tuning: Hours to days
- Alignment: Days to weeks
What is a token in the language model?
What is a token in the language model?
- Basic unit of text processing
- Can be word, subword, or character
- Depends on tokenizer (BPE, WordPiece, SentencePiece)
- Text → Tokens → Token IDs → Model
- Model processes tokens, not raw text
- Token count affects cost and context
- “Hello world” → 2 tokens (BPE)
- “Machine learning” → 2-3 tokens (depending on tokenizer)
- BPE: Byte Pair Encoding (GPT)
- WordPiece: (BERT)
- SentencePiece: (T5, multilingual)
How to estimate the cost of running SaaS-based and Open Source LLM models?
How to estimate the cost of running SaaS-based and Open Source LLM models?
- Pricing: Per token (input + output)
- Example: GPT-4: 0.06/1k output tokens
- Calculate: (input_tokens × input_price) + (output_tokens × output_price)
- Monthly: Estimate tokens/month × price
- Infrastructure: GPU instances (A100, H100)
- Cost: $5-15k/month for GPU instances
- Break-even: ~2-5M requests/month
- Additional: Storage, networking, maintenance
- Volume: More requests = higher cost
- Model size: Larger models = higher cost
- Context length: Longer context = more tokens
- Region: Different pricing by region
Explain the Temperature parameter and how to set it
Explain the Temperature parameter and how to set it
- Controls randomness in generation (0-2)
- Lower = more deterministic
- Higher = more creative
- 0.0-0.3: Deterministic (code, classification)
- 0.4-0.7: Balanced (Q&A, summaries)
- 0.8-1.0: Creative (writing, brainstorming)
- Start with 0.3-0.5 for most tasks
- Use 0.0 for deterministic tasks
- Use 0.8+ for creative tasks
- Test different values
What are different decoding strategies for picking output tokens?
What are different decoding strategies for picking output tokens?
-
Greedy:
- Always picks highest probability token
- Fastest, deterministic
- Can get repetitive
-
Beam search:
- Keeps top-k candidates
- Better quality, slower
- Good for translation
-
Top-k sampling:
- Samples from top-k tokens
- More diverse, less deterministic
- Good for creative tasks
-
Top-p (nucleus) sampling:
- Samples from smallest set covering p% probability
- Good balance of quality and diversity
- Most common for chat
-
Temperature sampling:
- Scales probabilities before sampling
- Controls randomness
- Often combined with top-p
What are different ways you can define stopping criteria in large language model?
What are different ways you can define stopping criteria in large language model?
-
Max tokens:
- Stop after N tokens
- Prevents long outputs
- Most common
-
Stop sequences:
- Stop when specific sequence appears
- Example: ”###” or “\n\n”
- Useful for structured output
-
EOS token:
- Stop at end-of-sequence token
- Model-generated
- Natural stopping point
-
Custom logic:
- Stop based on content
- Example: Complete sentence, paragraph
- More complex
How to use stop sequences in LLMs?
How to use stop sequences in LLMs?
-
Define sequences:
- List of strings to stop at
- Example: [”###”, “\n\n\n”]
- Model stops when any sequence appears
-
Use cases:
- Structured output (JSON, XML)
- Multi-turn conversations
- Preventing continuation
-
Best practices:
- Use unique sequences
- Test to ensure they work
- Combine with max tokens
- Stop sequence: ”###”
- Model stops when ”###” appears
- Useful for structured output
Explain the basic structure prompt engineering
Explain the basic structure prompt engineering
-
System prompt:
- Defines model’s role
- Sets behavior and constraints
- Example: “You are a helpful assistant.”
-
Context:
- Relevant information
- Retrieved documents (RAG)
- User history
-
Instructions:
- What model should do
- Format requirements
- Examples
-
User input:
- Actual query or request
- User’s question or task
Explain in-context learning
Explain in-context learning
- Model learns from examples in prompt
- No weight updates
- Examples guide model behavior
- Types: Zero-shot, few-shot, chain-of-thought
- Provide examples in prompt
- Model learns pattern from examples
- Applies pattern to new input
- No training needed
- Zero-shot: No examples
- Few-shot: 1-5 examples
- Chain-of-thought: Examples with reasoning
Explain type of prompt engineering
Explain type of prompt engineering
-
Zero-shot:
- No examples
- Model uses pre-training
- Fastest, cheapest
-
Few-shot:
- 1-5 examples
- Guides model behavior
- Better consistency
-
Chain-of-thought:
- Examples with reasoning steps
- Improves reasoning
- Better for complex tasks
-
Role-playing:
- Define model’s role
- Sets behavior
- Example: “You are an expert…”
-
Template-based:
- Structured prompts
- Consistent format
- Easy to maintain
What are some of the aspect to keep in mind while using few-shots prompting?
What are some of the aspect to keep in mind while using few-shots prompting?
-
Example quality:
- High-quality, relevant examples
- Representative of task
- Clear and correct
-
Example quantity:
- 2-5 examples usually best
- Diminishing returns beyond 5
- Balance cost and quality
-
Example diversity:
- Cover different cases
- Avoid bias
- Representative sample
-
Token usage:
- Examples increase tokens
- Higher cost
- Monitor usage
-
Format consistency:
- Consistent format across examples
- Clear structure
- Easy to follow
What are certain strategies to write good prompt?
What are certain strategies to write good prompt?
-
Be clear and specific:
- Clear instructions
- Specific requirements
- Avoid ambiguity
-
Use examples:
- Few-shot examples
- Show desired format
- Guide behavior
-
Structure prompts:
- System prompt, context, instructions
- Clear sections
- Easy to read
-
Iterate:
- Test different prompts
- Refine based on results
- A/B test
-
Version control:
- Version prompts
- Track changes
- Enable rollback
What is hallucination, and how can it be controlled using prompt engineering?
What is hallucination, and how can it be controlled using prompt engineering?
- Model generates false information
- Confidently states incorrect facts
- Common in LLMs
-
Ground in context:
- Use RAG to provide context
- Instruct model to use only context
- Cite sources
-
Explicit instructions:
- “Only use provided context”
- “If unsure, say so”
- “Don’t make up information”
-
Few-shot examples:
- Show correct behavior
- Examples of admitting uncertainty
- Guide model
-
Output format:
- Structured output
- Confidence scores
- Source citations
How to improve the reasoning ability of LLM through prompt engineering?
How to improve the reasoning ability of LLM through prompt engineering?
-
Chain-of-thought:
- Ask model to think step-by-step
- Show reasoning in examples
- Improves complex reasoning
-
Few-shot CoT:
- Examples with reasoning steps
- Model learns pattern
- Better reasoning
-
Self-consistency:
- Generate multiple reasoning chains
- Pick most common answer
- Improves accuracy
-
Verification:
- Ask model to verify answer
- Check reasoning
- Catch errors
How to improve LLM reasoning if your COT prompt fails?
How to improve LLM reasoning if your COT prompt fails?
-
Simplify problem:
- Break into smaller steps
- Solve step-by-step
- Combine solutions
-
Better examples:
- Higher quality examples
- Clearer reasoning
- More relevant
-
Different approach:
- Try different reasoning style
- Alternative methods
- Experiment
-
Model upgrade:
- Use larger model
- Better reasoning capability
- GPT-4, Claude Opus
-
External tools:
- Use calculator, code execution
- Verify with tools
- Hybrid approach
11. Cost & Latency Tradeoffs
How do you reduce token usage?
How do you reduce token usage?
-
Prompt optimization:
- Remove unnecessary text
- Use concise instructions
- Remove redundant examples
-
Context management:
- Only include relevant context
- Use RAG to retrieve only needed docs
- Truncate long documents
-
Prompt caching:
- Cache system prompts
- Reuse across requests
- Significant savings
-
Response limits:
- Set max tokens for output
- Stop early when possible
- Use stop sequences
-
Model selection:
- Use smaller models when possible
- Distilled models
- Task-specific models
When should you quantize a model?
When should you quantize a model?
-
Memory constraints:
- Limited GPU memory
- Need to fit larger models
- Edge devices
-
Cost optimization:
- Reduce inference cost
- Lower infrastructure costs
- Scale more efficiently
-
Latency requirements:
- Need faster inference
- Real-time applications
- Lower latency
- Pros: Less memory, faster, cheaper
- Cons: Some accuracy loss, more complex
What's your batching + caching strategy to reduce latency?
What's your batching + caching strategy to reduce latency?
-
Dynamic batching:
- Batch requests together
- Process multiple requests simultaneously
- Higher throughput
-
Continuous batching (vLLM):
- Add requests to batch dynamically
- Remove completed requests
- Optimal GPU utilization
-
Batch size:
- Balance latency vs throughput
- Larger batches = higher throughput
- Smaller batches = lower latency
-
Prompt caching:
- Cache system prompts
- Reuse across requests
- Significant latency reduction
-
Response caching:
- Cache common queries
- Return cached responses
- Very fast
-
Context caching:
- Cache conversation context
- Reuse for multi-turn
- Faster responses
When to use hosted APIs vs open-source models?
When to use hosted APIs vs open-source models?
- Use when:
- Low to medium volume
- Need latest models
- Don’t want infrastructure management
- Quick to market
- Use when:
- High volume (>2-5M requests/month)
- Cost-sensitive
- Need data privacy
- Want control over models
- Volume: Low → hosted, high → self-hosted
- Cost: Low volume → hosted, high volume → self-hosted
- Privacy: Need privacy → self-hosted
- Speed: Quick to market → hosted
12. Agentic AI
Define an 'agent' in practical terms?
Define an 'agent' in practical terms?
- LLM that can use tools and take actions
- Can plan, execute, and iterate
- Autonomous decision-making
- Examples: Code execution, web search, API calls
- Tool use: Call functions, APIs, tools
- Planning: Break down tasks into steps
- Execution: Take actions based on plan
- Iteration: Refine based on results
- Code agent: Writes and executes code
- Research agent: Searches web, synthesizes info
- API agent: Calls APIs, processes data
What's the hardest part of orchestration when chaining multiple tools?
What's the hardest part of orchestration when chaining multiple tools?
-
Error handling:
- Tool failures
- Partial failures
- Recovery strategies
-
State management:
- Track execution state
- Manage context across tools
- Handle state transitions
-
Planning:
- Determine tool sequence
- Handle dependencies
- Adapt to failures
-
Coordination:
- Coordinate multiple tools
- Handle async operations
- Manage timeouts
-
Debugging:
- Complex execution paths
- Hard to trace issues
- Difficult to reproduce
Why do planning agents often loop or stall?
Why do planning agents often loop or stall?
-
Poor planning:
- Incomplete plans
- Circular dependencies
- Unclear goals
-
No termination:
- No stopping criteria
- Keep trying indefinitely
- No timeout
-
Error recovery:
- Same error repeatedly
- No alternative strategies
- Stuck in loop
-
Context limits:
- Lose track of progress
- Forget what tried
- Repeat actions
-
Tool failures:
- Keep retrying failed tools
- No fallback strategies
- Stuck on failures
- Set max iterations
- Implement timeouts
- Track execution history
- Use fallback strategies
- Clear stopping criteria
Multi-agent vs single-agent: when does the complexity actually pay off?
Multi-agent vs single-agent: when does the complexity actually pay off?
- Simpler, easier to debug
- Good for most tasks
- Single point of failure
- More complex, harder to debug
- Good for complex tasks
- Parallel execution
- Complex tasks: Need multiple specialists
- Parallel work: Can work simultaneously
- Specialization: Different agents for different tasks
- Scale: Handle more complex workflows
- Simple tasks: Single agent sufficient
- Debugging: Easier to debug
- Cost: Lower complexity
How would you evaluate whether adding agentic behavior improved a system?
How would you evaluate whether adding agentic behavior improved a system?
-
Task completion:
- Success rate
- Task completion time
- Quality of results
-
Efficiency:
- Number of steps
- Tool calls per task
- Time to completion
-
Reliability:
- Error rate
- Recovery from failures
- Consistency
-
User satisfaction:
- User feedback
- Task success rate
- Time saved
-
Cost:
- Cost per task
- Tool usage costs
- Total cost
- A/B test: Agentic vs non-agentic
- Measure metrics above
- Compare performance
- User feedback
Explain different types of agents: simple reflex, model-based reflex, goal-based, utility-based, and learning agents
Explain different types of agents: simple reflex, model-based reflex, goal-based, utility-based, and learning agents
-
Simple reflex agents:
- React to current percept
- No memory, no planning
- Condition-action rules
- Example: Thermostat (if temp > threshold, turn on AC)
-
Model-based reflex agents:
- Maintain internal model of world
- Track how world evolves
- Better decisions with history
- Example: Agent tracking inventory changes
-
Goal-based agents:
- Have explicit goals
- Plan actions to achieve goals
- Consider future consequences
- Example: Navigation agent finding path to destination
-
Utility-based agents:
- Maximize utility function
- Handle uncertainty and trade-offs
- Choose best action given preferences
- Example: Trading agent maximizing profit while managing risk
-
Learning agents:
- Improve performance over time
- Learn from experience
- Adapt to new situations
- Example: Agent that improves recommendations based on feedback
- Simple reflex: Fastest, simplest, limited
- Model-based: More capable, needs world model
- Goal-based: Can plan, needs goal specification
- Utility-based: Handles trade-offs, needs utility function
- Learning: Most flexible, needs training data
What are reactive agents and how do they work?
What are reactive agents and how do they work?
- Respond to current situation only
- No internal state or memory
- Immediate action based on percept
- Simple condition-action rules
- Perceive: Observe current environment
- Match: Match percept to condition
- Act: Execute corresponding action
- Repeat: No memory of past actions
- Fast: No planning overhead
- Simple: Easy to implement
- Limited: Can’t handle complex tasks
- No learning: Don’t improve over time
- Simple control systems
- Real-time responses
- When speed > sophistication
- Deterministic environments
- Can’t plan ahead
- No memory of past actions
- Limited to simple tasks
- Can’t handle uncertainty well
Explain ReAct (Reasoning + Acting) agents and their advantages
Explain ReAct (Reasoning + Acting) agents and their advantages
- Combine reasoning and acting
- Interleave thinking and action
- Use chain-of-thought reasoning
- Take actions based on reasoning
- Think: Reason about current situation
- Act: Take action based on reasoning
- Observe: See result of action
- Think: Reason about new situation
- Repeat: Continue until goal achieved
- Reasoning: Chain-of-thought thinking
- Acting: Tool/function calls
- Observation: Results from actions
- Iteration: Refine based on observations
- Transparency: Can see reasoning process
- Flexibility: Adapts to new situations
- Error recovery: Can reason about failures
- Better decisions: Thoughtful actions
How do agents react to different situations and stimuli?
How do agents react to different situations and stimuli?
-
Immediate reaction:
- React instantly to stimulus
- No deliberation
- Fast response
- Simple reflex agents
-
Deliberative reaction:
- Think before acting
- Consider options
- Plan actions
- Goal-based agents
-
Adaptive reaction:
- Learn from experience
- Adjust behavior
- Improve over time
- Learning agents
-
Contextual reaction:
- Consider context
- Use memory/history
- Better decisions
- Model-based agents
- Stimulus → Action: Direct mapping
- Stimulus → Reasoning → Action: With deliberation
- Stimulus → Memory → Reasoning → Action: With context
- Stimulus → Learning → Adaptation → Action: With improvement
- Agent type: Reflex vs deliberative
- Environment: Deterministic vs stochastic
- Goals: Immediate vs long-term
- Experience: New vs learned
What are the key metrics for agent evaluation?
What are the key metrics for agent evaluation?
-
Task success:
- Success rate (% tasks completed)
- Goal achievement rate
- Task completion quality
- Accuracy of results
-
Efficiency:
- Steps to completion
- Tool calls per task
- Time to completion
- Resource usage
-
Reliability:
- Error rate
- Failure recovery rate
- Consistency across runs
- Robustness to edge cases
-
Cost:
- Cost per task
- Token usage
- Tool/API costs
- Total cost of ownership
-
User experience:
- User satisfaction
- Response time
- Quality of interactions
- Helpfulness
-
Learning (for learning agents):
- Improvement over time
- Adaptation to new tasks
- Generalization ability
- Sample efficiency
- Offline: Test on held-out dataset
- Online: A/B test with real users
- Simulation: Test in controlled environment
- Human evaluation: Expert review
How do you perform end-to-end evaluation of agents?
How do you perform end-to-end evaluation of agents?
-
Define evaluation tasks:
- Realistic scenarios
- Diverse task types
- Clear success criteria
- Representative of real use
-
Set up test environment:
- Simulated or real environment
- Tools and APIs available
- Controlled conditions
- Reproducible setup
-
Run agent on tasks:
- Execute agent on each task
- Record all actions
- Capture outputs
- Log errors/failures
-
Measure performance:
- Task success rate
- Steps to completion
- Time to completion
- Quality of results
- Cost per task
-
Analyze results:
- Identify failure modes
- Find common errors
- Analyze efficiency
- Compare with baselines
-
Iterate:
- Fix identified issues
- Improve agent
- Re-evaluate
- Continuous improvement
- WebArena: Web navigation tasks
- AgentBench: Multi-domain agent tasks
- ToolBench: Tool-using tasks
- Custom: Domain-specific tasks
What is tool calling and how do agents use tools?
What is tool calling and how do agents use tools?
- Agents invoke external functions/tools
- Extends agent capabilities
- Enables real-world actions
- Examples: API calls, code execution, web search
-
Tool definition:
- Define available tools
- Specify parameters
- Document functionality
- Example:
search_web(query: str) -> str
-
Tool selection:
- Agent decides which tool to use
- Based on current task
- Considers tool capabilities
- Matches tool to need
-
Tool invocation:
- Call tool with parameters
- Execute tool function
- Get result
- Handle errors
-
Result processing:
- Process tool output
- Use result for next action
- Integrate into reasoning
- Continue task
- Sequential: One tool at a time
- Parallel: Multiple tools simultaneously
- Conditional: Tool based on condition
- Iterative: Tool in loop until done
Explain OpenAI Functions (Function Calling) for agents
Explain OpenAI Functions (Function Calling) for agents
- Structured way to define tools
- Model decides when to call functions
- Returns structured function calls
- Enables reliable tool use
-
Define functions:
-
Model decides:
- Model sees function definitions
- Decides if function needed
- Returns function call if needed
- Or continues conversation
-
Execute function:
- Parse function call
- Execute with parameters
- Get result
- Return to model
-
Model continues:
- Model sees function result
- Uses result in response
- Can call more functions
- Completes task
- Reliable: Structured function calls
- Flexible: Model decides when to use
- Type-safe: JSON schema validation
- Easy integration: Standard format
What is MCP (Model Context Protocol) and how does it work?
What is MCP (Model Context Protocol) and how does it work?
- Standard protocol for agent-tool communication
- Enables agents to use external tools
- Provides context to models
- Standardizes tool interfaces
-
Tool registration:
- Tools register with MCP server
- Define capabilities
- Specify interfaces
- Make available to agents
-
Context provision:
- MCP provides tool context
- Describes available tools
- Shows tool capabilities
- Updates dynamically
-
Tool invocation:
- Agent requests tool use
- MCP routes to tool
- Executes tool
- Returns result
-
Context updates:
- MCP updates context
- Reflects tool results
- Maintains state
- Enables multi-step tasks
- Standardization: Common protocol
- Interoperability: Works across systems
- Context management: Maintains state
- Tool discovery: Agents find tools
- Multi-tool agent systems
- Tool marketplace integration
- Standardized agent platforms
- Cross-platform tool use
Explain Agent-to-Agent (A2A) communication and coordination
Explain Agent-to-Agent (A2A) communication and coordination
- Agents communicate with each other
- Coordinate on tasks
- Share information
- Collaborate on goals
-
Direct communication:
- Agents send messages directly
- Point-to-point communication
- Simple but limited scale
- Example: Two agents coordinating
-
Broadcast communication:
- One agent broadcasts to all
- Announcements, updates
- Efficient for one-to-many
- Example: Leader announcing plan
-
Mediated communication:
- Communication through mediator
- Centralized coordination
- Better for complex systems
- Example: Message broker
-
Shared memory:
- Agents share common memory
- Read/write shared state
- Coordination through state
- Example: Blackboard architecture
-
Task delegation:
- One agent delegates to others
- Divide and conquer
- Specialized agents
- Example: Manager delegates to workers
-
Consensus:
- Agents agree on action
- Voting, negotiation
- Democratic decision-making
- Example: Agents vote on plan
-
Auction:
- Agents bid on tasks
- Market-based coordination
- Efficient resource allocation
- Example: Task auction system
-
Contract net:
- One agent announces task
- Others bid on task
- Select best bidder
- Example: Task allocation
How do you design multi-agent systems for collaboration?
How do you design multi-agent systems for collaboration?
-
Agent roles:
- Define agent responsibilities
- Specialize agents
- Clear role boundaries
- Example: Researcher, Writer, Reviewer
-
Communication protocol:
- Define message format
- Specify communication channels
- Establish protocols
- Example: JSON messages, REST API
-
Coordination mechanism:
- How agents coordinate
- Task allocation
- Conflict resolution
- Example: Manager agent, voting
-
Shared resources:
- Common knowledge base
- Shared memory
- Tool access
- Example: Shared database
-
Error handling:
- Agent failure recovery
- Communication failures
- Task reassignment
- Example: Backup agents, retries
-
Hierarchical:
- Manager-worker structure
- Top-down coordination
- Clear hierarchy
- Example: Manager delegates to workers
-
Peer-to-peer:
- Equal agents
- Distributed coordination
- No central authority
- Example: Swarm agents
-
Market-based:
- Agents trade resources
- Auction-based allocation
- Economic incentives
- Example: Task marketplace
-
Blackboard:
- Shared blackboard
- Agents read/write
- Opportunistic coordination
- Example: Shared knowledge base
What are the challenges in agent-to-agent communication?
What are the challenges in agent-to-agent communication?
-
Message understanding:
- Agents interpret messages
- Ambiguity in communication
- Different vocabularies
- Misunderstandings
-
Synchronization:
- Timing of messages
- Async vs sync communication
- Race conditions
- Deadlocks
-
Scalability:
- Communication overhead
- Message flooding
- Network congestion
- Performance degradation
-
Reliability:
- Message delivery
- Lost messages
- Duplicate messages
- Ordering guarantees
-
Security:
- Authentication
- Authorization
- Message encryption
- Trust between agents
-
Coordination:
- Avoiding conflicts
- Resolving disputes
- Consensus building
- Task allocation
- Protocols: Standardized communication
- Message queues: Reliable delivery
- Encryption: Secure communication
- Authentication: Trusted agents
- Coordination algorithms: Conflict resolution
How do you evaluate agent-to-agent systems?
How do you evaluate agent-to-agent systems?
-
System-level metrics:
- Overall task completion
- System efficiency
- Resource utilization
- End-to-end performance
-
Agent-level metrics:
- Individual agent performance
- Agent contribution
- Agent reliability
- Agent efficiency
-
Communication metrics:
- Message overhead
- Communication latency
- Message success rate
- Coordination efficiency
-
Coordination metrics:
- Task allocation quality
- Conflict resolution rate
- Consensus building time
- Coordination overhead
-
Scalability metrics:
- Performance with more agents
- Communication overhead growth
- System stability
- Resource usage
- Simulation: Test in controlled environment
- Benchmarks: Standard test suites
- Real-world: Deploy and monitor
- Stress testing: Test under load
Compare reactive vs deliberative vs hybrid agents
Compare reactive vs deliberative vs hybrid agents
- React to current situation
- No planning or memory
- Fast response
- Simple implementation
- Limited to simple tasks
- Plan before acting
- Consider future consequences
- Slower but better decisions
- More complex
- Handle complex tasks
- Combine reactive and deliberative
- React for urgent, deliberate for complex
- Balance speed and quality
- Most practical
- Best of both worlds
| Aspect | Reactive | Deliberative | Hybrid |
|---|---|---|---|
| Speed | Fast | Slow | Medium |
| Complexity | Simple | Complex | Medium |
| Planning | No | Yes | Selective |
| Memory | No | Yes | Yes |
| Use case | Simple | Complex | General |
- Reactive: Simple, fast-response tasks
- Deliberative: Complex planning tasks
- Hybrid: General-purpose agents
What is the difference between plan-and-execute vs ReAct agent strategies?
What is the difference between plan-and-execute vs ReAct agent strategies?
- Plan entire task upfront
- Execute plan step by step
- Rigid execution
- Can’t adapt to changes
- Good for predictable tasks
- Interleave reasoning and acting
- Plan incrementally
- Adapt to observations
- Flexible execution
- Good for dynamic tasks
- Pros: Clear plan, efficient execution
- Cons: Rigid, can’t adapt, fails if plan wrong
- Use when: Task is predictable, plan is reliable
- Pros: Flexible, adapts, handles uncertainty
- Cons: More steps, slower, more tokens
- Use when: Task is dynamic, needs adaptation
How do learning agents improve over time?
How do learning agents improve over time?
-
Experience collection:
- Collect training data
- Record actions and outcomes
- Build experience database
- Track performance
-
Learning mechanisms:
- Supervised learning: Learn from labeled examples
- Reinforcement learning: Learn from rewards
- Unsupervised learning: Discover patterns
- Meta-learning: Learn to learn
-
Performance improvement:
- Better decision-making
- Fewer errors
- Faster task completion
- Higher success rate
-
Adaptation:
- Adapt to new tasks
- Handle edge cases
- Generalize from experience
- Transfer learning
-
Online learning:
- Learn during operation
- Continuous improvement
- Real-time adaptation
- Example: Agent learns from user feedback
-
Offline learning:
- Learn from historical data
- Batch training
- Periodic updates
- Example: Retrain on collected data
-
Transfer learning:
- Learn from related tasks
- Apply to new domains
- Faster adaptation
- Example: Agent trained on task A helps with task B
What are the evaluation frameworks for agent systems?
What are the evaluation frameworks for agent systems?
-
AgentBench:
- Multi-domain agent tasks
- Standardized evaluation
- Diverse task types
- Comprehensive metrics
-
WebArena:
- Web navigation tasks
- Realistic scenarios
- Browser automation
- Success rate metrics
-
ToolBench:
- Tool-using tasks
- Function calling evaluation
- Tool selection accuracy
- Task completion rate
-
ALFWorld:
- Household tasks
- Embodied agents
- Sequential actions
- Task success metrics
-
Custom frameworks:
- Domain-specific tasks
- Real-world scenarios
- Business metrics
- User satisfaction
- Task success: Can agent complete task?
- Efficiency: How many steps?
- Quality: How good is result?
- Reliability: How consistent?
- Cost: How expensive?
13. System Design Thinking
How do you make an AI system more deterministic and less brittle?
How do you make an AI system more deterministic and less brittle?
-
Temperature = 0:
- Pure greedy decoding
- Deterministic outputs
- Reproducible
-
Fixed seed:
- Set random seed
- Same seed = same output
- Reproducible
-
Structured output:
- Use JSON schema
- Validate output format
- Consistent structure
-
Prompt engineering:
- Clear instructions
- Few-shot examples
- Consistent format
-
Error handling:
- Graceful degradation
- Fallback strategies
- Retry logic
-
Validation:
- Input validation
- Output validation
- Error detection
-
Monitoring:
- Track failures
- Alert on issues
- Quick response
What fallback do you use if the LLM fails mid-task?
What fallback do you use if the LLM fails mid-task?
-
Retry:
- Retry with same prompt
- Exponential backoff
- Max retries
-
Simplified prompt:
- Retry with simpler prompt
- Remove complexity
- Basic version
-
Cached response:
- Return cached response
- Similar queries
- Fast fallback
-
Template response:
- Pre-written responses
- Generic answers
- User-friendly
-
Human escalation:
- Route to human
- For critical tasks
- Last resort
Can you solve this without an LLM or vector DB?
Can you solve this without an LLM or vector DB?
-
Simple tasks:
- Rule-based sufficient
- No need for AI
- Faster, cheaper
-
Deterministic tasks:
- Need exact results
- No ambiguity
- Traditional methods better
-
Cost-sensitive:
- LLM too expensive
- Simple solution sufficient
- Cost optimization
-
Latency-critical:
- Need very fast response
- LLM too slow
- Real-time requirements
-
Small dataset:
- Can use simple search
- No need for vector DB
- Overkill
-
Exact matches:
- Keyword search sufficient
- No semantic search needed
- Simpler solution
What's the right database for this task - SQL, NoSQL, or vector?
What's the right database for this task - SQL, NoSQL, or vector?
- Use for: Structured data, exact queries, transactions
- Examples: PostgreSQL, MySQL
- Best for: User data, transactions, structured queries
- Use for: Unstructured data, flexible schema, scale
- Examples: MongoDB, DynamoDB
- Best for: Documents, flexible schema, high scale
- Use for: Embeddings, similarity search, RAG
- Examples: Pinecone, Milvus, Weaviate
- Best for: Semantic search, RAG systems
- Structured data + exact queries: SQL
- Unstructured data + flexible schema: NoSQL
- Embeddings + similarity search: Vector
14. Risks, Integrity & Compliance
How do you monitor hallucinations in production?
How do you monitor hallucinations in production?
-
Output validation:
- Check for factual claims
- Verify against sources
- Flag suspicious outputs
-
Confidence scores:
- Monitor confidence levels
- Flag low-confidence outputs
- Review manually
-
User feedback:
- Collect thumbs up/down
- Track user reports
- Identify patterns
-
Citation accuracy:
- Verify citations
- Check source relevance
- Measure citation precision
-
Automated checks:
- Fact-checking APIs
- Knowledge base verification
- Pattern detection
Bias vs fairness: where does 'fixing it' actually make systems worse?
Bias vs fairness: where does 'fixing it' actually make systems worse?
- Statistical bias in model
- Can be measured
- Technical issue
- Social concept
- Subjective
- Context-dependent
-
Over-correction:
- Fixing one bias creates another
- Unintended consequences
- Worse outcomes
-
Wrong metrics:
- Optimizing wrong fairness metric
- Doesn’t improve real fairness
- Makes system worse
-
Context mismatch:
- Fixing for one context
- Doesn’t work in other contexts
- Creates new issues
What's your red-teaming checklist for a new LLM product?
What's your red-teaming checklist for a new LLM product?
-
Safety:
- Harmful content generation
- Jailbreak attempts
- Prompt injection
- Safety bypasses
-
Bias:
- Demographic bias
- Stereotyping
- Unfair treatment
- Representation issues
-
Privacy:
- PII leakage
- Data exposure
- Privacy violations
- Compliance issues
-
Security:
- Prompt injection
- Model extraction
- Data poisoning
- Adversarial attacks
-
Reliability:
- Hallucinations
- Inconsistency
- Error handling
- Edge cases
How do you handle privacy when logs contain user prompts?
How do you handle privacy when logs contain user prompts?
-
Anonymization:
- Remove PII
- Hash sensitive data
- Pseudonymize users
-
Access control:
- Role-based access
- Audit logs
- Secure storage
-
Retention:
- Set retention policies
- Delete old logs
- Comply with regulations
-
Encryption:
- Encrypt at rest
- Encrypt in transit
- Secure storage
-
Compliance:
- GDPR, CCPA compliance
- User consent
- Right to deletion
15. Scaling & Business Impact
Cost vs latency vs accuracy: describe a time you had to sacrifice one
Cost vs latency vs accuracy: describe a time you had to sacrifice one
- Use smaller models → lower cost, lower accuracy
- Use larger models → higher cost, higher accuracy
- Decision: Balance based on requirements
- Use faster models → lower latency, lower accuracy
- Use better models → higher latency, higher accuracy
- Decision: Balance based on use case
- Use caching → lower cost, lower latency
- Use more GPUs → higher cost, lower latency
- Decision: Balance based on budget
- Situation: High latency, need to reduce
- Solution: Use smaller model, add caching
- Trade-off: Slight accuracy loss, but acceptable
- Result: Latency reduced 50%, accuracy dropped 5%
Enterprise readiness: what infra concerns block adoption most?
Enterprise readiness: what infra concerns block adoption most?
-
Security:
- Data privacy
- Compliance
- Access control
- Encryption
-
Reliability:
- Uptime requirements
- Error handling
- Disaster recovery
- SLAs
-
Scalability:
- Handle enterprise scale
- Performance at scale
- Cost at scale
- Infrastructure needs
-
Integration:
- Existing systems
- APIs, authentication
- Data pipelines
- Workflows
-
Support:
- Documentation
- Support channels
- Training
- Maintenance
How would you design a GenAI-first product that survives beyond prototype hype?
How would you design a GenAI-first product that survives beyond prototype hype?
-
Real value:
- Solve real problems
- Clear value proposition
- User needs first
-
Quality:
- High accuracy
- Reliable performance
- Consistent results
-
Scalability:
- Handle growth
- Cost-effective
- Performance at scale
-
User experience:
- Intuitive interface
- Fast responses
- Good error handling
-
Iteration:
- Continuous improvement
- User feedback
- Regular updates
16. Real-World Scenarios
What happens if your embedding model changes - how do you migrate safely?
What happens if your embedding model changes - how do you migrate safely?
-
Dual-write:
- Write to both old and new vector DBs
- Gradually migrate reads
- Deprecate old DB
-
Blue-green:
- Maintain two environments
- Re-embed in green
- Switch traffic when ready
-
Incremental:
- Re-embed in batches
- Update incrementally
- Route queries appropriately
-
Validation:
- Compare results
- Ensure quality maintained
- Monitor metrics
How would you fine-tune a model on user behavior and deploy it?
How would you fine-tune a model on user behavior and deploy it?
-
Data collection:
- Collect user behavior data
- Label data
- Create training set
-
Fine-tuning:
- Train on user behavior
- Monitor validation metrics
- Iterate
-
Evaluation:
- Test on held-out set
- Compare with baseline
- Measure improvement
-
Deployment:
- A/B test against baseline
- Gradual rollout
- Monitor performance
-
Monitoring:
- Track metrics
- Monitor user feedback
- Iterate based on results
How would you make this system cheaper without killing quality?
How would you make this system cheaper without killing quality?
-
Model optimization:
- Quantization
- Model distillation
- Smaller models
-
Caching:
- Prompt caching
- Response caching
- Reduce redundant calls
-
Smart routing:
- Route simple queries to smaller models
- Route complex to larger models
- Cost-aware routing
-
Batching:
- Dynamic batching
- Continuous batching
- Higher throughput
-
Infrastructure:
- Spot instances
- Auto-scaling
- Right-sizing
Can you walk me through a debugging session for incorrect LLM outputs?
Can you walk me through a debugging session for incorrect LLM outputs?
-
Reproduce:
- Reproduce the issue
- Capture prompt and response
- Identify pattern
-
Analyze:
- Check prompt quality
- Review context
- Check model parameters
-
Isolate:
- Test with minimal prompt
- Remove complexity
- Identify root cause
-
Fix:
- Update prompt
- Adjust parameters
- Add examples
-
Validate:
- Test on known cases
- Verify fix works
- Monitor in production
- Issue: Model gives wrong answer
- Debug: Check prompt, find ambiguous instruction
- Fix: Clarify instruction, add example
- Validate: Test on known cases, works correctly
Conclusion & Interview Tips
This guide covers all major AI engineering areas from prompt design to scalable systems and ethical deployment.Key Preparation Tips
- Understand system trade-offs
- Build RAG or LLM-serving demos
- Learn caching, monitoring, CI/CD
- Emphasize ethics & safety
- Explain architecture choices clearly
During the Interview
- Clarify before answering
- Think aloud for reasoning
- Mention latency/cost trade-offs
- Talk about monitoring and fallback
- Stay calm & confident