AI Engineer Interview Questions

1. LLM Engineering & Prompt Design

What are few-shot vs zero-shot prompting and when to use each?

Zero-shot prompting asks the model to perform a task without examples, relying only on pre-trained knowledge.
Few-shot prompting provides 1–5 examples to guide structure, style, and tone.Use zero-shot when:

Task is simple and well-known (e.g., translation)
No examples available or cost-sensitive
You want unbiased responses

Use few-shot when:

Output format consistency is needed
Domain-specific context or edge cases exist
Zero-shot outputs are inconsistent

Example:

Zero-shot: “Classify sentiment: ‘I love this phone!’”
Few-shot: Add 2–3 labeled examples before the query.

Few-shot usually improves accuracy by 10–30% but increases token usage 3–5x.

What is temperature in LLM generation and how does it affect outputs?

Temperature controls randomness in text generation (range: 0–2).

Low (0–0.3): Deterministic, precise outputs (coding, classification)
Medium (0.4–0.7): Balanced tone (summaries, Q&A)
High (0.8–1.0): Creative, diverse results (brainstorming)

Examples:

Coding assistant → temp=0.1
Customer support → temp=0.4
Marketing content → temp=0.8

Combine with top_p (nucleus sampling) and top_k for fine control.

What is prompt injection and how to defend against it?

Prompt injection happens when malicious input manipulates an LLM to break rules or expose system prompts.Attack types:

Direct: “Ignore previous instructions…”
Context: Hidden malicious text in documents
Jailbreak: Role-play or DAN-style prompts
Prompt leak: Forcing model to reveal system prompt

Defenses:

Input validation: Filter keywords like “ignore”, “system prompt”.
Prompt separation: Clearly delimit system and user input.
Instruction hierarchy: Reiterate rules after user input.
Output validation: Sanitize responses before showing.
Monitoring: Log blocked attempts for review.

Best practice: Layered security validate, isolate, and filter both input and output.

What is tokenization, and how does it affect generation?

Tokenization is the process of breaking text into smaller units (tokens) that the model can process. Tokens can be words, subwords, or characters depending on the tokenizer.How it works:

Text → Token IDs → Model processing → Generation
Different tokenizers (BPE, WordPiece, SentencePiece) use different strategies
Token count directly affects cost and context window usage

Impact on generation:

Token limits: Models have maximum token limits (context window)
Cost: Pricing is typically per token (input + output)
Quality: Better tokenization preserves semantic meaning
Speed: Fewer tokens = faster processing

Example: “Hello world” might tokenize to ["Hello", " world"] (2 tokens) or ["Hel", "lo", " wor", "ld"] (4 tokens) depending on tokenizer.Best practices:

Understand your model’s tokenizer (GPT uses BPE, BERT uses WordPiece)
Monitor token usage to optimize costs
Consider tokenization when chunking documents for RAG

How do embeddings really work?

Embeddings convert text into dense numerical vectors (arrays of numbers) that capture semantic meaning. Similar texts have similar embeddings.How they work:

Training: Models learn from large text corpora that words appearing in similar contexts should have similar vectors
Vector space: Words/concepts are positioned in high-dimensional space (typically 384, 768, or 1536 dimensions)
Similarity: Cosine similarity or dot product measures how “close” two embeddings are
Semantic capture: “king” - “man” + “woman” ≈ “queen” (famous word2vec example)

Key concepts:

Dense vectors: Every dimension has meaning (unlike sparse one-hot encoding)
Fixed size: All texts map to same dimension vector
Learned representations: Capture semantic relationships from training data

Use cases:

Semantic search (find similar documents)
RAG (retrieve relevant context)
Clustering and classification
Recommendation systems

Example: “machine learning” and “artificial intelligence” have high cosine similarity (≈0.85) because they’re semantically related.

What's the role of attention, positional encoding?

Attention mechanism:

Allows model to focus on relevant parts of input when generating each token
Computes weighted relationships between all tokens
Enables understanding of long-range dependencies
Self-attention: tokens attend to other tokens in same sequence

Positional encoding:

Adds information about token position since transformers process all tokens in parallel
Without it, “dog bites man” and “man bites dog” would be identical
Can be learned (BERT) or fixed sinusoidal patterns (original Transformer)

Why both matter:

Attention: “what to focus on” (semantic relationships)
Position: “where things are” (order matters for meaning)
Together: Model understands both meaning and structure

Example: In “The cat sat on the mat”, attention helps model understand “cat” relates to “sat” and “mat”, while positional encoding ensures correct order.

What changes during fine-tuning? (optimizers, schedulers, layer freezing)

Fine-tuning adapts a pre-trained model to specific tasks or domains.What changes:

Model weights: Selected layers get updated (not all layers necessarily)
Learning rate: Much lower than pre-training (typically 1e-5 to 1e-3)
Optimizers: Often AdamW or Adam with weight decay
Schedulers: Cosine annealing, linear warmup, or constant LR
Layer freezing: Early layers often frozen, only top layers trained

Common strategies:

Full fine-tuning: All parameters updated (expensive, needs more data)
PEFT (Parameter-Efficient Fine-Tuning): LoRA, Adapters, only train small subset
Layer freezing: Freeze embeddings and early transformer layers, train only classifier head

Hyperparameters:

Batch size: 4-32 (depends on GPU memory)
Epochs: 1-5 (often 1-3 is enough)
Gradient accumulation: Simulate larger batches
Mixed precision: FP16/BF16 for memory efficiency

Best practices:

Start with frozen layers, gradually unfreeze
Use learning rate finder to find optimal LR
Monitor validation loss to prevent overfitting
Save checkpoints frequently

Transformers hinge on attention can you explain why 'attention is all you need' isn't just marketing?

The “Attention is All You Need” paper revolutionized NLP by showing attention alone (without RNNs/CNNs) could achieve state-of-the-art results.Why attention works:

Parallelization: Unlike RNNs, all tokens processed simultaneously (faster training)
Long-range dependencies: Direct connections between any two tokens (RNNs struggle with distance)
Interpretability: Attention weights show what model focuses on
Flexibility: Can attend to any part of input, not just sequential neighbors

Key innovations:

Multi-head attention: Multiple attention mechanisms capture different relationships
Self-attention: Tokens attend to other tokens in same sequence
Scaled dot-product attention: Efficient computation with scaling factor

Why it’s not just marketing:

Empirically proven: Achieved SOTA on translation, outperformed RNNs/CNNs
Enables modern LLMs: GPT, BERT, T5 all use attention
Scalable: Works with billions of parameters
Foundation for current AI: Most LLMs are transformer-based

Limitation: Quadratic complexity with sequence length (O(n²)), but recent work (Flash Attention, sparse attention) mitigates this.

Encoder vs decoder vs encoder-decoder: in what scenarios would you prefer each?

Encoder-only (BERT, RoBERTa):

Bidirectional understanding (sees full context)
Best for: Classification, NER, sentiment analysis, understanding tasks
Example: BERT for question answering (reads passage, finds answer)

Decoder-only (GPT, LLaMA):

Autoregressive generation (predicts next token)
Best for: Text generation, completion, chat, creative writing
Example: GPT for story generation or code completion

Encoder-decoder (T5, BART):

Both understanding and generation
Best for: Translation, summarization, text-to-text tasks
Example: T5 for “translate English to French: Hello” → “Bonjour”

Decision framework:

Need to understand input? → Encoder or encoder-decoder
Need to generate text? → Decoder or encoder-decoder
Need both? → Encoder-decoder
Most modern LLMs are decoder-only (GPT-style) because they’re more flexible

Modern trend: Decoder-only models (GPT-4, Claude) can do both understanding and generation, making them versatile for most tasks.

Walk me through tokenization choices (BPE vs WordPiece vs SentencePiece). Where do they break down?

BPE (Byte Pair Encoding):

Used by: GPT, RoBERTa
How: Starts with characters, iteratively merges most frequent pairs
Pros: Handles unknown words, good for multilingual
Cons: Can split words awkwardly
Breaks down: Very long words, domain-specific terms

WordPiece:

Used by: BERT, DistilBERT
How: Similar to BPE but uses language model likelihood
Pros: Better word boundaries, handles subwords well
Cons: Less flexible than BPE
Breaks down: Rare technical terms, code snippets

SentencePiece:

Used by: T5, ALBERT, multilingual models
How: Treats input as Unicode, works at sentence level
Pros: Language-agnostic, handles any Unicode text
Cons: Can be slower, larger vocabulary
Breaks down: Very rare characters, mixed scripts

Comparison:

BPE: Best for general-purpose, multilingual
WordPiece: Best for English, better word preservation
SentencePiece: Best for multilingual, code, special characters

When they break down:

Very long technical terms (e.g., chemical names)
Mixed languages in single sentence
Code with special syntax
Emojis and special Unicode characters
Domain-specific jargon not in training data

Embeddings: why does cosine similarity dominate, and when does it fail?

Why cosine similarity dominates:

Magnitude-independent: Focuses on direction, not vector length
Normalized: Range is [-1, 1], easy to interpret
Efficient: Fast computation, works well with approximate nearest neighbor search
Semantic focus: Captures semantic similarity better than Euclidean distance

Mathematical intuition:

Cosine similarity = dot product of normalized vectors
Measures angle between vectors, not distance
Vectors pointing same direction = similar meaning

When it works well:

Semantic search (find similar documents)
Recommendation systems
Clustering similar texts
RAG retrieval

When it fails:

Magnitude matters: If vector length encodes importance, cosine ignores it
Sparse vectors: Works poorly with very sparse embeddings
High dimensionality: Can become less discriminative in very high dimensions
Domain mismatch: Embeddings from different models aren’t comparable
Fine-grained differences: May not capture subtle distinctions

Alternatives:

Dot product: When magnitude matters
Euclidean distance: When absolute distance is important
Manhattan distance: For sparse vectors
Learned similarity: Train a model to learn similarity function

Best practice: Use cosine for semantic similarity, but validate on your specific use case.

How do context windows constrain design? How have you handled long-context hacks in production?

Context window constraints:

Models have fixed maximum context (e.g., GPT-4: 128k tokens, Claude: 200k)
Input + output must fit within limit
Longer context = higher cost and latency

Design implications:

Must truncate or summarize long documents
Need strategies for multi-turn conversations
RAG becomes essential for knowledge beyond context
Chunking strategy critical for document processing

Long-context strategies:

Sliding window: Process document in overlapping chunks
Hierarchical summarization: Summarize chunks, then summarize summaries
Retrieval: Use RAG to fetch relevant parts instead of including everything
Context compression: Use smaller models to compress context before main model
Relevance filtering: Only include most relevant parts of long documents

Production hacks:

Prompt caching: Cache system prompts to save tokens
Streaming: Start generating before full context processed
Progressive loading: Load context incrementally
Smart truncation: Keep beginning and end, truncate middle (often most important parts)

Example: For 100k token document with 8k context window:

Chunk into 10 pieces of 8k each
Embed and store in vector DB
For query, retrieve top 3 most relevant chunks
Include only those in context (24k tokens, fits in window)

Trade-offs:

More context = better understanding but higher cost
Less context = faster/cheaper but may miss information
Balance based on use case requirements

Greedy vs beam vs nucleus sampling which would you pick for a summarization API under latency pressure?

Greedy decoding:

Always picks highest probability token
Fastest, deterministic
Can get stuck in repetitive loops
Best for: Code generation, when determinism needed

Beam search:

Keeps top-k candidates at each step
Explores multiple paths, finds better sequences
Slower (k× slower), more memory
Best for: Translation, when quality > speed

Nucleus (top-p) sampling:

Samples from smallest set covering p% of probability mass
Good balance of quality and diversity
Faster than beam, more diverse than greedy
Best for: Creative tasks, chat, when need variety

For summarization API under latency pressure:

Choose: Greedy or top-p with low temperature (0.3-0.5)
Why: Summarization needs accuracy, not creativity
Greedy: Fastest, good for factual summaries
Top-p (p=0.9, temp=0.3): Slightly slower but more natural phrasing

Recommendation:

Start with greedy for maximum speed
If quality issues, use top-p with low temperature
Avoid beam search (too slow for API)
Consider caching common summaries

Production tip: Use streaming with greedy/top-p to send first tokens immediately, improving perceived latency.

Why is positional encoding critical, and what happens when models forget 'order'?

Why positional encoding is critical:

Transformers process all tokens in parallel (no inherent order)
“The cat sat” vs “sat cat The” would be identical without position info
Position encoding tells model where each token is in sequence

What happens without it:

Model can’t distinguish word order
“Dog bites man” = “Man bites dog” (same meaning to model)
Grammar and syntax understanding breaks down
Language is inherently sequential order matters

Types of positional encoding:

Fixed sinusoidal: Original Transformer, mathematical patterns
Learned: BERT-style, model learns positions during training
Relative: T5-style, encodes relative distances between tokens

When models “forget” order:

Position encoding gets corrupted or removed
Very long sequences beyond training length
Position embeddings not properly initialized
Result: Nonsensical output, loss of grammatical structure

Example failure:

Input: “I love programming in Python”
Without position: Model might generate “Python in programming love I”
With position: Correct order maintained

Best practices:

Always include positional encoding
For long contexts, use models trained on long sequences
Consider relative position encoding for variable-length inputs

What's the trade-off between dense embeddings and sparse/lexical search?

Dense embeddings (semantic search):

Vector representations from neural networks
Captures semantic meaning, synonyms, context
Pros: Understands intent, handles synonyms, multilingual
Cons: Can miss exact keyword matches, requires embedding model

Sparse/lexical search (keyword search):

Traditional TF-IDF, BM25, keyword matching
Exact word matching, no semantic understanding
Pros: Fast, exact matches, interpretable, no model needed
Cons: Misses synonyms, no semantic understanding

Trade-offs:

Dense: Better for “find documents about machine learning” (understands ML = AI = deep learning)
Sparse: Better for “find documents containing ‘Python 3.9’” (exact version matters)

Hybrid search (best of both):

Combine dense + sparse scores
Weighted combination: final_score = α × dense_score + (1-α) × sparse_score
Captures both semantic similarity and exact matches

When to use each:

Dense only: Semantic search, Q&A, when synonyms matter
Sparse only: Exact keyword search, code search, when precision critical
Hybrid: Production RAG systems (recommended)

Example: Query “automated testing”

Dense: Finds “test automation”, “QA automation”, “CI/CD testing”
Sparse: Only finds exact phrase “automated testing”
Hybrid: Finds both semantic matches and exact phrase

Best practice: Use hybrid search in production for best results.

When would you choose a smaller distilled model over a frontier model?

Distilled models (smaller, faster):

Examples: DistilBERT, TinyBERT, GPT-3.5-turbo vs GPT-4
Trained to mimic larger models
2-10× smaller, 3-5× faster

When to choose distilled:

Latency constraints: Real-time applications, edge devices
Cost optimization: Lower inference costs, especially at scale
Resource limits: Mobile apps, embedded systems, limited GPU memory
Simple tasks: When smaller model is sufficient (classification, simple Q&A)
High throughput: Need to process many requests quickly

When to choose frontier model:

Complex reasoning: Need advanced capabilities (GPT-4, Claude Opus)
Quality critical: When accuracy is more important than speed
Novel tasks: Tasks smaller models can’t handle
Low volume: When cost isn’t concern, quality is priority

Decision framework:

Task complexity: Simple → distilled, complex → frontier
Latency requirement: <100ms → distilled, can wait → frontier
Volume: High volume → distilled, low volume → frontier
Budget: Limited → distilled, flexible → frontier

Production strategy:

Use distilled for 80% of requests (fast, cheap)
Route complex queries to frontier model (smart routing)
A/B test to find right balance

Example: Customer support chatbot

Use GPT-3.5-turbo for common questions (fast, cheap)
Escalate complex issues to GPT-4 (better reasoning)

How do you test whether a model's 'knowledge cutoff' impacts reliability for your use case?

Knowledge cutoff:

Date when model’s training data ends
Model doesn’t know events/information after that date
Example: GPT-4 trained on data up to April 2023

Testing impact:

Create test set: Questions about events before and after cutoff
Measure accuracy: Compare performance on pre vs post-cutoff questions
Check hallucinations: Model may confidently make up post-cutoff information
Domain-specific: Test your specific domain (tech, finance, etc.)

Test methodology:

Before cutoff: “What happened in 2022?” → Should answer correctly
After cutoff: “What happened in 2024?” → May hallucinate or say “I don’t know”
Edge cases: Events right around cutoff date

Mitigation strategies:

RAG: Use retrieval to get current information
Web search: Integrate search API for recent events
Fine-tuning: Fine-tune on recent data (if available)
Hybrid approach: Use model for reasoning, external sources for facts

Production monitoring:

Track questions about recent events
Flag potential hallucinations
Use RAG for time-sensitive queries
Set expectations with users about knowledge cutoff

Example test:

Query: “Latest Python version in 2024?”
Without RAG: May give outdated answer or hallucinate
With RAG: Retrieves current info, gives accurate answer

Best practice: Always use RAG for time-sensitive information, regardless of model’s knowledge cutoff.

How do you choose a vector DB (Chroma, Pinecone, OpenSearch…)?

Key factors:

Scale: Number of vectors, query volume
Latency: Response time requirements
Features: Filtering, metadata, hybrid search
Deployment: Managed vs self-hosted
Cost: Pricing model, infrastructure costs

Comparison:Pinecone (Managed):

Pros: Easy setup, good performance, managed scaling
Cons: Expensive at scale, vendor lock-in
Best for: Quick prototypes, small to medium scale

Chroma (Self-hosted):

Pros: Open source, easy to use, good for development
Cons: Less scalable, fewer features
Best for: Development, small projects, learning

Weaviate (Self-hosted/Managed):

Pros: Feature-rich, good performance, hybrid search
Cons: More complex setup
Best for: Production systems needing advanced features

Milvus (Self-hosted):

Pros: Highly scalable, production-ready, open source
Cons: Complex setup, needs infrastructure
Best for: Large-scale production systems

OpenSearch/Elasticsearch:

Pros: Mature, good ecosystem, supports vector search
Cons: Not optimized specifically for vectors
Best for: When you need full-text + vector search

Qdrant:

Pros: Fast, good filtering, open source
Cons: Smaller community
Best for: Performance-critical applications

Decision framework:

Prototype: Chroma or Pinecone
Production <10M vectors: Pinecone or Weaviate
Production >10M vectors: Milvus or Qdrant
Need full-text search: OpenSearch
Budget constrained: Self-hosted (Chroma, Milvus, Qdrant)

Best practice: Start with managed (Pinecone) for speed, migrate to self-hosted (Milvus) as you scale.

Can you update or backfill embeddings with zero downtime?

Challenge:

Updating embedding model changes all vector representations
Old and new embeddings aren’t compatible
Need to re-embed all documents without service interruption

Zero-downtime strategies:1. Dual-write approach:

Write to both old and new vector DBs simultaneously
Gradually migrate reads from old to new
Once migration complete, deprecate old DB

2. Blue-green deployment:

Maintain two environments (blue = old, green = new)
Re-embed all documents in green environment
Switch traffic when ready
Keep blue as backup

3. Incremental backfill:

Re-embed documents in batches
Use message queue to process updates
Update vector DB incrementally
Route queries to appropriate DB based on document version

4. Versioned embeddings:

Store multiple embedding versions per document
Query both versions, merge results
Gradually phase out old version

Implementation:

Preparation: Set up new embedding pipeline, new vector DB
Dual-write: New documents go to both DBs
Backfill: Re-process existing documents in background
Gradual cutover: Route percentage of queries to new DB
Validation: Compare results, ensure quality maintained
Full cutover: Switch all traffic to new DB
Cleanup: Remove old DB after validation period

Best practices:

Use feature flags to control rollout
Monitor metrics during migration
Keep old system as fallback
Test with small subset first
Document the process

Example: Upgrading from text-embedding-ada-002 to text-embedding-3-small

Embed new documents with both models
Backfill existing documents in background
Gradually switch queries to new embeddings
Validate quality hasn’t degraded

How do you evaluate retrieval quality (precision@k, reranking, citation)?

Evaluation metrics:Precision@k:

Fraction of retrieved items that are relevant
Precision@5 = 3 relevant out of 5 retrieved = 0.6
Measures accuracy of top-k results

Recall@k:

Fraction of all relevant items that were retrieved
Recall@10 = 7 relevant retrieved out of 10 total relevant = 0.7
Measures coverage

MRR (Mean Reciprocal Rank):

Average of 1/rank of first relevant result
Higher is better, emphasizes top results

NDCG (Normalized Discounted Cumulative Gain):

Considers ranking quality, discounts lower positions
Best for when relevance has degrees (highly relevant vs somewhat relevant)

Reranking:

Second-stage ranking using more expensive model
Improves precision by reordering initial results
Trade-off: Better quality but higher latency/cost

Citation evaluation:

Check if retrieved documents support the answer
Verify citations are accurate and relevant
Measure citation precision (correct citations / total citations)

Evaluation process:

Create test set: Queries with known relevant documents
Run retrieval: Get top-k results for each query
Label relevance: Human annotators mark relevant/irrelevant
Calculate metrics: Precision@k, Recall@k, MRR, NDCG
Iterate: Improve retrieval based on results

Production monitoring:

Track precision@k over time
Monitor user feedback (thumbs up/down)
A/B test different retrieval strategies
Alert on quality degradation

Best practices:

Use multiple metrics (precision + recall)
Test on domain-specific data
Monitor in production, not just offline
Use reranking for critical queries

2. Prompting & Context Engineering

Zero-shot vs few-shot: when have you seen one clearly outperform the other?

Zero-shot outperforms when:

Task is well-defined and common (translation, summarization)
Model has strong pre-training on the task
Cost/token usage is critical
Need unbiased responses without example influence
Examples are hard to construct or may introduce bias

Few-shot outperforms when:

Need specific output format (JSON, structured data)
Domain-specific terminology or edge cases
Zero-shot produces inconsistent results
Task requires demonstration of pattern
Working with unusual or complex patterns

Real-world examples:Zero-shot better:

Translation: “Translate to French: Hello” (model knows this well)
Simple classification: “Is this positive or negative?” (clear task)
General Q&A: “What is machine learning?” (common knowledge)

Few-shot better:

Code generation with specific style: Need examples showing preferred patterns
Complex extraction: “Extract entities in this format: [name, age, location]”
Domain-specific: Medical terminology, legal documents (need examples)
Multi-step reasoning: Chain-of-thought needs examples

Decision rule:

Start with zero-shot (simpler, cheaper)
Add few-shot if quality/consistency issues
Monitor token usage vs quality trade-off
A/B test to measure actual improvement

Best practice: Use few-shot strategically 2-3 high-quality examples often better than 5+ mediocre ones.

Why do chain-of-thought prompts sometimes collapse into nonsense at scale?

Why CoT fails at scale:

Model limitations:
- Smaller models lack reasoning capacity
- Can’t maintain coherent reasoning chains
- Gets confused with complex multi-step problems
Prompt quality:
- Poor examples lead to poor reasoning
- Inconsistent formatting confuses model
- Too many steps overwhelm model
Error propagation:
- Early reasoning mistake cascades
- Model can’t self-correct mid-chain
- Accumulates errors across steps
Context limits:
- Long reasoning chains exceed context
- Model forgets earlier steps
- Truncation breaks reasoning flow
Task mismatch:
- CoT not suitable for all tasks
- Simple tasks don’t need reasoning
- Over-engineering can hurt performance

When CoT works:

Large models (GPT-4, Claude) with strong reasoning
Complex problems requiring multi-step thinking
Well-constructed prompts with clear examples
Tasks that benefit from explicit reasoning

When CoT fails:

Small models trying to reason beyond capacity
Simple tasks that don’t need reasoning
Poorly constructed prompts
Tasks requiring factual recall, not reasoning

Mitigation strategies:

Use CoT only with capable models
Start with simple CoT, add complexity gradually
Validate reasoning steps, not just final answer
Use self-consistency (generate multiple chains, pick best)
Monitor for reasoning quality, not just answer correctness

Best practice: Test CoT on your specific use case it’s not always better than direct prompting.

What's your approach to versioning prompts for reproducibility?

Versioning strategies:

Git-based versioning:
- Store prompts in version control
- Tag versions, track changes
- Enable rollback to previous versions
- Review prompt changes like code
Prompt registry:
- Centralized system for prompt management
- Version numbers, metadata (author, date, purpose)
- A/B testing different versions
- Track performance per version
Template system:
- Parameterized prompts with variables
- Version templates, not individual prompts
- Easier to update and maintain
- Example: {system_prompt_v2} + {user_input}
Configuration files:
- YAML/JSON configs for prompts
- Environment-specific prompts (dev, prod)
- Easy to update without code changes
- Version configs separately

Best practices:

Naming convention: prompt_v1.2.3_task_name
Documentation: Document why each version exists
Testing: Test prompts before deploying
Monitoring: Track performance per version
Rollback plan: Keep previous versions for quick rollback

Implementation example:

prompts:
  classification_v1.0:
    system: "You are a classifier..."
    examples: [...]
    created: "2024-01-15"
  
  classification_v1.1:
    system: "You are an expert classifier..."
    examples: [...]
    created: "2024-02-01"
    changes: "Added domain-specific examples"

Production workflow:

Develop prompt in staging
Version and test
Deploy with feature flag
Monitor performance
Gradually roll out
Keep old version as fallback

Best practice: Treat prompts like code version, test, review, and monitor.

Prompt failures are inevitable how do you debug them systematically?

Systematic debugging approach:

Logging:
- Log all prompts and responses
- Include metadata (timestamp, user, model version)
- Store for analysis and debugging
- Enable search and filtering
Categorize failures:
- Format errors: Wrong output structure
- Hallucinations: Made-up information
- Refusals: Model refuses valid requests
- Inconsistency: Same input, different outputs
- Off-topic: Model goes off-topic
Root cause analysis:
- Prompt issues: Ambiguous instructions, poor examples
- Model limitations: Task beyond model capability
- Input quality: Garbage in, garbage out
- Context problems: Missing or wrong context
- Parameter issues: Wrong temperature, top_p settings
Debugging techniques:
- Simplify: Remove complexity, test basic version
- Isolate: Test individual components
- Compare: A/B test different prompts
- Iterate: Make small changes, test each
- Validate: Check against known good examples
Tools:
- Prompt testing frameworks
- A/B testing platforms
- Evaluation metrics (accuracy, latency)
- User feedback collection

Debugging checklist:

Is prompt clear and unambiguous?
Are examples high-quality and relevant?
Is context complete and accurate?
Are parameters (temp, top_p) appropriate?
Is model capable of the task?
Are there edge cases not handled?

Best practices:

Build test suite of known good/bad cases
Monitor failure rates and patterns
Create runbook for common issues
Document solutions for future reference
Set up alerts for quality degradation

Example debugging session:

User reports: “Model gives wrong answer”
Check logs: Find prompt and response
Reproduce: Run same prompt, see if consistent
Simplify: Test with minimal prompt
Compare: Try different prompt variations
Fix: Identify issue, update prompt
Validate: Test on known cases
Deploy: Roll out fix with monitoring

Guardrails: regex filters, classifiers, or fine-tuning? What's the trade-off?

Guardrail approaches:Regex filters:

How: Pattern matching on input/output
Pros: Fast, simple, interpretable, no model needed
Cons: Brittle, easy to bypass, can’t understand context
Use when: Simple keyword blocking, known patterns

Classifiers:

How: ML model classifies content (toxic, PII, etc.)
Pros: Understands context, more robust, can be tuned
Cons: Needs training data, slower, may have false positives
Use when: Need semantic understanding, complex patterns

Fine-tuning:

How: Train model to refuse harmful requests
Pros: Most robust, understands nuance, built-in
Cons: Expensive, time-consuming, may reduce capabilities
Use when: Need model-level safety, have resources

Trade-offs:

Approach	Speed	Robustness	Cost	Complexity
Regex	Fast	Low	Low	Low
Classifier	Medium	Medium	Medium	Medium
Fine-tuning	Slow	High	High	High

Layered approach (recommended):

Regex: Block obvious patterns (quick wins)
Classifier: Catch semantic issues (context-aware)
Fine-tuning: Model-level safety (deep protection)
Output validation: Final check before returning

Best practice:

Start with regex for known issues
Add classifier for complex cases
Use fine-tuning for critical safety requirements
Combine approaches for defense in depth

Example:

Regex: Block known attack patterns
Classifier: Detect toxic content
Fine-tuning: Model refuses harmful requests
Output validation: Final safety check

How do you balance injecting instructions vs letting the model free-flow?

Instruction injection:

Explicit rules, constraints, format requirements
Pros: More control, consistent output, predictable
Cons: Can be too rigid, may limit creativity, longer prompts

Free-flow:

Minimal instructions, let model be creative
Pros: Natural responses, creative, flexible
Cons: Less control, inconsistent, may go off-topic

Balancing strategies:

Task-dependent:
- Structured tasks: More instructions (extraction, formatting)
- Creative tasks: Less instructions (writing, brainstorming)
- Critical tasks: More instructions (safety, accuracy)
Progressive disclosure:
- Start with minimal instructions
- Add constraints only if needed
- Test to find minimum viable instructions
Layered instructions:
- Core instructions (always)
- Optional constraints (when needed)
- Examples (for complex tasks)
User control:
- Let users choose strictness level
- “Creative mode” vs “Precise mode”
- Adjust instructions based on user preference

Best practices:

Start minimal, add only what’s necessary
Test different instruction levels
Monitor user satisfaction
Balance control with naturalness
Document instruction rationale

Example evolution:

v1: “Summarize this text” (too free, inconsistent)
v2: “Summarize in 3 bullet points” (better structure)
v3: “Summarize in 3 bullet points, each max 20 words” (too rigid)
v4: “Summarize in 3 concise bullet points” (balanced)

Decision framework:

Need consistency? → More instructions
Need creativity? → Less instructions
Need safety? → More instructions
Need naturalness? → Less instructions

If you had to monitor prompt drift in a live system, how would you design it?

Prompt drift monitoring:What to monitor:

Output quality metrics:
- Accuracy, relevance, correctness
- User satisfaction (thumbs up/down)
- Task completion rate
- Error rates
Output characteristics:
- Response length (may indicate drift)
- Tone/style changes
- Format consistency
- Hallucination rate
Model behavior:
- Refusal rate (may increase/decrease)
- Confidence scores (if available)
- Response time (may indicate issues)
- Token usage (may change)

Monitoring design:

Baseline establishment:
- Measure metrics on known good prompts
- Establish normal ranges
- Set thresholds for alerts
Continuous tracking:
- Log all prompts and responses
- Calculate metrics in real-time
- Store for historical analysis
Anomaly detection:
- Statistical tests (z-scores, percentiles)
- Machine learning models (detect patterns)
- Rule-based alerts (threshold breaches)
Alerting:
- Real-time alerts for critical issues
- Daily/weekly reports for trends
- Dashboard for visualization

Implementation:

# Pseudo-code
def monitor_prompt_drift():
    baseline_metrics = get_baseline()
    current_metrics = calculate_current_metrics()
    
    for metric in metrics:
        if abs(current - baseline) > threshold:
            alert(f"Drift detected in {metric}")
    
    if user_satisfaction_dropped():
        alert("User satisfaction declining")

Best practices:

Monitor multiple metrics (not just one)
Use statistical significance tests
Track trends over time
Set up automated alerts
Have runbook for common issues
Regular review and adjustment

Example monitoring:

Accuracy: Baseline 95%, current 92% → Alert
Response length: Baseline 100 tokens, current 150 → Investigate
User satisfaction: Baseline 4.5/5, current 3.8/5 → Alert

Few-shot vs zero-shot - which works better where?

Zero-shot works better:

Well-known tasks (translation, summarization)
When model has strong pre-training
Cost/token sensitive scenarios
Need unbiased responses
Examples are hard to construct

Few-shot works better:

Need specific output format
Domain-specific terminology
Complex or unusual patterns
Zero-shot produces inconsistent results
Task requires demonstration

Decision matrix:

Task Type	Zero-shot	Few-shot
Simple classification	✅	❌
Translation	✅	❌
Code generation	❌	✅
Complex extraction	❌	✅
General Q&A	✅	❌
Domain-specific	❌	✅

Best practice: Start with zero-shot, add few-shot only if needed. Test to measure actual improvement.

How do you design system prompts that are robust across users?

Robust system prompt design:

Clear role definition:
- Define model’s role clearly
- Set boundaries and limitations
- Specify behavior expectations
Explicit constraints:
- What model should do
- What model shouldn’t do
- How to handle edge cases
Consistent structure:
- Use clear sections (Role, Instructions, Constraints)
- Consistent formatting
- Easy to read and maintain
Test with diverse inputs:
- Test with different user types
- Test edge cases
- Test adversarial inputs
- Test various languages/styles
Version and iterate:
- Version system prompts
- A/B test different versions
- Monitor performance
- Update based on feedback

Example robust system prompt:

You are a helpful assistant. Your role is to:
- Answer questions accurately and helpfully
- Admit when you don't know something
- Refuse harmful or inappropriate requests

Constraints:
- Always be truthful
- Don't make up information
- If unsure, say so

Format:
- Use clear, concise language
- Structure complex answers with bullet points

Best practices:

Keep prompts concise but complete
Use examples for complex behaviors
Test with real users
Monitor for prompt injection attempts
Update based on observed issues

How do you make output deterministic?

Determinism strategies:

Temperature = 0:
- Pure greedy decoding
- Always picks highest probability token
- Most deterministic approach
Fixed seed:
- Set random seed for reproducibility
- Same seed = same output
- Works with temperature > 0
Top-k = 1:
- Only consider top token
- Combined with temperature = 0
- Maximum determinism
Prompt caching:
- Cache system prompts
- Reduces variability from prompt processing
- Improves consistency

Trade-offs:

Deterministic: Predictable but may be less natural
Non-deterministic: More natural but less predictable
Balanced: Low temperature (0.1-0.3) for slight variation

When to use:

Deterministic: Testing, debugging, when consistency critical
Non-deterministic: Creative tasks, when variety desired
Balanced: Most production use cases

Best practice: Use temperature = 0 for deterministic tasks, low temperature (0.1-0.3) for natural but consistent outputs.

How do you track, version, and backfill changing context?

Context management:

Versioning:
- Version context documents
- Track changes over time
- Enable rollback to previous versions
Tracking:
- Log which context was used for each query
- Store context version with responses
- Enable audit trail
Backfilling:
- Re-process queries with updated context
- Update responses if context changed
- Notify users of significant changes

Implementation:

Store context in versioned database
Tag queries with context version
Re-run queries when context updates
Compare old vs new responses

Best practices:

Version all context documents
Track context usage per query
Automate backfilling for critical updates
Monitor for context-related issues
Document context changes

How do you build/maintain the memory?

Memory systems for LLMs:

Short-term memory (conversation context):
- Last N messages in conversation
- Stored in session/cache
- TTL: 1-24 hours
Long-term memory (user preferences):
- User profile, preferences
- Stored in database
- Persists across sessions
Episodic memory (conversation history):
- Past conversations
- Searchable, retrievable
- Used for context
Semantic memory (knowledge base):
- RAG system with embeddings
- Retrieves relevant information
- Updates as knowledge changes

Maintenance:

Refresh: Update memory with new information
Prune: Remove outdated information
Validate: Check memory accuracy
Index: Make memory searchable

Best practices:

Use RAG for knowledge memory
Store user preferences in database
Cache recent conversations
Index memory for fast retrieval
Regularly update and validate

3. AI System Architecture

Design a scalable AI chatbot for 10,000 concurrent users

Requirements: 10k users, <2s latency, 99.9% uptime, cost-efficient.Architecture overview:

Load Balancer: NGINX / AWS ALB (SSL termination, DDoS protection)
API Gateway: Kong / AWS Gateway (rate limiting, JWT auth)
App Servers: FastAPI / Express (Kubernetes, auto-scaling)
Cache: Redis cluster for sessions & frequent responses
Model Serving:
- Managed (OpenAI/Vertex) for simplicity
- Self-hosted (vLLM + A100 GPUs) for cost optimization
Message Queue: RabbitMQ / Kafka for async tasks
Databases: PostgreSQL (metadata) + Pinecone/Milvus (vector search)
Monitoring: Prometheus + Grafana + ELK stack

Flow: User → Gateway → Cache → DB/vector store → Model → Streamed response
Cost: ~

33k (managed) or ~

15k (self-hosted) monthly.
Reliability: Circuit breakers, auto-scaling, blue-green deployments.

Why use a vector database in AI systems?

Vector DBs (Pinecone, Milvus, Weaviate) store high-dimensional embeddings for semantic search.Use cases:

RAG (Retrieval Augmented Generation)
Semantic similarity search
Contextual recommendations

Advantages:

Fast cosine/dot-product search
Horizontal scalability
<100ms retrieval time

Example: Retrieve top 5 semantically similar docs before LLM generation.

Explain caching in AI system design

Caching stores frequent responses or context to minimize repeated model calls.Tools: Redis / Memcached
Strategies:

Response caching: For common queries
Context caching: Last N user messages
Rate limiting: Prevent abuse
Eviction: LRU with TTL (1h for context, 24h for cache)

3. Model Deployment & Serving

What are common model serving frameworks?

vLLM: High throughput inference, dynamic batching
TorchServe: Scalable PyTorch serving
TensorRT / ONNX Runtime: Optimized inference for GPUs/CPUs
Ray Serve: Distributed deployment for microservices

Example: Deploy Mistral-7B on vLLM with 4×A100 GPUs for 200–300 tokens/sec per GPU.

Explain batch inference vs streaming

Batch inference: Process many inputs together efficient for offline jobs
Streaming: Generate and send tokens live ideal for chat or long text

Example: Chatbot → stream tokens for smoother UX.

What is the difference between managed API vs self-hosted model?

Managed API (OpenAI, Anthropic):

✅ Fast to integrate
✅ No infra maintenance
❌ Expensive at scale
❌ Limited customization

Self-hosted (vLLM, Text-Gen WebUI):

✅ Lower cost after 2–5M req/month
✅ Control over weights
❌ Needs GPU infra + MLOps skills

4. Fine-Tuning & Alignment

LoRA vs full fine-tuning: when is each justified?

LoRA (Low-Rank Adaptation):

Trains small adapter matrices instead of full weights
Only updates 0.1-1% of parameters
Much faster, less memory, cheaper
Preserves base model capabilities

Full fine-tuning:

Updates all model parameters
More expressive, can learn complex patterns
Slower, needs more memory, expensive
Risk of catastrophic forgetting

When to use LoRA:

Limited compute resources
Want to preserve base model
Quick iterations needed
Multiple task-specific adapters
Fine-tuning on consumer hardware

When to use full fine-tuning:

Large dataset available
Task significantly different from pre-training
Need maximum performance
Have sufficient compute resources
Single specialized model needed

Decision framework:

Resources limited? → LoRA
Large dataset? → Full fine-tuning
Multiple tasks? → LoRA (different adapters)
Maximum performance? → Full fine-tuning
Quick experiments? → LoRA

Best practice: Start with LoRA, move to full fine-tuning only if LoRA doesn’t meet requirements.

QLoRA claims efficiency what's the hidden cost?

QLoRA (Quantized LoRA):

Quantizes model to 4-bit, then applies LoRA
Enables fine-tuning on single GPU
Very memory efficient

Hidden costs:

Quantization overhead:
- Dequantization during training
- Slight accuracy loss from quantization
- More complex implementation
Performance trade-offs:
- Slower than full precision
- May not reach full fine-tuning quality
- Limited to certain model architectures
Debugging complexity:
- Harder to debug quantized models
- Less interpretable
- More moving parts
Compatibility:
- Not all models support quantization
- May need specific libraries
- Hardware requirements vary

When QLoRA makes sense:

Very limited GPU memory
Fine-tuning large models (7B+)
Research/experimentation
Cost-constrained scenarios

When to avoid:

Need maximum accuracy
Have sufficient resources
Production-critical applications
Small models (can use full fine-tuning)

Best practice: Use QLoRA when memory is the constraint, but be aware of accuracy trade-offs.

RLHF vs DPO: which alignment approach would you pick for a safety-critical use case?

RLHF (Reinforcement Learning from Human Feedback):

Uses reinforcement learning with human preferences
More complex, needs reward model
Proven in production (ChatGPT, Claude)
Better for complex alignment

DPO (Direct Preference Optimization):

Directly optimizes on preference pairs
Simpler, no reward model needed
Faster training, easier to implement
Good for preference alignment

For safety-critical use case:Choose RLHF if:

Need maximum safety guarantees
Have resources for complex setup
Need fine-grained control
Working with large models

Choose DPO if:

Need faster iteration
Limited resources
Simpler alignment needs
Want easier implementation

Best practice: For safety-critical, use RLHF with extensive red teaming and safety testing. DPO can be good starting point, but RLHF offers more control.

SFT vs PEFT: explain when one is overkill

SFT (Supervised Fine-Tuning):

Full fine-tuning on labeled data
Updates all parameters
More expressive, can learn complex patterns
Needs more data and compute

PEFT (Parameter-Efficient Fine-Tuning):

LoRA, Adapters, Prompt Tuning
Updates small subset of parameters
Faster, cheaper, less data needed
May not reach SFT performance

When SFT is overkill:

Small dataset (<1000 examples)
Task similar to pre-training
Limited compute resources
Quick experiments
Multiple tasks (use PEFT adapters)

When PEFT is insufficient:

Large, diverse dataset
Task very different from pre-training
Need maximum performance
Have sufficient resources
Single specialized model

Decision rule:

Start with PEFT (LoRA)
If performance insufficient, try SFT
Consider hybrid: PEFT for quick iteration, SFT for final model

Best practice: PEFT is rarely overkill it’s usually the right starting point. Use SFT when PEFT doesn’t meet requirements.

Why do many open-source fine-tuned models underperform their base model in the wild?

Common reasons:

Poor data quality:
- Low-quality training data
- Misaligned with use case
- Insufficient diversity
- Noisy labels
Overfitting:
- Too many epochs
- Small validation set
- Model memorizes training data
- Poor generalization
Catastrophic forgetting:
- Loses general capabilities
- Too focused on specific task
- Forgets pre-training knowledge
Hyperparameter issues:
- Wrong learning rate
- Poor scheduler choice
- Inappropriate batch size
- No proper validation
Evaluation mismatch:
- Evaluated on different metrics
- Test set doesn’t reflect real use
- Overfitting to test set

How to avoid:

Use high-quality, diverse data
Proper train/validation/test splits
Early stopping
Monitor validation metrics
Test on real-world scenarios
Use PEFT to preserve base capabilities

Best practice: Fine-tune carefully with proper validation, and test on real-world data before deployment.

When would you argue for sticking with retrieval + prompting instead of fine-tuning?

Stick with RAG + prompting when:

Data availability:
- Limited training data
- Data changes frequently
- Hard to collect labeled data
Flexibility:
- Need to update knowledge quickly
- Multiple knowledge domains
- Dynamic content requirements
Cost:
- Can’t afford fine-tuning compute
- Low volume of requests
- Cost of fine-tuning > cost of RAG
Transparency:
- Need to cite sources
- Want to verify answers
- Regulatory requirements
Multi-domain:
- Need to handle multiple domains
- Different knowledge bases
- General-purpose system

Choose fine-tuning when:

Large, stable dataset
Task-specific behavior needed
High volume, cost-sensitive
Need consistent style/format
Domain-specific terminology

Best practice: Start with RAG + prompting. Fine-tune only if RAG doesn’t meet requirements or cost/performance justifies it.

Proprietary fine-tuning: what's the most common 'gotcha' teams miss?

Common gotchas:

Data leakage:
- Test data in training set
- Validation contamination
- Overfitting to test metrics
Distribution shift:
- Training data ≠ production data
- Different user behavior
- Changing requirements
Evaluation gaps:
- Evaluating on wrong metrics
- Not testing on real scenarios
- Ignoring edge cases
Cost underestimation:
- Fine-tuning cost
- Inference cost changes
- Maintenance overhead
Model degradation:
- Catastrophic forgetting
- Losing general capabilities
- Performance on other tasks drops
Deployment issues:
- Model size increases
- Latency changes
- Infrastructure needs

How to avoid:

Proper data splits
Test on production-like data
Monitor all metrics, not just target
Budget for full lifecycle
Test general capabilities after fine-tuning
Plan for deployment infrastructure

Best practice: Always test fine-tuned models on real-world scenarios and monitor for degradation in general capabilities.

5. RAG Systems (Retrieval Augmented Generation)

What is RAG and why is it useful?

RAG = Retrieval + Generation retrieves relevant docs before generating output, grounding the model in factual context.Benefits:

Reduces hallucination
Keeps results current
Enables domain adaptation without retraining

Pipeline: Embed → Store → Retrieve (top-k) → Construct prompt → Generate response

Explain the embedding process

Text is converted into numerical vectors using embedding models (e.g., text-embedding-3-small).Steps:

Tokenize text
Convert to fixed-size vector
Store in vector DB

Distance metrics: Cosine similarity, dot product, Euclidean distance.

How to optimize RAG performance?

Smart document chunking (500–800 tokens)
Use metadata filters (type, tags, date)
Cache top-k retrievals
Re-rank using relevance scores

How do embeddings + similarity search actually work in RAG? Where does it break?

How it works:

Embedding: Convert documents to vectors
Storage: Store vectors in vector database
Query embedding: Convert query to vector
Similarity search: Find closest document vectors
Retrieval: Return top-k most similar documents
Generation: Use retrieved docs as context for LLM

Where it breaks:

Semantic mismatch:
- Query and documents use different terminology
- Embeddings don’t capture exact match needs
- Example: Query “ML” vs document “machine learning”
Context loss:
- Chunking loses document structure
- Missing surrounding context
- Fragmented information
Retrieval quality:
- Wrong documents retrieved
- Missing relevant documents
- Too many irrelevant results
Scale issues:
- Slow retrieval at large scale
- Vector DB limitations
- Cost of embedding everything
Domain mismatch:
- Embedding model not trained on domain
- Different languages or formats
- Specialized terminology

Mitigation:

Use hybrid search (dense + sparse)
Better chunking strategies
Domain-specific embedding models
Reranking for better results
Metadata filtering

Best practice: Always validate retrieval quality on your specific use case embeddings aren’t perfect.

Vector DBs: Pinecone vs FAISS vs Weaviate. what's your decision framework?

Comparison:Pinecone:

Managed service, easy setup
Good performance, auto-scaling
Expensive at scale
Best for: Quick prototypes, managed solution

FAISS (Facebook AI Similarity Search):

Library, not a database
Very fast, open source
No persistence, needs integration
Best for: Research, in-memory search

Weaviate:

Self-hosted or managed
Feature-rich, hybrid search
More complex setup
Best for: Production with advanced needs

Decision framework:

Prototype quickly? → Pinecone
Research/experiment? → FAISS
Production with features? → Weaviate
Large scale? → Milvus or Qdrant
Budget constrained? → Self-hosted (Weaviate, Milvus)

Best practice: Start with Pinecone for speed, migrate to self-hosted (Weaviate/Milvus) as you scale.

Hybrid retrieval (sparse+dense): when does it matter?

Hybrid retrieval:

Combines dense (semantic) and sparse (keyword) search
Weighted combination of scores
Captures both semantic similarity and exact matches

When it matters:

Exact matches needed:
- Code search, version numbers
- Proper nouns, technical terms
- When precision critical
Semantic understanding needed:
- General queries, synonyms
- Conceptual search
- When recall important
Production systems:
- Need best of both worlds
- Can’t afford to miss results
- Quality is priority

When it doesn’t matter:

Pure semantic search sufficient
Pure keyword search sufficient
Simple use cases
Cost-sensitive scenarios

Implementation:

# Pseudo-code
dense_score = cosine_similarity(query_embedding, doc_embedding)
sparse_score = bm25(query, doc)
final_score = α * dense_score + (1-α) * sparse_score

Best practice: Use hybrid retrieval in production RAG systems for best results.

Why does reranking help and when does it not?

Reranking:

Second-stage ranking using more expensive model
Reorders initial retrieval results
Improves precision of top results

Why it helps:

Initial retrieval may miss subtle relevance
Reranker understands context better
Can catch semantic nuances
Improves top-k precision significantly

When it helps:

Initial retrieval has good recall but poor precision
Need high-quality top results
Can afford extra latency/cost
Complex queries requiring understanding

When it doesn’t help:

Initial retrieval already very good
Latency/cost critical
Simple queries
Reranker not better than initial retrieval

Trade-offs:

Better quality but higher latency/cost
Typically 2-5× slower than initial retrieval
Worth it for critical queries

Best practice: Use reranking for important queries where quality matters more than speed.

How do you measure RAG quality beyond 'felt useful'?

Quantitative metrics:

Retrieval metrics:
- Precision@k: Fraction of retrieved docs that are relevant
- Recall@k: Fraction of relevant docs that were retrieved
- MRR: Mean reciprocal rank of first relevant result
- NDCG: Normalized discounted cumulative gain
Generation metrics:
- Answer accuracy: Correctness of generated answer
- Faithfulness: Answer grounded in retrieved docs
- Completeness: Answer covers all aspects
- Citation accuracy: Correct source attribution
End-to-end metrics:
- Task completion rate
- User satisfaction (thumbs up/down)
- Time to correct answer
- Error rate

Evaluation framework:

Create test set with known good answers
Run RAG pipeline on test set
Measure retrieval quality
Measure generation quality
Measure end-to-end performance
Compare against baselines

Best practices:

Use multiple metrics (not just one)
Test on real-world scenarios
Monitor in production
A/B test improvements
Regular evaluation cycles

Best practice: Combine quantitative metrics with qualitative evaluation for comprehensive RAG assessment.

Describe failure modes when your retriever fetches irrelevant context

Common failure modes:

Semantic mismatch:
- Query and documents use different terms
- Embeddings don’t capture exact need
- Example: “ML” vs “machine learning”
Over-retrieval:
- Too many irrelevant documents
- Dilutes relevant context
- Model gets confused
Under-retrieval:
- Missing critical documents
- Incomplete context
- Model makes up information
Chunking issues:
- Relevant info split across chunks
- Missing context from surrounding text
- Fragmented information
Temporal mismatch:
- Outdated information retrieved
- Wrong version of document
- Stale knowledge base
Domain mismatch:
- Embedding model not suited for domain
- Different language or format
- Specialized terminology

Impact:

Hallucinations (model makes up info)
Inaccurate answers
Missing information
Poor user experience

Mitigation:

Improve chunking strategy
Use hybrid search
Rerank results
Update embedding model
Filter by metadata (date, type)
Test retrieval quality regularly

Best practice: Monitor retrieval quality and have fallback strategies for when retrieval fails.

At what scale does naive RAG architecture fall apart?

Naive RAG limitations:Small scale (<10k documents):

Works fine
Simple vector search sufficient
No major issues

Medium scale (10k-1M documents):

Starts to show issues
Retrieval quality may degrade
Need better chunking/filtering

Large scale (>1M documents):

Significant problems:
- Slow retrieval
- Poor precision (too many results)
- Cost increases
- Quality degradation

When it falls apart:

Retrieval quality:
- Too many similar documents
- Hard to find most relevant
- Precision drops significantly
Performance:
- Slow vector search
- High latency
- Cost increases
Maintenance:
- Hard to update embeddings
- Complex to manage
- Scaling challenges

Solutions:

Hierarchical retrieval (coarse → fine)
Metadata filtering
Better chunking strategies
Hybrid search
Distributed vector DBs

Best practice: Plan for scale from the start naive RAG works for prototypes but needs optimization for production scale.

how to increase accuracy, and reliability & make answers verifiable in LLM

Strategies:

RAG (Retrieval Augmented Generation):
- Ground answers in retrieved documents
- Enables citation and verification
- Reduces hallucinations
Citation and sources:
- Always cite sources
- Link to original documents
- Enable fact-checking
Confidence scores:
- Provide confidence levels
- Flag uncertain answers
- Admit when unsure
Validation:
- Cross-check with multiple sources
- Verify against known facts
- Human review for critical answers
Transparency:
- Show retrieved context
- Explain reasoning
- Make process auditable

Implementation:

Use RAG for factual queries
Always include citations
Provide confidence scores
Enable source verification
Human review for critical cases

Best practice: Combine RAG with citations and confidence scores for verifiable, reliable answers.

How does RAG work?

RAG pipeline:

Document processing:
- Chunk documents into smaller pieces
- Embed chunks into vectors
- Store in vector database
Query processing:
- Embed user query into vector
- Search for similar document chunks
- Retrieve top-k most relevant chunks
Context construction:
- Combine retrieved chunks
- Format as context for LLM
- Include in prompt
Generation:
- LLM generates answer using context
- Grounded in retrieved documents
- Can cite sources

Benefits:

Reduces hallucinations
Keeps information current
Enables domain adaptation
Provides citations

Best practice: RAG is essential for factual, verifiable LLM applications.

What are some benefits of using the RAG system?

Benefits:

Reduced hallucinations:
- Grounded in real documents
- Less likely to make up information
- More accurate answers
Current information:
- Can update knowledge base
- No need to retrain model
- Always up-to-date
Domain adaptation:
- Add domain-specific documents
- No fine-tuning needed
- Quick to adapt
Transparency:
- Can cite sources
- Verifiable answers
- Auditable process
Cost-effective:
- No model retraining
- Update knowledge easily
- Lower maintenance

Best practice: RAG is the standard approach for factual, domain-specific LLM applications.

When should I use Fine-tuning instead of RAG?

Use fine-tuning when:

Task-specific behavior:
- Need specific output format
- Consistent style required
- Domain-specific terminology
Large, stable dataset:
- Have sufficient training data
- Data doesn’t change frequently
- Can afford fine-tuning cost
Performance critical:
- Need maximum performance
- Latency sensitive
- High volume
Consistency:
- Need very consistent outputs
- Style/format critical
- Behavior must be predictable

Use RAG when:

Need current information
Multiple knowledge domains
Data changes frequently
Need citations
Quick to deploy

Best practice: Start with RAG. Fine-tune only if RAG doesn’t meet requirements.

What are the architecture patterns for customizing LLM with proprietary data?

Architecture patterns:

RAG (Retrieval Augmented Generation):
- Retrieve relevant docs, use as context
- No model changes
- Easy to update
- Best for: Knowledge bases, Q&A
Fine-tuning:
- Train model on proprietary data
- Model learns from data
- More integrated
- Best for: Task-specific behavior
Hybrid:
- Fine-tune + RAG
- Model fine-tuned for task
- RAG for knowledge
- Best for: Complex requirements
Prompt engineering:
- Customize via prompts
- No model changes
- Very flexible
- Best for: Quick customization

Decision framework:

Knowledge base? → RAG
Task behavior? → Fine-tuning
Both? → Hybrid
Quick test? → Prompt engineering

Best practice: Choose pattern based on requirements RAG for knowledge, fine-tuning for behavior.

5. MLOps & LLMOps

What is MLOps and why is it important?

MLOps applies DevOps principles to ML ensuring consistent deployment, monitoring, and governance.Key areas:

Continuous integration (CI)
Continuous training (CT)
Continuous deployment (CD)
Model registry/versioning
Drift detection and rollback

Why important:

Faster deployment cycles
Better model quality
Reduced risk
Reproducibility
Scalability

Best practice: MLOps is essential for production ML systems enables reliable, scalable deployments.

How to monitor AI models in production?

Metrics: Latency, accuracy, token usage, cost, drift
Tools: Prometheus, Grafana, OpenTelemetry
Alerts: P95 latency, error spikes, accuracy dropsBest practices:

Log prompts safely
Anonymize data
Track feedback loops

Key metrics:

Latency: P50, P95, P99 response times
Accuracy: Task-specific metrics
Token usage: Input/output tokens
Cost: Per request, per day
Drift: Data and model drift

Best practice: Monitor multiple metrics latency, accuracy, cost, drift for comprehensive monitoring.

Explain AI safety and ethical considerations

Avoid harmful or biased outputs
Enforce strict usage policies
Apply constitutional AI + red teaming
Audit model data and behavior

Examples:

Refuse harmful requests
Document dataset sources
Test for bias before release

Best practice: Safety and ethics are critical implement guardrails, test for bias, and monitor for harmful outputs.

Sketch a pipeline: from raw data → model → serving → feedback

MLOps pipeline:

Data collection:
- Raw data ingestion
- Data validation
- Data storage
Data processing:
- Cleaning, transformation
- Feature engineering
- Data versioning
Model training:
- Training pipeline
- Hyperparameter tuning
- Model evaluation
Model registry:
- Version models
- Store metadata
- Track performance
Model deployment:
- Model serving
- A/B testing
- Gradual rollout
Monitoring:
- Performance metrics
- Drift detection
- Error tracking
Feedback loop:
- Collect user feedback
- Log predictions
- Retrain with new data

Best practice: Build end-to-end pipeline data → model → serving → feedback for continuous improvement.

How would you monitor performance drift or hallucinations?

Drift monitoring:

Data drift:
- Monitor input distribution
- Statistical tests (KS test, chi-square)
- Alert on significant changes
Model drift:
- Monitor prediction distribution
- Compare with baseline
- Alert on changes
Performance drift:
- Monitor accuracy metrics
- Compare with baseline
- Alert on degradation

Hallucination monitoring:

Output validation:
- Check for factual claims
- Verify against sources
- Flag suspicious outputs
Confidence scores:
- Monitor confidence levels
- Flag low-confidence outputs
- Review manually
User feedback:
- Collect thumbs up/down
- Track user reports
- Identify patterns

Best practice: Monitor drift and hallucinations continuously set up alerts and review regularly.

How do you log prompts and outputs for debugging and auditing?

Logging strategy:

What to log:
- Prompts (system + user)
- Model responses
- Metadata (timestamp, user, model version)
- Performance metrics
Privacy:
- Anonymize PII
- Hash sensitive data
- Comply with regulations
Storage:
- Centralized logging (ELK, Splunk)
- Searchable, filterable
- Retention policies
Access control:
- Role-based access
- Audit logs
- Secure storage

Best practices:

Log everything for debugging
Anonymize for privacy
Enable search and filtering
Set retention policies

Best practice: Log prompts and outputs securely essential for debugging and auditing.

CI/CD for LLM workflows - what's different from ML?

LLM-specific challenges:

Prompt versioning:
- Version prompts like code
- A/B test prompts
- Rollback prompts
Model updates:
- Base model updates
- Fine-tuned model versions
- Embedding model updates
Context management:
- Version context documents
- Update knowledge bases
- Backfill queries
Evaluation:
- LLM-specific metrics
- Human evaluation
- A/B testing
Deployment:
- Model serving (vLLM, etc.)
- Prompt caching
- Streaming responses

Differences from traditional ML:

Prompts: Version and test prompts
Context: Manage dynamic context
Evaluation: LLM-specific metrics
Deployment: Streaming, caching

Best practice: Adapt CI/CD for LLMs version prompts, manage context, use LLM-specific evaluation.

What's your playbook for deploying an LLM API (FastAPI, Docker, K8s)?

Deployment playbook:

API development:
- FastAPI for Python API
- Define endpoints
- Error handling
Containerization:
- Docker for containerization
- Multi-stage builds
- Optimize image size
Orchestration:
- Kubernetes for orchestration
- Deployments, services
- Auto-scaling
Model serving:
- vLLM, TorchServe for serving
- GPU allocation
- Batching
Monitoring:
- Prometheus metrics
- Grafana dashboards
- Alerts
CI/CD:
- GitHub Actions, GitLab CI
- Automated testing
- Deployment pipelines

Best practices:

Use FastAPI for APIs
Containerize with Docker
Orchestrate with Kubernetes
Monitor with Prometheus/Grafana

Best practice: Use FastAPI + Docker + Kubernetes for production LLM APIs standard, scalable stack.

Drift detection: how do you monitor it with LLMs?

LLM drift detection:

Input drift:
- Monitor prompt patterns
- Track user query types
- Alert on changes
Output drift:
- Monitor response patterns
- Track response length
- Alert on changes
Performance drift:
- Monitor accuracy metrics
- Track user satisfaction
- Alert on degradation
Model drift:
- Compare model versions
- Track behavior changes
- A/B test

Methods:

Statistical tests (KS test, chi-square)
Distribution comparison
Threshold-based alerts

Best practice: Monitor drift continuously set up automated alerts and review regularly.

Evaluation pipelines: offline vs online trade-offs?

Offline evaluation:

Test on held-out dataset
Fast, cheap
No user impact
May not reflect real usage

Online evaluation:

Test with real users
A/B testing
Reflects real usage
Slower, more expensive

Trade-offs:

Offline: Fast, cheap, but may not reflect reality
Online: Realistic, but slower and more expensive

Best practice: Use offline for initial evaluation, online (A/B testing) for final validation.

What observability metrics matter most in LLMOps?

Key metrics:

Latency:
- P50, P95, P99 response times
- Time to first token
- End-to-end latency
Accuracy:
- Task-specific metrics
- User satisfaction
- Error rates
Cost:
- Token usage (input + output)
- Cost per request
- Daily/monthly costs
Quality:
- Hallucination rate
- Citation accuracy
- User feedback
System:
- Throughput (requests/sec)
- Error rate
- Availability

Best practice: Monitor latency, accuracy, cost, and quality essential metrics for LLMOps.

Rollbacks: what's different here compared to traditional ML?

LLM rollback differences:

Prompt rollbacks:
- Rollback prompts quickly
- No model retraining needed
- Version control
Model rollbacks:
- Rollback model versions
- May need infrastructure changes
- More complex
Context rollbacks:
- Rollback context documents
- May need re-embedding
- Backfill queries
Fast rollbacks:
- Prompts: Very fast
- Models: Slower
- Context: Medium

Best practice: Version everything prompts, models, context for quick rollbacks.

How do you scale inference infra without bleeding cost?

Cost optimization:

Model optimization:
- Quantization (INT8, INT4)
- Model distillation
- Smaller models
Batching:
- Dynamic batching
- Continuous batching (vLLM)
- Higher throughput
Caching:
- Prompt caching
- Response caching
- Reduce redundant calls
Smart routing:
- Route simple queries to smaller models
- Route complex to larger models
- Cost-aware routing
Infrastructure:
- Spot instances
- Auto-scaling
- Right-sizing

Best practice: Optimize models, use batching and caching, smart routing for cost-effective scaling.

CI/CD for prompts + fine-tuned checkpoints how would you design it?

CI/CD design:

Version control:
- Git for prompts
- Model registry for checkpoints
- Track versions
Testing:
- Test prompts on sample queries
- Test models on validation set
- Automated tests
Deployment:
- Feature flags for prompts
- Gradual rollout for models
- A/B testing
Monitoring:
- Monitor performance
- Track metrics
- Alert on issues
Rollback:
- Quick rollback for prompts
- Model rollback capability
- Version management

Best practice: Design CI/CD for both prompts and models version, test, deploy, monitor, rollback.

6. Document Digitization & Chunking

What is chunking, and why do we chunk our data?

Chunking:

Breaking documents into smaller pieces
Makes documents fit in context window
Enables better retrieval

Why chunk:

Context limits: Models have max context (e.g., 128k tokens)
Better retrieval: Smaller chunks = more precise retrieval
Cost: Smaller chunks = lower embedding costs
Performance: Faster processing of smaller pieces

Best practices:

Chunk size: 500-800 tokens (balance context vs precision)
Overlap: 50-100 tokens between chunks (preserve context)
Semantic boundaries: Split at sentence/paragraph boundaries

Best practice: Chunk size depends on use case smaller for precise retrieval, larger for more context.

What factors influence chunk size?

Factors:

Model context window:
- Max tokens model can handle
- Need space for query + retrieved chunks
- Example: 8k context → chunks of 500-800 tokens
Retrieval precision:
- Smaller chunks = more precise retrieval
- Larger chunks = more context per chunk
- Balance precision vs context
Document structure:
- Paragraphs, sections, chapters
- Natural boundaries matter
- Preserve semantic units
Use case:
- Q&A: Smaller chunks for precise answers
- Summarization: Larger chunks for context
- Analysis: Medium chunks for balance
Embedding model:
- Max tokens per embedding
- Some models handle longer texts better
- Consider model limitations

Best practice: Start with 500-800 token chunks, adjust based on retrieval quality and use case.

What are the different types of chunking methods?

Chunking methods:

Fixed-size chunking:
- Split by character/token count
- Simple, fast
- May break sentences/paragraphs
Sentence-based chunking:
- Split at sentence boundaries
- Preserves sentence structure
- Better semantic units
Paragraph-based chunking:
- Split at paragraph boundaries
- Preserves paragraph context
- Good for structured documents
Recursive chunking:
- Try different strategies hierarchically
- Start with paragraphs, fall back to sentences
- Best of multiple approaches
Semantic chunking:
- Split based on semantic similarity
- Uses embeddings to find boundaries
- Most sophisticated, preserves meaning
Sliding window:
- Overlapping chunks
- Preserves context across boundaries
- More chunks but better coverage

Best practice: Use recursive or semantic chunking for best results, with overlap to preserve context.

How to find the ideal chunk size?

Finding ideal chunk size:

Start with baseline:
- Common: 500-800 tokens
- Test with your documents
- Measure retrieval quality
Test different sizes:
- Small (200-400): More precise, less context
- Medium (500-800): Balanced
- Large (1000-1500): More context, less precise
Evaluate:
- Precision@k: Are retrieved chunks relevant?
- Recall@k: Do we find all relevant chunks?
- End-to-end: Does RAG quality improve?
Consider factors:
- Document type (technical vs narrative)
- Query type (specific vs general)
- Model context window
- Use case requirements
Iterate:
- Start with medium size
- Adjust based on results
- Test on real queries

Best practice: Test multiple chunk sizes on your specific documents and queries ideal size varies by use case.

What is the best method to digitize and chunk complex documents like annual reports?

For complex documents (annual reports):

Preprocessing:
- Extract text from PDF
- Preserve structure (tables, sections)
- Clean formatting
Structure-aware chunking:
- Identify sections (executive summary, financials, etc.)
- Chunk within sections
- Preserve section context
Hierarchical chunking:
- Document → Sections → Subsections → Paragraphs
- Store hierarchy in metadata
- Enable section-level retrieval
Special handling:
- Tables: Extract as structured data, chunk separately
- Charts: Extract captions, link to images
- Footnotes: Include with relevant sections
Metadata:
- Section name, page number, date
- Document type, year
- Enable filtering by metadata

Implementation:

Use document parsers (PyPDF2, pdfplumber)
Structure detection (section headers)
Table extraction (tabula, camelot)
Semantic chunking within sections

Best practice: For complex documents, use structure-aware chunking with rich metadata for better retrieval.

How to handle tables during chunking?

Table handling strategies:

Extract as structured data:
- Convert to CSV/JSON
- Store separately from text
- Embed table descriptions
Text representation:
- Convert table to markdown/text
- Include in chunks
- Preserve structure
Hybrid approach:
- Store structured data separately
- Include table summary in chunks
- Link table data to text chunks
Metadata:
- Table type, headers, row count
- Enable table-specific queries
- Filter by table metadata

Best practices:

Extract tables with specialized tools (tabula, camelot)
Include table context (surrounding text)
Store both structured and text representations
Use metadata for table-specific retrieval

Best practice: Extract tables separately, include summaries in text chunks, and store full tables as structured data.

How do you handle very large table for better retrieval?

For very large tables:

Split by rows:
- Chunk table into row groups
- Preserve header row in each chunk
- Maintain table structure
Column-based chunking:
- Split by columns for column-specific queries
- Include row identifiers
- Preserve relationships
Summary chunks:
- Create summary of table
- Include statistics, key insights
- Use for high-level queries
Metadata:
- Table name, dimensions, date
- Column names, data types
- Enable filtering
Structured storage:
- Store full table in database
- Embed summaries and descriptions
- Link chunks to full table

Best practice: For large tables, create summary chunks for retrieval, store full table separately, and link them.

How to handle list item during chunking?

List handling:

Preserve list structure:
- Keep list items together
- Don’t split mid-list
- Maintain list context
List as single chunk:
- Small lists: Keep as one chunk
- Preserves relationships
- Better semantic unit
Split long lists:
- Large lists: Split into groups
- Include list title/context
- Maintain item relationships
Metadata:
- List type (ordered, unordered)
- List title, item count
- Enable list-specific queries

Best practice: Keep lists together when possible, split only if necessary, and preserve list context.

How do you build production grade document processing and indexing pipeline?

Production pipeline:

Document ingestion:
- Support multiple formats (PDF, DOCX, HTML)
- Handle errors gracefully
- Validate document quality
Preprocessing:
- Extract text, preserve structure
- Clean formatting
- Handle special elements (tables, images)
Chunking:
- Structure-aware chunking
- Preserve context
- Generate metadata
Embedding:
- Batch processing
- Error handling
- Retry logic
Indexing:
- Store in vector DB
- Store metadata
- Enable filtering
Monitoring:
- Track processing time
- Monitor errors
- Quality metrics
Versioning:
- Version documents
- Track changes
- Enable rollback

Best practices:

Use async processing for scale
Implement retry logic
Monitor pipeline health
Version everything
Test on production-like data

Best practice: Build robust pipeline with error handling, monitoring, and versioning for production use.

How to handle graphs & charts in RAG

Graphs and charts handling:

Extract text:
- Chart titles, labels, captions
- Axis labels, legends
- Include in text chunks
Image embeddings:
- Use vision models for image embeddings
- Store image embeddings separately
- Link to text chunks
Metadata:
- Chart type, data source
- Date, context
- Enable filtering
Hybrid approach:
- Text description in chunks
- Image embeddings for visual search
- Link images to text
Structured data:
- Extract underlying data if available
- Store as structured data
- Link to chart images

Best practice: Extract text from charts, use image embeddings for visual search, and link charts to relevant text chunks.

7. Embedding Models

What are vector embeddings, and what is an embedding model?

Vector embeddings:

Numerical representations of text
Dense vectors (arrays of numbers)
Capture semantic meaning
Similar texts have similar vectors

Embedding model:

Neural network that generates embeddings
Trained on large text corpora
Maps text to fixed-size vectors
Examples: text-embedding-3-small, sentence-transformers

How it works:

Input: Text (sentence, paragraph, document)
Output: Vector (e.g., 384, 768, 1536 dimensions)
Similar texts → similar vectors

Use cases:

Semantic search
RAG retrieval
Clustering
Classification

Best practice: Choose embedding model based on your domain and use case different models work better for different tasks.

How is an embedding model used in the context of LLM applications?

In LLM applications:

RAG (Retrieval Augmented Generation):
- Embed documents for retrieval
- Embed queries for search
- Find similar documents
- Use as context for LLM
Semantic search:
- Find similar documents
- Understand user intent
- Improve search quality
Context selection:
- Select relevant context from large corpus
- Filter documents
- Rank by relevance
Hybrid search:
- Combine with keyword search
- Best of both approaches
- Improved retrieval

Pipeline:

Documents → Embeddings → Vector DB
Query → Embedding → Search → Retrieve → LLM

Best practice: Embeddings are essential for RAG choose model that matches your domain and use case.

What is the difference between embedding short and long content?

Short content (sentences, phrases):

Better semantic capture
More precise embeddings
Faster processing
Less context loss

Long content (paragraphs, documents):

More context preserved
Better for document-level search
Slower processing
May lose fine-grained details

Trade-offs:

Short: Better precision, less context
Long: More context, less precision

Best practices:

Short: For precise retrieval, Q&A
Long: For document-level search, summarization
Hybrid: Embed both short and long versions

Best practice: Embed at chunk level (500-800 tokens) for RAG balance context and precision.

How to benchmark embedding models on your data?

Benchmarking process:

Create test set:
- Queries with known relevant documents
- Label relevance (relevant/irrelevant)
- Cover different query types
Embed documents:
- Use different embedding models
- Store in vector DB
- Track model versions
Run retrieval:
- Query each model
- Retrieve top-k results
- Measure retrieval quality
Evaluate:
- Precision@k: Fraction of relevant results
- Recall@k: Fraction of relevant docs found
- MRR: Mean reciprocal rank
- NDCG: Normalized discounted cumulative gain
Compare:
- Compare models on same test set
- Consider latency, cost
- Choose best model

Best practices:

Test on domain-specific data
Use multiple metrics
Consider latency and cost
Test on real queries

Best practice: Benchmark on your specific data generic benchmarks may not reflect your use case.

Suppose you are working with an open AI embedding model, after benchmarking accuracy is coming low, how would you further improve the accuracy of embedding the search model?

Improvement strategies:

Try different models:
- text-embedding-3-small vs text-embedding-3-large
- Different dimensions
- Domain-specific models
Fine-tune embedding model:
- Train on your domain data
- Better domain understanding
- Improved accuracy
Improve chunking:
- Better chunk size
- Semantic chunking
- Preserve context
Hybrid search:
- Add keyword search (BM25)
- Combine dense + sparse
- Better coverage
Reranking:
- Second-stage ranking
- More expensive but better
- Improves precision
Query expansion:
- Expand queries with synonyms
- Better query understanding
- Improved retrieval
Metadata filtering:
- Filter by document type, date
- Narrow search space
- Better precision

Best practice: Start with hybrid search and reranking often easier than fine-tuning and gives good results.

Walk me through steps of improving sentence transformer model used for embedding?

Improvement steps:

Baseline evaluation:
- Test current model
- Measure retrieval quality
- Identify issues
Data preparation:
- Collect domain-specific data
- Create training pairs (query, relevant doc)
- Label relevance
Fine-tuning:
- Use sentence-transformers library
- Train on domain data
- Monitor validation metrics
Evaluation:
- Test on held-out set
- Compare with baseline
- Measure improvement
Iteration:
- Adjust hyperparameters
- Add more training data
- Improve data quality
Deployment:
- Deploy new model
- A/B test against old model
- Monitor performance

Best practices:

Start with small dataset
Use contrastive learning
Monitor overfitting
Test on real queries

Best practice: Fine-tune on domain-specific data with proper evaluation often improves accuracy significantly.

8. Internal Working of Vector Databases

What is a vector database?

Vector database:

Specialized database for vector embeddings
Optimized for similarity search
Stores high-dimensional vectors
Fast nearest neighbor search

Key features:

Vector storage and indexing
Similarity search (cosine, dot product, Euclidean)
Metadata filtering
Scalability

Examples:

Pinecone, Milvus, Weaviate, Qdrant, Chroma

Use cases:

RAG systems
Semantic search
Recommendation systems
Similarity matching

Best practice: Use vector DB for production RAG systems much faster than naive similarity search.

How does a vector database differ from traditional databases?

Differences:Traditional databases (SQL, NoSQL):

Exact match queries
Structured data
Indexes for exact lookups
Not optimized for similarity

Vector databases:

Similarity search
High-dimensional vectors
Approximate nearest neighbor (ANN) algorithms
Optimized for vector operations

Key differences:

Query type: Exact match vs similarity
Data structure: Tables vs vectors
Indexing: B-tree vs ANN indexes
Use case: Structured data vs embeddings

When to use each:

Traditional: Structured data, exact queries
Vector: Embeddings, similarity search

Best practice: Use vector DB for embeddings, traditional DB for metadata and structured data.

How does a vector database work?

How it works:

Storage:
- Store vectors with metadata
- Index vectors for fast search
- Maintain data structures
Indexing:
- Build ANN indexes (HNSW, IVF, etc.)
- Enable fast approximate search
- Balance accuracy vs speed
Query:
- Embed query into vector
- Search for similar vectors
- Return top-k results
Similarity calculation:
- Cosine similarity, dot product, Euclidean
- Fast computation
- Optimized algorithms

Internal mechanisms:

HNSW (Hierarchical Navigable Small World): Graph-based index
IVF (Inverted File Index): Clustering-based
LSH (Locality-Sensitive Hashing): Hash-based

Best practice: Vector DBs use sophisticated indexing algorithms for fast similarity search at scale.

Explain difference between vector index, vector DB & vector plugins?

Vector index:

Data structure for fast similarity search
Examples: HNSW, IVF, LSH
Can be used standalone (FAISS)
No persistence, needs integration

Vector database:

Full database system with vector support
Persistence, querying, management
Examples: Pinecone, Milvus, Weaviate
Production-ready solution

Vector plugins:

Add vector capabilities to existing DBs
Examples: pgvector (PostgreSQL), vector search in Elasticsearch
Extends traditional databases
Hybrid approach

Comparison:

Index: Fast, no persistence, needs integration
DB: Full solution, persistence, production-ready
Plugin: Extends existing DB, hybrid approach

Best practice: Use vector DB for production, vector index for research, plugins for hybrid needs.

You are working on a project that involves a small dataset of customer reviews. Your task is to find similar reviews in the dataset. The priority is to achieve perfect accuracy in finding the most similar reviews, and the speed of the search is not a primary concern. Which search strategy would you choose and why?

For perfect accuracy, speed not concern:Choose: Exact nearest neighbor search (brute force)Why:

Perfect accuracy: Checks all vectors, finds true nearest neighbors
No approximation: No accuracy loss from indexing
Small dataset: Brute force is feasible for small datasets
Simple: No index tuning needed

How it works:

Compare query vector with all vectors
Calculate similarity for each
Return top-k most similar

Trade-offs:

Accuracy: Perfect (100%)
Speed: Slow (O(n) where n = dataset size)
Scalability: Doesn’t scale to large datasets

When to use:

Small datasets (<10k vectors)
Accuracy critical
Speed not concern
Simple implementation

Best practice: For small datasets where accuracy is critical, brute force is the right choice simple and perfect accuracy.

Explain vector search strategies like clustering and Locality-Sensitive Hashing

Clustering-based (IVF - Inverted File Index):

Cluster vectors into groups
Search only in relevant clusters
Reduces search space
Faster but approximate

Locality-Sensitive Hashing (LSH):

Hash similar vectors to same buckets
Search only in relevant buckets
Fast approximate search
Probabilistic guarantees

Comparison:

Clustering: Better accuracy, needs training
LSH: Faster, probabilistic

Best practice: Use clustering (IVF) for better accuracy, LSH for maximum speed.

How does clustering reduce search space? When does it fail and how can we mitigate these failures?

How clustering works:

Group similar vectors into clusters
For query, find relevant clusters
Search only in those clusters
Reduces search space significantly

When it fails:

Query near cluster boundary:
- May miss vectors in adjacent clusters
- Solution: Search multiple clusters
Poor clustering:
- Clusters don’t match query distribution
- Solution: Better clustering algorithm, more clusters
High-dimensional data:
- Clustering less effective
- Solution: Dimensionality reduction, better algorithms

Mitigation:

Search multiple clusters
Improve clustering quality
Use hierarchical clustering
Combine with other strategies

Best practice: Use clustering with multi-cluster search for better accuracy while maintaining speed.

Explain Random projection index?

Random projection:

Projects high-dimensional vectors to lower dimensions
Preserves distances approximately (Johnson-Lindenstrauss lemma)
Faster search in lower dimensions
Approximate but fast

How it works:

Multiply vectors by random matrix
Reduce dimensions (e.g., 1536 → 128)
Search in lower-dimensional space
Faster but approximate

Trade-offs:

Speed: Much faster (lower dimensions)
Accuracy: Approximate (some loss)
Memory: Less memory needed

Best practice: Use random projection for very large datasets where speed is critical and some accuracy loss is acceptable.

Explain Locality-sensitive hashing (LHS) indexing method?

LSH (Locality-Sensitive Hashing):

Hash similar vectors to same buckets
Search only in relevant buckets
Fast approximate nearest neighbor search
Probabilistic guarantees

How it works:

Create hash functions that map similar vectors to same hash
Hash query vector
Search in matching buckets
Return top-k results

Key properties:

Similar vectors → same hash (high probability)
Different vectors → different hash (high probability)
Fast lookup (hash-based)

Trade-offs:

Speed: Very fast (hash lookup)
Accuracy: Approximate (probabilistic)
Memory: Hash tables needed

Best practice: Use LSH for very large datasets where speed is critical and approximate results are acceptable.

Explain product quantization (PQ) indexing method?

Product Quantization (PQ):

Compresses vectors using quantization
Reduces memory usage
Enables fast approximate search
Trade-off: accuracy vs memory

How it works:

Split vector into subvectors
Quantize each subvector (reduce precision)
Store quantized codes
Fast distance computation using lookup tables

Benefits:

Memory: Much less memory (compressed)
Speed: Fast distance computation
Scalability: Can handle very large datasets

Trade-offs:

Accuracy: Some loss from quantization
Complexity: More complex implementation

Best practice: Use PQ for very large datasets where memory is a constraint and some accuracy loss is acceptable.

Compare different Vector index and given a scenario, which vector index you would use for a project?

Comparison:HNSW (Hierarchical Navigable Small World):

Graph-based index
High accuracy, good speed
Best for: General-purpose, production

IVF (Inverted File Index):

Clustering-based
Good accuracy, fast
Best for: Large datasets, known distribution

LSH (Locality-Sensitive Hashing):

Hash-based
Fast, approximate
Best for: Very large datasets, speed critical

PQ (Product Quantization):

Compression-based
Memory efficient
Best for: Memory-constrained, large datasets

Decision framework:

General production: HNSW
Large scale: IVF or HNSW
Memory constrained: PQ
Speed critical: LSH
Accuracy critical: HNSW or exact search

Best practice: Start with HNSW for general use, consider others based on specific constraints.

How would you decide ideal search similarity metrics for the use case?

Similarity metrics:Cosine similarity:

Measures angle between vectors
Magnitude-independent
Best for: Semantic similarity, general use

Dot product:

Measures magnitude and direction
Magnitude-dependent
Best for: When magnitude matters

Euclidean distance:

Measures absolute distance
Magnitude-dependent
Best for: When absolute distance matters

Decision factors:

Vector normalization: Normalized → cosine, not normalized → dot product
Magnitude importance: Matters → dot product/Euclidean, doesn’t → cosine
Use case: Semantic search → cosine, recommendation → dot product

Best practice: Use cosine similarity for semantic search (most common), dot product for recommendation systems.

Explain different types and challenges associated with filtering in vector DB?

Filtering types:

Metadata filtering:
- Filter by document type, date, tags
- Pre-filter before vector search
- Reduces search space
Post-filtering:
- Filter after vector search
- May reduce results below k
- Simpler but less efficient
Pre-filtering:
- Filter before vector search
- More efficient
- May miss relevant results

Challenges:

Performance: Filtering can slow down search
Result quality: Pre-filtering may miss results
Complexity: Combining filters is complex
Indexing: Need indexes for fast filtering

Best practices:

Use metadata indexes
Combine pre and post-filtering
Test filtering impact on quality
Optimize filter queries

Best practice: Use metadata filtering to narrow search space, but test impact on result quality.

How to decide the best vector database for your needs?

Decision factors:

Scale:
- Number of vectors
- Query volume
- Growth rate
Features:
- Filtering, hybrid search
- Metadata support
- Advanced features
Deployment:
- Managed vs self-hosted
- Infrastructure requirements
- Maintenance burden
Cost:
- Pricing model
- Infrastructure costs
- Total cost of ownership
Performance:
- Latency requirements
- Throughput needs
- Accuracy requirements

Decision framework:

Prototype: Pinecone or Chroma
Production <10M: Pinecone or Weaviate
Production >10M: Milvus or Qdrant
Budget constrained: Self-hosted (Chroma, Milvus)
Need features: Weaviate

Best practice: Start with managed (Pinecone) for speed, migrate to self-hosted (Milvus) as you scale.

9. Advanced Search Algorithms

What are architecture patterns for information retrieval & semantic search?

Architecture patterns:

Dense retrieval (semantic search):
- Embeddings for semantic similarity
- Vector database for storage
- Cosine similarity for ranking
- Best for: Semantic understanding
Sparse retrieval (keyword search):
- TF-IDF, BM25 for keyword matching
- Inverted index for storage
- Keyword-based ranking
- Best for: Exact matches
Hybrid retrieval:
- Combine dense + sparse
- Weighted combination of scores
- Best of both approaches
- Best for: Production systems
Reranking:
- Two-stage retrieval
- Initial retrieval (fast)
- Reranking (expensive but better)
- Best for: Quality-critical queries
Multi-stage retrieval:
- Coarse → Fine retrieval
- Hierarchical search
- Progressive refinement
- Best for: Large-scale systems

Best practice: Use hybrid retrieval with reranking for production RAG systems.

Why it's important to have very good search

Why good search matters:

RAG quality:
- Good retrieval = good RAG
- Bad retrieval = hallucinations
- Foundation of RAG system
User experience:
- Fast, accurate results
- Relevant information
- User satisfaction
System performance:
- Reduces LLM calls
- Lower latency
- Lower cost
Accuracy:
- Correct information retrieved
- Reduces hallucinations
- Better answers

Best practice: Invest in good search it’s the foundation of RAG systems.

How can you achieve efficient and accurate search results in large-scale datasets?

Strategies:

Hybrid search:
- Combine dense + sparse
- Better coverage
- Improved accuracy
Hierarchical retrieval:
- Coarse → Fine search
- Reduce search space
- Faster retrieval
Metadata filtering:
- Filter by type, date, tags
- Narrow search space
- Better precision
Reranking:
- Second-stage ranking
- Improves precision
- Better top-k results
Indexing:
- Efficient indexes (HNSW, IVF)
- Fast approximate search
- Scalable
Caching:
- Cache frequent queries
- Reduce computation
- Lower latency

Best practice: Combine multiple strategies hybrid search, filtering, reranking for best results.

Consider a scenario where a client has already built a RAG-based system that is not giving accurate results, upon investigation you find out that the retrieval system is not accurate, what steps you will take to improve it?

Improvement steps:

Diagnose issues:
- Measure retrieval quality (precision@k, recall@k)
- Identify failure modes
- Analyze query types
Improve chunking:
- Better chunk size
- Semantic chunking
- Preserve context
Improve embeddings:
- Try different embedding models
- Fine-tune on domain data
- Domain-specific models
Add hybrid search:
- Combine dense + sparse
- Better coverage
- Improved accuracy
Add reranking:
- Second-stage ranking
- Improves precision
- Better top-k results
Metadata filtering:
- Filter by type, date
- Narrow search space
- Better precision
Query expansion:
- Expand queries with synonyms
- Better query understanding
- Improved retrieval
Evaluate:
- Test on real queries
- Measure improvement
- Iterate

Best practice: Start with hybrid search and reranking often gives biggest improvement with least effort.

Explain the keyword-based retrieval method

Keyword-based retrieval:

TF-IDF (Term Frequency-Inverse Document Frequency):
- Weights terms by frequency and rarity
- Common terms get lower weight
- Rare terms get higher weight
- Classic information retrieval
BM25 (Best Matching 25):
- Improved version of TF-IDF
- Better term saturation
- Handles document length better
- Industry standard
Inverted index:
- Maps terms to documents
- Fast lookup
- Efficient storage
- Foundation of keyword search

How it works:

Extract keywords from query
Look up in inverted index
Score documents by term frequency
Rank by relevance score

Pros:

Fast, exact matches
Interpretable
No model needed

Cons:

Misses synonyms
No semantic understanding
Limited to exact matches

Best practice: Use keyword search for exact matches, combine with semantic search for best results.

How to fine-tune re-ranking models?

Fine-tuning reranking models:

Data preparation:
- Query-document pairs
- Relevance labels (relevant/irrelevant)
- Multiple relevance levels (highly relevant, somewhat relevant, etc.)
Model selection:
- Cross-encoder models (BERT, RoBERTa)
- Better than bi-encoders for reranking
- Understands query-document interaction
Training:
- Use sentence-transformers library
- Contrastive learning
- Train on domain data
- Monitor validation metrics
Evaluation:
- Test on held-out set
- Measure precision@k, MRR, NDCG
- Compare with baseline
Deployment:
- Deploy as second-stage reranker
- Use after initial retrieval
- Monitor performance

Best practices:

Use domain-specific data
Multiple relevance levels
Monitor overfitting
Test on real queries

Best practice: Fine-tune reranking models on domain-specific data for best results.

Explain most common metric used in information retrieval and when it fails

Common metrics:

Precision@k:
- Fraction of retrieved items that are relevant
- Measures accuracy of top-k results
- Fails when: Need to measure coverage (recall)
Recall@k:
- Fraction of relevant items that were retrieved
- Measures coverage
- Fails when: Need to measure accuracy (precision)
MRR (Mean Reciprocal Rank):
- Average of 1/rank of first relevant result
- Emphasizes top results
- Fails when: Need to measure overall quality (NDCG)
NDCG (Normalized Discounted Cumulative Gain):
- Considers ranking quality, discounts lower positions
- Best for graded relevance
- Fails when: Need simple binary relevance

When metrics fail:

Precision: When recall is important
Recall: When precision is important
MRR: When need overall ranking quality
NDCG: When need simple binary relevance

Best practice: Use multiple metrics precision + recall, or MRR + NDCG for comprehensive evaluation.

If you were to create an algorithm for a Quora-like question-answering system, with the objective of ensuring users find the most pertinent answers as quickly as possible, which evaluation metric would you choose to assess the effectiveness of your system?

For Quora-like Q&A system:Choose: MRR (Mean Reciprocal Rank)Why:

User experience: Users want first relevant answer quickly
MRR emphasizes top results: Measures rank of first relevant answer
Fast answers: Lower rank = faster to find answer
User satisfaction: Users typically read top results

Alternative metrics:

Precision@k: Measures accuracy but not position
Recall@k: Measures coverage but not speed
NDCG: Good but more complex, MRR simpler

MRR calculation:

For each query, find rank of first relevant answer
Calculate 1/rank
Average across queries
Higher MRR = better (answers found faster)

Best practice: Use MRR for Q&A systems where users want first relevant answer quickly.

I have a recommendation system, which metric should I use to evaluate the system?

For recommendation systems:Choose: NDCG (Normalized Discounted Cumulative Gain)Why:

Graded relevance: Recommendations have degrees (highly relevant, somewhat relevant)
Position matters: Top recommendations more important
Ranking quality: Measures how well system ranks recommendations
Industry standard: Widely used for recommendation systems

Alternative metrics:

Precision@k: Good but doesn’t consider position
Recall@k: Good but doesn’t consider position
MRR: Good but assumes binary relevance

NDCG benefits:

Considers relevance grades
Discounts lower positions
Normalized (comparable across queries)
Industry standard

Best practice: Use NDCG for recommendation systems where ranking quality and graded relevance matter.

Compare different information retrieval metrics and which one to use when

Comparison:Precision@k:

Measures: Accuracy of top-k results
Use when: Accuracy is priority
Example: Search engine results

Recall@k:

Measures: Coverage of relevant items
Use when: Coverage is priority
Example: Document retrieval

MRR (Mean Reciprocal Rank):

Measures: Rank of first relevant result
Use when: First relevant result matters
Example: Q&A systems

NDCG (Normalized Discounted Cumulative Gain):

Measures: Ranking quality with graded relevance
Use when: Ranking and relevance grades matter
Example: Recommendation systems

F1@k:

Measures: Harmonic mean of precision and recall
Use when: Need balance of both
Example: Balanced evaluation

Decision framework:

Accuracy priority: Precision@k
Coverage priority: Recall@k
First result matters: MRR
Ranking quality: NDCG
Balance: F1@k

Best practice: Use multiple metrics for comprehensive evaluation precision + recall, or MRR + NDCG.

How does hybrid search works?

Hybrid search:

Dense search (semantic):
- Embed query and documents
- Calculate cosine similarity
- Rank by semantic similarity
Sparse search (keyword):
- Extract keywords from query
- Use BM25/TF-IDF
- Rank by keyword matching
Score combination:
- Normalize scores (0-1)
- Weighted combination: final_score = α × dense_score + (1-α) × sparse_score
- Typical α = 0.7 (70% dense, 30% sparse)
Reranking:
- Optional: Rerank combined results
- Use cross-encoder
- Improve precision

Benefits:

Captures semantic similarity (dense)
Captures exact matches (sparse)
Better coverage
Improved accuracy

Best practice: Use hybrid search in production RAG systems best of both approaches.

If you have search results from multiple methods, how would you merge and homogenize the rankings into a single result set?

Merging strategies:

Score normalization:
- Normalize scores to same range (0-1)
- Use min-max or z-score normalization
- Enables fair combination
Weighted combination:
- final_score = α × method1_score + (1-α) × method2_score
- Adjust α based on method performance
- Typical: 0.7 dense + 0.3 sparse
Reciprocal rank fusion (RRF):
- Combine ranks, not scores
- RRF_score = Σ(1 / (k + rank))
- Works with different score ranges
- Popular in information retrieval
Learning to rank:
- Train model to combine scores
- Learns optimal combination
- More complex but better
Reranking:
- Merge initial results
- Rerank with cross-encoder
- Improves final ranking

Best practice: Use reciprocal rank fusion (RRF) for merging works well with different score ranges.

How to handle multi-hop/multifaceted queries?

Multi-hop queries:

Iterative retrieval:
- First hop: Retrieve initial documents
- Extract entities/concepts
- Second hop: Query with extracted entities
- Combine results
Graph-based retrieval:
- Build knowledge graph
- Traverse graph for multi-hop
- Find connected entities
Query decomposition:
- Break query into sub-queries
- Retrieve for each sub-query
- Combine results
Agent-based:
- Use LLM agent
- Plan retrieval steps
- Execute iteratively

Best practices:

Use iterative retrieval for simple multi-hop
Use graph-based for complex relationships
Use agents for complex reasoning

Best practice: Use iterative retrieval for multi-hop queries retrieve, extract, query again.

What are different techniques to be used to improved retrieval?

Retrieval improvement techniques:

Hybrid search:
- Combine dense + sparse
- Better coverage
- Improved accuracy
Reranking:
- Second-stage ranking
- Improves precision
- Better top-k results
Query expansion:
- Add synonyms, related terms
- Better query understanding
- Improved retrieval
Metadata filtering:
- Filter by type, date, tags
- Narrow search space
- Better precision
Better chunking:
- Semantic chunking
- Preserve context
- Better retrieval
Fine-tune embeddings:
- Domain-specific models
- Better domain understanding
- Improved accuracy
Multi-stage retrieval:
- Coarse → Fine search
- Hierarchical retrieval
- Faster and better

Best practice: Combine multiple techniques hybrid search, reranking, filtering for best results.

10. Prompt Engineering & Basics of LLM

What is the difference between Predictive/Discriminative AI and Generative AI?

Predictive/Discriminative AI:

Predicts labels or classes
Examples: Classification, regression
Input → Output (label/class)
Trained on labeled data
Examples: Image classification, sentiment analysis

Generative AI:

Generates new content
Examples: Text generation, image generation
Input → Output (new content)
Trained on unlabeled data
Examples: GPT, DALL-E, ChatGPT

Key differences:

Purpose: Prediction vs generation
Output: Label vs content
Training: Labeled vs unlabeled data
Use case: Classification vs creation

Best practice: Use discriminative AI for classification, generative AI for content creation.

What is LLM, and how are LLMs trained?

LLM (Large Language Model):

Neural network trained on large text corpora
Generates human-like text
Examples: GPT, BERT, LLaMA

How LLMs are trained:

Pre-training:
- Train on large unlabeled text corpus
- Learn language patterns
- Self-supervised learning (predict next token)
- Massive compute and data
Fine-tuning:
- Adapt to specific tasks
- Supervised learning on labeled data
- Task-specific behavior
- Much less data needed
Alignment:
- RLHF, DPO for human preferences
- Safety and helpfulness
- Human feedback
- Aligns with human values

Training process:

Pre-training: Months on thousands of GPUs
Fine-tuning: Hours to days
Alignment: Days to weeks

Best practice: LLMs are pre-trained on massive data, then fine-tuned and aligned for specific use cases.

What is a token in the language model?

Token:

Basic unit of text processing
Can be word, subword, or character
Depends on tokenizer (BPE, WordPiece, SentencePiece)

How tokens work:

Text → Tokens → Token IDs → Model
Model processes tokens, not raw text
Token count affects cost and context

Examples:

“Hello world” → 2 tokens (BPE)
“Machine learning” → 2-3 tokens (depending on tokenizer)

Tokenization methods:

BPE: Byte Pair Encoding (GPT)
WordPiece: (BERT)
SentencePiece: (T5, multilingual)

Best practice: Understand your model’s tokenizer affects cost, context window, and performance.

How to estimate the cost of running SaaS-based and Open Source LLM models?

Cost estimation:SaaS-based (OpenAI, Anthropic):

Pricing: Per token (input + output)
Example: GPT-4: $0.03/1k input tokens,$ 0.06/1k output tokens
Calculate: (input_tokens × input_price) + (output_tokens × output_price)
Monthly: Estimate tokens/month × price

Open source (self-hosted):

Infrastructure: GPU instances (A100, H100)
Cost: $5-15k/month for GPU instances
Break-even: ~2-5M requests/month
Additional: Storage, networking, maintenance

Factors:

Volume: More requests = higher cost
Model size: Larger models = higher cost
Context length: Longer context = more tokens
Region: Different pricing by region

Best practice: Calculate based on expected volume SaaS for low volume, self-hosted for high volume.

Explain the Temperature parameter and how to set it

Temperature:

Controls randomness in generation (0-2)
Lower = more deterministic
Higher = more creative

How to set:

0.0-0.3: Deterministic (code, classification)
0.4-0.7: Balanced (Q&A, summaries)
0.8-1.0: Creative (writing, brainstorming)

Best practices:

Start with 0.3-0.5 for most tasks
Use 0.0 for deterministic tasks
Use 0.8+ for creative tasks
Test different values

Best practice: Use low temperature (0.1-0.3) for factual tasks, higher (0.7-0.9) for creative tasks.

What are different decoding strategies for picking output tokens?

Decoding strategies:

Greedy:
- Always picks highest probability token
- Fastest, deterministic
- Can get repetitive
Beam search:
- Keeps top-k candidates
- Better quality, slower
- Good for translation
Top-k sampling:
- Samples from top-k tokens
- More diverse, less deterministic
- Good for creative tasks
Top-p (nucleus) sampling:
- Samples from smallest set covering p% probability
- Good balance of quality and diversity
- Most common for chat
Temperature sampling:
- Scales probabilities before sampling
- Controls randomness
- Often combined with top-p

Best practice: Use top-p (p=0.9) with temperature (0.7-0.9) for chat, greedy for code.

What are different ways you can define stopping criteria in large language model?

Stopping criteria:

Max tokens:
- Stop after N tokens
- Prevents long outputs
- Most common
Stop sequences:
- Stop when specific sequence appears
- Example: ”###” or “\n\n”
- Useful for structured output
EOS token:
- Stop at end-of-sequence token
- Model-generated
- Natural stopping point
Custom logic:
- Stop based on content
- Example: Complete sentence, paragraph
- More complex

Best practice: Use max tokens + stop sequences for reliable stopping.

How to use stop sequences in LLMs?

Stop sequences:

Define sequences:
- List of strings to stop at
- Example: [”###”, “\n\n\n”]
- Model stops when any sequence appears
Use cases:
- Structured output (JSON, XML)
- Multi-turn conversations
- Preventing continuation
Best practices:
- Use unique sequences
- Test to ensure they work
- Combine with max tokens

Example:

Stop sequence: ”###”
Model stops when ”###” appears
Useful for structured output

Best practice: Use stop sequences for structured output prevents model from continuing beyond desired point.

Explain the basic structure prompt engineering

Prompt structure:

System prompt:
- Defines model’s role
- Sets behavior and constraints
- Example: “You are a helpful assistant.”
Context:
- Relevant information
- Retrieved documents (RAG)
- User history
Instructions:
- What model should do
- Format requirements
- Examples
User input:
- Actual query or request
- User’s question or task

Example structure:

System: You are a helpful assistant.
Context: [Retrieved documents]
Instructions: Answer based on context, cite sources.
User: What is machine learning?

Best practice: Use clear structure system prompt, context, instructions, user input.

Explain in-context learning

In-context learning:

Model learns from examples in prompt
No weight updates
Examples guide model behavior
Types: Zero-shot, few-shot, chain-of-thought

How it works:

Provide examples in prompt
Model learns pattern from examples
Applies pattern to new input
No training needed

Types:

Zero-shot: No examples
Few-shot: 1-5 examples
Chain-of-thought: Examples with reasoning

Best practice: Use few-shot learning when zero-shot doesn’t work examples guide model behavior.

Explain type of prompt engineering

Prompt engineering types:

Zero-shot:
- No examples
- Model uses pre-training
- Fastest, cheapest
Few-shot:
- 1-5 examples
- Guides model behavior
- Better consistency
Chain-of-thought:
- Examples with reasoning steps
- Improves reasoning
- Better for complex tasks
Role-playing:
- Define model’s role
- Sets behavior
- Example: “You are an expert…”
Template-based:
- Structured prompts
- Consistent format
- Easy to maintain

Best practice: Start with zero-shot, add few-shot if needed, use chain-of-thought for reasoning tasks.

What are some of the aspect to keep in mind while using few-shots prompting?

Few-shot prompting considerations:

Example quality:
- High-quality, relevant examples
- Representative of task
- Clear and correct
Example quantity:
- 2-5 examples usually best
- Diminishing returns beyond 5
- Balance cost and quality
Example diversity:
- Cover different cases
- Avoid bias
- Representative sample
Token usage:
- Examples increase tokens
- Higher cost
- Monitor usage
Format consistency:
- Consistent format across examples
- Clear structure
- Easy to follow

Best practice: Use 2-3 high-quality, diverse examples more doesn’t always help.

What are certain strategies to write good prompt?

Prompt writing strategies:

Be clear and specific:
- Clear instructions
- Specific requirements
- Avoid ambiguity
Use examples:
- Few-shot examples
- Show desired format
- Guide behavior
Structure prompts:
- System prompt, context, instructions
- Clear sections
- Easy to read
Iterate:
- Test different prompts
- Refine based on results
- A/B test
Version control:
- Version prompts
- Track changes
- Enable rollback

Best practice: Write clear, structured prompts with examples iterate and test.

What is hallucination, and how can it be controlled using prompt engineering?

Hallucination:

Model generates false information
Confidently states incorrect facts
Common in LLMs

Control with prompt engineering:

Ground in context:
- Use RAG to provide context
- Instruct model to use only context
- Cite sources
Explicit instructions:
- “Only use provided context”
- “If unsure, say so”
- “Don’t make up information”
Few-shot examples:
- Show correct behavior
- Examples of admitting uncertainty
- Guide model
Output format:
- Structured output
- Confidence scores
- Source citations

Best practice: Use RAG + explicit instructions to reduce hallucinations ground answers in context.

How to improve the reasoning ability of LLM through prompt engineering?

Improve reasoning:

Chain-of-thought:
- Ask model to think step-by-step
- Show reasoning in examples
- Improves complex reasoning
Few-shot CoT:
- Examples with reasoning steps
- Model learns pattern
- Better reasoning
Self-consistency:
- Generate multiple reasoning chains
- Pick most common answer
- Improves accuracy
Verification:
- Ask model to verify answer
- Check reasoning
- Catch errors

Best practice: Use chain-of-thought prompting for complex reasoning ask model to think step-by-step.

How to improve LLM reasoning if your COT prompt fails?

If CoT fails:

Simplify problem:
- Break into smaller steps
- Solve step-by-step
- Combine solutions
Better examples:
- Higher quality examples
- Clearer reasoning
- More relevant
Different approach:
- Try different reasoning style
- Alternative methods
- Experiment
Model upgrade:
- Use larger model
- Better reasoning capability
- GPT-4, Claude Opus
External tools:
- Use calculator, code execution
- Verify with tools
- Hybrid approach

Best practice: Simplify problem, improve examples, or upgrade model CoT isn’t always sufficient.

11. Cost & Latency Tradeoffs

How do you reduce token usage?

Token reduction strategies:

Prompt optimization:
- Remove unnecessary text
- Use concise instructions
- Remove redundant examples
Context management:
- Only include relevant context
- Use RAG to retrieve only needed docs
- Truncate long documents
Prompt caching:
- Cache system prompts
- Reuse across requests
- Significant savings
Response limits:
- Set max tokens for output
- Stop early when possible
- Use stop sequences
Model selection:
- Use smaller models when possible
- Distilled models
- Task-specific models

Best practice: Optimize prompts, use caching, and manage context can reduce token usage by 30-50%.

When should you quantize a model?

When to quantize:

Memory constraints:
- Limited GPU memory
- Need to fit larger models
- Edge devices
Cost optimization:
- Reduce inference cost
- Lower infrastructure costs
- Scale more efficiently
Latency requirements:
- Need faster inference
- Real-time applications
- Lower latency

Trade-offs:

Pros: Less memory, faster, cheaper
Cons: Some accuracy loss, more complex

Best practice: Quantize when memory/cost/latency are constraints and small accuracy loss is acceptable.

What's your batching + caching strategy to reduce latency?

Batching strategy:

Dynamic batching:
- Batch requests together
- Process multiple requests simultaneously
- Higher throughput
Continuous batching (vLLM):
- Add requests to batch dynamically
- Remove completed requests
- Optimal GPU utilization
Batch size:
- Balance latency vs throughput
- Larger batches = higher throughput
- Smaller batches = lower latency

Caching strategy:

Prompt caching:
- Cache system prompts
- Reuse across requests
- Significant latency reduction
Response caching:
- Cache common queries
- Return cached responses
- Very fast
Context caching:
- Cache conversation context
- Reuse for multi-turn
- Faster responses

Best practice: Use continuous batching + prompt caching for best latency/throughput balance.

When to use hosted APIs vs open-source models?

Hosted APIs (OpenAI, Anthropic):

Use when:
- Low to medium volume
- Need latest models
- Don’t want infrastructure management
- Quick to market

Open-source models (self-hosted):

Use when:
- High volume (>2-5M requests/month)
- Cost-sensitive
- Need data privacy
- Want control over models

Decision framework:

Volume: Low → hosted, high → self-hosted
Cost: Low volume → hosted, high volume → self-hosted
Privacy: Need privacy → self-hosted
Speed: Quick to market → hosted

Best practice: Start with hosted APIs, migrate to self-hosted as you scale and cost becomes concern.

12. Agentic AI

Define an 'agent' in practical terms?

Agent definition:

LLM that can use tools and take actions
Can plan, execute, and iterate
Autonomous decision-making
Examples: Code execution, web search, API calls

Key capabilities:

Tool use: Call functions, APIs, tools
Planning: Break down tasks into steps
Execution: Take actions based on plan
Iteration: Refine based on results

Practical examples:

Code agent: Writes and executes code
Research agent: Searches web, synthesizes info
API agent: Calls APIs, processes data

Best practice: Agents are LLMs with tool-use capabilities enable autonomous task completion.

What's the hardest part of orchestration when chaining multiple tools?

Orchestration challenges:

Error handling:
- Tool failures
- Partial failures
- Recovery strategies
State management:
- Track execution state
- Manage context across tools
- Handle state transitions
Planning:
- Determine tool sequence
- Handle dependencies
- Adapt to failures
Coordination:
- Coordinate multiple tools
- Handle async operations
- Manage timeouts
Debugging:
- Complex execution paths
- Hard to trace issues
- Difficult to reproduce

Best practice: Error handling and state management are hardest design robust error handling and clear state management.

Why do planning agents often loop or stall?

Why agents loop/stall:

Poor planning:
- Incomplete plans
- Circular dependencies
- Unclear goals
No termination:
- No stopping criteria
- Keep trying indefinitely
- No timeout
Error recovery:
- Same error repeatedly
- No alternative strategies
- Stuck in loop
Context limits:
- Lose track of progress
- Forget what tried
- Repeat actions
Tool failures:
- Keep retrying failed tools
- No fallback strategies
- Stuck on failures

Mitigation:

Set max iterations
Implement timeouts
Track execution history
Use fallback strategies
Clear stopping criteria

Best practice: Set max iterations, timeouts, and track execution history to prevent loops and stalls.

Multi-agent vs single-agent: when does the complexity actually pay off?

Single-agent:

Simpler, easier to debug
Good for most tasks
Single point of failure

Multi-agent:

More complex, harder to debug
Good for complex tasks
Parallel execution

When multi-agent pays off:

Complex tasks: Need multiple specialists
Parallel work: Can work simultaneously
Specialization: Different agents for different tasks
Scale: Handle more complex workflows

When single-agent better:

Simple tasks: Single agent sufficient
Debugging: Easier to debug
Cost: Lower complexity

Best practice: Use single-agent for most tasks, multi-agent only when complexity justifies it.

How would you evaluate whether adding agentic behavior improved a system?

Evaluation metrics:

Task completion:
- Success rate
- Task completion time
- Quality of results
Efficiency:
- Number of steps
- Tool calls per task
- Time to completion
Reliability:
- Error rate
- Recovery from failures
- Consistency
User satisfaction:
- User feedback
- Task success rate
- Time saved
Cost:
- Cost per task
- Tool usage costs
- Total cost

Evaluation approach:

A/B test: Agentic vs non-agentic
Measure metrics above
Compare performance
User feedback

Best practice: Evaluate with A/B testing measure task completion, efficiency, reliability, and user satisfaction.

Explain different types of agents: simple reflex, model-based reflex, goal-based, utility-based, and learning agents

Agent types:

Simple reflex agents:
- React to current percept
- No memory, no planning
- Condition-action rules
- Example: Thermostat (if temp > threshold, turn on AC)
Model-based reflex agents:
- Maintain internal model of world
- Track how world evolves
- Better decisions with history
- Example: Agent tracking inventory changes
Goal-based agents:
- Have explicit goals
- Plan actions to achieve goals
- Consider future consequences
- Example: Navigation agent finding path to destination
Utility-based agents:
- Maximize utility function
- Handle uncertainty and trade-offs
- Choose best action given preferences
- Example: Trading agent maximizing profit while managing risk
Learning agents:
- Improve performance over time
- Learn from experience
- Adapt to new situations
- Example: Agent that improves recommendations based on feedback

Comparison:

Simple reflex: Fastest, simplest, limited
Model-based: More capable, needs world model
Goal-based: Can plan, needs goal specification
Utility-based: Handles trade-offs, needs utility function
Learning: Most flexible, needs training data

Best practice: Choose agent type based on task complexity simple reflex for basic tasks, learning agents for complex adaptive tasks.

What are reactive agents and how do they work?

Reactive agents:

Respond to current situation only
No internal state or memory
Immediate action based on percept
Simple condition-action rules

How they work:

Perceive: Observe current environment
Match: Match percept to condition
Act: Execute corresponding action
Repeat: No memory of past actions

Characteristics:

Fast: No planning overhead
Simple: Easy to implement
Limited: Can’t handle complex tasks
No learning: Don’t improve over time

Use cases:

Simple control systems
Real-time responses
When speed > sophistication
Deterministic environments

Limitations:

Can’t plan ahead
No memory of past actions
Limited to simple tasks
Can’t handle uncertainty well

Best practice: Use reactive agents for simple, fast-response tasks where immediate action is more important than planning.

Explain ReAct (Reasoning + Acting) agents and their advantages

ReAct agents:

Combine reasoning and acting
Interleave thinking and action
Use chain-of-thought reasoning
Take actions based on reasoning

How ReAct works:

Think: Reason about current situation
Act: Take action based on reasoning
Observe: See result of action
Think: Reason about new situation
Repeat: Continue until goal achieved

Key components:

Reasoning: Chain-of-thought thinking
Acting: Tool/function calls
Observation: Results from actions
Iteration: Refine based on observations

Advantages:

Transparency: Can see reasoning process
Flexibility: Adapts to new situations
Error recovery: Can reason about failures
Better decisions: Thoughtful actions

Example:

Thought: I need to find the weather. Let me search for it.
Action: search_web("weather today")
Observation: Weather is sunny, 75°F
Thought: User asked about weather. I have the answer.
Action: respond("The weather is sunny and 75°F")

Best practice: Use ReAct for complex tasks requiring reasoning combines thinking and action for better results.

How do agents react to different situations and stimuli?

Agent reaction patterns:

Immediate reaction:
- React instantly to stimulus
- No deliberation
- Fast response
- Simple reflex agents
Deliberative reaction:
- Think before acting
- Consider options
- Plan actions
- Goal-based agents
Adaptive reaction:
- Learn from experience
- Adjust behavior
- Improve over time
- Learning agents
Contextual reaction:
- Consider context
- Use memory/history
- Better decisions
- Model-based agents

Reaction mechanisms:

Stimulus → Action: Direct mapping
Stimulus → Reasoning → Action: With deliberation
Stimulus → Memory → Reasoning → Action: With context
Stimulus → Learning → Adaptation → Action: With improvement

Factors affecting reaction:

Agent type: Reflex vs deliberative
Environment: Deterministic vs stochastic
Goals: Immediate vs long-term
Experience: New vs learned

Best practice: Design agents to react appropriately immediate for urgent, deliberative for complex, adaptive for changing environments.

What are the key metrics for agent evaluation?

Agent evaluation metrics:

Task success:
- Success rate (% tasks completed)
- Goal achievement rate
- Task completion quality
- Accuracy of results
Efficiency:
- Steps to completion
- Tool calls per task
- Time to completion
- Resource usage
Reliability:
- Error rate
- Failure recovery rate
- Consistency across runs
- Robustness to edge cases
Cost:
- Cost per task
- Token usage
- Tool/API costs
- Total cost of ownership
User experience:
- User satisfaction
- Response time
- Quality of interactions
- Helpfulness
Learning (for learning agents):
- Improvement over time
- Adaptation to new tasks
- Generalization ability
- Sample efficiency

Evaluation framework:

Offline: Test on held-out dataset
Online: A/B test with real users
Simulation: Test in controlled environment
Human evaluation: Expert review

Best practice: Use multiple metrics task success, efficiency, reliability, and user experience for comprehensive evaluation.

How do you perform end-to-end evaluation of agents?

End-to-end evaluation:

Define evaluation tasks:
- Realistic scenarios
- Diverse task types
- Clear success criteria
- Representative of real use
Set up test environment:
- Simulated or real environment
- Tools and APIs available
- Controlled conditions
- Reproducible setup
Run agent on tasks:
- Execute agent on each task
- Record all actions
- Capture outputs
- Log errors/failures
Measure performance:
- Task success rate
- Steps to completion
- Time to completion
- Quality of results
- Cost per task
Analyze results:
- Identify failure modes
- Find common errors
- Analyze efficiency
- Compare with baselines
Iterate:
- Fix identified issues
- Improve agent
- Re-evaluate
- Continuous improvement

Evaluation datasets:

WebArena: Web navigation tasks
AgentBench: Multi-domain agent tasks
ToolBench: Tool-using tasks
Custom: Domain-specific tasks

Best practice: Evaluate end-to-end on realistic tasks measure success, efficiency, and quality comprehensively.

What is tool calling and how do agents use tools?

Tool calling:

Agents invoke external functions/tools
Extends agent capabilities
Enables real-world actions
Examples: API calls, code execution, web search

How agents use tools:

Tool definition:
- Define available tools
- Specify parameters
- Document functionality
- Example: search_web(query: str) -> str
Tool selection:
- Agent decides which tool to use
- Based on current task
- Considers tool capabilities
- Matches tool to need
Tool invocation:
- Call tool with parameters
- Execute tool function
- Get result
- Handle errors
Result processing:
- Process tool output
- Use result for next action
- Integrate into reasoning
- Continue task

Tool calling patterns:

Sequential: One tool at a time
Parallel: Multiple tools simultaneously
Conditional: Tool based on condition
Iterative: Tool in loop until done

Best practice: Design tools with clear interfaces agents need well-defined tools with good documentation.

Explain OpenAI Functions (Function Calling) for agents

OpenAI Functions:

Structured way to define tools
Model decides when to call functions
Returns structured function calls
Enables reliable tool use

How it works:

Define functions:

{
  "name": "get_weather",
  "description": "Get current weather",
  "parameters": {
    "type": "object",
    "properties": {
      "location": {"type": "string"}
    }
  }
}

Model decides:
- Model sees function definitions
- Decides if function needed
- Returns function call if needed
- Or continues conversation
Execute function:
- Parse function call
- Execute with parameters
- Get result
- Return to model
Model continues:
- Model sees function result
- Uses result in response
- Can call more functions
- Completes task

Advantages:

Reliable: Structured function calls
Flexible: Model decides when to use
Type-safe: JSON schema validation
Easy integration: Standard format

Best practice: Use OpenAI Functions for reliable tool calling structured, type-safe, and model-controlled.

What is MCP (Model Context Protocol) and how does it work?

MCP (Model Context Protocol):

Standard protocol for agent-tool communication
Enables agents to use external tools
Provides context to models
Standardizes tool interfaces

How MCP works:

Tool registration:
- Tools register with MCP server
- Define capabilities
- Specify interfaces
- Make available to agents
Context provision:
- MCP provides tool context
- Describes available tools
- Shows tool capabilities
- Updates dynamically
Tool invocation:
- Agent requests tool use
- MCP routes to tool
- Executes tool
- Returns result
Context updates:
- MCP updates context
- Reflects tool results
- Maintains state
- Enables multi-step tasks

Key features:

Standardization: Common protocol
Interoperability: Works across systems
Context management: Maintains state
Tool discovery: Agents find tools

Use cases:

Multi-tool agent systems
Tool marketplace integration
Standardized agent platforms
Cross-platform tool use

Best practice: Use MCP for standardized tool integration enables agents to discover and use tools reliably.

Explain Agent-to-Agent (A2A) communication and coordination

Agent-to-Agent (A2A) communication:

Agents communicate with each other
Coordinate on tasks
Share information
Collaborate on goals

A2A patterns:

Direct communication:
- Agents send messages directly
- Point-to-point communication
- Simple but limited scale
- Example: Two agents coordinating
Broadcast communication:
- One agent broadcasts to all
- Announcements, updates
- Efficient for one-to-many
- Example: Leader announcing plan
Mediated communication:
- Communication through mediator
- Centralized coordination
- Better for complex systems
- Example: Message broker
Shared memory:
- Agents share common memory
- Read/write shared state
- Coordination through state
- Example: Blackboard architecture

Coordination strategies:

Task delegation:
- One agent delegates to others
- Divide and conquer
- Specialized agents
- Example: Manager delegates to workers
Consensus:
- Agents agree on action
- Voting, negotiation
- Democratic decision-making
- Example: Agents vote on plan
Auction:
- Agents bid on tasks
- Market-based coordination
- Efficient resource allocation
- Example: Task auction system
Contract net:
- One agent announces task
- Others bid on task
- Select best bidder
- Example: Task allocation

Best practice: Design A2A systems with clear communication protocols enables effective coordination and collaboration.

How do you design multi-agent systems for collaboration?

Multi-agent system design:

Agent roles:
- Define agent responsibilities
- Specialize agents
- Clear role boundaries
- Example: Researcher, Writer, Reviewer
Communication protocol:
- Define message format
- Specify communication channels
- Establish protocols
- Example: JSON messages, REST API
Coordination mechanism:
- How agents coordinate
- Task allocation
- Conflict resolution
- Example: Manager agent, voting
Shared resources:
- Common knowledge base
- Shared memory
- Tool access
- Example: Shared database
Error handling:
- Agent failure recovery
- Communication failures
- Task reassignment
- Example: Backup agents, retries

Design patterns:

Hierarchical:
- Manager-worker structure
- Top-down coordination
- Clear hierarchy
- Example: Manager delegates to workers
Peer-to-peer:
- Equal agents
- Distributed coordination
- No central authority
- Example: Swarm agents
Market-based:
- Agents trade resources
- Auction-based allocation
- Economic incentives
- Example: Task marketplace
Blackboard:
- Shared blackboard
- Agents read/write
- Opportunistic coordination
- Example: Shared knowledge base

Best practice: Design multi-agent systems with clear roles, communication protocols, and coordination mechanisms enables effective collaboration.

What are the challenges in agent-to-agent communication?

A2A communication challenges:

Message understanding:
- Agents interpret messages
- Ambiguity in communication
- Different vocabularies
- Misunderstandings
Synchronization:
- Timing of messages
- Async vs sync communication
- Race conditions
- Deadlocks
Scalability:
- Communication overhead
- Message flooding
- Network congestion
- Performance degradation
Reliability:
- Message delivery
- Lost messages
- Duplicate messages
- Ordering guarantees
Security:
- Authentication
- Authorization
- Message encryption
- Trust between agents
Coordination:
- Avoiding conflicts
- Resolving disputes
- Consensus building
- Task allocation

Solutions:

Protocols: Standardized communication
Message queues: Reliable delivery
Encryption: Secure communication
Authentication: Trusted agents
Coordination algorithms: Conflict resolution

Best practice: Address communication challenges with protocols, reliability mechanisms, and security critical for multi-agent systems.

How do you evaluate agent-to-agent systems?

A2A system evaluation:

System-level metrics:
- Overall task completion
- System efficiency
- Resource utilization
- End-to-end performance
Agent-level metrics:
- Individual agent performance
- Agent contribution
- Agent reliability
- Agent efficiency
Communication metrics:
- Message overhead
- Communication latency
- Message success rate
- Coordination efficiency
Coordination metrics:
- Task allocation quality
- Conflict resolution rate
- Consensus building time
- Coordination overhead
Scalability metrics:
- Performance with more agents
- Communication overhead growth
- System stability
- Resource usage

Evaluation approaches:

Simulation: Test in controlled environment
Benchmarks: Standard test suites
Real-world: Deploy and monitor
Stress testing: Test under load

Best practice: Evaluate A2A systems at multiple levels system, agent, communication, and coordination for comprehensive assessment.

Compare reactive vs deliberative vs hybrid agents

Reactive agents:

React to current situation
No planning or memory
Fast response
Simple implementation
Limited to simple tasks

Deliberative agents:

Plan before acting
Consider future consequences
Slower but better decisions
More complex
Handle complex tasks

Hybrid agents:

Combine reactive and deliberative
React for urgent, deliberate for complex
Balance speed and quality
Most practical
Best of both worlds

Comparison:

Aspect	Reactive	Deliberative	Hybrid
Speed	Fast	Slow	Medium
Complexity	Simple	Complex	Medium
Planning	No	Yes	Selective
Memory	No	Yes	Yes
Use case	Simple	Complex	General

When to use:

Reactive: Simple, fast-response tasks
Deliberative: Complex planning tasks
Hybrid: General-purpose agents

Best practice: Use hybrid agents for most applications balance speed and quality, react when needed, deliberate when beneficial.

What is the difference between plan-and-execute vs ReAct agent strategies?

Plan-and-execute:

Plan entire task upfront
Execute plan step by step
Rigid execution
Can’t adapt to changes
Good for predictable tasks

ReAct:

Interleave reasoning and acting
Plan incrementally
Adapt to observations
Flexible execution
Good for dynamic tasks

Comparison:Plan-and-execute:

Pros: Clear plan, efficient execution
Cons: Rigid, can’t adapt, fails if plan wrong
Use when: Task is predictable, plan is reliable

ReAct:

Pros: Flexible, adapts, handles uncertainty
Cons: More steps, slower, more tokens
Use when: Task is dynamic, needs adaptation

Example:Plan-and-execute:

Plan: Search weather → Get location → Format response
Execute: Search weather
Execute: Get location
Execute: Format response

ReAct:

Think: Need weather, let me search
Act: search_web("weather")
Observe: Weather is sunny
Think: Good, now format response
Act: respond("Weather is sunny")

Best practice: Use ReAct for dynamic tasks, plan-and-execute for predictable tasks choose based on task characteristics.

How do learning agents improve over time?

Learning agent improvement:

Experience collection:
- Collect training data
- Record actions and outcomes
- Build experience database
- Track performance
Learning mechanisms:
- Supervised learning: Learn from labeled examples
- Reinforcement learning: Learn from rewards
- Unsupervised learning: Discover patterns
- Meta-learning: Learn to learn
Performance improvement:
- Better decision-making
- Fewer errors
- Faster task completion
- Higher success rate
Adaptation:
- Adapt to new tasks
- Handle edge cases
- Generalize from experience
- Transfer learning

Learning approaches:

Online learning:
- Learn during operation
- Continuous improvement
- Real-time adaptation
- Example: Agent learns from user feedback
Offline learning:
- Learn from historical data
- Batch training
- Periodic updates
- Example: Retrain on collected data
Transfer learning:
- Learn from related tasks
- Apply to new domains
- Faster adaptation
- Example: Agent trained on task A helps with task B

Best practice: Design learning agents with clear learning objectives collect experience, learn continuously, and adapt to new situations.

What are the evaluation frameworks for agent systems?

Agent evaluation frameworks:

AgentBench:
- Multi-domain agent tasks
- Standardized evaluation
- Diverse task types
- Comprehensive metrics
WebArena:
- Web navigation tasks
- Realistic scenarios
- Browser automation
- Success rate metrics
ToolBench:
- Tool-using tasks
- Function calling evaluation
- Tool selection accuracy
- Task completion rate
ALFWorld:
- Household tasks
- Embodied agents
- Sequential actions
- Task success metrics
Custom frameworks:
- Domain-specific tasks
- Real-world scenarios
- Business metrics
- User satisfaction

Evaluation dimensions:

Task success: Can agent complete task?
Efficiency: How many steps?
Quality: How good is result?
Reliability: How consistent?
Cost: How expensive?

Best practice: Use standardized frameworks (AgentBench, WebArena) for comparison, custom frameworks for domain-specific evaluation.

13. System Design Thinking

How do you make an AI system more deterministic and less brittle?

Determinism strategies:

Temperature = 0:
- Pure greedy decoding
- Deterministic outputs
- Reproducible
Fixed seed:
- Set random seed
- Same seed = same output
- Reproducible
Structured output:
- Use JSON schema
- Validate output format
- Consistent structure
Prompt engineering:
- Clear instructions
- Few-shot examples
- Consistent format

Reducing brittleness:

Error handling:
- Graceful degradation
- Fallback strategies
- Retry logic
Validation:
- Input validation
- Output validation
- Error detection
Monitoring:
- Track failures
- Alert on issues
- Quick response

Best practice: Use temperature=0, structured output, and robust error handling for deterministic, robust systems.

What fallback do you use if the LLM fails mid-task?

Fallback strategies:

Retry:
- Retry with same prompt
- Exponential backoff
- Max retries
Simplified prompt:
- Retry with simpler prompt
- Remove complexity
- Basic version
Cached response:
- Return cached response
- Similar queries
- Fast fallback
Template response:
- Pre-written responses
- Generic answers
- User-friendly
Human escalation:
- Route to human
- For critical tasks
- Last resort

Best practice: Implement layered fallbacks retry → simplified prompt → cached response → template → human escalation.

Can you solve this without an LLM or vector DB?

When to avoid LLMs:

Simple tasks:
- Rule-based sufficient
- No need for AI
- Faster, cheaper
Deterministic tasks:
- Need exact results
- No ambiguity
- Traditional methods better
Cost-sensitive:
- LLM too expensive
- Simple solution sufficient
- Cost optimization
Latency-critical:
- Need very fast response
- LLM too slow
- Real-time requirements

When to avoid vector DBs:

Small dataset:
- Can use simple search
- No need for vector DB
- Overkill
Exact matches:
- Keyword search sufficient
- No semantic search needed
- Simpler solution

Best practice: Consider simpler solutions first use LLMs/vector DBs only when needed.

What's the right database for this task - SQL, NoSQL, or vector?

SQL databases:

Use for: Structured data, exact queries, transactions
Examples: PostgreSQL, MySQL
Best for: User data, transactions, structured queries

NoSQL databases:

Use for: Unstructured data, flexible schema, scale
Examples: MongoDB, DynamoDB
Best for: Documents, flexible schema, high scale

Vector databases:

Use for: Embeddings, similarity search, RAG
Examples: Pinecone, Milvus, Weaviate
Best for: Semantic search, RAG systems

Decision framework:

Structured data + exact queries: SQL
Unstructured data + flexible schema: NoSQL
Embeddings + similarity search: Vector

Best practice: Use SQL for structured data, NoSQL for documents, vector DB for embeddings.

14. Risks, Integrity & Compliance

How do you monitor hallucinations in production?

Hallucination monitoring:

Output validation:
- Check for factual claims
- Verify against sources
- Flag suspicious outputs
Confidence scores:
- Monitor confidence levels
- Flag low-confidence outputs
- Review manually
User feedback:
- Collect thumbs up/down
- Track user reports
- Identify patterns
Citation accuracy:
- Verify citations
- Check source relevance
- Measure citation precision
Automated checks:
- Fact-checking APIs
- Knowledge base verification
- Pattern detection

Best practice: Monitor hallucinations with output validation, confidence scores, and user feedback essential for production.

Bias vs fairness: where does 'fixing it' actually make systems worse?

Bias vs fairness:Bias:

Statistical bias in model
Can be measured
Technical issue

Fairness:

Social concept
Subjective
Context-dependent

When fixing makes it worse:

Over-correction:
- Fixing one bias creates another
- Unintended consequences
- Worse outcomes
Wrong metrics:
- Optimizing wrong fairness metric
- Doesn’t improve real fairness
- Makes system worse
Context mismatch:
- Fixing for one context
- Doesn’t work in other contexts
- Creates new issues

Best practice: Carefully define fairness metrics, test in real contexts, and monitor for unintended consequences.

What's your red-teaming checklist for a new LLM product?

Red-teaming checklist:

Safety:
- Harmful content generation
- Jailbreak attempts
- Prompt injection
- Safety bypasses
Bias:
- Demographic bias
- Stereotyping
- Unfair treatment
- Representation issues
Privacy:
- PII leakage
- Data exposure
- Privacy violations
- Compliance issues
Security:
- Prompt injection
- Model extraction
- Data poisoning
- Adversarial attacks
Reliability:
- Hallucinations
- Inconsistency
- Error handling
- Edge cases

Best practice: Red-team before launch test safety, bias, privacy, security, and reliability comprehensively.

How do you handle privacy when logs contain user prompts?

Privacy handling:

Anonymization:
- Remove PII
- Hash sensitive data
- Pseudonymize users
Access control:
- Role-based access
- Audit logs
- Secure storage
Retention:
- Set retention policies
- Delete old logs
- Comply with regulations
Encryption:
- Encrypt at rest
- Encrypt in transit
- Secure storage
Compliance:
- GDPR, CCPA compliance
- User consent
- Right to deletion

Best practice: Anonymize logs, control access, set retention, encrypt data, and comply with regulations.

15. Scaling & Business Impact

Cost vs latency vs accuracy: describe a time you had to sacrifice one

Trade-offs:Cost vs accuracy:

Use smaller models → lower cost, lower accuracy
Use larger models → higher cost, higher accuracy
Decision: Balance based on requirements

Latency vs accuracy:

Use faster models → lower latency, lower accuracy
Use better models → higher latency, higher accuracy
Decision: Balance based on use case

Cost vs latency:

Use caching → lower cost, lower latency
Use more GPUs → higher cost, lower latency
Decision: Balance based on budget

Example scenario:

Situation: High latency, need to reduce
Solution: Use smaller model, add caching
Trade-off: Slight accuracy loss, but acceptable
Result: Latency reduced 50%, accuracy dropped 5%

Best practice: Understand trade-offs often need to balance cost, latency, and accuracy based on priorities.

Enterprise readiness: what infra concerns block adoption most?

Infrastructure concerns:

Security:
- Data privacy
- Compliance
- Access control
- Encryption
Reliability:
- Uptime requirements
- Error handling
- Disaster recovery
- SLAs
Scalability:
- Handle enterprise scale
- Performance at scale
- Cost at scale
- Infrastructure needs
Integration:
- Existing systems
- APIs, authentication
- Data pipelines
- Workflows
Support:
- Documentation
- Support channels
- Training
- Maintenance

Best practice: Address security, reliability, scalability, integration, and support critical for enterprise adoption.

How would you design a GenAI-first product that survives beyond prototype hype?

Design principles:

Real value:
- Solve real problems
- Clear value proposition
- User needs first
Quality:
- High accuracy
- Reliable performance
- Consistent results
Scalability:
- Handle growth
- Cost-effective
- Performance at scale
User experience:
- Intuitive interface
- Fast responses
- Good error handling
Iteration:
- Continuous improvement
- User feedback
- Regular updates

Best practice: Focus on real value, quality, scalability, and user experience build for long-term, not just demo.

16. Real-World Scenarios

What happens if your embedding model changes - how do you migrate safely?

Migration strategy:

Dual-write:
- Write to both old and new vector DBs
- Gradually migrate reads
- Deprecate old DB
Blue-green:
- Maintain two environments
- Re-embed in green
- Switch traffic when ready
Incremental:
- Re-embed in batches
- Update incrementally
- Route queries appropriately
Validation:
- Compare results
- Ensure quality maintained
- Monitor metrics

Best practice: Use dual-write or blue-green deployment migrate safely with validation and rollback capability.

How would you fine-tune a model on user behavior and deploy it?

Fine-tuning process:

Data collection:
- Collect user behavior data
- Label data
- Create training set
Fine-tuning:
- Train on user behavior
- Monitor validation metrics
- Iterate
Evaluation:
- Test on held-out set
- Compare with baseline
- Measure improvement
Deployment:
- A/B test against baseline
- Gradual rollout
- Monitor performance
Monitoring:
- Track metrics
- Monitor user feedback
- Iterate based on results

Best practice: Collect data, fine-tune carefully, evaluate thoroughly, deploy gradually, and monitor continuously.

How would you make this system cheaper without killing quality?

Cost optimization strategies:

Model optimization:
- Quantization
- Model distillation
- Smaller models
Caching:
- Prompt caching
- Response caching
- Reduce redundant calls
Smart routing:
- Route simple queries to smaller models
- Route complex to larger models
- Cost-aware routing
Batching:
- Dynamic batching
- Continuous batching
- Higher throughput
Infrastructure:
- Spot instances
- Auto-scaling
- Right-sizing

Best practice: Optimize models, use caching and smart routing, batch requests can reduce costs 30-50% without quality loss.

Can you walk me through a debugging session for incorrect LLM outputs?

Debugging process:

Reproduce:
- Reproduce the issue
- Capture prompt and response
- Identify pattern
Analyze:
- Check prompt quality
- Review context
- Check model parameters
Isolate:
- Test with minimal prompt
- Remove complexity
- Identify root cause
Fix:
- Update prompt
- Adjust parameters
- Add examples
Validate:
- Test on known cases
- Verify fix works
- Monitor in production

Example:

Issue: Model gives wrong answer
Debug: Check prompt, find ambiguous instruction
Fix: Clarify instruction, add example
Validate: Test on known cases, works correctly

Best practice: Systematic debugging reproduce, analyze, isolate, fix, validate.

Conclusion & Interview Tips

This guide covers all major AI engineering areas from prompt design to scalable systems and ethical deployment.

Key Preparation Tips

Understand system trade-offs
Build RAG or LLM-serving demos
Learn caching, monitoring, CI/CD
Emphasize ethics & safety
Explain architecture choices clearly

During the Interview

Clarify before answering
Think aloud for reasoning
Mention latency/cost trade-offs
Talk about monitoring and fallback
Stay calm & confident

Interviews test not just your AI knowledge but your reasoning about scale, safety, and reliability. Stay grounded and structured.

Good luck with your AI Engineer interviews!

Overview

Interview Prep

Open Source Programs

Companies & Careers

AI Engineer Interview Questions

1. LLM Engineering & Prompt Design

2. Prompting & Context Engineering

3. AI System Architecture

3. Model Deployment & Serving

4. Fine-Tuning & Alignment

5. RAG Systems (Retrieval Augmented Generation)

5. MLOps & LLMOps

6. Document Digitization & Chunking

7. Embedding Models

8. Internal Working of Vector Databases

9. Advanced Search Algorithms

10. Prompt Engineering & Basics of LLM

11. Cost & Latency Tradeoffs

12. Agentic AI

13. System Design Thinking

14. Risks, Integrity & Compliance

15. Scaling & Business Impact

16. Real-World Scenarios

Conclusion & Interview Tips

Key Preparation Tips

During the Interview

Overview

Interview Prep

Open Source Programs

Companies & Careers

​1. LLM Engineering & Prompt Design

​2. Prompting & Context Engineering

​3. AI System Architecture

​3. Model Deployment & Serving

​4. Fine-Tuning & Alignment

​5. RAG Systems (Retrieval Augmented Generation)

​5. MLOps & LLMOps

​6. Document Digitization & Chunking

​7. Embedding Models

​8. Internal Working of Vector Databases

​9. Advanced Search Algorithms

​10. Prompt Engineering & Basics of LLM

​11. Cost & Latency Tradeoffs

​12. Agentic AI

​13. System Design Thinking

​14. Risks, Integrity & Compliance

​15. Scaling & Business Impact

​16. Real-World Scenarios

​Conclusion & Interview Tips

​Key Preparation Tips

​During the Interview

1. LLM Engineering & Prompt Design

2. Prompting & Context Engineering

3. AI System Architecture

3. Model Deployment & Serving

4. Fine-Tuning & Alignment

5. RAG Systems (Retrieval Augmented Generation)

5. MLOps & LLMOps

6. Document Digitization & Chunking

7. Embedding Models

8. Internal Working of Vector Databases

9. Advanced Search Algorithms

10. Prompt Engineering & Basics of LLM

11. Cost & Latency Tradeoffs

12. Agentic AI

13. System Design Thinking

14. Risks, Integrity & Compliance

15. Scaling & Business Impact

16. Real-World Scenarios

Conclusion & Interview Tips

Key Preparation Tips

During the Interview