LLM & AI Interview Questions (50+ Detailed Q&A)
1. Fundamentals (Transformers & LLMs)
1. Transformer Architecture
1. Transformer Architecture
- Encoder: BERT (Understanding).
- Decoder: GPT (Generation). Key components: Self-Attention, Positional Embeddings, Feed-Forward Networks.
2. Self-Attention Mechanism
2. Self-Attention Mechanism
Softmax(QK^T / sqrt(d)) * V.
Solves Long-range dependency problem of RNN/LSTM.3. Tokenization
3. Tokenization
4. Pre-training vs Fine-tuning
4. Pre-training vs Fine-tuning
- Pre-training: Unsupervised. Next-token prediction on internet scale data. Learns Grammar, Facts. Expensive.
- Fine-tuning: Supervised (SFT). Q&A pairs. Learns Instruction following / Specific task. Cheaper.
5. RLHF (Reinforcement Learning from Human Feedback)
5. RLHF (Reinforcement Learning from Human Feedback)
- SFT Model generates responses.
- Humans rank responses.
- Train Reward Model to predict rank.
- Optimize Policy (LLM) using PPO against Reward Model.
6. Context Window
6. Context Window
O(N^2) limits window size.
Tricks: FlashAttention, Sliding Window.7. Hallucination
7. Hallucination
8. Temperature & Top-P
8. Temperature & Top-P
- Temperature: Randomness. 0 = Deterministic (Argmax). 1 = Creative.
- Top-P (Nucleus): Select from top tokens whose cumulative probability is P (e.g. 0.9).
9. Embeddings
9. Embeddings
10. Zero-shot vs Few-shot
10. Zero-shot vs Few-shot
- Zero-shot: “Translate this.” (No examples).
- Few-shot: “Translate this. Examples: A->B, C->D”. (In-context learning).
2. RAG (Retrieval Augmented Generation)
11. What is RAG?
11. What is RAG?
12. Vector Database
12. Vector Database
13. Chunking Strategies
13. Chunking Strategies
- Fixed Size: 512 tokens with overlap.
- Semantic: Break by Paragraph/Header.
- Recursive: RecursiveCharacterTextSplitter (LangChain).
14. HyDE (Hypothetical Document Embeddings)
14. HyDE (Hypothetical Document Embeddings)
15. Re-ranking (Cross-Encoder)
15. Re-ranking (Cross-Encoder)
16. Parent Document Retrieval
16. Parent Document Retrieval
17. Multi-Query Retrieval
17. Multi-Query Retrieval
18. Metadata Filtering
18. Metadata Filtering
year=2023.
Pre-filtering vs Post-filtering implementation.19. Graph RAG
19. Graph RAG
20. Lost in the Middle Phenomenon
20. Lost in the Middle Phenomenon
3. Training & FineTuning
21. PEFT (Parameter Efficient Fine Tuning)
21. PEFT (Parameter Efficient Fine Tuning)
22. LoRA (Low-Rank Adaptation)
22. LoRA (Low-Rank Adaptation)
23. QLoRA
23. QLoRA
24. Quantization (FP16 vs INT8 vs NF4)
24. Quantization (FP16 vs INT8 vs NF4)
- FP16 (Half): 16 bits.
- INT8: 8 bits.
- NF4: Normal Float 4-bit (SOTA). Tradeoff: Size/Speed vs Accuracy (Perplexity).
25. Gradient Checkpointing
25. Gradient Checkpointing
26. DPO (Direct Preference Optimization)
26. DPO (Direct Preference Optimization)
27. Catastrophic Forgetting
27. Catastrophic Forgetting
28. Data Cleaning for LLM
28. Data Cleaning for LLM
29. Scaling Laws (Chinchilla)
29. Scaling Laws (Chinchilla)
30. Mixed Precision Training
30. Mixed Precision Training
4. Agents & Prompt Engineering
31. Chain of Thought (CoT)
31. Chain of Thought (CoT)
32. ReAct Pattern (Reasoning + Acting)
32. ReAct Pattern (Reasoning + Acting)
33. Tree of Thoughts (ToT)
33. Tree of Thoughts (ToT)
34. Function Calling / Tools
34. Function Calling / Tools
35. System Prompt
35. System Prompt
36. Prevention: Prompt Injection
36. Prevention: Prompt Injection
37. Agent Memory
37. Agent Memory
- Short-term: Context Window.
- Long-term: Vector DB reflection.
- Working Memory: Scratchpad.
38. AutoGPT / BabyAGI
38. AutoGPT / BabyAGI
39. Few-Shot Prompting Selection
39. Few-Shot Prompting Selection
40. Structured Output (JSON Mode)
40. Structured Output (JSON Mode)
5. Deployment & Evaluation
41. KV Cache
41. KV Cache
42. Speculative Decoding
42. Speculative Decoding
43. vLLM / PagedAttention
43. vLLM / PagedAttention
44. LLM Evaluation Metrics
44. LLM Evaluation Metrics
- Perplexity: Next token probability (Lower is better).
- BLEU/ROUGE: N-gram overlap (Bad for meaning).
- LLM-as-a-Judge: Use GPT-4 to score output.
45. Ragas (RAG Assessment)
45. Ragas (RAG Assessment)
- Faithfulness: Answer derived from context?
- Answer Relevance: Answer addresses Query?
- Context Precision: Is relevant info in context?
46. Batch Inference vs Streaming
46. Batch Inference vs Streaming
- Streaming: SSE. Low TTFT (Time To First Token). UX implies speed.
- Batch: Offline. High Throughput.
47. Cost Analysis (Token Economics)
47. Cost Analysis (Token Economics)
48. Guardrails
48. Guardrails
49. Model Merging
49. Model Merging
50. Continuous Pre-training
50. Continuous Pre-training