Gen-ai Interview Questions

Fine-Tuning & Deployment Interview Questions

40 essential interview questions on LLM fine-tuning with LoRA, QLoRA, PEFT, RLHF, DPO, and production deployment with vLLM, quantization, and serving infrastructure.

By TechCoder TeamLast updated: 2026-06-02
In a Nutshell

40 essential interview questions on LLM fine-tuning with LoRA, QLoRA, PEFT, RLHF, DPO, and production deployment with vLLM, quantization, and serving infrastructure. This interview-focused guide covers essential fine-tuning & deployment interview questions concepts for technical interviews.

Fine-Tuning & LLM Deployment Interview Questions

Taking models from experimentation to production requires fine-tuning expertise and deployment engineering. These 40 questions cover LoRA/QLoRA, PEFT techniques, RLHF vs DPO, model serving with vLLM, quantization, and production infrastructure.


1. What is Fine-Tuning?

Fine-tuning takes a pre-trained model and further trains it on a smaller, domain-specific dataset. Unlike training from scratch, fine-tuning starts from existing knowledge and adapts it to specific tasks, styles, or domains.

2. When should you fine-tune vs. use prompting?

  • Fine-tune when: You need consistent tone/style, have 100+ quality examples, want lower latency/cost at scale, domain-specific vocabulary, or better performance than prompting achieves.
  • Prompt when: Quick prototyping, few examples available, changing requirements, or using frontier models with strong instruction following.

3. What is Supervised Fine-Tuning (SFT)?

SFT trains a model on (instruction, response) pairs. The model learns to follow instructions and produce desired outputs. This is the first step in the post-training pipeline for chat models.

# SFT data format
sft_example = {
    "instruction": "Explain what a decorator is in Python",
    "response": "A decorator is a function that modifies another function..."
}

4. What is LoRA (Low-Rank Adaptation)?

LoRA is a parameter-efficient fine-tuning method. Instead of updating all model weights, LoRA adds small, trainable rank decomposition matrices to attention layers. Key benefits:

  • Trains 0.1-1% of total parameters
  • No inference latency increase (merged into weights)
  • Multiple LoRA adapters can be swapped for different tasks
  • Much lower memory requirements
from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # Rank (higher = more capacity, more memory)
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)

5. What is QLoRA?

QLoRA = Quantized LoRA. Quantizes the base model to 4-bit precision, then applies LoRA on top. Dramatically reduces memory:

  • Full 7B model: ~14GB (FP16)
  • QLoRA 7B: ~4GB (4-bit)
  • Enables fine-tuning 7B/13B models on consumer GPUs (single RTX 3090/4090)

6. What is PEFT?

PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that fine-tune models without updating all parameters. Includes LoRA, Prefix Tuning, Prompt Tuning, Adapters, IA³.

7. What is the difference between LoRA, Prefix Tuning, and Prompt Tuning?

TechniqueWhat it doesParameters
LoRAAdds trainable matrices to attention/FFN layers0.1-1% of model
Prefix TuningPrepends learnable vectors to each transformer layer<1%
Prompt TuningAdds learnable tokens to input embeddings<0.01%
AdaptersInserts small bottleneck layers between transformer blocks1-5%

8. What is the LoRA rank (r) parameter?

Rank (r) controls LoRA's capacity. Higher r = more trainable parameters = more capacity to learn complex patterns, but higher memory/compute. Typical values: r=8 (light), r=16 (balanced), r=64 (heavy). Diminishing returns beyond r=64 for most tasks.

9. What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF is a 3-stage process:

  1. SFT: Fine-tune on high-quality instruction-response pairs
  2. Reward Model: Train a model to predict human preference scores for outputs
  3. PPO: Optimize the policy model to maximize reward model score using Proximal Policy Optimization This is how ChatGPT, Claude, and Llama-2-Chat were trained.

10. What is DPO (Direct Preference Optimization)?

DPO simplifies RLHF by eliminating the separate reward model. Instead, it directly optimizes the policy using preference pairs (chosen vs. rejected responses) with a loss function derived from the RLHF objective. Simpler, more stable, increasingly popular.

11. What is LoRA alpha?

LoRA alpha is the scaling factor. Delta weights are scaled by alpha / r. Higher alpha = stronger adaptation. Common: alpha = 2× r (r=16, alpha=32).

12. How do you prepare data for fine-tuning?

  • Format: Conversation format (user/assistant pairs) or instruction/response.
  • Quality: Curate diverse, high-quality examples. Quality > quantity.
  • Quantity: 100-1000 examples for LoRA, 1000-10000+ for full fine-tuning.
  • Coverage: Include edge cases, variations, and refusal examples.
  • No duplicates, no contradictions.

13. What is full fine-tuning vs. LoRA?

  • Full fine-tuning: All weights updated. Requires massive GPU memory (7B model = 7× model size). Best quality but expensive.
  • LoRA: Only adapter weights trained. 1 GPU sufficient for 7B-70B models. Slightly lower quality ceiling but vastly more practical.

14. What is the difference between training loss and validation loss?

  • Training loss: Error on training data. Should decrease over time.
  • Validation loss: Error on held-out data. Monitors overfitting. If training loss decreases but validation loss increases → overfitting.
  • Optimal stopping: Save checkpoint at minimum validation loss.

15. What is overfitting in fine-tuning?

Overfitting occurs when the model memorizes training examples instead of learning generalizable patterns. Symptoms:

  • Training loss continues decreasing
  • Validation loss starts increasing
  • Model regurgitates training data verbatim
  • Poor performance on new inputs

Prevention: More diverse data, dropout (lora_dropout), early stopping, lower learning rate, fewer epochs.

16. What is catastrophic forgetting?

Catastrophic forgetting occurs when fine-tuning on new data degrades the model's performance on previously learned tasks. Mitigation: mix in general data (data mixing), lower learning rate, LoRA (preserves base model weights), elastic weight consolidation.

17. What are the key hyperparameters for fine-tuning?

  • Learning rate: 1e-4 to 5e-4 for LoRA, 1e-5 to 5e-5 for full fine-tuning
  • Epochs: 1-5 for LoRA, 1-3 for full fine-tuning
  • Batch size: As large as GPU memory allows (4-64)
  • LoRA rank: 8-64
  • LoRA alpha: Usually 2× rank
  • Warmup ratio: 0.03-0.1
  • Weight decay: 0.01-0.1

18. What is gradient accumulation?

Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. Enables training with effective batch size larger than GPU memory allows.

19. What is mixed precision training?

Mixed precision uses FP16/BF16 for most operations while keeping master weights in FP32. ~2× memory reduction, faster training. Essential for large model fine-tuning. Use torch_dtype=torch.bfloat16.

20. What is quantization?

Quantization reduces numerical precision of model weights:

  • FP32 → FP16: 2× reduction, negligible quality loss
  • FP16 → INT8: 4× reduction, minimal quality loss
  • FP16 → INT4: 8× reduction, measurable quality loss
  • GGUF: llama.cpp format, CPU-friendly
  • GPTQ/AWQ: GPU-optimized quant formats

21. What is vLLM?

vLLM is a high-throughput LLM serving engine:

  • PagedAttention: Efficient KV-cache management, near-zero waste
  • Continuous batching: Dynamically batches requests as they arrive
  • High throughput: 10-20× vs. naive implementations
  • OpenAI-compatible API
  • Supports most open-source models
# Serve Llama 3 8B with vLLM
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

22. What is PagedAttention?

PagedAttention (from vLLM) manages KV-cache like OS virtual memory — in fixed-size pages. Eliminates KV-cache fragmentation and enables memory sharing across sequences. Key to vLLM's 20× throughput improvement.

23. What is Continuous Batching?

Continuous batching processes requests as they arrive instead of waiting for a full batch. When one request finishes, a new one immediately takes its slot. Dramatically reduces latency and increases throughput vs. static batching.

24. What is Hugging Face Transformers?

Hugging Face Transformers is the de-facto standard library for working with transformer models. Provides:

  • Thousands of pre-trained models
  • Unified API: AutoModel, AutoTokenizer, pipeline()
  • Training utilities: Trainer, TrainingArguments
  • Integration with PEFT, bitsandbytes, accelerate
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

25. What is bitsandbytes?

bitsandbytes provides CUDA kernels for 4-bit and 8-bit quantization. Used with Hugging Face Transformers for loading large models in quantized form. load_in_4bit=True reduces 7B model from 14GB → ~4GB.

26. What is accelerate?

Hugging Face Accelerate simplifies running models across multiple GPUs, TPUs, or mixed precision. device_map="auto" automatically distributes layers across available devices.

27. What is a GGUF model?

GGUF (GPT-Generated Unified Format) is a file format for storing quantized models for use with llama.cpp. CPU-friendly, no GPU required. Hugging Face hub hosts thousands of GGUF variants. Used by Ollama, LM Studio, llama.cpp.

28. What is model sharding?

Sharding splits a model across multiple GPUs or machines when it doesn't fit on one:

  • Tensor Parallelism: Split individual layers across GPUs
  • Pipeline Parallelism: Split model layers across GPUs
  • Data Parallelism: Replicate model, split data

29. What is speculative decoding for deployment?

A draft model (small, fast) generates candidate tokens; the target model (large) verifies them in parallel. Typical 2-3× speedup in serving. Implementations: Medusa, SpecInfer, vLLM speculative decoding.

30. How do you serve a fine-tuned LoRA model?

  1. Merge LoRA weights into base model: model.merge_and_unload()
  2. Save merged model: model.save_pretrained()
  3. Serve with vLLM or TGI (Text Generation Inference) Multi-LoRA serving: vLLM can serve multiple LoRA adapters from one base model simultaneously.

31. What is TGI (Text Generation Inference)?

TGI (Hugging Face) is a production serving solution:

  • Tensor parallelism
  • Continuous batching
  • Watermarking
  • Logits warping
  • gRPC + REST APIs
  • Optimized for Hugging Face models

32. What metrics do you monitor in production?

  • Latency: TTFT (time to first token), TBT (time between tokens), total response time
  • Throughput: Requests/second, tokens/second
  • GPU utilization: SM%, memory utilization
  • Error rate: API errors, timeouts
  • Cost: Tokens/$, GPU hours, per-request cost
  • Quality: User feedback, automated eval scores

33. What is a model registry?

A model registry stores and versions trained models. Examples: Hugging Face Hub, MLflow, W&B Model Registry. Tracks: model weights, config, training metadata, evaluation results, deployment status.

34. What is A/B testing for LLM deployments?

A/B testing routes traffic between model versions:

  • Split: 10% → new model, 90% → current model
  • Metrics: Compare user satisfaction, accuracy, latency, cost
  • Gradual rollout: Increase new model traffic as confidence grows
  • Requires consistent evaluation framework

35. How do you handle cold starts for LLM serving?

Cold start: loading model into GPU memory takes 30s-2min. Strategies:

  • Keep models pre-loaded (warm)
  • Use smaller models for initial response, switch to larger
  • Lazy loading with request queuing
  • Use spot instances with pre-warmed AMIs

36. What is the cost structure for LLM deployment?

  • GPU costs: A100 (~$1.50-3/hr), H100 (~$2-4/hr)
  • Inference APIs: OpenAI (~$5-30/M tokens), Together AI ($0.10-1/M tokens)
  • Self-hosted: GPU + infra + engineering time
  • Break-even: Self-hosting becomes cheaper at ~50M+ tokens/day for 7B models

37. What is ONNX Runtime?

ONNX (Open Neural Network Exchange) provides a standardized format for models, enabling deployment across hardware/different frameworks. ONNX Runtime optimizes inference with graph optimizations, quantization, and hardware-specific acceleration.

38. What is TensorRT?

TensorRT (NVIDIA) optimizes and deploys models on NVIDIA GPUs. Features: layer fusion, precision calibration (FP16/INT8), dynamic shapes, and kernel auto-tuning. 2-5× speedup over native PyTorch inference.

39. How do you evaluate fine-tuned model quality?

  • Automatic: BLEU, ROUGE, perplexity, RAGAS (if RAG), human eval models (GPT-4 as judge)
  • Human Evaluation: Side-by-side comparisons, Likert scale ratings, A/B preference tests
  • Task-specific: Accuracy for classification, exact match for QA, BLEU for translation
  • Safety evaluation: Harmful output rate, refusal rate for unsafe prompts

40. What is the LLM deployment checklist?

  1. Model Selection: Right size, open-source vs API
  2. Quantization: 4-bit/8-bit for serving efficiency
  3. Serving Engine: vLLM, TGI, TensorRT-LLM
  4. Scaling: Autoscaling based on queue depth/GPU utilization
  5. Monitoring: Prometheus + Grafana for latency/throughput/errors
  6. Safety: Content filtering, rate limiting, authentication
  7. Caching: Semantic cache for frequent queries
  8. Fallbacks: Retry logic, model fallback, graceful degradation
  9. CI/CD: Automated testing, canary deployments, rollback capability
PYTHON PLAYGROUND
⏳ Loading editor…

AI Mentor

Confused about "LLM fine-tuning with LoRA, QLoRA, PEFT, RLHF, DPO, model deployment with vLLM, quantization, serving infrastructure, and production monitoring"? Ask our AI mentor for a simplified explanation.

Quiz

Quiz

Question 1 of 3

What is the primary advantage of LoRA over full fine-tuning?

Better model quality
Drastic reduction in trainable parameters and memory requirements
Faster inference speed
No training data needed