Gen-ai Interview Questions

Fine-Tuning & Deployment Interview Questions

40 essential interview questions on LLM fine-tuning with LoRA, QLoRA, PEFT, RLHF, DPO, and production deployment with vLLM, quantization, and serving infrastructure.

By TechCoder TeamLast updated: 2026-06-23

In a Nutshell

40 essential interview questions on LLM fine-tuning with LoRA, QLoRA, PEFT, RLHF, DPO, and production deployment with vLLM, quantization, and serving infrastructure. This interview-focused guide covers essential fine-tuning & deployment interview questions concepts for technical interviews.

Fine-Tuning & LLM Deployment Interview Questions

Taking models from experimentation to production requires fine-tuning expertise and deployment engineering. These 40 questions cover LoRA/QLoRA, PEFT techniques, RLHF vs DPO, model serving with vLLM, quantization, and production infrastructure.

1. What is Fine-Tuning?

Fine-tuning takes a pre-trained model and further trains it on a smaller, domain-specific dataset. Unlike training from scratch, fine-tuning starts from existing knowledge and adapts it to specific tasks, styles, or domains.

2. When should you fine-tune vs. use prompting?

Fine-tune when: You need consistent tone/style, have 100+ quality examples, want lower latency/cost at scale, domain-specific vocabulary, or better performance than prompting achieves.
Prompt when: Quick prototyping, few examples available, changing requirements, or using frontier models with strong instruction following.

3. What is Supervised Fine-Tuning (SFT)?

SFT trains a model on (instruction, response) pairs. The model learns to follow instructions and produce desired outputs. This is the first step in the post-training pipeline for chat models.

# SFT data format
sft_example = {
    "instruction": "Explain what a decorator is in Python",
    "response": "A decorator is a function that modifies another function..."
}

4. What is LoRA (Low-Rank Adaptation)?

LoRA is a parameter-efficient fine-tuning method. Instead of updating all model weights, LoRA adds small, trainable rank decomposition matrices to attention layers. Key benefits:

Trains 0.1-1% of total parameters
No inference latency increase (merged into weights)
Multiple LoRA adapters can be swapped for different tasks
Much lower memory requirements

from peft import LoraConfig, get_peft_model

lora_config = LoraConfig(
    r=16,  # Rank (higher = more capacity, more memory)
    lora_alpha=32,  # Scaling factor
    target_modules=["q_proj", "v_proj"],  # Which layers to adapt
    lora_dropout=0.1,
    task_type="CAUSAL_LM"
)
model = get_peft_model(base_model, lora_config)

5. What is QLoRA?

QLoRA = Quantized LoRA. Quantizes the base model to 4-bit precision, then applies LoRA on top. Dramatically reduces memory:

Full 7B model: ~14GB (FP16)
QLoRA 7B: ~4GB (4-bit)
Enables fine-tuning 7B/13B models on consumer GPUs (single RTX 3090/4090)

6. What is PEFT?

PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that fine-tune models without updating all parameters. Includes LoRA, Prefix Tuning, Prompt Tuning, Adapters, IA³.

7. What is the difference between LoRA, Prefix Tuning, and Prompt Tuning?

Technique	What it does	Parameters
LoRA	Adds trainable matrices to attention/FFN layers	0.1-1% of model
Prefix Tuning	Prepends learnable vectors to each transformer layer	<1%
Prompt Tuning	Adds learnable tokens to input embeddings	<0.01%
Adapters	Inserts small bottleneck layers between transformer blocks	1-5%

8. What is the LoRA rank (r) parameter?

Rank (r) controls LoRA's capacity. Higher r = more trainable parameters = more capacity to learn complex patterns, but higher memory/compute. Typical values: r=8 (light), r=16 (balanced), r=64 (heavy). Diminishing returns beyond r=64 for most tasks.

9. What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF is a 3-stage process:

SFT: Fine-tune on high-quality instruction-response pairs
Reward Model: Train a model to predict human preference scores for outputs
PPO: Optimize the policy model to maximize reward model score using Proximal Policy Optimization This is how ChatGPT, Claude, and Llama-2-Chat were trained.

10. What is DPO (Direct Preference Optimization)?

DPO simplifies RLHF by eliminating the separate reward model. Instead, it directly optimizes the policy using preference pairs (chosen vs. rejected responses) with a loss function derived from the RLHF objective. Simpler, more stable, increasingly popular.

11. What is LoRA alpha?

LoRA alpha is the scaling factor. Delta weights are scaled by alpha / r. Higher alpha = stronger adaptation. Common: alpha = 2× r (r=16, alpha=32).

12. How do you prepare data for fine-tuning?

Format: Conversation format (user/assistant pairs) or instruction/response.
Quality: Curate diverse, high-quality examples. Quality > quantity.
Quantity: 100-1000 examples for LoRA, 1000-10000+ for full fine-tuning.
Coverage: Include edge cases, variations, and refusal examples.
No duplicates, no contradictions.

13. What is full fine-tuning vs. LoRA?

Full fine-tuning: All weights updated. Requires massive GPU memory (7B model = 7× model size). Best quality but expensive.
LoRA: Only adapter weights trained. 1 GPU sufficient for 7B-70B models. Slightly lower quality ceiling but vastly more practical.

14. What is the difference between training loss and validation loss?

Training loss: Error on training data. Should decrease over time.
Validation loss: Error on held-out data. Monitors overfitting. If training loss decreases but validation loss increases → overfitting.
Optimal stopping: Save checkpoint at minimum validation loss.

15. What is overfitting in fine-tuning?

Overfitting occurs when the model memorizes training examples instead of learning generalizable patterns. Symptoms:

Training loss continues decreasing
Validation loss starts increasing
Model regurgitates training data verbatim
Poor performance on new inputs

Prevention: More diverse data, dropout (lora_dropout), early stopping, lower learning rate, fewer epochs.

16. What is catastrophic forgetting?

Catastrophic forgetting occurs when fine-tuning on new data degrades the model's performance on previously learned tasks. Mitigation: mix in general data (data mixing), lower learning rate, LoRA (preserves base model weights), elastic weight consolidation.

17. What are the key hyperparameters for fine-tuning?

Learning rate: 1e-4 to 5e-4 for LoRA, 1e-5 to 5e-5 for full fine-tuning
Epochs: 1-5 for LoRA, 1-3 for full fine-tuning
Batch size: As large as GPU memory allows (4-64)
LoRA rank: 8-64
LoRA alpha: Usually 2× rank
Warmup ratio: 0.03-0.1
Weight decay: 0.01-0.1

18. What is gradient accumulation?

Gradient accumulation simulates larger batch sizes by accumulating gradients over multiple forward/backward passes before updating weights. Enables training with effective batch size larger than GPU memory allows.

19. What is mixed precision training?

Mixed precision uses FP16/BF16 for most operations while keeping master weights in FP32. ~2× memory reduction, faster training. Essential for large model fine-tuning. Use torch_dtype=torch.bfloat16.

20. What is quantization?

Quantization reduces numerical precision of model weights:

FP32 → FP16: 2× reduction, negligible quality loss
FP16 → INT8: 4× reduction, minimal quality loss
FP16 → INT4: 8× reduction, measurable quality loss
GGUF: llama.cpp format, CPU-friendly
GPTQ/AWQ: GPU-optimized quant formats

21. What is vLLM?

vLLM is a high-throughput LLM serving engine:

PagedAttention: Efficient KV-cache management, near-zero waste
Continuous batching: Dynamically batches requests as they arrive
High throughput: 10-20× vs. naive implementations
OpenAI-compatible API
Supports most open-source models

# Serve Llama 3 8B with vLLM
vllm serve meta-llama/Meta-Llama-3-8B-Instruct

22. What is PagedAttention?

PagedAttention (from vLLM) manages KV-cache like OS virtual memory — in fixed-size pages. Eliminates KV-cache fragmentation and enables memory sharing across sequences. Key to vLLM's 20× throughput improvement.

23. What is Continuous Batching?

Continuous batching processes requests as they arrive instead of waiting for a full batch. When one request finishes, a new one immediately takes its slot. Dramatically reduces latency and increases throughput vs. static batching.

24. What is Hugging Face Transformers?

Hugging Face Transformers is the de-facto standard library for working with transformer models. Provides:

Thousands of pre-trained models
Unified API: AutoModel, AutoTokenizer, pipeline()
Training utilities: Trainer, TrainingArguments
Integration with PEFT, bitsandbytes, accelerate

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3-8B",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")

25. What is bitsandbytes?

bitsandbytes provides CUDA kernels for 4-bit and 8-bit quantization. Used with Hugging Face Transformers for loading large models in quantized form. load_in_4bit=True reduces 7B model from 14GB → ~4GB.

26. What is accelerate?

Hugging Face Accelerate simplifies running models across multiple GPUs, TPUs, or mixed precision. device_map="auto" automatically distributes layers across available devices.

27. What is a GGUF model?

GGUF (GPT-Generated Unified Format) is a file format for storing quantized models for use with llama.cpp. CPU-friendly, no GPU required. Hugging Face hub hosts thousands of GGUF variants. Used by Ollama, LM Studio, llama.cpp.

28. What is model sharding?

Sharding splits a model across multiple GPUs or machines when it doesn't fit on one:

Tensor Parallelism: Split individual layers across GPUs
Pipeline Parallelism: Split model layers across GPUs
Data Parallelism: Replicate model, split data

29. What is speculative decoding for deployment?

A draft model (small, fast) generates candidate tokens; the target model (large) verifies them in parallel. Typical 2-3× speedup in serving. Implementations: Medusa, SpecInfer, vLLM speculative decoding.

30. How do you serve a fine-tuned LoRA model?

Merge LoRA weights into base model: model.merge_and_unload()
Save merged model: model.save_pretrained()
Serve with vLLM or TGI (Text Generation Inference) Multi-LoRA serving: vLLM can serve multiple LoRA adapters from one base model simultaneously.

31. What is TGI (Text Generation Inference)?

TGI (Hugging Face) is a production serving solution:

Tensor parallelism
Continuous batching
Watermarking
Logits warping
gRPC + REST APIs
Optimized for Hugging Face models

32. What metrics do you monitor in production?

Latency: TTFT (time to first token), TBT (time between tokens), total response time
Throughput: Requests/second, tokens/second
GPU utilization: SM%, memory utilization
Error rate: API errors, timeouts
Cost: Tokens/$, GPU hours, per-request cost
Quality: User feedback, automated eval scores

33. What is a model registry?

A model registry stores and versions trained models. Examples: Hugging Face Hub, MLflow, W&B Model Registry. Tracks: model weights, config, training metadata, evaluation results, deployment status.

34. What is A/B testing for LLM deployments?

A/B testing routes traffic between model versions:

Split: 10% → new model, 90% → current model
Metrics: Compare user satisfaction, accuracy, latency, cost
Gradual rollout: Increase new model traffic as confidence grows
Requires consistent evaluation framework

35. How do you handle cold starts for LLM serving?

Cold start: loading model into GPU memory takes 30s-2min. Strategies:

Keep models pre-loaded (warm)
Use smaller models for initial response, switch to larger
Lazy loading with request queuing
Use spot instances with pre-warmed AMIs

36. What is the cost structure for LLM deployment?

GPU costs: A100 (~$1.50-3/hr), H100 (~$2-4/hr)
Inference APIs: OpenAI (~$5-30/M tokens), Together AI ($0.10-1/M tokens)
Self-hosted: GPU + infra + engineering time
Break-even: Self-hosting becomes cheaper at ~50M+ tokens/day for 7B models

37. What is ONNX Runtime?

ONNX (Open Neural Network Exchange) provides a standardized format for models, enabling deployment across hardware/different frameworks. ONNX Runtime optimizes inference with graph optimizations, quantization, and hardware-specific acceleration.

38. What is TensorRT?

TensorRT (NVIDIA) optimizes and deploys models on NVIDIA GPUs. Features: layer fusion, precision calibration (FP16/INT8), dynamic shapes, and kernel auto-tuning. 2-5× speedup over native PyTorch inference.

39. How do you evaluate fine-tuned model quality?

Automatic: BLEU, ROUGE, perplexity, RAGAS (if RAG), human eval models (GPT-4 as judge)
Human Evaluation: Side-by-side comparisons, Likert scale ratings, A/B preference tests
Task-specific: Accuracy for classification, exact match for QA, BLEU for translation
Safety evaluation: Harmful output rate, refusal rate for unsafe prompts

40. What is the LLM deployment checklist?

Model Selection: Right size, open-source vs API
Quantization: 4-bit/8-bit for serving efficiency
Serving Engine: vLLM, TGI, TensorRT-LLM
Scaling: Autoscaling based on queue depth/GPU utilization
Monitoring: Prometheus + Grafana for latency/throughput/errors
Safety: Content filtering, rate limiting, authentication
Caching: Semantic cache for frequent queries
Fallbacks: Retry logic, model fallback, graceful degradation
CI/CD: Automated testing, canary deployments, rollback capability

PYTHON PLAYGROUND

⏳ Loading editor…

AI Mentor

Assistant

Confused about "LLM fine-tuning with LoRA, QLoRA, PEFT, RLHF, DPO, model deployment with vLLM, quantization, serving infrastructure, and production monitoring"? Ask our AI mentor for a simplified explanation.

Quiz

Question 1 of 3

What is the primary advantage of LoRA over full fine-tuning?

Better model quality

Drastic reduction in trainable parameters and memory requirements

Faster inference speed

No training data needed