Gen-ai Interview Questions

Gen AI Fundamentals Interview Questions

40 essential interview questions on LLMs, Transformers, GPT architecture, tokens, embeddings, attention mechanisms, and generative AI basics.

By TechCoder TeamLast updated: 2026-06-02
In a Nutshell

40 essential interview questions on LLMs, Transformers, GPT architecture, tokens, embeddings, attention mechanisms, and generative AI basics. This interview-focused guide covers essential gen ai fundamentals interview questions concepts for technical interviews.

Gen AI Fundamentals Interview Questions

Master the core concepts behind modern Generative AI. These 40 questions cover the transformer architecture, GPT family evolution, tokenization, embeddings, attention mechanisms, and the math that powers large language models.


1. What is Generative AI?

Generative AI refers to AI systems that can create new content — text, images, code, audio, and video — by learning patterns from training data. Unlike discriminative models that classify or predict, generative models produce novel outputs. LLMs like GPT-4, image models like DALL-E, and code generators like Copilot are all examples.

# Simple generative text example using a pre-trained model
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
output = generator("The future of AI is", max_length=30)
print(output[0]['generated_text'])

2. What is a Large Language Model (LLM)?

An LLM is a deep neural network (typically a transformer) trained on massive text corpora to predict the next token in a sequence. Key characteristics:

  • Billions of parameters (GPT-4: ~1.7 trillion estimated)
  • Trained on terabytes of text from the internet, books, and code
  • Exhibit emergent abilities: reasoning, translation, coding, summarization

3. What is the Transformer Architecture?

The Transformer (Vaswani et al., 2017) replaced RNNs with pure attention mechanisms. It consists of:

  • Encoder: Processes input sequence (used in BERT)
  • Decoder: Generates output autoregressively (used in GPT)
  • Self-Attention: Each token attends to all other tokens
  • Multi-Head Attention: Multiple parallel attention operations
  • Positional Encoding: Injects sequence order information
  • Feed-Forward Networks: Applied after attention layers

4. What is the difference between GPT and BERT?

FeatureGPT (Decoder-only)BERT (Encoder-only)
ArchitectureUnidirectional (left-to-right)Bidirectional
TaskText generationText understanding
TrainingNext token predictionMasked Language Modeling
Use casesChat, completion, code genClassification, NER, QA
Size rangeGPT-2: 1.5B, GPT-3: 175BBERT-base: 110M, BERT-large: 340M

[!NOTE] Modern trend: Decoder-only architectures (GPT family) have dominated since they scale better and exhibit stronger emergent capabilities.

5. What are tokens?

Tokens are the atomic units that LLMs process. A token is roughly ¾ of an English word. Tokenization breaks text into these sub-word units.

  • GPT tokenizer: ~1.3 tokens per word on average
  • "Hello world" → 2 tokens; "ChatGPT" → 3 tokens (Chat-G-PT)
  • Context window is measured in tokens, not words
  • OpenAI's tiktoken library counts tokens programmatically
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4")
tokens = enc.encode("Hello, how are you?")
print(f"Tokens: {len(tokens)}")  # Output: 6

6. What is the Context Window?

The context window is the maximum number of tokens a model can process in a single request. It includes both input (prompt) and output (completion):

  • GPT-3.5: 4K/16K tokens
  • GPT-4: 8K/32K/128K tokens
  • Claude 3: 200K tokens (~150K words, roughly an entire book)
  • Gemini 1.5 Pro: 1M tokens

7. What is Temperature in LLMs?

Temperature controls randomness in token sampling. Range: 0.0–2.0.

  • 0.0: Deterministic, always picks highest-probability token
  • 0.7-1.0: Balanced creativity
  • >1.0: Highly random, creative but potentially nonsensical
  • Lower for factual tasks (code, math), higher for creative writing
response = openai.ChatCompletion.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "Write a poem"}],
    temperature=0.9  # Creative
)

8. What is Top-P (Nucleus) Sampling?

Top-P sampling selects from the smallest set of tokens whose cumulative probability exceeds P. Example: P=0.9 means only tokens in the top 90% probability mass are considered. Works better than temperature alone for controlling diversity. Usually, temperature AND top-p are used together.

9. What are Embeddings?

Embeddings are dense vector representations of text that capture semantic meaning. Similar concepts have similar vectors. Key uses:

  • Semantic search
  • Clustering and classification
  • Retrieval-Augmented Generation (RAG)
  • OpenAI text-embedding-3-small: 1536 dimensions
  • Cosine similarity measures vector closeness
from openai import OpenAI
client = OpenAI()
response = client.embeddings.create(
    model="text-embedding-3-small",
    input="Generative AI is transforming technology"
)
embedding = response.data[0].embedding  # List of 1536 floats

10. What is Attention Mechanism?

Attention computes the relevance of each token to every other token in a sequence. The formula:

Attention(Q, K, V) = softmax(QK^T / √dk) × V

  • Q (Query): What I'm looking for
  • K (Key): What I can offer
  • V (Value): The actual information
  • √dk: Scaling factor to prevent extreme softmax values

[!TIP] Self-attention is the key breakthrough: every token directly "sees" every other token, unlike RNNs where information flows sequentially. This enables parallelization and long-range dependencies.

11. What is Multi-Head Attention?

Instead of one attention function, multi-head attention runs multiple attention operations in parallel with different learned projections. Each "head" can focus on different relationships:

  • Semantic meaning
  • Syntactic structure
  • Positional relationships
  • Co-reference resolution

Outputs are concatenated and projected back to the model dimension.

12. What is the difference between Self-Attention and Cross-Attention?

  • Self-Attention: Q, K, V all come from the same sequence (used in both encoder and decoder)
  • Cross-Attention: Q comes from the decoder, K and V from the encoder output (used in encoder-decoder models like T5)

GPT uses only self-attention (causal, masked) since it's decoder-only.

13. What is Causal Masking?

Causal masking prevents the model from "cheating" by looking at future tokens during training. A triangular mask is applied to the attention matrix, setting future positions to -inf (which becomes 0 after softmax). This ensures autoregressive generation: each token only depends on previous tokens.

14. What is the difference between Pre-training and Fine-tuning?

  • Pre-training: Training a model from scratch on massive general data. Extremely expensive. GPT-4 pre-training estimated at $100M+.
  • Fine-tuning: Adapting a pre-trained model to a specific task with a smaller, domain-specific dataset. Much cheaper. Few hundred to few thousand examples.

15. What is RLHF (Reinforcement Learning from Human Feedback)?

RLHF aligns model outputs with human preferences:

  1. Supervised Fine-Tuning (SFT): Fine-tune on human-written responses
  2. Reward Model: Train a model to predict human preference scores
  3. PPO Optimization: Use Proximal Policy Optimization to maximize reward This is how ChatGPT was refined from GPT-3.5.

16. What are the key differences between GPT-3.5, GPT-4, and GPT-4o?

FeatureGPT-3.5GPT-4GPT-4o
MultimodalText onlyText + Image inputText + Image + Audio I/O
Context4K / 16K8K / 32K / 128K128K
ReasoningGoodMuch betterBest, faster
CostCheapestMidCheaper than GPT-4
SpeedFastSlowMuch faster

17. What are Open-Source LLMs?

Popular open-source alternatives to OpenAI:

  • Llama 3 (Meta): 8B, 70B, 405B params
  • Mistral/Mixtral: Strong 7B model, Mixture of Experts 8×7B
  • Gemma (Google): 2B, 7B params
  • Falcon (TII): 7B, 40B, 180B
  • Phi-3 (Microsoft): Small but capable (3.8B)

18. What is Mixture of Experts (MoE)?

MoE models use multiple "expert" sub-networks, but only a subset is activated per token. This allows scaling parameters without proportionally increasing compute.

  • Mixtral 8×7B: Total 46.7B params, but only 12.9B active per token
  • GPT-4 is rumored to use MoE with 8 experts

19. What are parameters in a neural network?

Parameters are the trainable weights and biases of a neural network. They represent what the model "learned" from data:

  • Float numbers stored in matrices
  • More parameters = more capacity, more memory, more compute
  • 1B parameters × 2 bytes (FP16) = 2GB minimum VRAM

20. What is Quantization?

Quantization reduces model precision (e.g., FP32 → INT8/INT4) to decrease memory and increase speed:

  • INT8: 4× memory reduction, minimal quality loss
  • INT4: 8× memory reduction, slight quality loss
  • GGUF/GPTQ/AWQ: Popular quantization formats
  • Enables running 7B models on consumer hardware
# Using quantized model with llama.cpp / Ollama
# ollama run llama3:8b-q4_K_M  # 4-bit quantized

21. What is hallucination in LLMs?

Hallucination occurs when an LLM generates factually incorrect, nonsensical, or fabricated information that sounds plausible. Causes:

  • Training data gaps
  • Model confidently extrapolates from patterns
  • No ground-truth verification mechanism Mitigation: RAG, grounding, chain-of-thought, factual verification.

22. What is Chain-of-Thought (CoT) prompting?

Chain-of-Thought instructs the model to show its reasoning step-by-step before giving the final answer. Dramatically improves accuracy on math, logic, and multi-step problems.

prompt = """Q: If a train travels 120 miles in 2 hours, what's its speed?
A: Let me think step by step.
Speed = Distance / Time
Speed = 120 / 2 = 60
Answer: 60 mph"""

23. What is the difference between Zero-shot, Few-shot, and Fine-tuning?

  • Zero-shot: No examples. Pure prompt instruction.
  • Few-shot: 2-5 examples in the prompt to guide behavior.
  • Fine-tuning: Many examples used to update model weights.

24. What is Perplexity?

Perplexity measures how well a model predicts a sample. Lower perplexity = better prediction. Mathematically, it's the exponentiated cross-entropy loss. A perplexity of 10 means the model is as confused as if it had to choose between 10 equally likely options for each token.

25. What is BLEU Score?

BLEU (Bilingual Evaluation Understudy) measures machine translation quality by comparing n-gram overlap with reference translations. Range: 0-100. Drawbacks: doesn't capture semantic equivalence, penalizes creative rephrasing. Used less for LLM evaluation now.

26. What is ROUGE Score?

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures summarization quality. ROUGE-N measures n-gram recall, ROUGE-L uses longest common subsequence. More recall-focused than BLEU.

27. What are Inference and Training?

  • Training: Forward pass + backpropagation. Updates model weights. Requires massive GPU clusters.
  • Inference: Forward pass only. Generates output. Much less compute. What happens when you chat with ChatGPT.

28. What is Autoregressive Generation?

Autoregressive models generate output one token at a time, with each new token conditioned on all previous tokens:

  • Start: "The"
  • Next: "The cat"
  • Next: "The cat sat"
  • Continue until stop token or max length

This is why LLM generation is sequential and can't be perfectly parallelized.

29. What are Stop Tokens?

Stop tokens signal the model to end generation. Models learn to predict a special <|endoftext|> token. Without stop tokens, models would generate infinitely. API providers also enforce max_tokens limits as a safeguard.

30. What is Token Sampling Strategy?

Beyond temperature and top-p:

  • Top-K: Only consider top K tokens
  • Beam Search: Maintain multiple candidate sequences
  • Contrastive Search: Penalize repetitive tokens
  • Repetition Penalty: Reduce probability of already-generated tokens

31. What is the Transformer's Positional Encoding?

Since transformers have no recurrence, they need positional encoding to understand token order. Original approach: sinusoidal functions of position. Modern approach: learned positional embeddings (RoPE — Rotary Position Embedding used in Llama, GPT-NeoX).

32. What is RoPE (Rotary Position Embedding)?

RoPE encodes position by rotating the query and key vectors. Key advantages:

  • Relative position is naturally captured
  • Extrapolates to longer sequences than training
  • Used by Llama, Mistral, GPT-NeoX, PaLM

33. What is Layer Normalization?

LayerNorm normalizes activations across features within each layer. Stabilizes training. Modern LLMs typically use RMSNorm (Root Mean Square Normalization) which is computationally simpler and works better for transformers.

34. What is the Feed-Forward Network (FFN) in Transformers?

Each transformer block has an FFN after attention: two linear transformations with an activation (GELU). This is where most model parameters live (⅔ of total). Modern variants: SwiGLU activation (used in Llama, PaLM).

35. What is KV-Cache?

During autoregressive generation, Key and Value matrices for previous tokens are cached to avoid recomputation. This turns O(n²) into O(n) per step. KV-cache is the primary memory bottleneck during inference. PagedAttention (vLLM) manages KV-cache efficiently.

36. What is Speculative Decoding?

Speculative Decoding speeds up generation by using a smaller "draft" model to propose multiple tokens, then a larger model verifies them in parallel. Can achieve 2-3× speedup without quality loss.

37. What are Emergent Abilities?

Emergent abilities appear only at scale — small models can't do them, but large models suddenly can:

  • Few-shot reasoning
  • Chain-of-thought
  • Instruction following
  • Code generation
  • Theory of mind tasks

[!IMPORTANT] Emergence is a key reason for the "scaling hypothesis": bigger models unlock qualitatively new capabilities, not just incremental improvements.

38. What is Scaling Laws?

Scaling laws (Kaplan et al., 2020) describe how model performance improves with:

  • Model size (N): More parameters
  • Dataset size (D): More training tokens
  • Compute (C): More FLOPs

The Chinchilla scaling law suggests optimal training uses ~20 tokens per parameter.

39. What is Prompt Injection?

Prompt injection is a security attack where an adversary manipulates an LLM by embedding malicious instructions in user input. Types:

  • Direct: "Ignore previous instructions and..."
  • Indirect: Hidden text in documents processed by the LLM Mitigation: input sanitization, system prompt hardening, output filtering.

40. What are the ethical considerations in Gen AI?

  • Bias & Fairness: Models inherit training data biases
  • Misinformation: Convincing false content generation
  • Privacy: Training data may contain PII
  • Copyright: Generated content ownership unclear
  • Job Displacement: Automation of knowledge work
  • Environmental Impact: Training large models consumes massive energy
  • Safety: Harmful content, deepfakes, autonomous misuse
PYTHON PLAYGROUND
⏳ Loading editor…

AI Mentor

Confused about "Gen AI Fundamentals - LLMs, transformers, tokens, embeddings, attention, and core generative AI concepts"? Ask our AI mentor for a simplified explanation.

Quiz

Quiz

Question 1 of 3

What architecture do GPT models use?

Encoder-only (like BERT)
Decoder-only (autoregressive)
Encoder-Decoder (like T5)
CNN-based