Gen-ai Interview Questions

Vector Databases & RAG Interview Questions

40 essential interview questions on vector databases, embeddings, Pinecone, Chroma, FAISS, Weaviate, RAG architectures, and retrieval strategies.

By TechCoder TeamLast updated: 2026-06-02
In a Nutshell

40 essential interview questions on vector databases, embeddings, Pinecone, Chroma, FAISS, Weaviate, RAG architectures, and retrieval strategies. This interview-focused guide covers essential vector databases & rag interview questions concepts for technical interviews.

Vector Databases & RAG Interview Questions

RAG (Retrieval-Augmented Generation) and vector databases are the backbone of knowledge-grounded AI applications. These 40 questions cover embeddings, indexing algorithms, retrieval strategies, chunking, and production RAG architectures.


1. What is RAG (Retrieval-Augmented Generation)?

RAG combines information retrieval with LLM generation. Instead of relying solely on the model's training data, RAG retrieves relevant documents from a knowledge base and provides them as context to the LLM. This grounds responses in factual, up-to-date, and domain-specific information.

# Pseudocode RAG pipeline
query = "What's our return policy?"
docs = vector_store.similarity_search(query, k=5)
context = "\n".join([doc.page_content for doc in docs])
response = llm.invoke(f"Context: {context}\nQuestion: {query}")

2. Why use RAG instead of fine-tuning?

RAGFine-tuning
Dynamic knowledge updatesFixed at training time
No model retraining neededRequires training compute
Source attribution possibleNo source traceability
Lower cost to implementHigher training cost
Can use with any modelModel-specific
Slower (retrieval step)Faster inference

3. What is a Vector Database?

A vector database stores, indexes, and queries high-dimensional vectors (embeddings). It enables fast similarity search to find semantically similar items. Key features: CRUD operations, metadata filtering, and approximate nearest neighbor (ANN) search.

4. How do Vector Databases work?

  1. Ingestion: Convert documents → chunks → embeddings → store vectors with metadata
  2. Indexing: Build ANN index (HNSW, IVF, PQ) for fast search
  3. Querying: Convert query → embedding → similarity search → return top-K results
  4. Filtering: Combine vector similarity with metadata filters
  • Pinecone: Managed, serverless, proprietary. Easiest to start.
  • Chroma: Open-source, lightweight. Good for prototyping.
  • Weaviate: Open-source, GraphQL API, hybrid search.
  • Qdrant: Open-source, Rust-based, high performance.
  • Milvus: Open-source, cloud-native, billion-scale.
  • FAISS: Meta's library (not a DB). In-memory, extremely fast.
  • pgvector: PostgreSQL extension. If you already use Postgres.

6. What is FAISS?

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It's NOT a database (no persistence, CRUD, or metadata handling). Best used for in-memory, high-performance search scenarios.

import faiss
import numpy as np

dimension = 768
index = faiss.IndexFlatL2(dimension)  # L2 distance
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)
D, I = index.search(query_embedding, k=5)  # Top 5 results

7. What are Embeddings?

Embeddings map text (words, sentences, documents) to dense vectors in high-dimensional space. Similar text → similar vectors. Models: OpenAI text-embedding-3-small (1536d), text-embedding-3-large (3072d), Cohere Embed, Sentence-BERT.

8. What similarity metrics are used in vector search?

  • Cosine Similarity: Measures angle between vectors. Range: -1 to 1. Most common.
  • Euclidean Distance (L2): Straight-line distance. Lower = more similar.
  • Dot Product: For normalized vectors, equals cosine similarity.
  • Manhattan Distance (L1): Sum of absolute differences. Less common.
import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

9. What is ANN (Approximate Nearest Neighbor)?

ANN algorithms trade slight accuracy for massive speed improvements. Without ANN, searching 1M vectors takes 1M comparisons. With ANN (HNSW): ~50 comparisons.

Popular ANN algorithms: HNSW, IVF (Inverted File), PQ (Product Quantization), DiskANN, ScaNN.

10. What is HNSW (Hierarchical Navigable Small World)?

HNSW builds a multi-layer graph. Top layers: few nodes, long-range connections. Bottom layer: many nodes, local connections. Search starts at top, moves to relevant region, descends layers. O(log N) search complexity. Most popular ANN algorithm.

11. What is the chunking strategy for RAG?

Chunking splits documents for embedding:

  • Fixed-size: Simple 500-1000 token chunks with overlap
  • Semantic: Split by paragraph, section headers
  • Recursive: Try multiple separators (¶, sentence, word)
  • Agentic: LLM decides optimal chunk boundaries

[!TIP] Chunk overlap (10-20%) prevents splitting sentences mid-thought. Chunk size depends on embedding model's context window. For OpenAI: 8191 tokens max per chunk.

12. What is the difference between sparse and dense retrieval?

  • Sparse (BM25, TF-IDF): Keyword-based. Fast, interpretable. Misses semantic meaning. "car" ≠ "automobile".
  • Dense (embeddings): Semantic understanding. "car" ≈ "automobile". Requires embedding model.
  • Hybrid: Combines both. Best of both worlds.

13. What is Hybrid Search?

Hybrid search combines dense (semantic) and sparse (keyword) retrieval with a fusion algorithm (RRF - Reciprocal Rank Fusion). Catches both exact keyword matches and semantic similarities.

# Hybrid search pseudocode
dense_results = vector_search(query_embedding, k=20)
sparse_results = bm25_search(query_text, k=20)
combined = reciprocal_rank_fusion(dense_results, sparse_results)

14. What is MMR (Maximum Marginal Relevance)?

MMR balances relevance with diversity in retrieved results. Prevents returning highly similar documents that don't add new information.

retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7}
)

15. What is the naive RAG pipeline?

  1. Load documents
  2. Split into chunks
  3. Embed chunks and store in vector DB
  4. User asks question → embed question
  5. Retrieve top-K similar chunks
  6. Pass chunks + question to LLM
  7. Return grounded answer

16. What are advanced RAG architectures?

  • HyDE: Generate hypothetical answer first, use it for retrieval
  • Multi-hop RAG: Retrieve, answer partially, retrieve more, refine
  • Self-RAG: Model decides when to retrieve, critiques its own outputs
  • Corrective RAG: Evaluate retrieved docs quality, re-retrieve if needed
  • Graph RAG: Knowledge graph + vector search combined

17. What is the role of a reranker in RAG?

Rerankers refine initial retrieval results. Instead of just vector similarity, rerankers use cross-encoders (like Cohere Rerank, BGE reranker) to deeply compare query-document relevance. Higher quality, but slower (applied to top-N results only).

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    document_compressor=compressor
)

18. How do you handle metadata filtering in vector search?

Combine semantic search with metadata constraints: "search for documents about 'revenue' WHERE year=2024 AND department='sales'". Most vector DBs support pre-filtering (filter first, then vector search) or post-filtering.

19. What is the difference between Pinecone and Chroma?

FeaturePineconeChroma
TypeManaged (SaaS)Open-source
SetupInstant, no opsSelf-hosted or client-server
ScaleAuto-scaledManual scaling
CostPay-per-useFree (self-hosted)
Best forProductionDevelopment / Prototyping
Advanced featuresServerless indexingEmbedding API built-in

20. What is Weaviate?

Weaviate is an open-source vector database with:

  • GraphQL + REST APIs
  • Built-in vectorization modules (OpenAI, Cohere, HuggingFace)
  • Hybrid search (dense + sparse) natively
  • Multi-tenancy support
  • CRUD with schema management

21. How do you evaluate RAG quality?

  • Retrieval Metrics: Recall@K, Precision@K, MRR, NDCG, Hit Rate
  • Generation Metrics: Faithfulness, Answer Relevance, Context Relevance
  • End-to-end: Human evaluation, RAGAS framework, TruLens
  • Factual Accuracy: Compare generated answer against ground truth

22. What is the RAGAS evaluation framework?

RAGAS (RAG Assessment) provides automated metrics:

  • Faithfulness: Is the answer supported by retrieved context?
  • Answer Relevancy: Does the answer address the question?
  • Context Precision: Are retrieved docs relevant?
  • Context Recall: Did we retrieve all relevant information?

23. What is HyDE (Hypothetical Document Embeddings)?

HyDE generates a hypothetical answer before retrieval, then uses that to find similar documents. Counter-intuitive but effective: a generated answer may share more vocabulary/embeddings with relevant documents than the question itself.

24. What are common RAG failure modes?

  • Retrieval misses relevant documents (low recall)
  • Hallucination despite having correct context
  • Context window overflow (too many/large chunks)
  • Irrelevant retrieval (noise in context)
  • Stitching issues (poor context synthesis)

25. What is Multi-Modal RAG?

Multi-modal RAG retrieves both text and images. Embed images with CLIP, store in vector DB. Query can be text or image. Used in visual QA, product search, medical imaging.

26. What is the role of an embedding model?

Converts text → fixed-length vector. Choice affects retrieval quality dramatically:

  • OpenAI text-embedding-3: General purpose, fast, 1536/3072 dims
  • Cohere Embed v3: Strong for enterprise search
  • BGE (BAAI): Open-source, top MTEB leaderboard
  • E5 (Microsoft): Open-source, strong performance

27. How do you handle document updates in RAG?

  • Re-indexing: Delete old vectors + insert new (most common)
  • Upsert: Update existing vectors with same ID
  • Incremental: Only index changed/added documents
  • Versioning: Keep multiple versions with timestamps

28. What is Self-Querying Retrieval?

Self-querying extracts both semantic query AND metadata filters from natural language: "Show me sales reports from 2024 about Q4 performance" → query="Q4 performance" + filter={year: 2024, type: "sales report"}.

29. What is the Context Window limitation in RAG?

LLMs have max context limits. With too many retrieved chunks, you can't fit everything. Solutions: Re-ranking (keep only top-N), summarization of chunks, iterative retrieval (retrieve → reason → retrieve more if needed).

30. What is Sentence Window Retrieval?

Retrieve small chunks for search relevance, but return a larger window (surrounding sentences/paragraphs) for context. Better retrieval precision + richer context for the LLM.

31. How does Pinecone's serverless indexing work?

Pinecone automatically handles index scaling, replication, and sharding. You specify dimension + metric + cloud/region. No capacity planning needed. Cold storage for infrequently accessed vectors.

32. What is pgvector and when to use it?

pgvector is a PostgreSQL extension for vector storage/search. Use when you already have Postgres and don't want another database. Supports IVFFlat and HNSW indexing. Good for moderate scale (<10M vectors). Simpler architecture.

33. What is Qdrant?

Qdrant is a Rust-based vector DB with:

  • Rich filtering (nested objects, boolean logic)
  • Quantization (scalar, binary, product)
  • Payload (metadata) indexing
  • On-disk storage for large datasets
  • gRPC and REST APIs

34. What is Milvus?

Milvus is a cloud-native vector DB for billion-scale. Features:

  • Distributed architecture (proxy, data node, index node)
  • Multiple index types (IVF, HNSW, DiskANN, GPU)
  • Multi-vector fields, hybrid search
  • CDC (Change Data Capture) for streaming

35. What is the role of a document loader?

Document loaders handle ingestion: file parsing (PDF, CSV, HTML, Markdown), web scraping, API data fetching. Must handle encoding, structure extraction, metadata preservation.

36. How do you optimize retrieval latency?

  • Use smaller embeddings (1536d vs 3072d)
  • Use approximate index (ANN) vs exact search
  • Pre-filter with metadata before vector search
  • Use caching for frequent queries
  • Choose appropriate pod size / hardware
  • Batch embeddings for large ingestions

37. What is streaming in RAG?

Instead of waiting for full response, stream tokens as they're generated. The user sees the answer being typed in real-time. Requires SSE (Server-Sent Events) or WebSocket.

38. How do you handle multi-lingual RAG?

  • Use multilingual embedding models (LaBSE, multilingual-e5)
  • Translate queries to document language
  • Store multilingual documents with language metadata
  • Query routing based on detected language

39. What is the cost structure of vector databases?

  • Pinecone: Pod-based ($70/mo for S1) or serverless (per-RU pricing)
  • Weaviate Cloud: Starting free tier, then per-instance
  • Self-hosted (Chroma/FAISS): Infrastructure costs only
  • pgvector: Free extension, Postgres infra costs

40. What are Vector Database security considerations?

  • Network isolation (VPC, private endpoints)
  • API key authentication
  • Encryption at rest and in transit
  • Role-based access control (RBAC) for multi-tenant
  • Audit logging for all operations
  • Data residency compliance (GDPR)
PYTHON PLAYGROUND
⏳ Loading editor…

AI Mentor

Confused about "Vector databases, embeddings, RAG architectures, Pinecone, Chroma, FAISS, Weaviate, hybrid search, and retrieval strategies for LLM applications"? Ask our AI mentor for a simplified explanation.

Quiz

Quiz

Question 1 of 3

What does RAG stand for?

Random Access Generation
Retrieval-Augmented Generation
Recursive Answer Generation
Reactive AI Generation