Gen-ai Interview Questions

Vector Databases & RAG Interview Questions

40 essential interview questions on vector databases, embeddings, Pinecone, Chroma, FAISS, Weaviate, RAG architectures, and retrieval strategies.

By TechCoder TeamLast updated: 2026-06-23

In a Nutshell

40 essential interview questions on vector databases, embeddings, Pinecone, Chroma, FAISS, Weaviate, RAG architectures, and retrieval strategies. This interview-focused guide covers essential vector databases & rag interview questions concepts for technical interviews.

Vector Databases & RAG Interview Questions

RAG (Retrieval-Augmented Generation) and vector databases are the backbone of knowledge-grounded AI applications. These 40 questions cover embeddings, indexing algorithms, retrieval strategies, chunking, and production RAG architectures.

1. What is RAG (Retrieval-Augmented Generation)?

RAG combines information retrieval with LLM generation. Instead of relying solely on the model's training data, RAG retrieves relevant documents from a knowledge base and provides them as context to the LLM. This grounds responses in factual, up-to-date, and domain-specific information.

# Pseudocode RAG pipeline
query = "What's our return policy?"
docs = vector_store.similarity_search(query, k=5)
context = "\n".join([doc.page_content for doc in docs])
response = llm.invoke(f"Context: {context}\nQuestion: {query}")

2. Why use RAG instead of fine-tuning?

RAG	Fine-tuning
Dynamic knowledge updates	Fixed at training time
No model retraining needed	Requires training compute
Source attribution possible	No source traceability
Lower cost to implement	Higher training cost
Can use with any model	Model-specific
Slower (retrieval step)	Faster inference

3. What is a Vector Database?

A vector database stores, indexes, and queries high-dimensional vectors (embeddings). It enables fast similarity search to find semantically similar items. Key features: CRUD operations, metadata filtering, and approximate nearest neighbor (ANN) search.

4. How do Vector Databases work?

Ingestion: Convert documents → chunks → embeddings → store vectors with metadata
Indexing: Build ANN index (HNSW, IVF, PQ) for fast search
Querying: Convert query → embedding → similarity search → return top-K results
Filtering: Combine vector similarity with metadata filters

5. What are the most popular Vector Databases?

Pinecone: Managed, serverless, proprietary. Easiest to start.
Chroma: Open-source, lightweight. Good for prototyping.
Weaviate: Open-source, GraphQL API, hybrid search.
Qdrant: Open-source, Rust-based, high performance.
Milvus: Open-source, cloud-native, billion-scale.
FAISS: Meta's library (not a DB). In-memory, extremely fast.
pgvector: PostgreSQL extension. If you already use Postgres.

6. What is FAISS?

FAISS (Facebook AI Similarity Search) is a library for efficient similarity search and clustering of dense vectors. It's NOT a database (no persistence, CRUD, or metadata handling). Best used for in-memory, high-performance search scenarios.

import faiss
import numpy as np

dimension = 768
index = faiss.IndexFlatL2(dimension)  # L2 distance
embeddings = np.random.random((1000, dimension)).astype('float32')
index.add(embeddings)
D, I = index.search(query_embedding, k=5)  # Top 5 results

7. What are Embeddings?

Embeddings map text (words, sentences, documents) to dense vectors in high-dimensional space. Similar text → similar vectors. Models: OpenAI text-embedding-3-small (1536d), text-embedding-3-large (3072d), Cohere Embed, Sentence-BERT.

8. What similarity metrics are used in vector search?

Cosine Similarity: Measures angle between vectors. Range: -1 to 1. Most common.
Euclidean Distance (L2): Straight-line distance. Lower = more similar.
Dot Product: For normalized vectors, equals cosine similarity.
Manhattan Distance (L1): Sum of absolute differences. Less common.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

9. What is ANN (Approximate Nearest Neighbor)?

ANN algorithms trade slight accuracy for massive speed improvements. Without ANN, searching 1M vectors takes 1M comparisons. With ANN (HNSW): ~50 comparisons.

Popular ANN algorithms: HNSW, IVF (Inverted File), PQ (Product Quantization), DiskANN, ScaNN.

10. What is HNSW (Hierarchical Navigable Small World)?

HNSW builds a multi-layer graph. Top layers: few nodes, long-range connections. Bottom layer: many nodes, local connections. Search starts at top, moves to relevant region, descends layers. O(log N) search complexity. Most popular ANN algorithm.

11. What is the chunking strategy for RAG?

Chunking splits documents for embedding:

Fixed-size: Simple 500-1000 token chunks with overlap
Semantic: Split by paragraph, section headers
Recursive: Try multiple separators (¶, sentence, word)
Agentic: LLM decides optimal chunk boundaries

[!TIP] Chunk overlap (10-20%) prevents splitting sentences mid-thought. Chunk size depends on embedding model's context window. For OpenAI: 8191 tokens max per chunk.

12. What is the difference between sparse and dense retrieval?

Sparse (BM25, TF-IDF): Keyword-based. Fast, interpretable. Misses semantic meaning. "car" ≠ "automobile".
Dense (embeddings): Semantic understanding. "car" ≈ "automobile". Requires embedding model.
Hybrid: Combines both. Best of both worlds.

13. What is Hybrid Search?

Hybrid search combines dense (semantic) and sparse (keyword) retrieval with a fusion algorithm (RRF - Reciprocal Rank Fusion). Catches both exact keyword matches and semantic similarities.

# Hybrid search pseudocode
dense_results = vector_search(query_embedding, k=20)
sparse_results = bm25_search(query_text, k=20)
combined = reciprocal_rank_fusion(dense_results, sparse_results)

14. What is MMR (Maximum Marginal Relevance)?

MMR balances relevance with diversity in retrieved results. Prevents returning highly similar documents that don't add new information.

retriever = vector_store.as_retriever(
    search_type="mmr",
    search_kwargs={"k": 5, "fetch_k": 20, "lambda_mult": 0.7}
)

15. What is the naive RAG pipeline?

Load documents
Split into chunks
Embed chunks and store in vector DB
User asks question → embed question
Retrieve top-K similar chunks
Pass chunks + question to LLM
Return grounded answer

16. What are advanced RAG architectures?

HyDE: Generate hypothetical answer first, use it for retrieval
Multi-hop RAG: Retrieve, answer partially, retrieve more, refine
Self-RAG: Model decides when to retrieve, critiques its own outputs
Corrective RAG: Evaluate retrieved docs quality, re-retrieve if needed
Graph RAG: Knowledge graph + vector search combined

17. What is the role of a reranker in RAG?

Rerankers refine initial retrieval results. Instead of just vector similarity, rerankers use cross-encoders (like Cohere Rerank, BGE reranker) to deeply compare query-document relevance. Higher quality, but slower (applied to top-N results only).

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

compressor = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_retriever=base_retriever,
    document_compressor=compressor
)

18. How do you handle metadata filtering in vector search?

Combine semantic search with metadata constraints: "search for documents about 'revenue' WHERE year=2024 AND department='sales'". Most vector DBs support pre-filtering (filter first, then vector search) or post-filtering.

19. What is the difference between Pinecone and Chroma?

Feature	Pinecone	Chroma
Type	Managed (SaaS)	Open-source
Setup	Instant, no ops	Self-hosted or client-server
Scale	Auto-scaled	Manual scaling
Cost	Pay-per-use	Free (self-hosted)
Best for	Production	Development / Prototyping
Advanced features	Serverless indexing	Embedding API built-in

20. What is Weaviate?

Weaviate is an open-source vector database with:

GraphQL + REST APIs
Built-in vectorization modules (OpenAI, Cohere, HuggingFace)
Hybrid search (dense + sparse) natively
Multi-tenancy support
CRUD with schema management

21. How do you evaluate RAG quality?

Retrieval Metrics: Recall@K, Precision@K, MRR, NDCG, Hit Rate
Generation Metrics: Faithfulness, Answer Relevance, Context Relevance
End-to-end: Human evaluation, RAGAS framework, TruLens
Factual Accuracy: Compare generated answer against ground truth

22. What is the RAGAS evaluation framework?

RAGAS (RAG Assessment) provides automated metrics:

Faithfulness: Is the answer supported by retrieved context?
Answer Relevancy: Does the answer address the question?
Context Precision: Are retrieved docs relevant?
Context Recall: Did we retrieve all relevant information?

23. What is HyDE (Hypothetical Document Embeddings)?

HyDE generates a hypothetical answer before retrieval, then uses that to find similar documents. Counter-intuitive but effective: a generated answer may share more vocabulary/embeddings with relevant documents than the question itself.

24. What are common RAG failure modes?

Retrieval misses relevant documents (low recall)
Hallucination despite having correct context
Context window overflow (too many/large chunks)
Irrelevant retrieval (noise in context)
Stitching issues (poor context synthesis)

Multi-modal RAG retrieves both text and images. Embed images with CLIP, store in vector DB. Query can be text or image. Used in visual QA, product search, medical imaging.

26. What is the role of an embedding model?

Converts text → fixed-length vector. Choice affects retrieval quality dramatically:

OpenAI text-embedding-3: General purpose, fast, 1536/3072 dims
Cohere Embed v3: Strong for enterprise search
BGE (BAAI): Open-source, top MTEB leaderboard
E5 (Microsoft): Open-source, strong performance

27. How do you handle document updates in RAG?

Re-indexing: Delete old vectors + insert new (most common)
Upsert: Update existing vectors with same ID
Incremental: Only index changed/added documents
Versioning: Keep multiple versions with timestamps

28. What is Self-Querying Retrieval?

Self-querying extracts both semantic query AND metadata filters from natural language: "Show me sales reports from 2024 about Q4 performance" → query="Q4 performance" + filter={year: 2024, type: "sales report"}.

29. What is the Context Window limitation in RAG?

LLMs have max context limits. With too many retrieved chunks, you can't fit everything. Solutions: Re-ranking (keep only top-N), summarization of chunks, iterative retrieval (retrieve → reason → retrieve more if needed).

30. What is Sentence Window Retrieval?

Retrieve small chunks for search relevance, but return a larger window (surrounding sentences/paragraphs) for context. Better retrieval precision + richer context for the LLM.

31. How does Pinecone's serverless indexing work?

Pinecone automatically handles index scaling, replication, and sharding. You specify dimension + metric + cloud/region. No capacity planning needed. Cold storage for infrequently accessed vectors.

32. What is pgvector and when to use it?

pgvector is a PostgreSQL extension for vector storage/search. Use when you already have Postgres and don't want another database. Supports IVFFlat and HNSW indexing. Good for moderate scale (<10M vectors). Simpler architecture.

33. What is Qdrant?

Qdrant is a Rust-based vector DB with:

Rich filtering (nested objects, boolean logic)
Quantization (scalar, binary, product)
Payload (metadata) indexing
On-disk storage for large datasets
gRPC and REST APIs

34. What is Milvus?

Milvus is a cloud-native vector DB for billion-scale. Features:

Distributed architecture (proxy, data node, index node)
Multiple index types (IVF, HNSW, DiskANN, GPU)
Multi-vector fields, hybrid search
CDC (Change Data Capture) for streaming

35. What is the role of a document loader?

Document loaders handle ingestion: file parsing (PDF, CSV, HTML, Markdown), web scraping, API data fetching. Must handle encoding, structure extraction, metadata preservation.

36. How do you optimize retrieval latency?

Use smaller embeddings (1536d vs 3072d)
Use approximate index (ANN) vs exact search
Pre-filter with metadata before vector search
Use caching for frequent queries
Choose appropriate pod size / hardware
Batch embeddings for large ingestions

37. What is streaming in RAG?

Instead of waiting for full response, stream tokens as they're generated. The user sees the answer being typed in real-time. Requires SSE (Server-Sent Events) or WebSocket.

38. How do you handle multi-lingual RAG?

Use multilingual embedding models (LaBSE, multilingual-e5)
Translate queries to document language
Store multilingual documents with language metadata
Query routing based on detected language

39. What is the cost structure of vector databases?

Pinecone: Pod-based ($70/mo for S1) or serverless (per-RU pricing)
Weaviate Cloud: Starting free tier, then per-instance
Self-hosted (Chroma/FAISS): Infrastructure costs only
pgvector: Free extension, Postgres infra costs

40. What are Vector Database security considerations?

Network isolation (VPC, private endpoints)
API key authentication
Encryption at rest and in transit
Role-based access control (RBAC) for multi-tenant
Audit logging for all operations
Data residency compliance (GDPR)

PYTHON PLAYGROUND

⏳ Loading editor…

AI Mentor

Assistant

Confused about "Vector databases, embeddings, RAG architectures, Pinecone, Chroma, FAISS, Weaviate, hybrid search, and retrieval strategies for LLM applications"? Ask our AI mentor for a simplified explanation.

Quiz

Question 1 of 3

What does RAG stand for?

Random Access Generation

Retrieval-Augmented Generation

Recursive Answer Generation

Reactive AI Generation

Vector Databases & RAG Interview Questions

1. What is RAG (Retrieval-Augmented Generation)?

2. Why use RAG instead of fine-tuning?

3. What is a Vector Database?

4. How do Vector Databases work?

5. What are the most popular Vector Databases?

6. What is FAISS?

7. What are Embeddings?

8. What similarity metrics are used in vector search?

9. What is ANN (Approximate Nearest Neighbor)?

10. What is HNSW (Hierarchical Navigable Small World)?

11. What is the chunking strategy for RAG?

12. What is the difference between sparse and dense retrieval?

13. What is Hybrid Search?

14. What is MMR (Maximum Marginal Relevance)?

15. What is the naive RAG pipeline?

16. What are advanced RAG architectures?

17. What is the role of a reranker in RAG?

18. How do you handle metadata filtering in vector search?

19. What is the difference between Pinecone and Chroma?

20. What is Weaviate?

21. How do you evaluate RAG quality?

22. What is the RAGAS evaluation framework?

23. What is HyDE (Hypothetical Document Embeddings)?

24. What are common RAG failure modes?

25. What is Multi-Modal RAG?

26. What is the role of an embedding model?

27. How do you handle document updates in RAG?

28. What is Self-Querying Retrieval?

29. What is the Context Window limitation in RAG?

30. What is Sentence Window Retrieval?

31. How does Pinecone's serverless indexing work?

32. What is pgvector and when to use it?

33. What is Qdrant?

34. What is Milvus?

35. What is the role of a document loader?

36. How do you optimize retrieval latency?

37. What is streaming in RAG?

38. How do you handle multi-lingual RAG?

39. What is the cost structure of vector databases?

40. What are Vector Database security considerations?

AI Mentor

Quiz

Quiz