Semantic Search Systems
Beyond keywords. Master Vector Embeddings, Similarity Metrics, Hybrid Search, and building production-ready search engines.
Beyond keywords. Master Vector Embeddings, Similarity Metrics, Hybrid Search, and building production-ready search engines. This hands-on tutorial focuses on practical implementation of semantic search systems concepts.
Semantic Search Systems
Traditional search (Ctrl+F, Google circa 2005) looks for exact keyword matches. Semantic Search understands the meaning behind words. If you search for "movie about a sinking ship," it should find "Titanic" even though those exact words don't appear in the title.
1. The Evolution of Search π
Generation 1: Keyword Search (BM25)
- Method: Count word frequencies, rank by TF-IDF
- Example: Search "car accident" β finds documents with those exact words
- Problem: Misses "automotive collision," "vehicle crash"
Generation 2: Semantic Search (Dense Vectors)
- Method: Convert queries and documents to vectors, find nearest neighbors
- Example: Search "car accident" β finds "traffic collision," "automobile crash"
- Magic: Understands synonyms, context, and related concepts
Generation 3: Hybrid Search (Best of Both)
- Method: Combine keyword matching + semantic similarity
- Why: Keywords catch exact matches (product IDs, names); vectors catch concepts
- Result: Most production systems use this approach
2. Dense vs. Sparse Representations π
Sparse Search (BM25)
- Representation: High-dimensional sparse vectors (one dimension per word)
- "apple":
[0,0,0,...,1,0,0](only position 47839 is 1) - Pros: Fast, explainable, works for exact matches
- Cons: Vocabulary mismatch problem
Dense Search (Neural Embeddings)
- Representation: Low-dimensional dense vectors (e.g., 768 dimensions)
- "apple":
[0.23, -0.15, 0.87, ..., 0.42](all dimensions have values) - Pros: Captures semantics, handles synonyms
- Cons: Computationally expensive, harder to debug
3. The Embedding Model π§
You need a model to convert text β vectors. This is the heart of semantic search.
Popular Embedding Models
| Model | Dimensions | Speed | Best For |
|---|---|---|---|
text-embedding-3-small (OpenAI) | 1536 | Fast | General purpose |
text-embedding-3-large (OpenAI) | 3072 | Slower | Highest quality |
all-MiniLM-L6-v2 | 384 | Very Fast | CPU deployment |
multilingual-e5-large | 1024 | Medium | Multi-language |
Fine-Tuning Embeddings for Your Domain
Pre-trained embeddings are great, but domain-specific fine-tuning boosts accuracy:
- Medical: "discharge" (hospital) vs "discharge" (electrical)
- Legal: "party" (lawsuit) vs "party" (celebration)
4. Similarity Metrics π
Once you have vectors, how do you measure "closeness"?
Cosine Similarity (Most Common)
Measures the angle between vectors, ignoring magnitude.
- Formula:
cos(ΞΈ) = (A Β· B) / (||A|| Γ ||B||) - Range: -1 (opposite) to 1 (identical)
- Why it works: Text length shouldn't matter ("I love AI" vs "I really truly deeply love AI")
Dot Product
Combines angle AND magnitude.
- Formula:
A Β· B = Ξ£(ai Γ bi) - Use case: When you want to factor in document length/importance
Euclidean Distance (L2)
Straight-line distance in vector space.
- Formula:
β(Ξ£(ai - bi)Β²) - Use case: Clustering, anomaly detection
5. Building a Production Search Engine ποΈ
Architecture Overview
Step 1: Indexing (Offline)
- Load documents (PDFs, web pages, etc.)
- Chunk them (see previous chapter)
- Generate embeddings for each chunk
- Store in Vector DB (Pinecone, Weaviate, Qdrant, Milvus)
Step 2: Querying (Real-time)
- User enters query
- Generate query embedding
- Vector DB finds nearest neighbors (HNSW, IVF algorithms)
- Optional: Rerank with a more powerful model
- Return top results
Advanced: Reranking
The 2-stage approach:
- Retrieval (fast, approximate): Vector search finds top 100 candidates
- Reranking (slow, accurate): Cross-encoder model scores top 10
6. Hybrid Search Strategy π―
Combine BM25 (keyword) + Vector (semantic) for best results.
Reciprocal Rank Fusion (RRF)
Simple algorithm to merge two ranked lists:
score = 1/(k + rank_bm25) + 1/(k + rank_vector)- where
kis a constant (usually 60)
Quiz
Quiz
Question 1 of 4What is the main advantage of Dense Embeddings over Sparse (BM25)?
Key Takeaways
β
Semantic Search understands intent, not just keywords.
β
Embeddings map text to a geometric space where similarity = proximity.
β
Cosine Similarity is the gold standard for text comparison.
β
Hybrid Search (BM25 + Vector) is the production-standard approach.
β
2-Stage Retrieval (fast vector search + slow reranking) balances speed and accuracy.
What's Next?
English is powerful, but the world speaks 7,000 languages. Can AI truly understand them all? Next Chapter: Multilingual NLP.