Semantic Search Systems

Traditional search (Ctrl+F, Google circa 2005) looks for exact keyword matches. Semantic Search understands the meaning behind words. If you search for "movie about a sinking ship," it should find "Titanic" even though those exact words don't appear in the title.

1. The Evolution of Search 📜

Generation 1: Keyword Search (BM25)

Method: Count word frequencies, rank by TF-IDF
Example: Search "car accident" → finds documents with those exact words
Problem: Misses "automotive collision," "vehicle crash"

Generation 2: Semantic Search (Dense Vectors)

Method: Convert queries and documents to vectors, find nearest neighbors
Example: Search "car accident" → finds "traffic collision," "automobile crash"
Magic: Understands synonyms, context, and related concepts

Generation 3: Hybrid Search (Best of Both)

Method: Combine keyword matching + semantic similarity
Why: Keywords catch exact matches (product IDs, names); vectors catch concepts
Result: Most production systems use this approach

2. Dense vs. Sparse Representations 🔎

Sparse Search (BM25)

Representation: High-dimensional sparse vectors (one dimension per word)
"apple": [0,0,0,...,1,0,0] (only position 47839 is 1)
Pros: Fast, explainable, works for exact matches
Cons: Vocabulary mismatch problem

Dense Search (Neural Embeddings)

Representation: Low-dimensional dense vectors (e.g., 768 dimensions)
"apple": [0.23, -0.15, 0.87, ..., 0.42] (all dimensions have values)
Pros: Captures semantics, handles synonyms
Cons: Computationally expensive, harder to debug

3. The Embedding Model 🧠

You need a model to convert text → vectors. This is the heart of semantic search.

Popular Embedding Models

Model	Dimensions	Speed	Best For
`text-embedding-3-small` (OpenAI)	1536	Fast	General purpose
`text-embedding-3-large` (OpenAI)	3072	Slower	Highest quality
`all-MiniLM-L6-v2`	384	Very Fast	CPU deployment
`multilingual-e5-large`	1024	Medium	Multi-language

Fine-Tuning Embeddings for Your Domain

Pre-trained embeddings are great, but domain-specific fine-tuning boosts accuracy:

Medical: "discharge" (hospital) vs "discharge" (electrical)
Legal: "party" (lawsuit) vs "party" (celebration)

PYTHON PLAYGROUND

⏳ Loading editor…

4. Similarity Metrics 📏

Once you have vectors, how do you measure "closeness"?

Cosine Similarity (Most Common)

Measures the angle between vectors, ignoring magnitude.

Formula: cos(θ) = (A · B) / (||A|| × ||B||)
Range: -1 (opposite) to 1 (identical)
Why it works: Text length shouldn't matter ("I love AI" vs "I really truly deeply love AI")

Dot Product

Combines angle AND magnitude.

Formula: A · B = Σ(ai × bi)
Use case: When you want to factor in document length/importance

Euclidean Distance (L2)

Straight-line distance in vector space.

Formula: √(Σ(ai - bi)²)
Use case: Clustering, anomaly detection

PYTHON PLAYGROUND

⏳ Loading editor…

5. Building a Production Search Engine 🏗️

Architecture Overview

Step 1: Indexing (Offline)

Load documents (PDFs, web pages, etc.)
Chunk them (see previous chapter)
Generate embeddings for each chunk
Store in Vector DB (Pinecone, Weaviate, Qdrant, Milvus)

Step 2: Querying (Real-time)

User enters query
Generate query embedding
Vector DB finds nearest neighbors (HNSW, IVF algorithms)
Optional: Rerank with a more powerful model
Return top results

Advanced: Reranking

The 2-stage approach:

Retrieval (fast, approximate): Vector search finds top 100 candidates
Reranking (slow, accurate): Cross-encoder model scores top 10

PYTHON PLAYGROUND

⏳ Loading editor…

6. Hybrid Search Strategy 🎯

Combine BM25 (keyword) + Vector (semantic) for best results.

Reciprocal Rank Fusion (RRF)

Simple algorithm to merge two ranked lists:

score = 1/(k + rank_bm25) + 1/(k + rank_vector)
where k is a constant (usually 60)

Quiz

Question 1 of 4

What is the main advantage of Dense Embeddings over Sparse (BM25)?

They are faster to compute

They capture semantic meaning and handle synonyms

They use less storage

Key Takeaways

✅ Semantic Search understands intent, not just keywords.
✅ Embeddings map text to a geometric space where similarity = proximity.
✅ Cosine Similarity is the gold standard for text comparison.
✅ Hybrid Search (BM25 + Vector) is the production-standard approach.
✅ 2-Stage Retrieval (fast vector search + slow reranking) balances speed and accuracy.

What's Next?

English is powerful, but the world speaks 7,000 languages. Can AI truly understand them all? Next Chapter: Multilingual NLP.