Semantic Search Implementation: Embeddings, Cosine Similarity, and Hybrid Search
Semantic search understands intent, not just keywords. A user searching "how to cancel subscription" should find results about "account termination" and "membership end date" — even without keyword overlap. This guide covers building semantic search end-to-end.
How semantic search works
// Traditional keyword search query: "cancel subscription" result: documents containing "cancel" AND/OR "subscription" // Semantic search query: "cancel subscription" result: documents semantically similar to the query intent: - "terminate your membership" ← high similarity - "account deletion process" ← medium similarity - "subscription management page" ← medium similarity - "pizza delivery guide" ← low similarity
Text embeddings
An embedding is a fixed-size vector that captures semantic meaning. Similar texts have similar vectors.
import OpenAI from 'openai';
const openai = new OpenAI();
async function embed(text: string): Promise<number[]> {
const response = await openai.embeddings.create({
model: 'text-embedding-3-small', // 1536 dimensions
input: text,
encoding_format: 'float',
});
return response.data[0].embedding;
}
const v1 = await embed('cancel subscription');
const v2 = await embed('terminate membership');
const v3 = await embed('pizza delivery');
console.log(cosineSimilarity(v1, v2)); // ~0.92 — very similar
console.log(cosineSimilarity(v1, v3)); // ~0.15 — unrelatedCosine similarity
function cosineSimilarity(a: number[], b: number[]): number {
let dot = 0, normA = 0, normB = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}
// For normalized vectors (most embedding models output normalized vectors):
// cosine_similarity = dot_product (faster)
function dotProduct(a: number[], b: number[]): number {
return a.reduce((sum, val, i) => sum + val * b[i], 0);
}Indexing pipeline
import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';
async function indexDocuments(documents: { id: string; content: string; metadata: Record<string, string> }[]) {
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 512,
chunkOverlap: 64,
});
const chunks: { id: string; text: string; embedding: number[]; metadata: Record<string, string> }[] = [];
for (const doc of documents) {
const textChunks = await splitter.splitText(doc.content);
// Batch embed for efficiency
const response = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: textChunks,
});
textChunks.forEach((text, i) => {
chunks.push({
id: `${doc.id}-chunk-${i}`,
text,
embedding: response.data[i].embedding,
metadata: doc.metadata,
});
});
}
// Upsert to vector store
await vectorStore.upsertMany(chunks);
console.log(`Indexed ${chunks.length} chunks from ${documents.length} documents`);
}Query pipeline
async function semanticSearch(
query: string,
options: { topK?: number; threshold?: number; filter?: Record<string, string> } = {}
) {
const { topK = 5, threshold = 0.7, filter } = options;
// Embed query using same model as index
const queryEmbedding = await embed(query);
// Search vector store
const results = await vectorStore.query({
vector: queryEmbedding,
topK: topK * 2, // over-fetch for threshold filtering
filter,
includeMetadata: true,
});
// Filter by similarity threshold
return results
.filter(r => r.score >= threshold)
.slice(0, topK)
.map(r => ({ text: r.metadata.text, score: r.score, source: r.metadata.source }));
}Hybrid search: BM25 + vector
Vector search excels at semantics; BM25 excels at exact keyword matching. Combining both outperforms either alone:
import { BM25 } from 'bm25-ts';
async function hybridSearch(query: string, topK = 5, alpha = 0.7) {
// alpha=1: pure vector, alpha=0: pure BM25
// Dense (vector) search
const queryVec = await embed(query);
const denseHits = await vectorStore.query({ vector: queryVec, topK: topK * 2 });
// Sparse (BM25 keyword) search
const sparseHits = bm25Index.search(query, topK * 2);
// Reciprocal Rank Fusion
const scores = new Map<string, number>();
denseHits.forEach(({ id }, rank) => {
scores.set(id, (scores.get(id) ?? 0) + alpha * (1 / (rank + 60)));
});
sparseHits.forEach(({ id }, rank) => {
scores.set(id, (scores.get(id) ?? 0) + (1 - alpha) * (1 / (rank + 60)));
});
return [...scores.entries()]
.sort((a, b) => b[1] - a[1])
.slice(0, topK)
.map(([id, score]) => ({ id, score }));
}Reranking for precision
Use a cross-encoder reranker to reorder the top-K results for precision (at the cost of latency):
// Use a cross-encoder model (e.g., Cohere Rerank, Jina Rerank)
import { CohereRerank } from '@langchain/cohere';
async function rerankResults(query: string, documents: string[], topN = 3) {
const reranker = new CohereRerank({ model: 'rerank-english-v3.0', topN });
return reranker.compressDocuments(
documents.map((d, i) => ({ pageContent: d, metadata: { id: String(i) } })),
query
);
}
// Pipeline: vector search (recall) → rerank (precision)
const candidates = await semanticSearch(query, { topK: 20 });
const reranked = await rerankResults(query, candidates.map(c => c.text));Query expansion
// Expand sparse queries to improve recall
async function expandQuery(query: string): Promise<string[]> {
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Generate 3 alternative phrasings of this search query. Output as JSON array.
Query: "${query}"
Output: ["alt1", "alt2", "alt3"]`,
}],
response_format: { type: 'json_object' },
max_tokens: 100,
});
const { alternatives } = JSON.parse(response.choices[0].message.content!);
return [query, ...alternatives];
}
// Search all expansions and merge results
const queries = await expandQuery(userQuery);
const allResults = await Promise.all(queries.map(q => semanticSearch(q, { topK: 5 })));
const deduped = deduplicateByScore(allResults.flat());Performance optimization
| Optimization | Impact | How |
|---|---|---|
| Batch embedding | 3–5× faster indexing | Send up to 2048 inputs per API call |
| Embedding cache | Near-zero repeated query cost | Cache in Redis with sha256 key |
| Dimension reduction | 30–50% less memory | Use text-embedding-3-small with dimensions=256 |
| HNSW index | 10–100× faster ANN search | Use HNSW over flat index in pgvector |
| Metadata pre-filter | 2–5× faster with filters | Index metadata fields in vector store |
Takeaway
Start with pure vector search and a single embedding model. Add BM25 hybrid search once you have enough user queries to diagnose keyword-recall gaps. Add reranking only if your top-5 precision matters more than latency. Each layer adds cost and complexity — validate the improvement before adding it.