Semantic Search Implementation: Embeddings, Cosine Similarity, and Hybrid Search

Semantic search understands intent, not just keywords. A user searching "how to cancel subscription" should find results about "account termination" and "membership end date" — even without keyword overlap. This guide covers building semantic search end-to-end.

How semantic search works

// Traditional keyword search
query: "cancel subscription"
result: documents containing "cancel" AND/OR "subscription"

// Semantic search
query: "cancel subscription"
result: documents semantically similar to the query intent:
  - "terminate your membership"     ← high similarity
  - "account deletion process"      ← medium similarity
  - "subscription management page"  ← medium similarity
  - "pizza delivery guide"          ← low similarity

Text embeddings

An embedding is a fixed-size vector that captures semantic meaning. Similar texts have similar vectors.

import OpenAI from 'openai';

const openai = new OpenAI();

async function embed(text: string): Promise<number[]> {
  const response = await openai.embeddings.create({
    model: 'text-embedding-3-small',  // 1536 dimensions
    input: text,
    encoding_format: 'float',
  });
  return response.data[0].embedding;
}

const v1 = await embed('cancel subscription');
const v2 = await embed('terminate membership');
const v3 = await embed('pizza delivery');

console.log(cosineSimilarity(v1, v2));  // ~0.92 — very similar
console.log(cosineSimilarity(v1, v3));  // ~0.15 — unrelated

Cosine similarity

function cosineSimilarity(a: number[], b: number[]): number {
  let dot = 0, normA = 0, normB = 0;
  for (let i = 0; i < a.length; i++) {
    dot   += a[i] * b[i];
    normA += a[i] * a[i];
    normB += b[i] * b[i];
  }
  return dot / (Math.sqrt(normA) * Math.sqrt(normB));
}

// For normalized vectors (most embedding models output normalized vectors):
// cosine_similarity = dot_product (faster)
function dotProduct(a: number[], b: number[]): number {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

Indexing pipeline

import { RecursiveCharacterTextSplitter } from 'langchain/text_splitter';

async function indexDocuments(documents: { id: string; content: string; metadata: Record<string, string> }[]) {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 512,
    chunkOverlap: 64,
  });

  const chunks: { id: string; text: string; embedding: number[]; metadata: Record<string, string> }[] = [];

  for (const doc of documents) {
    const textChunks = await splitter.splitText(doc.content);

    // Batch embed for efficiency
    const response = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: textChunks,
    });

    textChunks.forEach((text, i) => {
      chunks.push({
        id: `${doc.id}-chunk-${i}`,
        text,
        embedding: response.data[i].embedding,
        metadata: doc.metadata,
      });
    });
  }

  // Upsert to vector store
  await vectorStore.upsertMany(chunks);
  console.log(`Indexed ${chunks.length} chunks from ${documents.length} documents`);
}

Query pipeline

async function semanticSearch(
  query: string,
  options: { topK?: number; threshold?: number; filter?: Record<string, string> } = {}
) {
  const { topK = 5, threshold = 0.7, filter } = options;

  // Embed query using same model as index
  const queryEmbedding = await embed(query);

  // Search vector store
  const results = await vectorStore.query({
    vector: queryEmbedding,
    topK: topK * 2,  // over-fetch for threshold filtering
    filter,
    includeMetadata: true,
  });

  // Filter by similarity threshold
  return results
    .filter(r => r.score >= threshold)
    .slice(0, topK)
    .map(r => ({ text: r.metadata.text, score: r.score, source: r.metadata.source }));
}

Hybrid search: BM25 + vector

Vector search excels at semantics; BM25 excels at exact keyword matching. Combining both outperforms either alone:

import { BM25 } from 'bm25-ts';

async function hybridSearch(query: string, topK = 5, alpha = 0.7) {
  // alpha=1: pure vector, alpha=0: pure BM25
  
  // Dense (vector) search
  const queryVec = await embed(query);
  const denseHits = await vectorStore.query({ vector: queryVec, topK: topK * 2 });

  // Sparse (BM25 keyword) search
  const sparseHits = bm25Index.search(query, topK * 2);

  // Reciprocal Rank Fusion
  const scores = new Map<string, number>();

  denseHits.forEach(({ id }, rank) => {
    scores.set(id, (scores.get(id) ?? 0) + alpha * (1 / (rank + 60)));
  });

  sparseHits.forEach(({ id }, rank) => {
    scores.set(id, (scores.get(id) ?? 0) + (1 - alpha) * (1 / (rank + 60)));
  });

  return [...scores.entries()]
    .sort((a, b) => b[1] - a[1])
    .slice(0, topK)
    .map(([id, score]) => ({ id, score }));
}

Reranking for precision

Use a cross-encoder reranker to reorder the top-K results for precision (at the cost of latency):

// Use a cross-encoder model (e.g., Cohere Rerank, Jina Rerank)
import { CohereRerank } from '@langchain/cohere';

async function rerankResults(query: string, documents: string[], topN = 3) {
  const reranker = new CohereRerank({ model: 'rerank-english-v3.0', topN });
  return reranker.compressDocuments(
    documents.map((d, i) => ({ pageContent: d, metadata: { id: String(i) } })),
    query
  );
}

// Pipeline: vector search (recall) → rerank (precision)
const candidates = await semanticSearch(query, { topK: 20 });
const reranked   = await rerankResults(query, candidates.map(c => c.text));

Query expansion

// Expand sparse queries to improve recall
async function expandQuery(query: string): Promise<string[]> {
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Generate 3 alternative phrasings of this search query. Output as JSON array.
Query: "${query}"
Output: ["alt1", "alt2", "alt3"]`,
    }],
    response_format: { type: 'json_object' },
    max_tokens: 100,
  });

  const { alternatives } = JSON.parse(response.choices[0].message.content!);
  return [query, ...alternatives];
}

// Search all expansions and merge results
const queries = await expandQuery(userQuery);
const allResults = await Promise.all(queries.map(q => semanticSearch(q, { topK: 5 })));
const deduped = deduplicateByScore(allResults.flat());

Performance optimization

Optimization	Impact	How
Batch embedding	3–5× faster indexing	Send up to 2048 inputs per API call
Embedding cache	Near-zero repeated query cost	Cache in Redis with sha256 key
Dimension reduction	30–50% less memory	Use text-embedding-3-small with dimensions=256
HNSW index	10–100× faster ANN search	Use HNSW over flat index in pgvector
Metadata pre-filter	2–5× faster with filters	Index metadata fields in vector store

Takeaway

Start with pure vector search and a single embedding model. Add BM25 hybrid search once you have enough user queries to diagnose keyword-recall gaps. Add reranking only if your top-5 precision matters more than latency. Each layer adds cost and complexity — validate the improvement before adding it.