Embeddings Deep Dive: Models, Dimensions, Normalization, and Production Patterns

Embeddings are the foundational primitive of modern AI applications: semantic search, RAG, clustering, classification, deduplication, and anomaly detection all depend on high-quality vector representations. This guide covers everything you need to deploy embeddings in production.

What is an embedding?

An embedding maps text (or other data) to a fixed-size vector of floats. Semantically similar texts produce similar vectors — measured by cosine similarity or dot product.

// "cat" and "kitten" → similar vectors
// "cat" and "database" → dissimilar vectors
const catVec    = await embed('cat');     // [0.12, -0.87, 0.34, ...]
const kittenVec = await embed('kitten'); // [0.11, -0.89, 0.36, ...]
const dbVec     = await embed('database'); // [-0.45, 0.23, -0.67, ...]

cosineSimilarity(catVec, kittenVec);  // ~0.93
cosineSimilarity(catVec, dbVec);      // ~0.08

Embedding model comparison

Model	Dimensions	Max tokens	Cost/1M tokens	Best for
text-embedding-3-small	1536	8191	$0.020	General, cost-sensitive
text-embedding-3-large	3072	8191	$0.130	Max accuracy
text-embedding-ada-002	1536	8191	$0.100	Legacy
BAAI/bge-large-en-v1.5	1024	512	Free (self-hosted)	Open-source, on-prem
nomic-embed-text	768	8192	Free (Ollama)	Local dev, privacy
Cohere embed-v3	1024	512	$0.100	Multilingual

Dimension reduction with Matryoshka

OpenAI's text-embedding-3 models support dimension reduction — smaller dimensions are cheaper to store and faster to search, with a small accuracy trade-off:

// Reduce dimensions for storage efficiency
const response = await openai.embeddings.create({
  model: 'text-embedding-3-small',
  input: text,
  dimensions: 256,  // default 1536 → 256 (83% storage savings)
});

// Benchmark: text-embedding-3-small@256 still beats ada-002@1536
// on MTEB benchmarks despite 6x fewer dimensions

Batch embedding for efficiency

// Process up to 2048 texts in a single API call
async function batchEmbed(texts: string[], model = 'text-embedding-3-small'): Promise<number[][]> {
  const BATCH_SIZE = 100;  // safe limit for large texts
  const results: number[][] = [];

  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE);

    const response = await openai.embeddings.create({ model, input: batch });
    results.push(...response.data.map(d => d.embedding));

    // Respect rate limits
    if (i + BATCH_SIZE < texts.length) {
      await new Promise(r => setTimeout(r, 100));
    }
  }

  return results;
}

// Index 10,000 documents
const allTexts = documents.map(d => d.content);
const allEmbeddings = await batchEmbed(allTexts);
// Much faster than 10,000 individual API calls

Vector normalization

// OpenAI embeddings are already normalized (unit vectors)
// For other models, normalize manually before storing

function normalize(vector: number[]): number[] {
  const magnitude = Math.sqrt(vector.reduce((sum, v) => sum + v * v, 0));
  return vector.map(v => v / magnitude);
}

// For normalized vectors: cosine_similarity = dot_product (much faster)
function dotProduct(a: number[], b: number[]): number {
  return a.reduce((sum, v, i) => sum + v * b[i], 0);
}

// Store normalized vectors for 3–5x faster similarity computation

Embedding cache

import { createHash } from 'crypto';
import { Redis } from '@upstash/redis';

const redis = new Redis({ url: process.env.UPSTASH_REDIS_URL! });

async function cachedEmbed(text: string, model = 'text-embedding-3-small'): Promise<number[]> {
  const key = `embed:${model}:${createHash('sha256').update(text).digest('hex')}`;

  const cached = await redis.get<number[]>(key);
  if (cached) return cached;

  const response = await openai.embeddings.create({ model, input: text });
  const embedding = response.data[0].embedding;

  await redis.set(key, embedding, { ex: 86400 * 30 });  // cache 30 days
  return embedding;
}

// For static content (product catalog, docs), pre-compute and store forever

Use case patterns

Semantic search

// Embed query → find nearest documents in vector store
const queryVec = await cachedEmbed(userQuery);
const results  = await vectorStore.query({ vector: queryVec, topK: 5 });

Deduplication

// Find near-duplicate content (cosine similarity > 0.95)
async function findDuplicates(texts: string[]): Promise<[number, number][]> {
  const embeddings = await batchEmbed(texts);
  const duplicates: [number, number][] = [];

  for (let i = 0; i < embeddings.length; i++) {
    for (let j = i + 1; j < embeddings.length; j++) {
      if (dotProduct(embeddings[i], embeddings[j]) > 0.95) {
        duplicates.push([i, j]);
      }
    }
  }
  return duplicates;
}

Classification without fine-tuning

// Zero-shot classification via embedding similarity to label descriptions
const labels = ['billing issue', 'technical bug', 'feature request', 'general inquiry'];
const labelEmbeddings = await batchEmbed(labels);

async function classify(text: string): Promise<string> {
  const textEmbedding = await cachedEmbed(text);
  const scores = labelEmbeddings.map(le => dotProduct(textEmbedding, le));
  const bestIdx = scores.indexOf(Math.max(...scores));
  return labels[bestIdx];
}

const category = await classify('My payment failed with error code 402');
// 'billing issue'

Anomaly detection

// Flag messages that are semantically far from normal patterns
const normalMessages = await batchEmbed(historicalNormalMessages);
const centroid = normalMessages[0].map((_, i) =>
  normalMessages.reduce((sum, v) => sum + v[i], 0) / normalMessages.length
);

async function isAnomalous(message: string, threshold = 0.5): Promise<boolean> {
  const embedding = await cachedEmbed(message);
  const similarity = dotProduct(embedding, normalize(centroid));
  return similarity < threshold;
}

Recommendation

// Find similar items in a catalog
async function findSimilarProducts(productId: string, topK = 5) {
  const productEmbedding = await getStoredEmbedding(productId);
  return vectorStore.query({
    vector: productEmbedding,
    topK: topK + 1,  // +1 to exclude the query item itself
    filter: { type: 'product' },
  }).then(results => results.filter(r => r.id !== productId).slice(0, topK));
}

Production checklist

Cache embeddings for static content — never re-embed the same text twice.
Version your embedding model — switching models invalidates all stored vectors.
Normalize vectors before storage for faster cosine similarity computation.
Use batching for indexing — never call the embeddings API in a loop one-by-one.
Pre-compute label embeddings for classification — they never change.
Monitor embedding API latency separately from generation latency.

Takeaway

Use text-embedding-3-small with dimensions: 256 as your default — it outperforms ada-002 at a fraction of the storage cost. Cache aggressively, batch your indexing pipeline, and never change your embedding model mid-project without re-indexing everything.