AI Cost Optimization: Reducing LLM API Costs in Production

LLM API costs at scale can easily reach $10,000–$100,000/month. This guide covers the highest-leverage strategies to cut costs without degrading user experience.

Where the money goes

Cost driver	Typical share	Addressable by
Input tokens (prompt)	40–60%	Compression, caching, smaller models
Output tokens	30–50%	max_tokens limits, structured output
Repeated prompts	10–30%	Semantic cache
Over-powered model	20–50%	Model routing

1. Semantic caching

Avoid re-querying the LLM when a semantically similar question was already answered:

import { Redis } from '@upstash/redis';
import OpenAI from 'openai';

const redis = new Redis({ url: process.env.UPSTASH_REDIS_URL! });
const openai = new OpenAI();

async function cachedCompletion(userQuestion: string) {
  // 1. Embed the question
  const embeddingRes = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuestion,
  });
  const queryVector = embeddingRes.data[0].embedding;

  // 2. Search cache (cosine similarity >= 0.95 → cache hit)
  const cacheHit = await vectorCache.findSimilar(queryVector, threshold: 0.95);
  if (cacheHit) return cacheHit.response;

  // 3. Miss — call LLM
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: userQuestion }],
  });
  const answer = response.choices[0].message.content;

  // 4. Store in cache
  await vectorCache.store(queryVector, answer, ttl: 3600);
  return answer;
}

Expected savings: 20–40% for customer support and FAQ-heavy workloads.

2. Model routing

Route simple queries to cheap models, complex ones to powerful models:

async function routedCompletion(query: string, context: string) {
  // Classify complexity with gpt-4o-mini (cheap)
  const classifyResponse = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Classify complexity of this task as "simple" or "complex". Reply with one word only.
Task: ${query}`,
    }],
    max_tokens: 5,
  });

  const complexity = classifyResponse.choices[0].message.content?.trim().toLowerCase();

  return openai.chat.completions.create({
    model: complexity === 'complex' ? 'gpt-4o' : 'gpt-4o-mini',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: query },
    ],
  });
}

Expected savings: 50–70% on workloads with a mix of simple and complex queries.

3. Prompt compression

// LLMLingua-style compression — reduce prompt tokens by 50%+ with ~5% quality loss
// Using the llmlingua npm package
import { PromptCompressor } from 'llmlingua';

const compressor = new PromptCompressor();
const { compressed_prompt, ratio } = await compressor.compress(
  longContextDocument,
  { target_token: 512, rank_method: 'longllmlingua' }
);
console.log(`Compression ratio: ${ratio}x`);

Also consider manual techniques:

Remove filler phrases ("Please note that", "As an AI language model").
Convert verbose instructions to bullet points.
Truncate retrieved context to only the most relevant sentences.
Store frequently reused prompts in OpenAI prompt cache format.

4. Batch API (50% discount)

Use OpenAI's Batch API for any workload that is not real-time:

// Create batch file (JSONL)
const batchRequests = documents.map((doc, i) => JSON.stringify({
  custom_id: `request-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Summarize this document in 2 sentences.' },
      { role: 'user', content: doc.text },
    ],
    max_tokens: 100,
  },
})).join('\n');

// Upload and submit
const file = await openai.files.create({
  file: new Blob([batchRequests], { type: 'text/plain' }),
  purpose: 'batch',
});

const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});
console.log('Batch ID:', batch.id);

5. Output token control

// Always set max_tokens — an unconstrained response can be 10x more expensive
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  max_tokens: 256,  // set to minimum needed for task

  // For structured output, enforce schema → models output fewer tokens
  response_format: { type: 'json_schema', json_schema: schema },
});

6. Prompt caching

OpenAI automatically caches the first 1024+ tokens of a stable prompt prefix at 50% discount:

// Structure your messages to maximize cached prefix
const messages = [
  // Stable system prompt (1000+ tokens) — always first, never changes
  { role: 'system', content: largeStableSystemPrompt },
  // Stable examples — also cached
  { role: 'user', content: 'Example input' },
  { role: 'assistant', content: 'Example output' },
  // Dynamic user message — not cached
  { role: 'user', content: userQuery },
];

7. Use embeddings for classification

Embeddings cost ~100× less than chat completions for routing and classification tasks:

// Instead of asking GPT-4o to classify, use k-NN over embeddings
const queryEmbedding = await embedText(userQuery);
const nearestLabel = await findNearestLabel(queryEmbedding, labelEmbeddings);
// labelEmbeddings computed once, cached forever

Cost monitoring

// Track cost per request
const costPerToken = { 'gpt-4o': 0.000005, 'gpt-4o-mini': 0.00000015 };

function estimateCost(model: string, usage: { prompt_tokens: number; completion_tokens: number }) {
  const rate = costPerToken[model] ?? 0.000001;
  return (usage.prompt_tokens + usage.completion_tokens) * rate;
}

logger.info({
  model,
  estimatedCostUSD: estimateCost(model, response.usage!),
  cachedTokens: response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
});

Cost reduction summary

Strategy	Typical savings	Implementation effort
Model routing	50–70%	Medium
Batch API	50%	Low
Prompt caching	20–50%	Low
Semantic cache	20–40%	Medium
Prompt compression	20–40%	Medium
max_tokens discipline	10–30%	Low

Takeaway

Start with model routing and Batch API — both deliver the highest savings with the least risk. Add semantic caching once your query patterns are understood. Treat every uncapped max_tokens as a latent cost spike.