AI Cost Optimization: Reducing LLM API Costs in Production
LLM API costs at scale can easily reach $10,000–$100,000/month. This guide covers the highest-leverage strategies to cut costs without degrading user experience.
Where the money goes
| Cost driver | Typical share | Addressable by |
|---|---|---|
| Input tokens (prompt) | 40–60% | Compression, caching, smaller models |
| Output tokens | 30–50% | max_tokens limits, structured output |
| Repeated prompts | 10–30% | Semantic cache |
| Over-powered model | 20–50% | Model routing |
1. Semantic caching
Avoid re-querying the LLM when a semantically similar question was already answered:
import { Redis } from '@upstash/redis';
import OpenAI from 'openai';
const redis = new Redis({ url: process.env.UPSTASH_REDIS_URL! });
const openai = new OpenAI();
async function cachedCompletion(userQuestion: string) {
// 1. Embed the question
const embeddingRes = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: userQuestion,
});
const queryVector = embeddingRes.data[0].embedding;
// 2. Search cache (cosine similarity >= 0.95 → cache hit)
const cacheHit = await vectorCache.findSimilar(queryVector, threshold: 0.95);
if (cacheHit) return cacheHit.response;
// 3. Miss — call LLM
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: userQuestion }],
});
const answer = response.choices[0].message.content;
// 4. Store in cache
await vectorCache.store(queryVector, answer, ttl: 3600);
return answer;
}Expected savings: 20–40% for customer support and FAQ-heavy workloads.
2. Model routing
Route simple queries to cheap models, complex ones to powerful models:
async function routedCompletion(query: string, context: string) {
// Classify complexity with gpt-4o-mini (cheap)
const classifyResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Classify complexity of this task as "simple" or "complex". Reply with one word only.
Task: ${query}`,
}],
max_tokens: 5,
});
const complexity = classifyResponse.choices[0].message.content?.trim().toLowerCase();
return openai.chat.completions.create({
model: complexity === 'complex' ? 'gpt-4o' : 'gpt-4o-mini',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: query },
],
});
}Expected savings: 50–70% on workloads with a mix of simple and complex queries.
3. Prompt compression
// LLMLingua-style compression — reduce prompt tokens by 50%+ with ~5% quality loss
// Using the llmlingua npm package
import { PromptCompressor } from 'llmlingua';
const compressor = new PromptCompressor();
const { compressed_prompt, ratio } = await compressor.compress(
longContextDocument,
{ target_token: 512, rank_method: 'longllmlingua' }
);
console.log(`Compression ratio: ${ratio}x`);Also consider manual techniques:
- Remove filler phrases ("Please note that", "As an AI language model").
- Convert verbose instructions to bullet points.
- Truncate retrieved context to only the most relevant sentences.
- Store frequently reused prompts in OpenAI prompt cache format.
4. Batch API (50% discount)
Use OpenAI's Batch API for any workload that is not real-time:
// Create batch file (JSONL)
const batchRequests = documents.map((doc, i) => JSON.stringify({
custom_id: `request-${i}`,
method: 'POST',
url: '/v1/chat/completions',
body: {
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Summarize this document in 2 sentences.' },
{ role: 'user', content: doc.text },
],
max_tokens: 100,
},
})).join('\n');
// Upload and submit
const file = await openai.files.create({
file: new Blob([batchRequests], { type: 'text/plain' }),
purpose: 'batch',
});
const batch = await openai.batches.create({
input_file_id: file.id,
endpoint: '/v1/chat/completions',
completion_window: '24h',
});
console.log('Batch ID:', batch.id);5. Output token control
// Always set max_tokens — an unconstrained response can be 10x more expensive
await openai.chat.completions.create({
model: 'gpt-4o',
messages,
max_tokens: 256, // set to minimum needed for task
// For structured output, enforce schema → models output fewer tokens
response_format: { type: 'json_schema', json_schema: schema },
});6. Prompt caching
OpenAI automatically caches the first 1024+ tokens of a stable prompt prefix at 50% discount:
// Structure your messages to maximize cached prefix
const messages = [
// Stable system prompt (1000+ tokens) — always first, never changes
{ role: 'system', content: largeStableSystemPrompt },
// Stable examples — also cached
{ role: 'user', content: 'Example input' },
{ role: 'assistant', content: 'Example output' },
// Dynamic user message — not cached
{ role: 'user', content: userQuery },
];7. Use embeddings for classification
Embeddings cost ~100× less than chat completions for routing and classification tasks:
// Instead of asking GPT-4o to classify, use k-NN over embeddings const queryEmbedding = await embedText(userQuery); const nearestLabel = await findNearestLabel(queryEmbedding, labelEmbeddings); // labelEmbeddings computed once, cached forever
Cost monitoring
// Track cost per request
const costPerToken = { 'gpt-4o': 0.000005, 'gpt-4o-mini': 0.00000015 };
function estimateCost(model: string, usage: { prompt_tokens: number; completion_tokens: number }) {
const rate = costPerToken[model] ?? 0.000001;
return (usage.prompt_tokens + usage.completion_tokens) * rate;
}
logger.info({
model,
estimatedCostUSD: estimateCost(model, response.usage!),
cachedTokens: response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
});Cost reduction summary
| Strategy | Typical savings | Implementation effort |
|---|---|---|
| Model routing | 50–70% | Medium |
| Batch API | 50% | Low |
| Prompt caching | 20–50% | Low |
| Semantic cache | 20–40% | Medium |
| Prompt compression | 20–40% | Medium |
| max_tokens discipline | 10–30% | Low |
Takeaway
Start with model routing and Batch API — both deliver the highest savings with the least risk. Add semantic caching once your query patterns are understood. Treat every uncapped max_tokens as a latent cost spike.