Free developer tools and practical guides for SQL, data workflows, and debugging.
AAskDBSQL & Data Toolkit

AI Cost Optimization: Reducing LLM API Costs in Production

·11 min read

LLM API costs at scale can easily reach $10,000–$100,000/month. This guide covers the highest-leverage strategies to cut costs without degrading user experience.

Where the money goes

Cost driverTypical shareAddressable by
Input tokens (prompt)40–60%Compression, caching, smaller models
Output tokens30–50%max_tokens limits, structured output
Repeated prompts10–30%Semantic cache
Over-powered model20–50%Model routing

1. Semantic caching

Avoid re-querying the LLM when a semantically similar question was already answered:

import { Redis } from '@upstash/redis';
import OpenAI from 'openai';

const redis = new Redis({ url: process.env.UPSTASH_REDIS_URL! });
const openai = new OpenAI();

async function cachedCompletion(userQuestion: string) {
  // 1. Embed the question
  const embeddingRes = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: userQuestion,
  });
  const queryVector = embeddingRes.data[0].embedding;

  // 2. Search cache (cosine similarity >= 0.95 → cache hit)
  const cacheHit = await vectorCache.findSimilar(queryVector, threshold: 0.95);
  if (cacheHit) return cacheHit.response;

  // 3. Miss — call LLM
  const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: userQuestion }],
  });
  const answer = response.choices[0].message.content;

  // 4. Store in cache
  await vectorCache.store(queryVector, answer, ttl: 3600);
  return answer;
}

Expected savings: 20–40% for customer support and FAQ-heavy workloads.

2. Model routing

Route simple queries to cheap models, complex ones to powerful models:

async function routedCompletion(query: string, context: string) {
  // Classify complexity with gpt-4o-mini (cheap)
  const classifyResponse = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Classify complexity of this task as "simple" or "complex". Reply with one word only.
Task: ${query}`,
    }],
    max_tokens: 5,
  });

  const complexity = classifyResponse.choices[0].message.content?.trim().toLowerCase();

  return openai.chat.completions.create({
    model: complexity === 'complex' ? 'gpt-4o' : 'gpt-4o-mini',
    messages: [
      { role: 'system', content: systemPrompt },
      { role: 'user', content: query },
    ],
  });
}

Expected savings: 50–70% on workloads with a mix of simple and complex queries.

3. Prompt compression

// LLMLingua-style compression — reduce prompt tokens by 50%+ with ~5% quality loss
// Using the llmlingua npm package
import { PromptCompressor } from 'llmlingua';

const compressor = new PromptCompressor();
const { compressed_prompt, ratio } = await compressor.compress(
  longContextDocument,
  { target_token: 512, rank_method: 'longllmlingua' }
);
console.log(`Compression ratio: ${ratio}x`);

Also consider manual techniques:

  • Remove filler phrases ("Please note that", "As an AI language model").
  • Convert verbose instructions to bullet points.
  • Truncate retrieved context to only the most relevant sentences.
  • Store frequently reused prompts in OpenAI prompt cache format.

4. Batch API (50% discount)

Use OpenAI's Batch API for any workload that is not real-time:

// Create batch file (JSONL)
const batchRequests = documents.map((doc, i) => JSON.stringify({
  custom_id: `request-${i}`,
  method: 'POST',
  url: '/v1/chat/completions',
  body: {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Summarize this document in 2 sentences.' },
      { role: 'user', content: doc.text },
    ],
    max_tokens: 100,
  },
})).join('\n');

// Upload and submit
const file = await openai.files.create({
  file: new Blob([batchRequests], { type: 'text/plain' }),
  purpose: 'batch',
});

const batch = await openai.batches.create({
  input_file_id: file.id,
  endpoint: '/v1/chat/completions',
  completion_window: '24h',
});
console.log('Batch ID:', batch.id);

5. Output token control

// Always set max_tokens — an unconstrained response can be 10x more expensive
await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  max_tokens: 256,  // set to minimum needed for task

  // For structured output, enforce schema → models output fewer tokens
  response_format: { type: 'json_schema', json_schema: schema },
});

6. Prompt caching

OpenAI automatically caches the first 1024+ tokens of a stable prompt prefix at 50% discount:

// Structure your messages to maximize cached prefix
const messages = [
  // Stable system prompt (1000+ tokens) — always first, never changes
  { role: 'system', content: largeStableSystemPrompt },
  // Stable examples — also cached
  { role: 'user', content: 'Example input' },
  { role: 'assistant', content: 'Example output' },
  // Dynamic user message — not cached
  { role: 'user', content: userQuery },
];

7. Use embeddings for classification

Embeddings cost ~100× less than chat completions for routing and classification tasks:

// Instead of asking GPT-4o to classify, use k-NN over embeddings
const queryEmbedding = await embedText(userQuery);
const nearestLabel = await findNearestLabel(queryEmbedding, labelEmbeddings);
// labelEmbeddings computed once, cached forever

Cost monitoring

// Track cost per request
const costPerToken = { 'gpt-4o': 0.000005, 'gpt-4o-mini': 0.00000015 };

function estimateCost(model: string, usage: { prompt_tokens: number; completion_tokens: number }) {
  const rate = costPerToken[model] ?? 0.000001;
  return (usage.prompt_tokens + usage.completion_tokens) * rate;
}

logger.info({
  model,
  estimatedCostUSD: estimateCost(model, response.usage!),
  cachedTokens: response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
});

Cost reduction summary

StrategyTypical savingsImplementation effort
Model routing50–70%Medium
Batch API50%Low
Prompt caching20–50%Low
Semantic cache20–40%Medium
Prompt compression20–40%Medium
max_tokens discipline10–30%Low

Takeaway

Start with model routing and Batch API — both deliver the highest savings with the least risk. Add semantic caching once your query patterns are understood. Treat every uncapped max_tokens as a latent cost spike.