OpenAI API Best Practices: Production Patterns for Reliability and Cost Control

Running the OpenAI API in production requires more than calling chat.completions.create. This guide covers the patterns that separate reliable AI features from brittle ones.

Client setup

import OpenAI from 'openai';

const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  maxRetries: 3,          // built-in exponential backoff
  timeout: 30_000,        // 30-second timeout
});

Retry and rate-limit handling

OpenAI's SDK retries on 429 and 5xx by default. For custom retry logic:

import { RateLimitError, APIConnectionError } from 'openai';

async function callWithRetry(params, attempts = 0) {
  try {
    return await openai.chat.completions.create(params);
  } catch (err) {
    if ((err instanceof RateLimitError || err instanceof APIConnectionError) && attempts < 4) {
      const delay = Math.pow(2, attempts) * 1000 + Math.random() * 500;
      await new Promise(r => setTimeout(r, delay));
      return callWithRetry(params, attempts + 1);
    }
    throw err;
  }
}

Streaming responses

Stream for latency-sensitive UIs — show the first token in under 500ms:

const stream = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  stream: true,
});

for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta?.content ?? '';
  process.stdout.write(delta);
}

// Server-Sent Events (Next.js API route)
return new Response(
  new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const text = chunk.choices[0]?.delta?.content ?? '';
        controller.enqueue(new TextEncoder().encode(`data: ${text}\n\n`));
      }
      controller.close();
    },
  }),
  { headers: { 'Content-Type': 'text/event-stream' } }
);

Model selection by use case

Model	Best for	Relative cost
gpt-4o-mini	Classification, extraction, quick Q&A	$
gpt-4o	Complex reasoning, code generation	$$
o3-mini	Math, logic, multi-step planning	$$$
gpt-4o (vision)	Image understanding	$$
text-embedding-3-small	Embeddings, semantic search	$

Token budgeting

import { encoding_for_model } from '@dqbd/tiktoken';

function countTokens(text: string, model = 'gpt-4o'): number {
  const enc = encoding_for_model(model);
  const tokens = enc.encode(text);
  enc.free();
  return tokens.length;
}

// Guard before sending
const inputTokens = countTokens(systemPrompt + userMessage);
if (inputTokens > 4000) {
  throw new Error(`Prompt too large: ${inputTokens} tokens`);
}

Prompt caching

OpenAI automatically caches the prefix of long prompts (≥1024 tokens). To maximise cache hits:

Keep your system prompt stable and at the front of the context.
Append the dynamic user message at the end.
Avoid adding timestamps or request IDs to the system prompt.

// Cached prefix (stable system prompt, ~1200 tokens)
const messages = [
  { role: 'system', content: stableSystemPrompt },  // ← cached after first request
  { role: 'user',   content: userQuestion },         // ← dynamic, not cached
];

Structured output

const result = await openai.chat.completions.create({
  model: 'gpt-4o',
  messages,
  response_format: { type: 'json_schema', json_schema: {
    name: 'ExtractedData',
    strict: true,
    schema: {
      type: 'object',
      properties: {
        sentiment: { type: 'string', enum: ['positive', 'negative', 'neutral'] },
        confidence: { type: 'number' },
        summary: { type: 'string' },
      },
      required: ['sentiment', 'confidence', 'summary'],
      additionalProperties: false,
    }
  }},
});
const data = JSON.parse(result.choices[0].message.content!);

Cost control patterns

Cache semantic results: hash the prompt, store the response in Redis with a TTL.
Batch API: use /v1/batches for non-realtime workloads — 50% cost reduction.
Route by complexity: classify the query first with gpt-4o-mini, then only escalate complex queries to gpt-4o.
Truncate history: for multi-turn conversations, summarize old turns instead of sending the full transcript.

// Semantic cache check
const cacheKey = crypto.createHash('sha256').update(prompt).digest('hex');
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);

const response = await openai.chat.completions.create({ ... });
await redis.setex(cacheKey, 3600, JSON.stringify(response));
return response;

Observability

// Log usage metadata on every call
const response = await openai.chat.completions.create(params);
logger.info({
  model: response.model,
  promptTokens: response.usage?.prompt_tokens,
  completionTokens: response.usage?.completion_tokens,
  cachedTokens: response.usage?.prompt_tokens_details?.cached_tokens,
  finishReason: response.choices[0].finish_reason,
  durationMs: Date.now() - startTime,
});

Security checklist

Never expose your API key to the client — always proxy through your backend.
Validate and sanitize user input before injecting into prompts.
Set max_tokens on every request to cap unexpected cost spikes.
Use moderation endpoint on user-generated content before processing.

// Always set max_tokens
await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages,
  max_tokens: 512,  // Never omit in production
});

Takeaway

Reliability comes from retries and timeouts. Cost control comes from caching, model routing, and the Batch API. Treat every missing max_tokens and exposed API key as a production incident waiting to happen.