OpenAI API Best Practices: Production Patterns for Reliability and Cost Control
·12 min read
Running the OpenAI API in production requires more than calling chat.completions.create. This guide covers the patterns that separate reliable AI features from brittle ones.
Client setup
import OpenAI from 'openai';
const openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY,
maxRetries: 3, // built-in exponential backoff
timeout: 30_000, // 30-second timeout
});Retry and rate-limit handling
OpenAI's SDK retries on 429 and 5xx by default. For custom retry logic:
import { RateLimitError, APIConnectionError } from 'openai';
async function callWithRetry(params, attempts = 0) {
try {
return await openai.chat.completions.create(params);
} catch (err) {
if ((err instanceof RateLimitError || err instanceof APIConnectionError) && attempts < 4) {
const delay = Math.pow(2, attempts) * 1000 + Math.random() * 500;
await new Promise(r => setTimeout(r, delay));
return callWithRetry(params, attempts + 1);
}
throw err;
}
}Streaming responses
Stream for latency-sensitive UIs — show the first token in under 500ms:
const stream = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
stream: true,
});
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content ?? '';
process.stdout.write(delta);
}
// Server-Sent Events (Next.js API route)
return new Response(
new ReadableStream({
async start(controller) {
for await (const chunk of stream) {
const text = chunk.choices[0]?.delta?.content ?? '';
controller.enqueue(new TextEncoder().encode(`data: ${text}\n\n`));
}
controller.close();
},
}),
{ headers: { 'Content-Type': 'text/event-stream' } }
);Model selection by use case
| Model | Best for | Relative cost |
|---|---|---|
| gpt-4o-mini | Classification, extraction, quick Q&A | $ |
| gpt-4o | Complex reasoning, code generation | $$ |
| o3-mini | Math, logic, multi-step planning | $$$ |
| gpt-4o (vision) | Image understanding | $$ |
| text-embedding-3-small | Embeddings, semantic search | $ |
Token budgeting
import { encoding_for_model } from '@dqbd/tiktoken';
function countTokens(text: string, model = 'gpt-4o'): number {
const enc = encoding_for_model(model);
const tokens = enc.encode(text);
enc.free();
return tokens.length;
}
// Guard before sending
const inputTokens = countTokens(systemPrompt + userMessage);
if (inputTokens > 4000) {
throw new Error(`Prompt too large: ${inputTokens} tokens`);
}Prompt caching
OpenAI automatically caches the prefix of long prompts (≥1024 tokens). To maximise cache hits:
- Keep your system prompt stable and at the front of the context.
- Append the dynamic user message at the end.
- Avoid adding timestamps or request IDs to the system prompt.
// Cached prefix (stable system prompt, ~1200 tokens)
const messages = [
{ role: 'system', content: stableSystemPrompt }, // ← cached after first request
{ role: 'user', content: userQuestion }, // ← dynamic, not cached
];Structured output
const result = await openai.chat.completions.create({
model: 'gpt-4o',
messages,
response_format: { type: 'json_schema', json_schema: {
name: 'ExtractedData',
strict: true,
schema: {
type: 'object',
properties: {
sentiment: { type: 'string', enum: ['positive', 'negative', 'neutral'] },
confidence: { type: 'number' },
summary: { type: 'string' },
},
required: ['sentiment', 'confidence', 'summary'],
additionalProperties: false,
}
}},
});
const data = JSON.parse(result.choices[0].message.content!);Cost control patterns
- Cache semantic results: hash the prompt, store the response in Redis with a TTL.
- Batch API: use
/v1/batchesfor non-realtime workloads — 50% cost reduction. - Route by complexity: classify the query first with
gpt-4o-mini, then only escalate complex queries togpt-4o. - Truncate history: for multi-turn conversations, summarize old turns instead of sending the full transcript.
// Semantic cache check
const cacheKey = crypto.createHash('sha256').update(prompt).digest('hex');
const cached = await redis.get(cacheKey);
if (cached) return JSON.parse(cached);
const response = await openai.chat.completions.create({ ... });
await redis.setex(cacheKey, 3600, JSON.stringify(response));
return response;Observability
// Log usage metadata on every call
const response = await openai.chat.completions.create(params);
logger.info({
model: response.model,
promptTokens: response.usage?.prompt_tokens,
completionTokens: response.usage?.completion_tokens,
cachedTokens: response.usage?.prompt_tokens_details?.cached_tokens,
finishReason: response.choices[0].finish_reason,
durationMs: Date.now() - startTime,
});Security checklist
- Never expose your API key to the client — always proxy through your backend.
- Validate and sanitize user input before injecting into prompts.
- Set
max_tokenson every request to cap unexpected cost spikes. - Use moderation endpoint on user-generated content before processing.
// Always set max_tokens
await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages,
max_tokens: 512, // Never omit in production
});Takeaway
Reliability comes from retries and timeouts. Cost control comes from caching, model routing, and the Batch API. Treat every missing max_tokens and exposed API key as a production incident waiting to happen.