AI Observability: Monitoring LLM Applications in Production
·12 min read
LLM applications fail silently — a degraded prompt, a new model version, or a changed system prompt can tank quality without triggering any traditional alert. Observability for AI requires tracking signals that do not exist in standard APM tools.
What to measure
| Signal | Why it matters | Alert threshold |
|---|---|---|
| P95 latency | User experience | >3s for streaming, >8s for batch |
| Input token count | Cost and context-window risk | >80% of model limit |
| Output token count | Cost, truncation detection | Consistently hitting max_tokens |
| Error rate | Reliability | >1% 5xx or parse failures |
| Cache hit rate | Cost efficiency | <20% for FAQ-type workloads |
| Quality score | Output degradation | 5% drop from baseline |
| Hallucination rate | Trust and safety | Any upward trend |
| Cost per request | Business viability | 20% above budget baseline |
Instrumentation: logging every LLM call
interface LLMCallLog {
traceId: string;
model: string;
promptTokens: number;
completionTokens: number;
cachedTokens: number;
latencyMs: number;
finishReason: string;
estimatedCostUSD: number;
userId?: string;
feature: string; // which product feature triggered this call
promptVersion: string; // track prompt changes over time
}
async function trackedCompletion(
params: OpenAI.ChatCompletionCreateParams,
meta: { feature: string; promptVersion: string; userId?: string }
): Promise<OpenAI.ChatCompletion> {
const traceId = crypto.randomUUID();
const start = Date.now();
try {
const response = await openai.chat.completions.create(params);
const latencyMs = Date.now() - start;
const log: LLMCallLog = {
traceId,
model: response.model,
promptTokens: response.usage!.prompt_tokens,
completionTokens: response.usage!.completion_tokens,
cachedTokens: response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
latencyMs,
finishReason: response.choices[0].finish_reason,
estimatedCostUSD: estimateCost(response.model, response.usage!),
...meta,
};
logger.info({ event: 'llm_call', ...log });
metrics.histogram('llm.latency_ms', latencyMs, { model: response.model, feature: meta.feature });
metrics.increment('llm.tokens', log.promptTokens + log.completionTokens);
return response;
} catch (err) {
logger.error({ event: 'llm_error', traceId, feature: meta.feature, error: String(err) });
metrics.increment('llm.errors', 1, { feature: meta.feature });
throw err;
}
}Distributed tracing with OpenTelemetry
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('ai-service', '1.0.0');
async function tracedLLMCall(messages: Message[]) {
const span = tracer.startSpan('openai.chat.completion', {
attributes: {
'llm.model': 'gpt-4o',
'llm.feature': 'customer-support',
},
});
try {
const response = await openai.chat.completions.create({ model: 'gpt-4o', messages });
span.setAttributes({
'llm.prompt_tokens': response.usage!.prompt_tokens,
'llm.completion_tokens': response.usage!.completion_tokens,
'llm.finish_reason': response.choices[0].finish_reason,
});
span.setStatus({ code: SpanStatusCode.OK });
return response;
} catch (err) {
span.recordException(err as Error);
span.setStatus({ code: SpanStatusCode.ERROR });
throw err;
} finally {
span.end();
}
}LangSmith integration
LangSmith provides full prompt + response tracing out of the box for LangChain applications:
// Set environment variables — LangChain auto-instruments
process.env.LANGCHAIN_TRACING_V2 = 'true';
process.env.LANGCHAIN_API_KEY = process.env.LANGSMITH_API_KEY!;
process.env.LANGCHAIN_PROJECT = 'production-ai-v2';
// All LangChain calls are now traced automatically
const chain = RunnableSequence.from([promptTemplate, model, outputParser]);
const result = await chain.invoke({ question: userQuery });
// → Appears in LangSmith with full prompt, response, latency, token countsQuality scoring in production
// Automated quality scoring using a judge LLM
async function scoreResponse(
question: string,
response: string,
groundTruth?: string
): Promise<number> {
const judgeResponse = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{
role: 'user',
content: `Score this AI response from 0.0 to 1.0 for quality. Reply with a number only.
Question: ${question}
Response: ${response}
${groundTruth ? `Reference: ${groundTruth}` : ''}`,
}],
max_tokens: 5,
});
return parseFloat(judgeResponse.choices[0].message.content!);
}
// Sample 10% of traffic for quality scoring
if (Math.random() < 0.1) {
const score = await scoreResponse(userQuestion, aiResponse);
metrics.histogram('llm.quality_score', score, { feature });
}Dashboard metrics to track
// Key metrics for your observability dashboard: // // 1. Latency percentiles (P50, P95, P99) by model and feature // 2. Token usage trend (prompt + completion) — detect prompt bloat // 3. Estimated daily/weekly cost by model and feature // 4. Error rate by type (timeout, rate limit, parse failure) // 5. Cache hit rate (semantic cache efficiency) // 6. Quality score distribution over time // 7. Token context utilization (avg prompt / model max) // 8. Finish reason breakdown (stop vs length → truncation alert)
Alerting rules
// Example: Datadog-style pseudo-config
alerts:
- name: "LLM P95 latency spike"
condition: "p95(llm.latency_ms) > 5000 for 5min"
severity: warning
- name: "LLM error rate high"
condition: "rate(llm.errors) / rate(llm.calls) > 0.02 for 5min"
severity: critical
- name: "Response truncation detected"
condition: "rate(llm.finish_reason:length) > 0.05 for 15min"
severity: warning
- name: "Daily cost budget exceeded"
condition: "sum(llm.estimated_cost_usd, 24h) > 500"
severity: criticalPrompt versioning
// Track which prompt version produced which output
const PROMPT_VERSION = 'support-v3.2';
const response = await trackedCompletion(params, {
feature: 'customer-support',
promptVersion: PROMPT_VERSION,
userId: req.user.id,
});
// In your analytics: group quality scores by promptVersion
// to detect regressions after prompt changesTakeaway
Instrument every LLM call with traceId, feature tag, prompt version, token counts, latency, and estimated cost. Sample 5–10% of traffic for quality scoring. Alert on finish_reason:length (truncation), error rate, and quality score regression — these are the canary metrics for AI system health.