AI Observability: Monitoring LLM Applications in Production

LLM applications fail silently — a degraded prompt, a new model version, or a changed system prompt can tank quality without triggering any traditional alert. Observability for AI requires tracking signals that do not exist in standard APM tools.

What to measure

Signal	Why it matters	Alert threshold
P95 latency	User experience	>3s for streaming, >8s for batch
Input token count	Cost and context-window risk	>80% of model limit
Output token count	Cost, truncation detection	Consistently hitting max_tokens
Error rate	Reliability	>1% 5xx or parse failures
Cache hit rate	Cost efficiency	<20% for FAQ-type workloads
Quality score	Output degradation	5% drop from baseline
Hallucination rate	Trust and safety	Any upward trend
Cost per request	Business viability	20% above budget baseline

Instrumentation: logging every LLM call

interface LLMCallLog {
  traceId:          string;
  model:            string;
  promptTokens:     number;
  completionTokens: number;
  cachedTokens:     number;
  latencyMs:        number;
  finishReason:     string;
  estimatedCostUSD: number;
  userId?:          string;
  feature:          string;  // which product feature triggered this call
  promptVersion:    string;  // track prompt changes over time
}

async function trackedCompletion(
  params: OpenAI.ChatCompletionCreateParams,
  meta: { feature: string; promptVersion: string; userId?: string }
): Promise<OpenAI.ChatCompletion> {
  const traceId = crypto.randomUUID();
  const start = Date.now();

  try {
    const response = await openai.chat.completions.create(params);
    const latencyMs = Date.now() - start;

    const log: LLMCallLog = {
      traceId,
      model:            response.model,
      promptTokens:     response.usage!.prompt_tokens,
      completionTokens: response.usage!.completion_tokens,
      cachedTokens:     response.usage?.prompt_tokens_details?.cached_tokens ?? 0,
      latencyMs,
      finishReason:     response.choices[0].finish_reason,
      estimatedCostUSD: estimateCost(response.model, response.usage!),
      ...meta,
    };

    logger.info({ event: 'llm_call', ...log });
    metrics.histogram('llm.latency_ms', latencyMs, { model: response.model, feature: meta.feature });
    metrics.increment('llm.tokens', log.promptTokens + log.completionTokens);

    return response;
  } catch (err) {
    logger.error({ event: 'llm_error', traceId, feature: meta.feature, error: String(err) });
    metrics.increment('llm.errors', 1, { feature: meta.feature });
    throw err;
  }
}

Distributed tracing with OpenTelemetry

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('ai-service', '1.0.0');

async function tracedLLMCall(messages: Message[]) {
  const span = tracer.startSpan('openai.chat.completion', {
    attributes: {
      'llm.model': 'gpt-4o',
      'llm.feature': 'customer-support',
    },
  });

  try {
    const response = await openai.chat.completions.create({ model: 'gpt-4o', messages });
    span.setAttributes({
      'llm.prompt_tokens':     response.usage!.prompt_tokens,
      'llm.completion_tokens': response.usage!.completion_tokens,
      'llm.finish_reason':     response.choices[0].finish_reason,
    });
    span.setStatus({ code: SpanStatusCode.OK });
    return response;
  } catch (err) {
    span.recordException(err as Error);
    span.setStatus({ code: SpanStatusCode.ERROR });
    throw err;
  } finally {
    span.end();
  }
}

LangSmith integration

LangSmith provides full prompt + response tracing out of the box for LangChain applications:

// Set environment variables — LangChain auto-instruments
process.env.LANGCHAIN_TRACING_V2 = 'true';
process.env.LANGCHAIN_API_KEY = process.env.LANGSMITH_API_KEY!;
process.env.LANGCHAIN_PROJECT = 'production-ai-v2';

// All LangChain calls are now traced automatically
const chain = RunnableSequence.from([promptTemplate, model, outputParser]);
const result = await chain.invoke({ question: userQuery });
// → Appears in LangSmith with full prompt, response, latency, token counts

Quality scoring in production

// Automated quality scoring using a judge LLM
async function scoreResponse(
  question: string,
  response: string,
  groundTruth?: string
): Promise<number> {
  const judgeResponse = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{
      role: 'user',
      content: `Score this AI response from 0.0 to 1.0 for quality. Reply with a number only.
Question: ${question}
Response: ${response}
${groundTruth ? `Reference: ${groundTruth}` : ''}`,
    }],
    max_tokens: 5,
  });
  return parseFloat(judgeResponse.choices[0].message.content!);
}

// Sample 10% of traffic for quality scoring
if (Math.random() < 0.1) {
  const score = await scoreResponse(userQuestion, aiResponse);
  metrics.histogram('llm.quality_score', score, { feature });
}

Dashboard metrics to track

// Key metrics for your observability dashboard:
//
// 1. Latency percentiles (P50, P95, P99) by model and feature
// 2. Token usage trend (prompt + completion) — detect prompt bloat
// 3. Estimated daily/weekly cost by model and feature
// 4. Error rate by type (timeout, rate limit, parse failure)
// 5. Cache hit rate (semantic cache efficiency)
// 6. Quality score distribution over time
// 7. Token context utilization (avg prompt / model max)
// 8. Finish reason breakdown (stop vs length → truncation alert)

Alerting rules

// Example: Datadog-style pseudo-config
alerts:
  - name: "LLM P95 latency spike"
    condition: "p95(llm.latency_ms) > 5000 for 5min"
    severity: warning

  - name: "LLM error rate high"
    condition: "rate(llm.errors) / rate(llm.calls) > 0.02 for 5min"
    severity: critical

  - name: "Response truncation detected"
    condition: "rate(llm.finish_reason:length) > 0.05 for 15min"
    severity: warning

  - name: "Daily cost budget exceeded"
    condition: "sum(llm.estimated_cost_usd, 24h) > 500"
    severity: critical

Prompt versioning

// Track which prompt version produced which output
const PROMPT_VERSION = 'support-v3.2';

const response = await trackedCompletion(params, {
  feature: 'customer-support',
  promptVersion: PROMPT_VERSION,
  userId: req.user.id,
});

// In your analytics: group quality scores by promptVersion
// to detect regressions after prompt changes

Takeaway

Instrument every LLM call with traceId, feature tag, prompt version, token counts, latency, and estimated cost. Sample 5–10% of traffic for quality scoring. Alert on finish_reason:length (truncation), error rate, and quality score regression — these are the canary metrics for AI system health.