Monitoring LLM Pipelines in Production: Metrics That Actually Matter -- KamiwazaAI

Six months after shipping a LangChain-based research assistant to 40,000 users, our p95 latency was sitting pretty at 380ms—and support tickets were climbing 15% week-over-week because the model was confidently hallucinating citations. Latency is what you reach for first because it's trivial to instrument. It's almost never the signal that tells you your pipeline is quietly on fire.

Why Standard APM Falls Short for LLM Pipelines

Traditional observability stacks—Prometheus scraping HTTP endpoints, Grafana dashboards full of request rates and error codes, Jaeger traces through microservices—were built for deterministic systems. An LLM pipeline isn't that. A 200 OK with a 1,200-token response can be a correct answer, a hallucinated answer, a politely-worded refusal, or a partially-executed tool call that silently dropped two steps. HTTP status codes tell you nothing about any of those outcomes.

You need a second layer of evaluation metrics running alongside—or immediately after—every inference call, feeding into the same observability infrastructure your on-call engineers already know how to query. Bolting on a separate "AI observability" tool that nobody checks during an incident is not the answer.

The Metric Set That Actually Reflects Pipeline Health

Hallucination Rate

For a retrieval-augmented pipeline using pgvector or Pinecone with a well-tuned prompt, expect 0.5–2% hallucination on factual claims measured against a ground-truth eval set. Above 3% is a yellow alert. Above 6% is a page. You can't run a judge-LLM call on every request in production—cost kills that idea fast. Instead, run a lightweight embedding-based faithfulness check on every response (cosine similarity between retrieved context and generated answer, threshold 0.72), then route a 5% statistical sample to a GPT-4-class judge for calibration. Emit the result as a Prometheus gauge: llm_faithfulness_score_bucket. When the rolling 5-minute p10 faithfulness drops below threshold, you want the alert firing before your users start filing tickets.

Refusal Rate and Refusal Drift

A refusal isn't inherently bad. A sudden spike almost always signals a prompt regression, a guardrail misconfiguration, or an upstream context window overflow. Track llm_refusal_total as a counter, labeled by pipeline stage and model version.

The more dangerous pattern is refusal drift—the rate slowly creeping from 1.2% to 4.8% over three weeks because a retrieval reranker started injecting ambiguous content. Nobody notices until users complain. A Grafana alert on the 7-day rolling slope catches this before it becomes visible on the support queue. Most teams get this wrong by only alerting on absolute values, not slope.

Tool-Call Success Rate

LangGraph and AutoGen pipelines routing through tool calls—API lookups, SQL generation, code execution—have a compound failure surface. Track success at two levels: parse success (did the model emit valid JSON matching your tool schema?) and execution success (did the downstream call return a non-error result?). Parse failures above 1.5% usually mean your function-calling prompt has drifted or a model version changed underneath you without anyone noticing. Execution failures need separate attribution—they're often infrastructure problems, not model problems, and conflating the two wastes hours during an incident.

Token-per-Dollar Efficiency

Cost observability is under-instrumented in nearly every production stack we've audited. Emit llm_tokens_total (prompt + completion, labeled by model) and join it against your cloud spend API in Grafana. A well-optimized pipeline running mixtral-8x7b for classification and GPT-4o only for synthesis should land at $0.0008–$0.0015 per user session. If that number drifts upward 20%, it's usually prompt bloat, context window mismanagement (stuffing 8k tokens when 2k suffices), or a routing rule that quietly stopped working. A platform team we worked with last quarter caught a 40% cost regression in week one this way—beats finding it on the monthly invoice by a lot.

Instrumenting with OpenTelemetry for LLM

The OpenTelemetry semantic conventions for LLM (GenAI attributes, stabilized in the 1.27 spec) give you a standard schema: gen_ai.system, gen_ai.request.model, gen_ai.usage.prompt_tokens, gen_ai.usage.completion_tokens, and gen_ai.response.finish_reason. Instrument at the LLM client boundary—not the HTTP layer—so you capture model-level semantics rather than transport noise. A minimal span looks like this:

// OpenTelemetry span attributes (GenAI conventions)
{
  "gen_ai.system": "openai",
  "gen_ai.request.model": "gpt-4o",
  "gen_ai.request.max_tokens": 1024,
  "gen_ai.usage.prompt_tokens": 1847,
  "gen_ai.usage.completion_tokens": 312,
  "gen_ai.response.finish_reason": "stop",
  "gen_ai.response.id": "chatcmpl-abc123",
  // custom evaluator attributes
  "llm.faithfulness_score": 0.91,
  "llm.refusal": false,
  "llm.tool_call_parse_ok": true
}

Export these spans to your existing collector pipeline—Prometheus remote write for metrics, Tempo or Jaeger for traces. Custom evaluator attributes like llm.faithfulness_score sit outside the official spec but are entirely legal as extension attributes. Prefix them with your namespace to avoid collisions when the spec expands.

Semantic Drift: The Slow Regression Nobody Notices

Semantic drift is what happens when your pipeline's output distribution shifts gradually—not broken, just different. A retrieval pipeline that answered "explain the return policy" with a two-sentence direct answer in January might respond with a five-paragraph hedged essay in April because an upstream reranker model was silently updated. Users don't file tickets. They just churn.

Detect it by embedding a random sample of production responses daily with a frozen encoder model (text-embedding-3-small works fine) and tracking centroid distance against a baseline snapshot stored in Redis. Centroid drift above 0.08 cosine distance over a 14-day window is a meaningful signal. Wire this into Grafana as a daily data point—it won't page you at 3am, but it will surface in your weekly reliability review before the churn numbers do.

User-Reported Incorrect: Closing the Feedback Loop

Every automated metric has blind spots. User-reported incorrect—a thumbs-down, a "this is wrong" flag, an explicit correction—is ground truth at a sample rate you can't manufacture synthetically. The problem is that most teams store this signal in a product database and never join it back to the trace that produced the response. That's a waste.

Fix it by emitting a trace_id in your API response payload, storing it alongside the user feedback event in Kafka, and running a Flink or Kafka Streams job that joins feedback to the original OpenTelemetry span within a 30-minute window. The resulting labeled dataset feeds weekly eval runs and, more immediately, lets you build a Grafana panel showing user-reported error rate by model version and pipeline variant. When a new LlamaIndex retriever ships and user-reported incorrect jumps from 1.1% to 2.8%, you know within hours instead of weeks.

Dashboard Architecture: One Source of Truth

Don't build a separate "AI observability" dashboard silo. Grafana's data source federation lets you pull Prometheus metrics, OpenTelemetry trace exemplars, and your custom eval database into a single dashboard that on-call engineers can use without context-switching. Organize panels by signal type, not by tool:

Reliability row: hallucination rate, refusal rate, tool-call parse success, user-reported incorrect
Performance row: p50/p95/p99 inference latency by model, TTFT (time-to-first-token) for streaming endpoints, queue depth for Ray Serve or Kubernetes-based workers
Cost row: tokens per dollar, prompt vs. completion token ratio, model routing distribution
Drift row: 14-day semantic centroid distance, refusal rate slope, output length distribution shift

Set alert thresholds at two levels: a Slack notification for slow regressions (drift, cost creep) and a PagerDuty page for acute failures (hallucination rate above 6%, tool-call parse success below 95%). The goal is that any engineer on call can read the dashboard and immediately identify which stage is degrading—retrieval, generation, tool execution, or post-processing—without reproducing the issue locally first.

Takeaways

Latency and error rate are table stakes; hallucination rate, refusal drift, tool-call success, and semantic drift are the metrics that actually reflect whether your LLM pipeline is working for users.
Instrument at the LLM client boundary using OpenTelemetry GenAI semantic conventions and extend spans with custom evaluator attributes—faithfulness scores, refusal flags, parse results—without abandoning your existing collector infrastructure.
Close the feedback loop by joining user-reported incorrect signals (via Kafka + trace ID) back to production spans; that labeled data is more valuable than any synthetic eval set at scale.
Build one unified Grafana dashboard organized by signal type (reliability, performance, cost, drift) so on-call engineers can triage LLM regressions the same way they triage any other production incident.

← Back to Blog