January 20, 2026
Three weeks into production, our LangGraph-based research agent started billing $4,000 a day in OpenAI costs—not because it was doing useful work, but because it had found a loop between a web-search tool and a summarization step that it couldn't reason its way out of. Nobody had written a test for "agent spins forever." That incident forced us to build a failure taxonomy before we wrote another line of agent code.
Agent failures cluster into four categories, and conflating them leads to wrong mitigations. Infinite loops occur when the planner never receives a terminal signal—the agent keeps invoking tools because its stopping condition is vague or the tool output doesn't match the expected schema. Tool misuse is subtler: the agent calls the right tool with semantically wrong arguments, producing plausible-looking garbage that propagates downstream. A ReAct-style agent using pgvector for similarity search might pass a question string instead of an embedding vector; the query succeeds, the results are meaningless, and nothing raises an exception. Hallucinated plans happen when the LLM generates a multi-step plan referencing tools or APIs that don't exist—AutoGen agents are particularly prone to this when the tool manifest isn't tightly constrained. Context overflow is the slow poison: at 8k or even 128k context windows, long-running agents accumulate tool outputs, scratchpad notes, and retry traces until the useful signal is buried, p95 latency climbs past 4 seconds, and coherence degrades below any useful threshold.
Observability isn't optional here—it's the difference between a postmortem you can actually act on and one where you're just guessing. Instrument every tool invocation with OpenTelemetry spans. Each span should carry the tool name, input hash, output token count, latency, and a monotonically incrementing step counter per agent run. Export to a Prometheus/Grafana stack and alert on step count exceeding a threshold. We use 25 steps as a soft limit, 50 as a hard kill. LangChain's callback system and LlamaIndex's instrumentation hooks both emit compatible span data with minimal configuration.
Kafka is worth considering if you're running agents at scale. Stream every agent event to a topic and replay them for post-hoc debugging without touching the agent itself. At 10k agent runs per day, synchronous logging creates measurable latency overhead; async event streaming keeps it under 15ms per step. A platform team we worked with last quarter had skipped this and was flying blind when their agents started misbehaving—reconstructing what had happened from LLM API logs alone took two days.
Most teams get retry logic wrong for agents. A blind retry after a tool failure re-executes the same flawed plan with the same flawed inputs. Retries for agents are not the same as retries for HTTP calls, and the distinction matters. For transient infrastructure failures—Pinecone timeout, Redis eviction—exponential backoff with jitter is correct: base 100ms, cap at 8 seconds, three attempts. For tool-output validation failures, the retry must include the validation error in the prompt context so the LLM can adjust its argument construction. For hallucinated tool names, the correct response is not a retry at all—it's an immediate fallback to a constrained tool-selection prompt.
# Pseudo-config for a LangGraph node with typed retry policy
tool_node:
max_retries: 3
retry_policy:
TRANSIENT_ERROR: exponential_backoff(base=0.1, cap=8.0)
VALIDATION_ERROR: retry_with_error_context(max=2)
UNKNOWN_TOOL: fallback_plan_trigger
TIMEOUT: circuit_breaker_check_then_retry(max=1)
circuit_breaker:
failure_threshold: 5 # failures in window
window_seconds: 60
open_duration_seconds: 30
half_open_probes: 1
Circuit breakers should operate at the tool level, not the agent level. If your Weaviate semantic search tool is returning errors, you want to open the breaker on that tool and route to an Elasticsearch fallback—not kill the entire agent run. Istio service mesh can enforce circuit breakers at the network layer for external tool endpoints, giving you a safety net independent of application code.
Every agent running in production needs a degraded-mode path. Design fallback plans as first-class artifacts. A research agent that normally hits five specialized tools should have a two-tool fallback that uses only a general web search and a cached knowledge base—lower-quality output, but actual output. Define quality tiers explicitly: Tier 1 uses all tools with full context, Tier 2 drops vector search and uses keyword-only retrieval from Elasticsearch, Tier 3 returns a cached response or a structured "unable to complete" with partial results. Hallucination rates climb from a 0.5% baseline at Tier 1 to roughly 3–4% at Tier 3. That's acceptable for some use cases and a disaster for others. Either way, document those numbers in your runbook so on-call engineers aren't making judgment calls at 2am.
LlamaIndex query pipelines support conditional routing that makes tier switching straightforward to implement. Temporal workflows are worth considering for agents with complex multi-step dependencies—Temporal's durable execution model handles retries, timeouts, and state persistence, which eliminates a large class of edge-case failures that are tedious to handle manually in LangGraph.
Not every recovery strategy should be automated. High-stakes agent actions—executing code, writing to production databases, sending external communications—need a human confirmation gate before execution, not after failure. Implement gates as blocking checkpoints: the agent proposes an action, serializes its plan to a queue, and waits for approval before proceeding. LangGraph's interrupt mechanism and Temporal's signal/wait patterns both support this natively.
OPA (Open Policy Agent) can enforce gate policies declaratively without coupling them to agent code. Define policies as Rego rules, evaluate them at the workflow checkpoint, and you get a consistent audit trail at no extra cost. One thing people consistently get wrong: gate latency should be excluded from your agent p95 SLA. Human review time is a product decision, not an infrastructure problem.
Context overflow is the failure mode teams address last and regret earliest. At 8k context, an agent accumulates roughly 6–8 tool outputs before early conversation is truncated. At 128k context, the problem is delayed but not eliminated—and larger contexts push p95 latency past 2 seconds per LLM call.
The practical fix is aggressive context pruning at each step: summarize completed sub-tasks into a compact state object, evict raw tool outputs after extraction, and never carry full retrieval results beyond the step that used them. Redis is the right store for intermediate agent state between steps—fast, cheap, and eviction policies keep memory bounded. Keep active context under 40% of the model's window. That headroom is what preserves coherent reasoning over long runs, and losing it is gradual enough that you won't notice until p95 latency has already gone sideways.