Building Reliable AI Agents: Failure Modes and Recovery Strategies -- KamiwazaAI

Three weeks into production, our LangGraph-based research agent started billing $4,000 a day in OpenAI costs—not because it was doing useful work, but because it had found a loop between a web-search tool and a summarization step that it couldn't reason its way out of. Nobody had written a test for "agent spins forever." That incident forced us to build a failure taxonomy before we wrote another line of agent code.

The Four Failure Modes That Will Actually Kill You

Agent failures cluster into four categories, and conflating them leads to wrong mitigations. Infinite loops occur when the planner never receives a terminal signal—the agent keeps invoking tools because its stopping condition is vague or the tool output doesn't match the expected schema. Tool misuse is subtler: the agent calls the right tool with semantically wrong arguments, producing plausible-looking garbage that propagates downstream. A ReAct-style agent using pgvector for similarity search might pass a question string instead of an embedding vector; the query succeeds, the results are meaningless, and nothing raises an exception. Hallucinated plans happen when the LLM generates a multi-step plan referencing tools or APIs that don't exist—AutoGen agents are particularly prone to this when the tool manifest isn't tightly constrained. Context overflow is the slow poison: at 8k or even 128k context windows, long-running agents accumulate tool outputs, scratchpad notes, and retry traces until the useful signal is buried, p95 latency climbs past 4 seconds, and coherence degrades below any useful threshold.

Instrumenting Agents Before You Need It

Observability isn't optional here—it's the difference between a postmortem you can actually act on and one where you're just guessing. Instrument every tool invocation with OpenTelemetry spans. Each span should carry the tool name, input hash, output token count, latency, and a monotonically incrementing step counter per agent run. Export to a Prometheus/Grafana stack and alert on step count exceeding a threshold. We use 25 steps as a soft limit, 50 as a hard kill. LangChain's callback system and LlamaIndex's instrumentation hooks both emit compatible span data with minimal configuration.

Key Metrics to Track Per Agent Run

Step count: hard ceiling prevents infinite loops; alert at 80% of ceiling
Cumulative token consumption: track against context budget; 6k tokens of scratchpad in an 8k model is a red flag
Tool error rate: any single tool exceeding 5% error rate in a 5-minute window should trip a circuit breaker
Wall-clock run duration: p95 target of 400ms per tool call; aggregate run timeout at 120 seconds
Plan–reality divergence: log when the agent invokes a tool not listed in its initial plan; high divergence correlates with hallucinated plans

Kafka is worth considering if you're running agents at scale. Stream every agent event to a topic and replay them for post-hoc debugging without touching the agent itself. At 10k agent runs per day, synchronous logging creates measurable latency overhead; async event streaming keeps it under 15ms per step. A platform team we worked with last quarter had skipped this and was flying blind when their agents started misbehaving—reconstructing what had happened from LLM API logs alone took two days.

Retry Logic and Circuit Breakers

Most teams get retry logic wrong for agents. A blind retry after a tool failure re-executes the same flawed plan with the same flawed inputs. Retries for agents are not the same as retries for HTTP calls, and the distinction matters. For transient infrastructure failures—Pinecone timeout, Redis eviction—exponential backoff with jitter is correct: base 100ms, cap at 8 seconds, three attempts. For tool-output validation failures, the retry must include the validation error in the prompt context so the LLM can adjust its argument construction. For hallucinated tool names, the correct response is not a retry at all—it's an immediate fallback to a constrained tool-selection prompt.

# Pseudo-config for a LangGraph node with typed retry policy
tool_node:
  max_retries: 3
  retry_policy:
    TRANSIENT_ERROR: exponential_backoff(base=0.1, cap=8.0)
    VALIDATION_ERROR: retry_with_error_context(max=2)
    UNKNOWN_TOOL: fallback_plan_trigger
    TIMEOUT: circuit_breaker_check_then_retry(max=1)
  circuit_breaker:
    failure_threshold: 5        # failures in window
    window_seconds: 60
    open_duration_seconds: 30
    half_open_probes: 1

Circuit breakers should operate at the tool level, not the agent level. If your Weaviate semantic search tool is returning errors, you want to open the breaker on that tool and route to an Elasticsearch fallback—not kill the entire agent run. Istio service mesh can enforce circuit breakers at the network layer for external tool endpoints, giving you a safety net independent of application code.

Fallback Plans and Graceful Degradation

Every agent running in production needs a degraded-mode path. Design fallback plans as first-class artifacts. A research agent that normally hits five specialized tools should have a two-tool fallback that uses only a general web search and a cached knowledge base—lower-quality output, but actual output. Define quality tiers explicitly: Tier 1 uses all tools with full context, Tier 2 drops vector search and uses keyword-only retrieval from Elasticsearch, Tier 3 returns a cached response or a structured "unable to complete" with partial results. Hallucination rates climb from a 0.5% baseline at Tier 1 to roughly 3–4% at Tier 3. That's acceptable for some use cases and a disaster for others. Either way, document those numbers in your runbook so on-call engineers aren't making judgment calls at 2am.

LlamaIndex query pipelines support conditional routing that makes tier switching straightforward to implement. Temporal workflows are worth considering for agents with complex multi-step dependencies—Temporal's durable execution model handles retries, timeouts, and state persistence, which eliminates a large class of edge-case failures that are tedious to handle manually in LangGraph.

Human-in-the-Loop Gates

Not every recovery strategy should be automated. High-stakes agent actions—executing code, writing to production databases, sending external communications—need a human confirmation gate before execution, not after failure. Implement gates as blocking checkpoints: the agent proposes an action, serializes its plan to a queue, and waits for approval before proceeding. LangGraph's interrupt mechanism and Temporal's signal/wait patterns both support this natively.

When to Require Human Approval

Any irreversible action (data deletion, external API calls with side effects)
Plans that deviate more than two steps from the initial outline
Tool calls that would consume more than 50k tokens in a single step
Confidence scores below a calibrated threshold (requires a scoring layer, but worth building)
First execution of a new tool or a new tool argument shape the agent hasn't used before

OPA (Open Policy Agent) can enforce gate policies declaratively without coupling them to agent code. Define policies as Rego rules, evaluate them at the workflow checkpoint, and you get a consistent audit trail at no extra cost. One thing people consistently get wrong: gate latency should be excluded from your agent p95 SLA. Human review time is a product decision, not an infrastructure problem.

Context Management as a Reliability Primitive

Context overflow is the failure mode teams address last and regret earliest. At 8k context, an agent accumulates roughly 6–8 tool outputs before early conversation is truncated. At 128k context, the problem is delayed but not eliminated—and larger contexts push p95 latency past 2 seconds per LLM call.

The practical fix is aggressive context pruning at each step: summarize completed sub-tasks into a compact state object, evict raw tool outputs after extraction, and never carry full retrieval results beyond the step that used them. Redis is the right store for intermediate agent state between steps—fast, cheap, and eviction policies keep memory bounded. Keep active context under 40% of the model's window. That headroom is what preserves coherent reasoning over long runs, and losing it is gradual enough that you won't notice until p95 latency has already gone sideways.

Takeaways

Classify failures before building mitigations—infinite loops, tool misuse, hallucinated plans, and context overflow require different responses, and treating them uniformly leads to both over-engineering and missed failure cases.
Instrument every tool call with OpenTelemetry spans and enforce hard step-count ceilings; a 25-step soft limit with a 50-step kill switch would have prevented the $4k/day loop incident on day one.
Design fallback plans as explicit quality tiers with documented hallucination rate trade-offs, and use circuit breakers at the tool level rather than the agent level to preserve partial functionality.
Human-in-the-loop gates enforced via OPA policies belong in the architecture from the start for any irreversible action—retrofitting them after a production incident is painful and incomplete.

← Back to Blog