latency reliability routing

Latency-budget routing: how to stop SLA breaches before they happen

Luke Norris May 28, 2025 7 min read

P95 latency monitoring dashboard showing automatic endpoint failover before SLA breach threshold

Most LLM routing configurations handle failure the same way traditional HTTP clients do: send a request, wait for a response, and if the response takes too long, surface an error to the user. The timeout is usually set in the HTTP client config and never revisited. For batch processing pipelines, this is tolerable. For interactive applications where users are waiting on a response, it's a latency SLA breach pattern that could have been interrupted before the user ever felt it.

Latency-budget routing is a different model: declare a latency budget in your routing policy (e.g., "this workload type must complete within 3 seconds at p95"), evaluate real-time per-endpoint latency metrics continuously, and fail over to an alternate endpoint before a request is even sent when the primary endpoint's current p95 exceeds the budget. The breach doesn't happen because the routing layer acts before it starts.

This post covers the mechanics of how latency-budget routing works in practice, the scenarios where it matters most, and the places where it can introduce more problems than it solves.

The distinction between timeout-based and budget-based routing

Timeout-based routing is reactive: a request is sent, it takes too long, the timeout fires, and the application receives an error. If your application handles that error gracefully, it may retry on another provider — but the user has already waited the full timeout duration before the fallback starts. At a 5-second timeout, a retry adds another 1-4 seconds of latency. The best-case user experience is 6-9 seconds. The expected user experience is worse.

Budget-based routing is proactive: before sending the request, the gateway checks whether the target endpoint's current latency distribution is consistent with delivering a response within the declared budget. If the p95 on that endpoint is already at 4.2 seconds, and the budget is 3 seconds p95, the gateway routes the request to the next endpoint in the priority list — one whose current p95 is within budget — without the user ever waiting on the primary endpoint.

The implementation requires that the gateway maintain a rolling latency window per endpoint. At Kamiwaza, this is a 60-second exponentially weighted moving average of TTFT and total completion time per endpoint, updated on every response. The routing decision uses these live metrics, not static configuration values.

How the latency budget is declared

A routing policy with latency budget enforcement looks like this:

routing_policy:
  name: interactive-chat-latency-aware
  latency_budget:
    ttft_p95_ms: 1500      # Time to first token, 95th percentile
    total_p95_ms: 4000     # Full completion, 95th percentile
    evaluation_window_s: 60  # Rolling window for endpoint health

  endpoints:
    - id: anthropic-claude-3-haiku
      priority: 1
      within_budget_check: ttft_p95_ms
    - id: openai-gpt4o-mini
      priority: 2
      within_budget_check: ttft_p95_ms
    - id: bedrock-llama-3-8b
      priority: 3
      within_budget_check: ttft_p95_ms

  fallback_behavior: next-within-budget
  no_endpoints_within_budget: use-fastest-available

When a request arrives, the gateway evaluates the endpoints in priority order. It checks whether each endpoint's current rolling p95 TTFT is within the declared 1,500ms budget. The first endpoint whose current metrics are within budget receives the request. If no endpoint is currently within budget — a scenario that happens during widespread provider degradation events — the no_endpoints_within_budget: use-fastest-available directive falls back to the endpoint with the lowest current p95, even if it exceeds the budget, rather than returning an error.

A concrete scenario: the ed-tech platform under mid-day load

An ed-tech platform serving cost-sensitive student users processes roughly 2M monthly inferences — primarily tutoring chat, essay feedback, and quiz generation. Their interactive chat feature has a 3-second p95 total latency SLA, which their internal testing found to be the threshold below which student satisfaction scores don't degrade.

Their primary endpoint is Claude 3 Haiku (best quality-per-cost for their tutoring task type). During a mid-day load event in late 2024, Anthropic's API experienced elevated p95 latency — their rolling p95 at the gateway level showed 3.8 seconds and climbing. Without latency-budget routing, every new request would be sent to Claude 3 Haiku, take 3.8+ seconds, and breach the SLA. With budget routing configured at 2.8 seconds p95 TTFT, the gateway automatically started routing to GPT-4o-mini (their priority-2 endpoint, whose current p95 was 1.1 seconds) within the next 60-second evaluation window.

The cost difference: Claude 3 Haiku at ~$0.25/M input is cheaper than GPT-4o-mini at the same quality level for their task type. But the fallback period was roughly 40 minutes at ~30% of their traffic volume. The cost increment from routing 30% of traffic to GPT-4o-mini for 40 minutes at their volume was approximately $8. The alternative was hundreds of SLA breaches during a period when students were actively using the platform.

Request hedging: the more aggressive variant

Latency-budget routing with ordered fallback is proactive but still sequential: try endpoint 1, if its metrics are bad, try endpoint 2. Request hedging goes further: send the request to both endpoints simultaneously and use whichever response arrives first, canceling the other.

Hedging guarantees the best possible latency at the cost of doubled API spend on hedged requests. It's not appropriate for all traffic — but for a subset of requests where latency is the dominant concern (real-time suggestions, interactive completions with a 500ms user-visible budget), hedging against two cheap endpoints (GPT-4o-mini + Claude 3 Haiku) costs roughly $0.90/M blended instead of $0.45/M, but eliminates the tail latency risk entirely.

routing_policy:
  name: realtime-suggestion-hedged
  mode: hedge
  hedge_endpoints:
    - openai-gpt4o-mini
    - anthropic-claude-3-haiku
  hedge_trigger:
    # Start second request after first endpoint hasn't responded in Xms
    delay_ms: 400
  cancel_loser: true  # Cancel the slower response to avoid token waste

The delay_ms: 400 parameter is the key lever. At 400ms delay, you only send the hedge request if the primary hasn't responded in under 400ms — which avoids doubling costs on requests that would have resolved quickly anyway. The hedge fires only for the tail of requests that are already showing early signs of slowness.

We're not saying request hedging is a cost-free optimization. At scale, even with a 400ms trigger delay, a non-trivial fraction of requests will fire both legs. Platform teams should measure their hedge rate (what percentage of requests actually trigger the second leg) and treat it as a cost vs. latency tradeoff dial — not as a feature you enable and forget.

The rolling metric problem: what your latency window is actually measuring

The accuracy of latency-budget routing depends entirely on whether your rolling latency metrics reflect current endpoint conditions. This is harder than it sounds for three reasons:

Sample size in low-traffic windows: If your platform processes 10 requests per minute and your evaluation window is 60 seconds, your p95 estimate is based on 9-10 samples. That's not a statistically reliable p95. For low-traffic platforms, the window needs to be longer (300-600 seconds) or the fallback trigger should use a less extreme percentile (p85 or p90) that's more robustly estimated from small samples.

Latency confounding by request type: A 200-token summarization request and a 2,000-token analysis request have very different expected latency profiles. If your rolling window mixes both types without stratification, the p95 estimate is meaningless for either type individually. Routing policies for latency-sensitive workloads should track per-endpoint latency stratified by approximate token count range, not just globally.

Correlated failures across providers: During major cloud provider incidents, multiple managed API endpoints may degrade simultaneously. The routing layer's fallback chain assumes at least one endpoint is healthy at any given time. If all endpoints are degraded, the use-fastest-available directive handles the graceful degradation — but the routing layer should also surface a clear signal to the platform's observability stack so engineers know the degradation is external, not a gateway configuration problem.

Integrating latency budgets with cost caps

Building on the experience of working through distributed compute architecture — where the challenge is always balancing resource utilization against responsiveness — Luke Norris designed Kamiwaza's routing engine to evaluate latency budget and cost cap as parallel constraints rather than sequential filters. A request is eligible for an endpoint only if both constraints are satisfied: the endpoint is within the latency budget AND routing to it won't breach the tenant's daily cost cap.

The interaction between these two constraints creates a natural optimization pressure: expensive, fast endpoints (like Claude 3.5 Sonnet) are available as latency fallbacks, but the cost cap prevents them from becoming the default when cheaper options are within budget. This means the routing policy doesn't need to hardcode a preference — it emerges from the constraint interaction.

def select_endpoint(request, tenant, endpoints, latency_metrics):
    budget_ms = tenant.routing_policy.ttft_p95_ms
    cost_remaining = tenant.daily_cost_cap - tenant.today_cost_consumed

    candidates = [
        ep for ep in endpoints
        if latency_metrics[ep.id].p95_ttft_ms <= budget_ms
        and ep.estimated_cost(request) <= cost_remaining * 0.01  # <1% of remaining budget per request
    ]

    if not candidates:
        # No endpoint satisfies both constraints — fall back to fastest within cost
        candidates = sorted(endpoints, key=lambda ep: latency_metrics[ep.id].p95_ttft_ms)

    return candidates[0]  # Highest priority endpoint meeting both constraints

This logic runs in the gateway's hot path, adding roughly 2-5ms of overhead per routing decision — well within the gateway's 4ms median routing overhead target. The result is that platform teams declare intent (latency budget, cost cap, endpoint priority) and the gateway resolves the actual routing decision in real time, without the platform team needing to anticipate every possible endpoint degradation scenario in their policy configuration.