Use case

Federated inference

Run inference on your private GPU cluster and burst to Bedrock when p95 latency exceeds your declared budget. Failover is automatic at the gateway level — your application code is unchanged.

How it works

Latency-budget routing with auto-failover

Kamiwaza monitors real-time p95 latency per endpoint. When a private GPU cluster starts exceeding your declared budget, the gateway automatically shifts traffic to a managed endpoint until the cluster recovers — without any code change in your application.

# Federated inference: latency-budget routing
version: v1
endpoints:
  - id: private-gpu
    type: vllm
    url: https://gpu.internal.acme.com/v1
    health_check_interval_ms: 5000
  - id: bedrock-burst
    type: bedrock
    role: burst-overflow
rules:
  # Try private GPU first, failover to Bedrock if p95 > 800ms
  - match:
      endpoint_p95_ms:
        private-gpu: <800
    route_to: private-gpu
  - default:
    route_to: bedrock-burst
    alert: latency-budget-exceeded
Benefits

Why federated inference

Private GPU utilization — use your on-prem capacity to the limit before spending on managed endpoints. Bedrock is the overflow valve, not the default.

Transparent failover — your application code doesn't know or care which endpoint serves the request. Routing decisions are invisible at the API level.

Alert before SLA breach — the alert: latency-budget-exceeded field fires a webhook when the gateway falls back to burst. You know before users notice.