Use case
Federated inference
Run inference on your private GPU cluster and burst to Bedrock when p95 latency exceeds your declared budget. Failover is automatic at the gateway level — your application code is unchanged.
How it works
Latency-budget routing with auto-failover
Kamiwaza monitors real-time p95 latency per endpoint. When a private GPU cluster starts exceeding your declared budget, the gateway automatically shifts traffic to a managed endpoint until the cluster recovers — without any code change in your application.
# Federated inference: latency-budget routing
version: v1
endpoints:
- id: private-gpu
type: vllm
url: https://gpu.internal.acme.com/v1
health_check_interval_ms: 5000
- id: bedrock-burst
type: bedrock
role: burst-overflow
rules:
# Try private GPU first, failover to Bedrock if p95 > 800ms
- match:
endpoint_p95_ms:
private-gpu: <800
route_to: private-gpu
- default:
route_to: bedrock-burst
alert: latency-budget-exceeded
Benefits