Private LLM deployment
Route PII and regulated data exclusively to on-prem Llama. Never touch a third-party API.
How it works
Kamiwaza sits between your application and your inference endpoints. When a request carries X-Data-Class: pii-restricted, the routing engine evaluates your policy and sends the request to the on-prem GPU cluster via PrivateLink — not to any managed API. Non-PII requests can use cheaper managed endpoints.
# Private LLM routing policy
version: v1
endpoints:
- id: private-gpu
type: vllm
url: https://gpu.internal.acme.com/v1
transport: privatelink
models: [llama-3.1-70b-instruct]
- id: bedrock
type: bedrock
models: [meta.llama3-instruct]
rules:
# ALL PII data stays on-premises — hard guarantee
- match:
data_class: pii-restricted
route_to: private-gpu
on_endpoint_unavailable: reject # fail-safe: never fallback to cloud
- default:
route_to: bedrock
What makes this pattern work
Hard data class guard
on_endpoint_unavailable: reject means PII requests fail closed — they never fall back to a managed API if the GPU cluster is down.
PrivateLink transport
Traffic between the gateway and the on-prem cluster travels over PrivateLink or VPC peering. No public internet path exists for PII data.
Audit trail
Every PII-tagged request generates an audit record showing which rule matched, which endpoint was used, and that the PII guard was enforced.
Mixed-class routing
Non-PII traffic routes to managed endpoints as normal. One gateway, two routing tracks — no code changes in the application layer.