Use case

Private LLM deployment

Route PII and regulated data exclusively to on-prem Llama. Never touch a third-party API.

Architecture

How it works

Kamiwaza sits between your application and your inference endpoints. When a request carries X-Data-Class: pii-restricted, the routing engine evaluates your policy and sends the request to the on-prem GPU cluster via PrivateLink — not to any managed API. Non-PII requests can use cheaper managed endpoints.

# Private LLM routing policy
version: v1
endpoints:
  - id: private-gpu
    type: vllm
    url: https://gpu.internal.acme.com/v1
    transport: privatelink
    models: [llama-3.1-70b-instruct]
  - id: bedrock
    type: bedrock
    models: [meta.llama3-instruct]
rules:
  # ALL PII data stays on-premises — hard guarantee
  - match:
      data_class: pii-restricted
    route_to: private-gpu
    on_endpoint_unavailable: reject  # fail-safe: never fallback to cloud
  - default:
    route_to: bedrock
Key features

What makes this pattern work

Hard data class guard

on_endpoint_unavailable: reject means PII requests fail closed — they never fall back to a managed API if the GPU cluster is down.

PrivateLink transport

Traffic between the gateway and the on-prem cluster travels over PrivateLink or VPC peering. No public internet path exists for PII data.

Audit trail

Every PII-tagged request generates an audit record showing which rule matched, which endpoint was used, and that the PII guard was enforced.

Mixed-class routing

Non-PII traffic routes to managed endpoints as normal. One gateway, two routing tracks — no code changes in the application layer.