The AI Model Gateway

The AI model gateway for enterprise infrastructure

One control plane for private GPU nodes (vLLM, TGI, TensorRT-LLM), managed APIs (Anthropic, OpenAI, Bedrock, Vertex AI), and internal fine-tunes. Declarative routing policies. Full per-request audit trail.

Read the Docs Get Early Access

Architecture

Request lifecycle

Every inbound request hits a single /v1/route endpoint. Kamiwaza extracts tenant identity, data class, and request metadata, then evaluates the routing policy tree. The winning rule determines which endpoint receives the forwarded request.

Policy evaluation engine

Rules are evaluated in declaration order. Each rule is a match/action pair: match on tenant ID, data class, user role, latency budget, or cost cap; route to the named endpoint. First match wins.

Audit log pipeline

Every request generates a structured audit record: timestamp, tenant, data class, matched policy, selected endpoint, latency, cost attribution. Streamed to your configured storage (S3, GCS, or Kamiwaza hosted).

Supported endpoints

Supported endpoint types

Endpoint type	Connection	Auth	Data stays in your VPC
Private GPU (self-hosted)	VPC Peering / PrivateLink	Bearer token	Yes
Anthropic Claude	HTTPS / public	API key	No (Anthropic's infra)
OpenAI GPT-4/3.5	HTTPS / public	API key	No (OpenAI's infra)
AWS Bedrock	VPC Endpoint / HTTPS	IAM / STS	Yes (with VPC endpoint)
GCP Vertex AI	Private Service Connect	Service Account	Yes (with PSC)
Azure OpenAI	Private Endpoint / HTTPS	Azure AD / API key	Yes (with Private Endpoint)

Kamiwaza is not a model provider. We route to your existing keys — you retain your existing agreements with each provider. Kamiwaza never touches your provider billing.

Policy language

Routing policy language

YAML-first policy language. Express tenant isolation, data class guards, latency budgets, and cost caps as declarative rules. Store them in git, review them in PRs, audit them in logs.

# Full routing policy example
version: v1
rules:
  # PII data always stays on-prem
  - match:
      data_class: pii-restricted
    route_to: private-gpu
  # Tenant-specific model allowlist
  - match:
      tenant: enterprise-acme
      latency_budget_ms: "<400"
    route_to: anthropic
  # Cost-capped fallback
  - match:
      cost_cap_per_1k: "<0.002"
    route_to: bedrock
  - default:
    route_to: anthropic

Observability

Observability built in

Per-request latency

p50, p95, p99 latency by endpoint. Identify which model is lagging and trigger failover rules.

Cost attribution by tenant

Token costs mapped to tenant IDs. Know exactly which customer is spending on which model.

Model usage breakdown

Request volume and token usage by endpoint, model, and policy rule. Drive capacity planning.

Policy evaluation audit log

Every routing decision logged: matched rule, evaluated conditions, selected endpoint. Full compliance trail.

Read the full architecture docs.

Read the Docs Get Early Access