The AI Model Gateway

The AI model gateway for enterprise infrastructure

One control plane for private GPU nodes (vLLM, TGI, TensorRT-LLM), managed APIs (Anthropic, OpenAI, Bedrock, Vertex AI), and internal fine-tunes. Declarative routing policies. Full per-request audit trail.

Inbound App KAMIWAZA Gateway Policy Engine Private GPU Llama 3.1 Anthropic Claude AWS Bedrock GCP Vertex AI Audit Log
Architecture

Architecture

Kamiwaza platform architecture: request lifecycle from app through gateway to private GPU, Anthropic, Bedrock, and Vertex endpoints with policy engine and audit log

Request lifecycle

Every inbound request hits a single /v1/route endpoint. Kamiwaza extracts tenant identity, data class, and request metadata, then evaluates the routing policy tree. The winning rule determines which endpoint receives the forwarded request.

Policy evaluation engine

Rules are evaluated in declaration order. Each rule is a match/action pair: match on tenant ID, data class, user role, latency budget, or cost cap; route to the named endpoint. First match wins.

Audit log pipeline

Every request generates a structured audit record: timestamp, tenant, data class, matched policy, selected endpoint, latency, cost attribution. Streamed to your configured storage (S3, GCS, or Kamiwaza hosted).

Supported endpoints

Supported endpoint types

Endpoint type Connection Auth Data stays in your VPC
Private GPU (self-hosted) VPC Peering / PrivateLink Bearer token Yes
Anthropic Claude HTTPS / public API key No (Anthropic's infra)
OpenAI GPT-4/3.5 HTTPS / public API key No (OpenAI's infra)
AWS Bedrock VPC Endpoint / HTTPS IAM / STS Yes (with VPC endpoint)
GCP Vertex AI Private Service Connect Service Account Yes (with PSC)
Azure OpenAI Private Endpoint / HTTPS Azure AD / API key Yes (with Private Endpoint)

Kamiwaza is not a model provider. We route to your existing keys — you retain your existing agreements with each provider. Kamiwaza never touches your provider billing.

Policy language

Routing policy language

YAML-first policy language. Express tenant isolation, data class guards, latency budgets, and cost caps as declarative rules. Store them in git, review them in PRs, audit them in logs.

# Full routing policy example
version: v1
rules:
  # PII data always stays on-prem
  - match:
      data_class: pii-restricted
    route_to: private-gpu
  # Tenant-specific model allowlist
  - match:
      tenant: enterprise-acme
      latency_budget_ms: "<400"
    route_to: anthropic
  # Cost-capped fallback
  - match:
      cost_cap_per_1k: "<0.002"
    route_to: bedrock
  - default:
    route_to: anthropic
Observability

Observability built in

Per-request latency

p50, p95, p99 latency by endpoint. Identify which model is lagging and trigger failover rules.

Cost attribution by tenant

Token costs mapped to tenant IDs. Know exactly which customer is spending on which model.

Model usage breakdown

Request volume and token usage by endpoint, model, and policy rule. Drive capacity planning.

Policy evaluation audit log

Every routing decision logged: matched rule, evaluated conditions, selected endpoint. Full compliance trail.

Read the full architecture docs.