Private GPU Setup

Connect your on-prem or cloud-hosted GPU cluster to Kamiwaza. Traffic routes over PrivateLink or VPC peering — no public internet path for sensitive workloads.

Network diagram showing Kamiwaza gateway connecting to private GPU cluster via PrivateLink with no public internet path

Prerequisites

  • A GPU server (A100, H100, or equivalent) running vLLM 0.4+ or Ollama 0.1.30+
  • Network connectivity between your GPU cluster and Kamiwaza's gateway IPs (via PrivateLink, VPC peering, or a site-to-site VPN)
  • A Kamiwaza API key with endpoints:write permission
On-prem (bare metal)? If your GPU cluster is not on AWS/GCP/Azure, use a site-to-site VPN or SD-WAN to bridge your data center to Kamiwaza's gateway. Contact [email protected] for a network access token.

Step 1 — Start vLLM

Start vLLM on your GPU server with the model you want to serve. Kamiwaza expects an OpenAI-compatible /v1/chat/completions endpoint:

shell
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-70B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 4

Verify vLLM is responsive:

curl http://localhost:8000/v1/models
# {"data":[{"id":"meta-llama/Llama-3.1-70B-Instruct","object":"model",...}]}

Step 2b — VPC peering (alternative)

VPC peering is simpler to set up but requires non-overlapping CIDR ranges between your VPC and Kamiwaza's.

shell
kmw network peer \
  --cloud aws \
  --vpc-id vpc-XXXXXXXXXXXX \
  --region us-east-1

This command creates a peering request from Kamiwaza's gateway VPC to yours. Accept it in your AWS Console and add a route table entry pointing Kamiwaza's CIDR (10.96.0.0/16) to the peering connection.

Step 3 — Register the endpoint

Once the network path is established, register the private endpoint in Kamiwaza:

shell
kmw endpoints add \
  --id private-gpu \
  --type vllm \
  --url https://vpce-xxxx.us-east-1.vpce.amazonaws.com:8000/v1 \
  --transport privatelink \
  --models llama-3.1-70b-instruct

Or via YAML in your policy file:

yaml
endpoints:
  - id: private-gpu
    type: vllm
    url: https://vpce-xxxx.us-east-1.vpce.amazonaws.com:8000/v1
    transport: privatelink
    models: [llama-3.1-70b-instruct]
    health_check_interval_ms: 10000

Health checks

Kamiwaza polls the registered endpoint's /health path at the configured interval. If three consecutive health checks fail, the endpoint is marked unhealthy and traffic is routed to the next matching rule.

shell
kmw endpoints status private-gpu
# ID            STATUS      P50_MS    P95_MS    LAST_CHECKED
# private-gpu   healthy     198       412       2025-05-28T14:30:00Z

Configure an alert to fire when an endpoint goes unhealthy:

kmw webhooks add \
  --event endpoint-unhealthy \
  --endpoint private-gpu \
  --url https://hooks.pagerduty.com/services/XXXX