infrastructure cost-analysis private-gpu

The economics of model routing: when private GPU beats managed API

Luke Norris November 14, 2024 8 min read

Cost comparison chart for private GPU vs managed API model routing

The question "should we host our own model or call a managed API?" used to be answered with a hand-wave toward complexity. Managed APIs won by default because the operational burden of private GPU clusters seemed too steep for early-stage AI products. That calculus has shifted. In late 2024, the combination of falling GPU spot prices, maturing inference runtimes like vLLM and TensorRT-LLM, and increasingly expensive frontier API tiers makes the break-even volume lower than most platform teams expect — sometimes by an order of magnitude.

This post walks through a real cost model. No vendor marketing, no simplified "GPU costs $X/hour" hand-waving. We'll use publicly available numbers for Anthropic Claude 3 Haiku, GPT-4o-mini, AWS Bedrock Llama 3 Instruct, and a self-hosted Llama 3.1 70B on a leased A100 80 GB instance. The numbers change monthly; what doesn't change is the framework.

The unit that matters: cost per 1M tokens

The industry settled on cost-per-million-tokens (CPM) as the canonical comparison unit because token counts scale linearly with workload and are consistent across providers. Here's the public pricing baseline as of late 2024:

Anthropic Claude 3 Haiku: ~$0.25/M input tokens, ~$1.25/M output tokens
OpenAI GPT-4o-mini: ~$0.15/M input, ~$0.60/M output
AWS Bedrock — Llama 3 8B Instruct: ~$0.30/M input, ~$0.60/M output (on-demand)
AWS Bedrock — Llama 3 70B Instruct: ~$0.70/M input, ~$0.99/M output (on-demand)
Anthropic Claude 3.5 Sonnet (mid-tier): ~$3.00/M input, ~$15.00/M output
OpenAI GPT-4o: ~$2.50/M input, ~$10.00/M output

Self-hosted Llama 3.1 70B on a single A100 80 GB (via Lambda Labs or equivalent spot market at roughly $1.50-$2.00/GPU-hour) produces approximately 1,800-2,200 tokens/second at fp16 with KV cache prefix sharing enabled in vLLM. At 2,000 tokens/second sustained utilization and a blended input/output ratio of 1:2 (typical for summarization and generation workloads), that translates to a practical throughput cost of roughly $0.04-$0.06 per 1M tokens — plus ~$0.005/M for inference runtime overhead and storage.

The break-even volume: lower than you expect

The common assumption is that self-hosted GPU infrastructure only pencils out at "massive scale." The math doesn't support this once you account for the true comparison: not just GPU hours but total operational costs including gateway overhead, monitoring, and idle capacity.

Consider an AI-native document processing platform running roughly 8M tokens per day across summarization, extraction, and classification tasks. At GPT-4o-mini pricing (~$0.15/M input, ~$0.60/M output, blended ~$0.35/M for a 60/40 in/out split), that's $2,800/day or roughly $85,000/month in API costs alone.

At the same workload, a 2× A100 cluster leased at $3.20/hour combined runs ~$2,300/month with 100% utilization — or $4,600/month at 50% utilization, still a 94% cost reduction against GPT-4o-mini. Even adding $800/month for ops tooling, monitoring, and the gateway (Kamiwaza's Team tier is $899/month), the self-hosted path saves roughly $79,000/month at 8M tokens/day.

The break-even volume against GPT-4o-mini at this cluster size is approximately 1.5-2M tokens/day. Most teams focused on classification, extraction, or repetitive summarization cross this threshold within 3-4 months of production traffic.

We're not saying managed APIs are expensive for what they offer. We're saying that routing all traffic through managed APIs when a private GPU cluster would handle 70% of volume at a fraction of the cost is a failure mode in platform economics — one that a routing layer is specifically designed to correct.

Where managed APIs still win

The break-even analysis above has important boundary conditions. Private GPU infrastructure wins on repetitive, high-volume, latency-tolerant workloads with predictable token distributions. It loses — sometimes badly — in several scenarios:

Burst traffic without hedging. If your workload spikes 10x during product launches or batch jobs, you need either over-provisioned cluster capacity (expensive idle compute) or a managed API fallback. Request hedging — sending a parallel request to a managed endpoint when the private cluster queue depth exceeds a threshold — recovers the cost advantage without sacrificing latency SLAs.

Workloads that need frontier model capability. Llama 3.1 70B is excellent for classification, extraction, moderate-complexity reasoning, and code generation in bounded domains. It is not Claude 3.5 Sonnet or GPT-4o for open-ended reasoning, complex multi-step analysis, or tasks requiring broad world knowledge. If your use case genuinely requires a frontier model, the quality premium is the cost.

Early product stage. At fewer than 500K tokens/day, cluster idle costs dominate. A growing platform with unpredictable traffic should start on managed APIs and migrate specific workloads to private GPU once traffic patterns are stable enough to size the cluster accurately.

Batch inference vs. interactive: a critical distinction

The economics above assume interactive inference — synchronous request/response, typically sub-5 second latency requirements. Batch inference (document processing pipelines, nightly classification jobs, embedding generation for vector search) has materially different economics.

OpenAI Batch API pricing is 50% off standard rates. Anthropic's batch API offers similar discounts. For a workload that can tolerate 24-hour turnaround, batch APIs close the gap substantially against self-hosted infrastructure, particularly when your cluster would otherwise sit idle overnight. The hybrid strategy — interactive workloads on private GPU, batch jobs on managed Batch API — can be optimal for platforms processing both real-time requests and nightly data pipelines.

vLLM's continuous batching engine complicates this further: it dynamically groups concurrent interactive requests into efficient batches, which means high-concurrency scenarios on self-hosted infrastructure benefit from similar throughput efficiency to explicit batch APIs. The key variable is request concurrency, not just daily volume.

The multi-model routing opportunity

Once you accept that no single endpoint is optimal for all workloads, the question becomes: how do you route efficiently at the request level without rewriting your application for each provider? This is the core problem Kamiwaza solves.

A typical enterprise platform has at least three natural tiers:

Tier 1 — Commodity workloads (classification, entity extraction, embedding generation): route to private Llama 3.1 70B or 8B at $0.04-0.10/M tokens. High volume, tolerant of P99 latency up to 8-10 seconds.
Tier 2 — General generation (chat, summarization, structured output): route to Claude 3 Haiku or GPT-4o-mini at $0.15-$1.25/M. Moderate volume, latency budget 2-5 seconds.
Tier 3 — Complex reasoning (multi-step analysis, long-context synthesis, code review): route to Claude 3.5 Sonnet or GPT-4o at $2.50-$15.00/M. Low volume, latency budget 10-30 seconds acceptable.

A YAML routing policy that implements this tiering looks like:

routing_policy:
  name: cost-tiered-routing
  rules:
    - name: commodity-to-private-gpu
      match:
        task_type: [classification, extraction, embedding]
      route_to: private-gpu-llama-70b
      fallback: bedrock-llama-3-70b
    - name: general-to-haiku
      match:
        task_type: [summarization, chat, structured-output]
      route_to: anthropic-claude-haiku
      fallback: openai-gpt4o-mini
    - name: reasoning-to-frontier
      match:
        task_type: [analysis, code-review, long-context]
      route_to: anthropic-claude-35-sonnet
      cost_cap_usd_per_request: 0.05

The platform team doesn't rewrite anything. Their application sends all requests to the Kamiwaza gateway endpoint with a X-Task-Type header, and the gateway handles model selection, fallback chain, and cost attribution per tenant.

What the cost model misses

Any cost model that only counts token prices misses several real factors. Rate limits (RPM/TPM) on managed APIs impose invisible queuing costs when your workload exceeds tier limits — GPT-4o on the Tier 1 API plan has a 500 RPM limit, and Claude API limits vary by usage tier. Hitting these limits causes latency spikes that violate your application's SLAs even though the token price looks competitive.

KV cache prefix sharing in vLLM provides a material throughput boost for workloads with shared prefixes (system prompts, document templates, few-shot examples). If 60% of your tokens are repeated system prompt, effective throughput increases substantially — the private GPU economics above assume this optimization is enabled, which not all operators configure correctly.

Finally, model pinning matters. gpt-4o is not a stable model identifier; OpenAI's API has aliased that name across multiple underlying model versions. Production deployments should pin to explicit version strings (gpt-4o-2024-11-20) and test behavior on version bumps before allowing automatic alias upgrades. This is less a cost concern than an evals concern — unexpected quality changes on unnoticed model updates are a real operational risk.

The economics favor hybrid routing. The engineering complexity of implementing it correctly — per-request routing decisions, fallback chains, cost attribution, rate limit awareness — is what keeps most teams defaulting to a single provider longer than they should. A gateway that makes these decisions declaratively, without application changes, is how the break-even volume becomes actionable rather than theoretical.