architecture private-gpu managed-endpoints

Private GPU vs. managed endpoints: a decision framework for platform teams

Luke Norris December 10, 2024 10 min read

Decision matrix comparing private GPU and managed API endpoints for enterprise AI workloads

The "private GPU vs. managed API" framing is usually presented as a binary infrastructure choice. In practice, platform teams are almost never choosing one or the other — they're deciding which workloads belong where, and how to manage the routing layer between them. The teams that get this right have answered four specific questions before committing infrastructure resources. The teams that get it wrong have usually answered a different question: "which option looks cheaper on a spreadsheet?"

This post works through that four-question framework, with enough technical detail to be useful at the infrastructure level. We'll also cover the cases where managed APIs hold a genuine structural advantage that private GPU deployments can't easily overcome.

Question 1: What does your data classification require?

Before any cost or latency calculation, data governance constrains the option space. If any request type carries PII, HIPAA-protected health data, or material non-public financial information, sending that data to a managed API may be contractually or regulatorily prohibited — regardless of how competitive the token pricing is.

The Anthropic API Terms of Service allow Anthropic to use submitted data for model improvement unless you have an explicit enterprise agreement with data retention turned off. OpenAI has similar provisions on default consumer tiers. AWS Bedrock operates under the AWS Customer Agreement and does not train on customer data by default — a meaningful compliance distinction for some regulated workloads.

The practical implication: if you're building a platform that handles multiple data classes, you likely need both. PII-flagged requests route to private GPU infrastructure where data never leaves your VPC; non-sensitive requests can hit managed APIs. A routing policy that enforces this at the gateway level — rather than relying on individual application teams to remember the rule — is the correct architectural pattern. The NIST AI Risk Management Framework (AI RMF 1.0) specifically identifies data provenance and containment as governance requirements for AI deployment, which directly maps to this routing decision.

Question 2: What's your request volume profile — flat or spiky?

Private GPU clusters are capacity-fixed. You provision a number of GPUs and they sit running whether you're using them or not. At flat, predictable throughput, this is fine — and the economics can be compelling, as we covered in our routing economics post. At spiky throughput, you're choosing between two unpleasant options: over-provision the cluster (expensive idle compute) or accept queuing latency during spikes.

Consider a B2B SaaS platform with roughly 4,000 tenant organizations. Their LLM usage is concentrated: 80% of daily token volume occurs in a 4-hour window during business hours, with batch ETL jobs generating another spike around midnight. Outside those windows, a private GPU cluster sized for peak load runs at 15-20% utilization — paying for hardware that's mostly idle.

The right architecture for this profile is hybrid: a private cluster sized for average load (not peak), with a managed API configured as the overflow target in the fallback chain. The routing policy sets a queue depth threshold on the private cluster; when depth exceeds the threshold, new requests are redirected to managed API endpoints until the queue drains. The platform captures most of the private GPU cost savings while maintaining SLA during peaks without over-provisioning.

This pattern requires a gateway that can make per-request routing decisions with awareness of real-time cluster state — not a configuration file that says "70% to private, 30% to API."

Question 3: What are your actual latency requirements per workload type?

Not all workloads are equal on latency. Interactive chat, real-time suggestion engines, and code completion have user-facing latency requirements in the 500ms-3s range for first token. Document batch processing, nightly summarization pipelines, and embedding generation jobs can tolerate much higher latency.

This matters because private GPU infrastructure and managed APIs have different latency profiles under different conditions:

Private GPU (vLLM on A100, Llama 3.1 70B): TTFT (time to first token) of 200-600ms under normal concurrency, degrading to 2-8s under high queue depth. Continuous batching helps throughput but can increase individual request latency under saturation.
Anthropic Claude 3 Haiku (direct API): TTFT typically 400-900ms, P99 spikes to 3-6s during provider load events — not predictable.
AWS Bedrock on-demand inference: Adds 50-150ms of routing overhead vs. direct API calls, but benefits from Bedrock's provisioned throughput option which reduces P99 variance significantly for committed usage tiers.
Bedrock Provisioned Throughput: For latency-sensitive workloads, Bedrock's provisioned throughput option (purchased in model units, minimum 1-month commitment) can reduce P99 latency by 40-60% compared to on-demand, at a cost premium of roughly 2-3x on-demand rates.

The point: "managed API is slower" is not a reliable statement. Under sustained high load, a well-tuned vLLM deployment with KV cache prefix sharing enabled will outperform on-demand managed API endpoints. But during burst periods, a managed API with provisioned throughput can have more predictable tail latency than a saturated private cluster.

Question 4: What's your model capability requirement — and how fast is it changing?

This question is often ignored in infrastructure discussions but it's operationally significant. Private GPU deployments require explicit model management: downloading model weights, managing storage (Llama 3.1 70B in fp16 is roughly 140 GB), running eval harnesses before swapping to a new checkpoint, configuring TensorRT-LLM for optimized inference if you need sub-200ms TTFT at scale.

Managed APIs abstract all of this. When Anthropic released Claude 3.5 Sonnet with significantly improved reasoning capability, API users got the upgrade automatically. Self-hosted operators have to download new weights, run eval comparisons against production benchmarks, and deploy — a process that takes days to weeks even with a well-structured MLOps pipeline.

We're not saying managed APIs are the right choice for capability-sensitive workloads at any cost. We're saying that if your platform's competitive advantage depends on tracking frontier model improvements closely, the operational overhead of private GPU deployments will slow your iteration cycle — and that tradeoff needs to be explicit in your architectural decision.

The practical guidance: use managed APIs for workloads that benefit from frontier model capability improvements, and private GPU for workloads where Llama 3.1 70B (or equivalent) is already good enough — which, for a wide range of classification, extraction, and structured generation tasks, it is.

The deployment decision matrix

Running through all four questions produces a simple decision matrix:

Private GPU only: High-volume, predictable, PII-restricted, latency-tolerant workloads where Llama-class capability is sufficient. Data governance requires VPC containment.
Managed API only: Low-volume, spiky, frontier-model-dependent workloads. Early product stage where infrastructure overhead isn't justified. Non-PII or data processor agreement in place.
Hybrid with routing layer: Mixed workload types, data class heterogeneity, burst traffic profiles. This is most enterprise platforms beyond ~6 months of production traffic.

The "hybrid" category is where the architecture gets interesting, because it requires a gateway that understands your routing policy, not just a load balancer that distributes requests randomly. Kamiwaza routes at the request level based on data class tags, tenant configuration, real-time endpoint health, and declared latency budgets — which means the hybrid architecture becomes manageable rather than a maintenance burden.

What "BYO endpoint" actually requires

When operators add private GPU nodes to a routing gateway, a few common failure patterns emerge that aren't obvious from the docs:

VPC peering vs. PrivateLink: If your inference cluster and gateway run in different cloud accounts or VPCs, you need either VPC peering (lower latency, requires non-overlapping CIDR blocks) or AWS PrivateLink (higher setup complexity, works across accounts and organizations). Most teams default to VPC peering and discover the CIDR conflict problem after the cluster is provisioned.

Health check semantics: A vLLM process that is running but has a saturated GPU queue will return HTTP 200 on /health while delivering 30-second request latencies. A routing gateway needs to check more than liveness — it needs to evaluate queue depth or real-time p95 latency to determine whether to route new requests to that endpoint or fail over.

Model card alignment: When you register a BYO endpoint, you're declaring that it serves a specific model. Kamiwaza's endpoint registry includes a model_card field that specifies the model name, parameter size, and quantization level. If you update the model weights behind the endpoint without updating the registry, routing policies that assume specific capability levels will behave incorrectly.

These aren't arguments against private GPU deployments. They're the operational details that make the difference between a hybrid architecture that works reliably and one that requires manual intervention every time a cluster hiccups.

A note on total cost of ownership

Infrastructure comparisons often compare GPU spot prices against API token costs and conclude the private GPU wins. They omit: engineering time to maintain the cluster, monitoring stack costs, storage for model weights and KV cache, GPU driver and CUDA version management, and the opportunity cost of engineers not building product.

For a small platform team (3-5 engineers), the total cost of ownership for a self-hosted inference cluster that handles 2M tokens/day is roughly $8,000-$12,000/month when ops time is accounted for — which does still beat managed API costs at that volume, but by a smaller margin than the GPU-hours-only analysis suggests. The routing layer pays for itself by capturing cost savings on high-volume workloads while letting managed APIs absorb the tail.