benchmarks latency cost

Anthropic vs. OpenAI vs. Bedrock: latency and cost profiles for enterprise routing

Luke Norris April 17, 2025 12 min read

Latency distribution charts comparing Claude Haiku, GPT-4o-mini, and Bedrock Llama 3 at p50, p95, and p99 percentiles

Provider latency and cost comparisons for LLM APIs circulate constantly, but most of them conflate median latency with tail latency, use synthetic microbenchmarks that don't reflect production traffic patterns, or compare models at different capability tiers as if the comparison were meaningful. This post focuses on what enterprise platform teams actually care about when making routing decisions: the latency distributions at p50, p95, and p99, the cost-per-token at production-realistic token volumes, and the operational differences between first-party APIs and their cloud-hosted equivalents.

The data below reflects measurements from a production routing gateway handling mixed enterprise workloads — summarization, structured extraction, and chat completions — across the three provider configurations most commonly deployed in enterprise settings. We're not publishing methodology details that would be impossible to replicate, and we're not pretending the numbers are universal — provider latency varies by region, time of day, and your specific traffic mix. What we're sharing is the shape of the distributions and the operational patterns that matter for routing policy design.

The three configurations and what they're not

To be precise about what's being compared:

Anthropic direct API — Claude 3 Haiku: claude-3-haiku-20240307 via api.anthropic.com, Anthropic's lowest-latency production model as of late 2024. Not Claude 3.5 Sonnet — which has different latency characteristics — and not Claude 3 Opus, which is substantially slower at lower throughput.
OpenAI direct API — GPT-4o-mini: gpt-4o-mini-2024-07-18 via api.openai.com. This is OpenAI's recommended small/fast model for high-volume use cases, priced at ~$0.15/M input and ~$0.60/M output. Not GPT-4o (~$2.50/M input), which has meaningfully different cost and latency characteristics.
AWS Bedrock on-demand — Llama 3 70B Instruct: meta.llama3-70b-instruct-v1:0 via Bedrock's on-demand inference in us-east-1. Not Bedrock's Anthropic models (which route through Anthropic's infrastructure with Bedrock's IAM/billing wrapper), and not Bedrock Provisioned Throughput (which has different — and generally better — P99 characteristics).

This is important because "Bedrock" is not a single endpoint. Bedrock-hosted Anthropic Claude and first-party Anthropic Claude have different network paths, different latency profiles, and somewhat different rate limit structures. Treating them as interchangeable is a common misconfiguration.

Latency distributions: what the p99 tells you that p50 hides

At roughly 500 concurrent requests across a mixed workload (400-1200 token prompts, 200-800 token completions), a realistic production traffic pattern for a growing platform shows these approximate distributions for time-to-first-token (TTFT):

Endpoint	p50 TTFT	p95 TTFT	p99 TTFT	p99.9 TTFT
Claude 3 Haiku (direct)	~480ms	~1,400ms	~3,200ms	~8,500ms
GPT-4o-mini (direct)	~320ms	~1,100ms	~4,800ms	~12,000ms
Bedrock Llama 3 70B (on-demand)	~650ms	~2,800ms	~7,200ms	>15,000ms

The p50 numbers flatter all three options. GPT-4o-mini's median TTFT looks attractive. But the P99 tells a different story: GPT-4o-mini's P99 is 4.8 seconds — if your application has a 3-second interactive latency SLA, GPT-4o-mini will breach it for roughly 1% of requests under this traffic pattern. At 5M requests/month, that's 50,000 latency SLA breaches per month from a single provider.

Bedrock Llama 3 70B on-demand has the worst P99 in this configuration. The on-demand tier shares capacity with all other Bedrock customers; under load, the queue can grow substantially. Bedrock Provisioned Throughput eliminates this variance at the cost of a committed purchase (minimum 1 model unit for 1 month) — at high sustained volume, the Provisioned Throughput economics can be favorable, but the commitment model adds financial risk that on-demand doesn't have.

Cost at scale: where the per-token rate meets your traffic mix

The published per-token rates for these models as of early 2025:

Model	Input $/M	Output $/M	Blended (1:2 in/out)
Claude 3 Haiku	$0.25	$1.25	~$0.92/M
GPT-4o-mini	$0.15	$0.60	~$0.45/M
Bedrock Llama 3 70B (on-demand)	$0.70	$0.99	~$0.89/M
Claude 3.5 Sonnet	$3.00	$15.00	~$11.00/M
GPT-4o	$2.50	$10.00	~$7.50/M

GPT-4o-mini has the lowest blended cost at 1:2 in/out ratios. But this blended rate is sensitive to your actual token mix. For workloads with long system prompts and short outputs (classification, extraction), the input-heavy mix narrows the gap between GPT-4o-mini and Claude 3 Haiku. For generation-heavy workloads (summarization, drafting), the output-heavy mix makes GPT-4o-mini's lower output rate more significant.

We're not saying GPT-4o-mini is the best default choice. We're saying that cost-per-token comparisons at a single in/out ratio can misrepresent the real cost difference for your specific workload — and that routing policy should target the right model for each task type rather than optimizing for a single blended CPM across all traffic.

Bedrock vs. direct Anthropic: the IAM and compliance angle

For enterprise teams on AWS with IAM-governed access control requirements, Bedrock-hosted Anthropic Claude (Claude 3 Haiku and Claude 3.5 Sonnet are available via Bedrock) provides a meaningful operational advantage: the API call is governed by AWS IAM roles and policies, request logging can go directly to CloudTrail, and the billing consolidates into the existing AWS invoice.

This isn't a latency or cost advantage — Bedrock-hosted Claude adds approximately 50-120ms of routing overhead compared to direct Anthropic API calls due to the Bedrock request proxy layer. But for teams where the alternative is building out a separate Anthropic API key management system with per-team access controls, the operational simplification of IAM-governed access can justify that overhead.

The AWS Bedrock service documentation (specifically, the section on inference request isolation and data retention) clarifies that Bedrock does not use customer inputs to train models. This is the same guarantee Anthropic provides on enterprise agreements. For teams without an Anthropic enterprise agreement, Bedrock may provide stronger contractual data handling guarantees than the direct API on a standard developer plan.

Rate limits as a routing constraint, not just a cost constraint

Published rate limits for these APIs (which change over time and vary by account tier) affect routing policy in ways that cost analysis misses:

Anthropic's API applies rate limits in both RPM (requests per minute) and TPM (tokens per minute). At higher usage tiers, these limits increase, but they still represent a ceiling that affects burst traffic handling. A routing policy that funnels 100% of traffic to Claude 3 Haiku will hit rate limits during batch operations or traffic spikes, requiring either request queuing (adds latency) or a managed overflow to a secondary provider.

OpenAI's Tier 1 accounts have 500 RPM on GPT-4o-mini; Tier 4 accounts have 10,000 RPM. The tier is determined by cumulative spend, not by a plan selection. A platform that ramps up quickly may find itself rate-limited at the exact moment its traffic is growing fastest — before it has accumulated enough spend history to move to a higher tier.

This is one of the reasons that routing policies with multi-provider fallback chains are operationally superior to single-provider configurations even when one provider's cost/latency profile is better: hitting a rate limit is a hard failure that breaks your application, and the routing layer that fails over to a secondary provider automatically is worth more than the marginal cost savings from committing all traffic to the cheapest endpoint.

Structured output (JSON mode) and the format contract

For applications that depend on structured output — extraction pipelines, classification systems, agentic workflows that parse model output — the reliability of JSON-mode responses across providers matters as much as latency and cost.

All three providers offer structured/JSON output modes: OpenAI's response_format: { type: "json_object" }, Anthropic's tool use API or system prompt-forced JSON, and Bedrock's equivalent. In practice, the reliability of format compliance varies. Some extraction pipelines switch providers based on which endpoint has better structured output compliance on their specific schema — and need their routing policy to target specific models for specific task types.

A Kamiwaza routing policy that maps task_type: structured-extraction to a specific model (rather than letting the cost optimizer pick) gives platform teams control over this tradeoff: pay slightly more for a model with better format compliance, reduce downstream parsing failures, maintain overall system throughput. The policy makes the tradeoff explicit and auditable rather than implicit in which SDK happens to be installed.

The operational reality of multi-provider routing

Choosing between Anthropic, OpenAI, and Bedrock is not a one-time architectural decision. All three providers make pricing changes, capacity investments, and model updates on timelines you don't control. A routing policy that pins all traffic to a single provider because it was cheapest at evaluation time will need manual updates every time a provider reprices, changes latency characteristics, or updates model behavior in a way that breaks your evals.

The operationally resilient architecture treats provider selection as a routing decision made per-request based on current conditions — latency, cost, rate limit headroom, data class requirements — not as a static configuration that requires a human to update when market conditions change. That's not an argument for constant provider-switching; most requests should go to the same endpoint most of the time. But the routing policy should be declarative enough that updating cost targets or swapping a provider requires changing one YAML file, not redeploying application code across every service that calls an LLM.