February 3, 2026
Six months into production, a mid-size platform team we worked with had an LLM API bill of $140k/month—mostly because every request, regardless of complexity, was hitting GPT-4o at full context. No exotic architecture saved them. What actually worked was a layered set of boring, measurable interventions that cut spend by 62% in eight weeks while holding p95 latency under 400ms and keeping hallucination rates at the 0.5% baseline they'd measured at launch.
The single single biggest win is a cascade router that classifies query complexity before touching an expensive model. Route simple factual lookups, single-turn Q&A, and short summarizations to gpt-4o-mini or a self-hosted Llama 3 8B on Ray Serve. Escalate only when confidence is low or the task requires multi-step reasoning. A two-tier cascade—small model first, large model on fallback—typically shifts 55–70% of traffic to the cheaper tier.
The router itself should be lightweight: a fine-tuned 125M-parameter classifier or even a rules-based heuristic on prompt length and keyword density. It should add no more than 15ms overhead. Most teams get this wrong by over-engineering the router and under-engineering the escalation logic.
Track escalation rate in Prometheus and alert if it climbs above your target baseline. A sudden spike usually signals a prompt regression upstream, not a model quality issue—we've been burned by that confusion more than once.
Caching is where teams leave the most money on the table. The mistake is treating it as one strategy rather than three with different hit-rate profiles and implementation costs.
For deterministic or near-deterministic queries—FAQ bots, documentation Q&A, report generation from fixed templates—an exact-match cache in Redis with a 24-hour TTL is the cheapest possible call you can make. Hit rates of 30–40% on support workloads are common. Key on a normalized SHA-256 of the prompt after stripping whitespace and lowercasing. Dead simple, high yield.
When users rephrase the same question, exact-match fails you. Store response embeddings in pgvector or Weaviate and do a nearest-neighbor lookup before every API call. A cosine similarity threshold of 0.92 works well empirically—below that, quality degrades noticeably. Semantic caches add 8–20ms of latency, which is negligible against a 1–3 second model call. Expect 15–25% additional hit rate on top of exact-match for conversational workloads.
OpenAI's prompt caching—and vLLM's prefix caching for self-hosted models—reuses KV-cache entries for shared prompt prefixes. If your system prompt is 2k tokens and your average user message is 200 tokens, you're paying for those 2k tokens on every single request without this. With it, only the incremental tokens are billed. Structure prompts so static system context comes first and variable content comes last. This is non-negotiable once system prompts exceed 1k tokens.
Most production prompts are 20–40% bloated: redundant instructions, over-specified examples, context windows stuffed with retrieved chunks that barely touch the actual answer.
Prompt compression uses a small model—LLMLingua or a custom fine-tune—to compress few-shot examples and retrieved context before sending to the expensive model. Compression ratios of 2–4x are achievable with less than 1% quality regression on structured tasks. This is especially effective when piping LlamaIndex retrieval results into GPT-4o, where retrieved chunks often contain one relevant sentence wrapped in three paragraphs of noise.
Context pruning means actively trimming conversation history rather than just appending forever. In a multi-turn LangGraph agent, naively accumulating messages hits 8k context within a dozen turns. Summarize older turns with a cheap model every N exchanges, keeping the active window under 4k tokens. The summarization call costs roughly $0.0001 and saves $0.01–0.05 per subsequent turn at GPT-4o pricing. The math is obvious once you write it down, but we've seen surprisingly few teams actually doing it.
# Example: sliding window summarization in LangGraph state update
def prune_history(messages: list[dict], model: str = "gpt-4o-mini") -> list[dict]:
if token_count(messages) < 3500:
return messages
older = messages[:-6] # keep last 3 exchanges verbatim
recent = messages[-6:]
summary = summarize(older, model=model) # ~200 token output
return [{"role": "system", "content": f"Prior context: {summary}"}] + recent
Not every LLM call needs a sub-second response. Document processing pipelines, nightly report generation, embedding refreshes for vector stores like Pinecone, and evaluation harnesses are all batch workloads that teams routinely run synchronously out of pure habit. OpenAI's Batch API cuts cost by 50% with a 24-hour completion SLA. For self-hosted setups, vLLM's continuous batching via Ray achieves 10k+ QPS on a single A100 node by maximizing GPU utilization—orders of magnitude better than sequential inference.
The operational discipline: queue separation. Route async jobs through Kafka into a dedicated batch consumer, keeping that traffic off the real-time path. Airflow or Temporal both work well as orchestrators. Never let a background embedding job compete for rate-limit quota with user-facing inference. We've watched that mistake take down production response times at 2am more than once.
If a specific task runs at high volume with a stable input distribution—intent classification, slot filling, structured extraction—fine-tuning a small model on GPT-4o outputs is the most durable cost reduction available. Generate 10k–50k labeled examples from the large model, fine-tune a 7B or 13B base on those labels, evaluate on a held-out set against the teacher model's outputs, and deploy on Ray Serve behind the router. Cost per call drops by 90%+ for the distilled task.
The catch is maintenance. The fine-tune needs periodic refresh as the input distribution drifts, so you need to budget for a retraining pipeline in Airflow and track distribution shift with Prometheus histograms on embedding centroids. Distillation only makes economic sense above roughly 1M calls/month for a given task. Below that, the engineering amortization doesn't pencil out—stick to routing and caching.