LLM Cost Optimization: How to Cut API Spend Without Sacrificing Quality -- KamiwazaAI

Six months into production, a mid-size platform team we worked with had an LLM API bill of $140k/month—mostly because every request, regardless of complexity, was hitting GPT-4o at full context. No exotic architecture saved them. What actually worked was a layered set of boring, measurable interventions that cut spend by 62% in eight weeks while holding p95 latency under 400ms and keeping hallucination rates at the 0.5% baseline they'd measured at launch.

Model Routing: Don't Pay for a Sledgehammer on Every Nail

The single single biggest win is a cascade router that classifies query complexity before touching an expensive model. Route simple factual lookups, single-turn Q&A, and short summarizations to gpt-4o-mini or a self-hosted Llama 3 8B on Ray Serve. Escalate only when confidence is low or the task requires multi-step reasoning. A two-tier cascade—small model first, large model on fallback—typically shifts 55–70% of traffic to the cheaper tier.

The router itself should be lightweight: a fine-tuned 125M-parameter classifier or even a rules-based heuristic on prompt length and keyword density. It should add no more than 15ms overhead. Most teams get this wrong by over-engineering the router and under-engineering the escalation logic.

Routing Signals Worth Measuring

Token count: prompts under 512 tokens are strong candidates for the small tier.
Task type label: extraction, classification, and reformatting rarely need a 70B model.
Self-consistency score: sample the small model twice; if outputs diverge above a cosine distance threshold of 0.15, escalate.
Domain signal: legal, medical, or code-generation requests with known high error cost go straight to the large tier regardless.

Track escalation rate in Prometheus and alert if it climbs above your target baseline. A sudden spike usually signals a prompt regression upstream, not a model quality issue—we've been burned by that confusion more than once.

Caching at Three Layers

Caching is where teams leave the most money on the table. The mistake is treating it as one strategy rather than three with different hit-rate profiles and implementation costs.

Full-Response Cache

For deterministic or near-deterministic queries—FAQ bots, documentation Q&A, report generation from fixed templates—an exact-match cache in Redis with a 24-hour TTL is the cheapest possible call you can make. Hit rates of 30–40% on support workloads are common. Key on a normalized SHA-256 of the prompt after stripping whitespace and lowercasing. Dead simple, high yield.

Semantic Cache

When users rephrase the same question, exact-match fails you. Store response embeddings in pgvector or Weaviate and do a nearest-neighbor lookup before every API call. A cosine similarity threshold of 0.92 works well empirically—below that, quality degrades noticeably. Semantic caches add 8–20ms of latency, which is negligible against a 1–3 second model call. Expect 15–25% additional hit rate on top of exact-match for conversational workloads.

Prefix Caching

OpenAI's prompt caching—and vLLM's prefix caching for self-hosted models—reuses KV-cache entries for shared prompt prefixes. If your system prompt is 2k tokens and your average user message is 200 tokens, you're paying for those 2k tokens on every single request without this. With it, only the incremental tokens are billed. Structure prompts so static system context comes first and variable content comes last. This is non-negotiable once system prompts exceed 1k tokens.

Prompt Compression and Context Pruning

Most production prompts are 20–40% bloated: redundant instructions, over-specified examples, context windows stuffed with retrieved chunks that barely touch the actual answer.

Prompt compression uses a small model—LLMLingua or a custom fine-tune—to compress few-shot examples and retrieved context before sending to the expensive model. Compression ratios of 2–4x are achievable with less than 1% quality regression on structured tasks. This is especially effective when piping LlamaIndex retrieval results into GPT-4o, where retrieved chunks often contain one relevant sentence wrapped in three paragraphs of noise.

Context pruning means actively trimming conversation history rather than just appending forever. In a multi-turn LangGraph agent, naively accumulating messages hits 8k context within a dozen turns. Summarize older turns with a cheap model every N exchanges, keeping the active window under 4k tokens. The summarization call costs roughly $0.0001 and saves $0.01–0.05 per subsequent turn at GPT-4o pricing. The math is obvious once you write it down, but we've seen surprisingly few teams actually doing it.

# Example: sliding window summarization in LangGraph state update
def prune_history(messages: list[dict], model: str = "gpt-4o-mini") -> list[dict]:
    if token_count(messages) < 3500:
        return messages
    older = messages[:-6]          # keep last 3 exchanges verbatim
    recent = messages[-6:]
    summary = summarize(older, model=model)  # ~200 token output
    return [{"role": "system", "content": f"Prior context: {summary}"}] + recent

Batch Inference for Async Workloads

Not every LLM call needs a sub-second response. Document processing pipelines, nightly report generation, embedding refreshes for vector stores like Pinecone, and evaluation harnesses are all batch workloads that teams routinely run synchronously out of pure habit. OpenAI's Batch API cuts cost by 50% with a 24-hour completion SLA. For self-hosted setups, vLLM's continuous batching via Ray achieves 10k+ QPS on a single A100 node by maximizing GPU utilization—orders of magnitude better than sequential inference.

The operational discipline: queue separation. Route async jobs through Kafka into a dedicated batch consumer, keeping that traffic off the real-time path. Airflow or Temporal both work well as orchestrators. Never let a background embedding job compete for rate-limit quota with user-facing inference. We've watched that mistake take down production response times at 2am more than once.

Distillation: When Volume Justifies the Investment

If a specific task runs at high volume with a stable input distribution—intent classification, slot filling, structured extraction—fine-tuning a small model on GPT-4o outputs is the most durable cost reduction available. Generate 10k–50k labeled examples from the large model, fine-tune a 7B or 13B base on those labels, evaluate on a held-out set against the teacher model's outputs, and deploy on Ray Serve behind the router. Cost per call drops by 90%+ for the distilled task.

The catch is maintenance. The fine-tune needs periodic refresh as the input distribution drifts, so you need to budget for a retraining pipeline in Airflow and track distribution shift with Prometheus histograms on embedding centroids. Distillation only makes economic sense above roughly 1M calls/month for a given task. Below that, the engineering amortization doesn't pencil out—stick to routing and caching.

Takeaways

A two-tier model cascade routing 55–70% of traffic to a smaller model is the fastest path to 40–70% cost reduction with measurable quality guardrails.
Layer all three cache types—exact-match in Redis, semantic lookup in pgvector or Weaviate, and prefix caching at the API level—before touching prompt engineering; combined hit rates of 45–60% are realistic on conversational workloads.
Separate async workloads via Kafka and process them through batch inference APIs or vLLM; the 50% batch discount and GPU utilization gains compound significantly at scale.
Reserve fine-tuning and distillation for high-volume, stable tasks above 1M calls/month—it's the most durable optimization but carries the highest ongoing maintenance cost if input distributions shift.

← Back to Blog