October 28, 2025
Three months into a production deployment, a team I worked with had built a beautifully decomposed microservices architecture—forty-two services, clean domain boundaries, Istio handling mTLS and traffic shaping—and their AI features were a disaster. Response times were blowing past 2 seconds, context was evaporating between hops, and the LLM calls were scattered across nine different services with no unified retry or cost accounting. The problem wasn't the microservices. The problem was treating an AI reasoning pipeline like a REST API call.
Microservices architecture exists to solve organizational and operational scale: independent deployability, team autonomy, fault isolation, and horizontal scaling of discrete compute units. The contract between services is stateless and synchronous (or event-driven). Each service does one thing, knows nothing about its neighbors' internal state, and communicates through well-defined interfaces. This maps well to transactional workloads—order processing, user authentication, inventory updates.
AI orchestration solves a fundamentally different class of problem: stateful, multi-step reasoning with probabilistic outputs. A LangGraph workflow that retrieves context from Pinecone, synthesizes it against an 8k-token prompt window, invokes a tool, evaluates the result, and conditionally loops is not a microservice pipeline. It's a reasoning graph. State must persist across steps, retry semantics depend on semantic evaluation rather than HTTP status codes, and cost accumulates per token rather than per request.
Conflating these two abstractions is the root cause of most production AI architecture failures we've seen. Most teams get this wrong because they reach for the decomposition patterns they already know.
The canonical microservices pattern pushes you toward fine-grained decomposition and stateless handlers. Both properties actively fight AI workloads. Here's what happens when you naively decompose an AI pipeline into microservices:
The architecture that actually works in production is layered: your existing microservices handle transactional, domain-specific operations, and an AI orchestration layer sits on top, treating those services as tools and data sources. This is not a replacement—it's an additive layer with a clear boundary.
In practice, your LangGraph or CrewAI workflow calls your inventory microservice via a well-defined tool interface, retrieves embeddings from Weaviate or pgvector through your data layer, and manages all inter-step state internally. The microservices stay stateless and focused. The orchestrator owns the reasoning state, the retry logic, and the token budget. A platform team we worked with last quarter restructured along exactly these lines and cut their median AI response latency by roughly 60% without touching a single microservice.
The cleanest boundary we've found: if the logic requires probabilistic evaluation of outputs or maintains conversational or reasoning context across more than one operation, it belongs in the orchestration layer. If it's a deterministic read or write against a domain model, it's a microservice. AutoGen agents calling your pricing service are using a microservice as a tool—the agent framework owns the multi-turn logic, not the pricing service. That distinction sounds simple, but it's surprisingly easy to blur under deadline pressure.
Running AI orchestration workloads on Kubernetes requires rethinking your resource and networking model. LLM-backed services are not like typical microservices—they have spiky, high-latency, high-memory profiles that make standard HPA autoscaling react too slowly. Size orchestration pods with at minimum 4–8GB memory headroom for in-flight context, and configure Ray for distributed execution of parallel agent branches rather than relying on Kubernetes replicas alone.
Istio remains valuable, but its role shifts. For AI workloads, per-request load balancing across identical replicas matters less than circuit breaking to upstream LLM endpoints and traffic shaping for canary deployments of new prompt versions. A concrete Istio DestinationRule for an LLM gateway service might look like:
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: llm-gateway
spec:
host: llm-gateway.ai-platform.svc.cluster.local
trafficPolicy:
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
connectionPool:
http:
http2MaxRequests: 50
pendingHttpRequests: 25
This limits concurrent LLM calls to 50, queues up to 25, and ejects the upstream after three consecutive 5xx errors—protecting you from a cascade when your model provider throttles. Without this, a spike to 10k QPS on your API layer will happily attempt 10k parallel LLM calls and incinerate your rate limits. We think this config is underspecified in most reference architectures, which is why teams discover the problem in production instead of in staging.
The orchestration layer needs durable state that microservices don't. A LangGraph workflow running for 45 seconds across multiple tool calls must survive pod eviction. Use Temporal or Airflow for long-running workflow durability—they give you checkpointing, replay semantics, and retry policies that map cleanly to multi-step AI pipelines. Redis works well for short-lived session state under five minutes, but don't use it as your sole persistence layer for anything you'd be upset about losing.
Prometheus metrics for AI orchestration need custom instrumentation beyond the defaults. Track orchestrator_step_duration_seconds per named step, llm_tokens_consumed_total labeled by model and workflow, and agent_error_rate segmented by error type—hallucination classification, tool failure, timeout. A 0.5% hallucination baseline is achievable with retrieval-augmented pipelines. Without explicit measurement, you won't know you've drifted to 3% until a user complains.
OPA and Vault have roles here that differ from their microservices usage. OPA policies at the orchestration layer should gate which tools an agent can invoke based on the request's authorization context—not just "is the user authenticated" but "is this agent invocation permitted to call the payments tool for this user tier." Vault handles secrets rotation for LLM API keys centrally, so your fifty microservices aren't each managing their own OpenAI credentials.
Centralizing LLM credentials in Vault with dynamic secrets and short TTLs is one of those operational changes that feels like overhead until a key gets compromised. Then it pays for itself in about twenty minutes.