February 18, 2026
Most teams bolt access control onto their AI platform as an afterthought, then spend three weeks in a compliance review explaining why a junior developer's credentials were used to invoke a GPT-4 fine-tune against the customer PII dataset. Getting RBAC right before you onboard your fifth team is far cheaper than retrofitting it after your fifteenth.
LLM platforms have a richer permission surface than a typical API gateway. You're not just gating HTTP endpoints. A well-structured RBAC model needs to cover at least four distinct resource types: model access (which foundation models or fine-tunes a principal can call), prompt-template access (which versioned templates can be rendered and sent), tool invocation (which external functions or APIs an agent may call at runtime), and data source access (which vector stores, relational tables, or retrieval pipelines can be queried).
Conflating these into a single "can use AI" boolean is exactly how you end up with an autonomous LangGraph agent that can read from Pinecone indices containing contractual data and call your Stripe billing API—because both were gated by the same role. We've seen this exact configuration at a platform team we worked with last quarter; it took a pentest finding to surface it. Treat each resource type as its own namespace in your policy engine from day one.
Open Policy Agent with Rego is the right choice for this layer. It decouples policy from your application code, supports dry-run evaluation for auditing, and has a mature Kubernetes sidecar story via OPA Gatekeeper. Below is a minimal Rego policy that gates model access by role and enforces a per-team token budget ceiling before the request ever reaches your LLM router:
package ai.access
default allow = false
allow {
input.resource.type == "model"
role_has_model_permission[input.principal.role][input.resource.model_id]
within_token_budget(input.principal.team_id, input.request.estimated_tokens)
}
role_has_model_permission := {
"ml-engineer": {"gpt-4o", "claude-3-5-sonnet", "llama-3-70b-instruct"},
"analyst": {"gpt-4o-mini", "claude-3-haiku"},
"agent-worker": {"gpt-4o-mini"}
}
within_token_budget(team_id, tokens) {
budget := data.budgets[team_id].monthly_tokens
used := data.usage[team_id].current_month_tokens
used + tokens <= budget
}
The data.budgets and data.usage documents are pushed to OPA's bundle endpoint from your billing service on a 60-second interval. At p95, OPA policy evaluation adds roughly 2–4ms of latency—well within a 400ms end-to-end budget for interactive requests. For agent loops making dozens of sequential decisions, cache the allow decision at the session level rather than re-evaluating per token.
Model API keys, database credentials for pgvector or Elasticsearch, and third-party tool secrets should never live in environment variables or Kubernetes Secrets in plaintext. Full stop. HashiCorp Vault's dynamic secrets engine is the answer here: each team gets a Vault namespace, and each agent workload gets a short-lived token with a 15-minute TTL, generated by a Kubernetes auth method binding to the pod's service account.
When the agent needs to query Weaviate or call an external REST tool, it requests the credential on demand rather than holding it in memory for the session lifetime. This limits blast radius: a compromised agent process can make at most a handful of calls before the credential expires. Rotate your static model provider API keys through Vault's KV v2 engine with versioning enabled so you have a clean audit trail of which version was active during any given inference call.
A human analyst who hits the wrong endpoint gets an error and stops. An autonomous agent running a CrewAI or AutoGen workflow will retry, rephrase, and potentially find a side channel. Most teams get this wrong by applying the same mental model to agents that they apply to human users. The principle of least privilege matters more for agentic workloads than for any other principal type.
Define tool permissions as a whitelist, not a blacklist. An agent role should enumerate exactly which tools it may call:
LlamaIndex's tool abstraction makes this tractable: wrap each tool with a metadata annotation that your OPA policy can inspect at dispatch time. A hallucinating agent that tries to invoke an unapproved tool gets a policy denial logged as a structured event, not a silent no-op. That distinction matters for your 0.5% hallucination baseline monitoring—you want to know when an agent is attempting actions outside its declared scope, not just when it returns factually wrong text.
Vector store access is often treated as binary, but you need attribute-based filtering layered on top of RBAC. A Weaviate tenant scoped to a team namespace handles coarse-grained isolation. For fine-grained control—say, an analyst who can query customer support tickets but not tickets tagged with legal-hold status—push filter predicates into the retrieval call itself. Your retrieval service reads the principal's data-source policy from OPA, constructs a metadata filter ({"path": ["sensitivity"], "operator": "NotEqual", "valueText": "legal-hold"}), and appends it to every Weaviate or Elasticsearch query before execution.
This adds about 8ms to retrieval latency at the 10k QPS throughput level we run on a mid-sized deployment. That's acceptable. The alternative is a compliance incident.
An audit trail that lives only in application logs is not an audit trail—it's a wishlist. Every policy decision (allow or deny), every tool invocation, and every model call should emit a structured event to Kafka, consumed by a retention pipeline that writes to immutable storage with a 90-day hot tier and a two-year cold tier. Include the OPA decision ID, principal identity, resource identifier, token count, and the specific policy version evaluated. This gives your security team a chain of custody that can answer "which agent, running under which service account, accessed which data source, at what time, under which policy version" in under 30 seconds.
On the budget side: aggregate token consumption per team in Prometheus, expose it as a gauge, and alarm in Grafana when a team hits 80% of their monthly allocation. At that threshold, auto-downgrade their model tier from GPT-4o to GPT-4o-mini rather than hard-blocking. We've measured 40–70% cost reduction on overflow traffic with this approach, and workflows keep running—teams barely notice until they see the monthly report.