Building a multi-tenant SaaS platform on top of LLM infrastructure presents a category of isolation problems that are structurally different from traditional multi-tenant database isolation. With databases, the isolation contract is well-understood: separate schemas, row-level security, or separate instances. The audit trail is the transaction log. Access control is mature and tested.
With LLM inference infrastructure, the problem is more complex. The "data" being isolated isn't just stored records — it's prompts, completions, system instructions, and the behavior of the model itself under tenant-specific configurations. A tenant who pays for premium model access shouldn't subsidize another tenant's heavy usage through shared rate limits. A tenant whose data is under GDPR constraints shouldn't have their inference logs stored in a region that another tenant's compliance team requires. An enterprise customer's internal documents referenced in RAG context shouldn't be extractable by another tenant through model behavior side-channels.
This post covers three patterns for implementing tenant isolation in a shared inference gateway, with the tradeoffs and failure modes of each.
The scope of tenant isolation in LLM infrastructure
Before choosing a pattern, it helps to be explicit about what you're isolating. In a multi-tenant LLM gateway, there are at least four distinct isolation axes:
- Model allowlist isolation: Which models can this tenant's requests reach? A free-tier tenant may only access Llama 3.1 8B. An enterprise tenant may have access to Claude 3.5 Sonnet and a custom fine-tune on your private cluster.
- Rate limit isolation: How much throughput can this tenant consume before requests are queued or rejected? Without this, one tenant's batch job saturates the gateway and degrades latency for everyone else.
- Audit log isolation: Where are this tenant's request logs written, and who can access them? Some enterprise customers require that their audit logs are stored in a tenant-controlled S3 bucket under their own AWS account, not yours.
- Data class policy isolation: Can tenant A's routing policy (e.g., "never send data to external APIs") affect how tenant B's requests are routed? It shouldn't be possible to misconfigure one tenant's policy in a way that impacts another's.
A B2B SaaS platform with roughly 4,000 tenant organizations across three pricing tiers (self-serve, team, enterprise) will have different isolation requirements at each tier — but all four axes need to be addressed for every tenant, with different enforcement intensities.
Pattern 1: Per-tenant model allowlists
The most fundamental isolation mechanism is controlling which model endpoints each tenant can reach. This is implemented in the gateway's tenant registry as a per-tenant allowlist evaluated at routing time, before cost or latency optimization logic runs.
tenants:
- id: acme-corp-enterprise
plan: enterprise
model_allowlist:
- anthropic-claude-35-sonnet
- anthropic-claude-3-haiku
- private-gpu-acme-finetune # Their dedicated fine-tune
- bedrock-llama-3-70b
model_denylist: [] # Enterprise gets all allowlisted models
default_model: anthropic-claude-3-haiku
- id: startup-inc-team
plan: team
model_allowlist:
- anthropic-claude-3-haiku
- openai-gpt4o-mini
- bedrock-llama-3-8b
model_denylist:
- anthropic-claude-35-sonnet # Not on their plan
- private-gpu-* # Private GPU not included in team tier
default_model: openai-gpt4o-mini
- id: hobbyco-starter
plan: starter
model_allowlist:
- bedrock-llama-3-8b
- openai-gpt4o-mini
cost_cap_usd_per_day: 10.00
default_model: bedrock-llama-3-8b
The enforcement happens at the gateway's policy evaluation layer, not in the calling application. When acme-corp-enterprise sends a request with model: claude-3-5-sonnet-20241022, the gateway verifies that model is on their allowlist before routing. When hobbyco-starter attempts the same request, the gateway returns an HTTP 403 Policy violation: model not permitted for tenant plan before any upstream call is made.
The critical correctness requirement: allowlist checks must be evaluated before cost optimization or fallback chain logic. A fallback chain that upgrades a request to a more capable model (to handle a format error from a cheaper model) must check the fallback model against the tenant's allowlist before executing the upgrade.
Pattern 2: Audit bucket isolation
Audit log isolation is the requirement that one tenant's inference logs cannot be accessed by another tenant, and optionally that an enterprise tenant can write logs directly to their own infrastructure rather than yours.
The basic implementation: each tenant has a separate audit log destination configured in the gateway registry. Self-serve and team tenants write to partitioned paths in your centralized audit store (s3://your-audit-bucket/tenant-id/YYYY/MM/DD/). Enterprise tenants can configure a BYO destination — an S3 bucket or Kinesis stream in their own AWS account, accessed via an IAM cross-account role.
tenants:
- id: finco-enterprise
audit_destination:
type: s3-cross-account
bucket: arn:aws:s3:::finco-llm-audit-prod
role_arn: arn:aws:iam::123456789012:role/KamiwazaAuditWriter
encryption: aws:kms
kms_key_id: arn:aws:kms:us-east-1:123456789012:key/abc123
retention_days: 365 # Tenant controls their own retention
- id: startup-inc-team
audit_destination:
type: managed
partition: tenant-id/startup-inc-team
retention_days: 90
The audit log for each request must capture at minimum: tenant ID, timestamp (UTC), request fingerprint hash (not the raw prompt — hashing preserves audit utility without storing PII in the audit system itself), endpoint routed to, policy rules evaluated, data class tag, tokens consumed, and any policy violations. For the fintech enterprise tenant, their audit log may also need to satisfy financial services record-keeping requirements under SEC Rule 17a-4 or equivalent.
We're not saying every platform needs per-tenant audit bucket isolation on day one. We're saying that the architecture needs to support it before your first enterprise customer signs a security review, because retrofitting it after the fact requires rebuilding the audit pipeline — and enterprise sales timelines don't accommodate that kind of remediation.
Pattern 3: Per-tenant rate limits and throughput guardrails
Rate limit isolation prevents one tenant's usage pattern from degrading service quality for others. This is the hardest pattern to implement correctly because it requires both per-tenant tracking and fair queuing semantics that don't create starvation for lower-priority tenants.
The parameters that need per-tenant limits in a real gateway:
- RPM (requests per minute): Hard limit on request rate, enforced with a token bucket or sliding window algorithm. Exceeding this returns HTTP 429 with a
Retry-Afterheader. - TPM (tokens per minute): Limit on token consumption rate, which is more meaningful for LLM workloads than request count because token consumption variance across requests is high.
- Concurrent request limit: Maximum in-flight requests at any given moment. Prevents a single tenant from holding a large number of open streaming connections that consume gateway resources.
- Daily cost cap: Hard stop on total spend attribution in a billing day. Relevant for self-serve tenants on free trials or starter plans where unexpected usage could create billing disputes.
The fair queuing requirement: when the gateway is under load and multiple tenants are at their RPM limits, how do new requests get prioritized? A naive implementation processes requests in arrival order — which means a high-volume tenant who fires a batch job at midnight will starve interactive requests from lower-volume tenants who need sub-second response times. A properly implemented per-tenant queue uses weighted fair queuing: each tenant gets a share of gateway capacity proportional to their plan tier, with burst allowances for short spikes.
Isolation failure modes to test explicitly
The three patterns above describe what to build. These are the failure modes that will bite you if you test only the happy path:
Tenant ID spoofing via header injection: If the gateway extracts tenant identity from a header that the calling application sets (e.g., X-Tenant-ID), a misconfigured application can present another tenant's ID and receive that tenant's model allowlist and routing policy. Tenant identity at the gateway must be derived from the API key itself, not from a user-supplied header. The API key carries the tenant claim; the header is informational only and must be validated against the key claim.
Fallback chain policy bypass: A fallback chain that cascades through multiple endpoints on error must re-evaluate the tenant's allowlist at each fallback step. A routing policy that says "fallback to claude-3-5-sonnet on error" will bypass the tenant's model allowlist if the allowlist check only runs at initial routing and not at fallback selection.
Audit log cross-contamination under high write throughput: When audit log writes are batched for efficiency, a flush that combines records from multiple tenants into a single partition write can create a shared-key collision if the partition key logic has an off-by-one on the tenant ID path segment. This is a correctness bug that looks like a performance optimization. Every audit record must carry the tenant ID as a mandatory field, and batch writes must partition by tenant before flushing.
KV cache prefix sharing across tenants: vLLM's KV cache prefix sharing allows multiple requests with the same prefix (e.g., a shared system prompt) to reuse cached key-value computation, reducing TTFT significantly. In a multi-tenant deployment, this optimization must be scoped per-tenant — it's not safe to share KV cache entries across tenant boundaries even if the system prompts are identical, because the cache may contain document context from a previous tenant's RAG retrieval that gets incorrectly surfaced in a subsequent request.
Putting it together: Kamiwaza's tenant configuration model
The three patterns above — model allowlist, audit bucket isolation, and rate limit guardrails — are implemented as first-class constructs in Kamiwaza's tenant registry. A new tenant is provisioned by creating a tenant record in the gateway config:
curl -X POST https://api.kamiwazaai.org/v1/tenants \
-H "Authorization: Bearer $ADMIN_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"tenant_id": "acme-corp",
"plan": "enterprise",
"model_allowlist": ["anthropic-claude-3-haiku", "bedrock-llama-3-70b"],
"rate_limits": { "rpm": 500, "tpm": 500000, "concurrent": 50 },
"audit_destination": { "type": "managed", "retention_days": 365 },
"data_class_policy": "pii-restricted"
}'
The tenant receives an API key scoped to their configuration. Their application developers use this key and never need to know which model they're routing to — the gateway handles model selection within the tenant's allowlist based on the routing policy's cost and latency optimization rules.
What this means operationally: adding a new enterprise customer to the platform is a gateway configuration change, not an infrastructure provisioning operation. You don't spin up a new gateway instance, you don't add a new deployment pipeline. You add a tenant record, issue an API key, and the isolation guarantees are live within the next configuration sync cycle.