July 4, 2025
Most AI security reviews we've seen are web-app checklists with "LLM" labels pasted on top. That gap is exactly where the breaches happen. The attack surface of an enterprise AI orchestration layer is categorically different from a web app: tools execute real code, models emit text that downstream systems trust implicitly, and the blast radius of a single compromised agent can span every SaaS integration in your estate. These twelve controls are what our team treats as non-negotiable before any agent workflow touches production traffic.
Prompt injection is the SQL injection of the AI era—and it's significantly harder to detect because there's no grammar to parse, only semantics. Most teams get this wrong by only classifying the raw user prompt. The dangerous injections arrive in RAG-retrieved documents, because the vector store (Pinecone, Weaviate, pgvector) implicitly trusts its own content. We run a fine-tuned 7B classifier at p95 under 40ms that catches exactly those indirect injections before they enter a LangChain or LangGraph pipeline. Pair that with strict schema validation on structured inputs—JSON Schema or Pydantic—so any deviation gets a hard 400 before it reaches model context.
A platform team we worked with last quarter had an AutoGen agent with a shell-exec tool and an overly broad IAM role. One malformed instruction later, it was making API calls it had no business making. The fix is structural, not prompt-based.
When any AutoGen or CrewAI agent calls a tool—filesystem read, shell exec, HTTP fetch—that tool must run in an isolated execution context. We containerize every tool handler on Kubernetes with a dedicated service account, no ambient cloud credentials, and a strict seccomp profile. Network egress from tool pods is whitelisted at the Istio policy layer; anything not on the allowlist is dropped with a metric emitted to Prometheus. Model inference endpoints get the same treatment: each model family (GPT-4o, Claude, Llama-3) lives in its own namespace with ResourceQuota and NetworkPolicy. Cross-model calls go through a brokered API, never peer-to-peer, giving you a single audit point instead of a mesh of implicit trust.
An agent that can read your CRM and write to Slack is a data exfiltration pipe by definition. Model alignment is not a substitute for output filtering. You need two layers: pre-emission filtering and egress monitoring.
Before any model output reaches an external channel, run it through a classifier trained on PII patterns, internal code identifiers, and credential regexes. We use a regex engine plus a small BERT-based classifier in combination: the regex catches known-format secrets (AWS keys, JWTs) at sub-millisecond latency, while the classifier handles free-form sensitive content with a 0.5% false-negative target.
# Example output-filter config (simplified)
output_filter:
rules:
- name: aws_key
pattern: "AKIA[0-9A-Z]{16}"
action: redact
- name: internal_hostname
pattern: "\.(corp|internal)\b"
action: block_and_alert
classifier:
model: bert-pii-v3
threshold: 0.87
fallback_on_timeout: block
Rate limiting isn't cost control dressed up as security—it's your earliest signal that a credential is compromised or an agent is stuck in a loop. Set hard limits at both the API gateway (requests per minute per API key) and at the orchestration layer (LlamaIndex retrieval calls per workflow run). Export counters to Prometheus and alert at 80% of ceiling, not 100%. Waiting until you hit the wall is too late.
Every outbound HTTP call from an agent workflow should be proxied through a controlled egress gateway that logs destination, payload size, and response code to Kafka for stream processing. Anomaly detection on that Kafka topic catches novel exfiltration channels faster than any signature-based tool will. On dependencies: LangChain releases frequently and has had several supply-chain incidents. Pin versions. Maintain a Software Bill of Materials for every AI dependency and run automated SBOM diffs against a known-good baseline on every dependency update.
Output filtering beyond secrets means hallucination scoring and policy compliance. Every agent response that cites a fact or makes a recommendation should pass a grounding check against its source documents. We target a hallucination rate below 0.5% on factual retrieval tasks, measured weekly against a golden eval set running in Airflow. That number sounds small until you're operating at scale and "small" means thousands of wrong citations per day.
Incident response is the area where we see the most underinvestment. Traditional IR playbooks simply don't cover agent-native scenarios: how do you revoke tool access mid-session? How do you replay a conversation to reconstruct what data was accessed? How do you quarantine a compromised LlamaIndex index without taking down production retrieval? These need documented runbooks, with Temporal workflows handling the automated remediation steps—not just a Confluence page someone wrote once and never tested.
Red-team your AI attack surface on a fixed cadence. Quarterly is the minimum; bi-weekly if you're actively shipping new agent capabilities. The scope must include indirect prompt injection via data sources, tool chain abuse, and multi-turn context manipulation. Single-shot jailbreak testing alone tells you almost nothing about your actual risk posture.