/12 min read/Tracehold team

Prompt injection detection: 6 detectors you can ship in a weekend

You do not need a fine-tuned classifier to catch the bulk of prompt injection and tool-misuse attempts against production agents. Six boring rule-based detectors cover most of the real-world signal. Here is the code.

prompt injectionAI agent securityOWASP ASIruntime guardrailsdetectorsagentic AI

The "we need a model for this" trap

Every team that first takes AI agent security seriously hits the same fork in the road. The room splits into two camps.

One camp wants to train a classifier. They start sketching fine-tuning budgets, labelling pipelines, a data flywheel. Six months later they have a research project.

The other camp ships a list of regexes on Friday afternoon, watches real production traffic on Monday, and has blocked actual attack attempts by Wednesday.

The first camp is right that a well-trained detection model is more robust. The second camp is right that you should not let that be the reason you ship nothing.

Most of the prompt injection and tool-misuse attempts we see against production agents are not adversarial masterpieces. They are copy-pastes from a Hacker News thread, leaked user inputs with ignore previous instructions in them, a misconfigured retrieval-augmented workflow that lets a tainted document reach a tool call. A rule-based detector catches them. A surprising amount of the real signal is boring.

This post walks through six detectors from our production alerts pipeline. Each one is a short function, each one fires on real traffic, and each one is easy enough that a senior engineer can ship all six in a weekend.

The shape of a detector

Before the rules, the interface. A detector takes a context object describing a single gateway intercept and returns either None (clean) or a finding with a severity, an OWASP code, and a remediation hint.

@dataclass
class DetectorContext:
    organization_id: str
    agent_id: str
    task_id: str
    action_identifier: str        # e.g. "aws:s3:deleteobject"
    risk_level: str               # READ | WRITE | DESTRUCTIVE | CRITICAL
    decision: str                 # ALLOW | BLOCK | REQUIRES_APPROVAL
    parameters: dict[str, Any]    # the tool-call arguments
    tainted: bool                 # upstream input flagged as untrusted
    blast_radius_score: float
    recent_blocks_count: int      # this agent, last 5 min (from Redis)
    recent_actions_count: int     # this agent, last 1 min (from Redis)
    simulation_mode: bool


@dataclass
class DetectorFinding:
    owasp_code: str
    severity: str                 # critical | high | medium | low
    title: str
    description: str
    remediation_hint: str | None = None

Every detector below has the same signature: def detect_x(ctx: DetectorContext) -> DetectorFinding | None. That constraint is load-bearing. It means a detector is a pure function of context, the context is cheap to build once per intercept, and adding a new detector is one file and one registration.

Detector 1: known prompt-injection phrases in tool parameters (ASI01)

The obvious one. About 60% of the injection attempts we observe are textbook phrases.

_INJECTION_PATTERNS = re.compile(
    "|".join([
        r"ignore\s+previous\s+instructions",
        r"disregard\s+all\s+prior",
        r"you\s+are\s+now\s+in\s+developer\s+mode",
        r"jailbreak",
        r"act\s+as\s+if\s+you\s+have\s+no\s+restrictions",
        r"do\s+not\s+follow\s+your\s+guidelines",
        r"<\s*system\s*>",
        r"\[\s*INST\s*\]",
    ]),
    re.IGNORECASE,
)

def detect_prompt_injection(ctx: DetectorContext) -> DetectorFinding | None:
    hits = [k for k, v in ctx.parameters.items()
            if isinstance(v, str) and _INJECTION_PATTERNS.search(v)]
    if not hits:
        return None
    return DetectorFinding(
        owasp_code="ASI01",
        severity="critical",
        title="Potential prompt injection detected",
        description=f"Injection-like patterns in parameters: {', '.join(hits)}",
        remediation_hint=(
            "Validate and sanitize user-supplied inputs before they reach tool "
            "call parameters. Raw user text should not land in a tool argument."
        ),
    )

Two things to get right. First, scan the tool-call parameters, not the model's reasoning output. The reasoning is where the injection landed; the parameter is where the damage happens. By the time a compromised thought has turned into s3:DeleteObject(Bucket="*"), you have moved past detection and into blast radius.

Second, pair this with a signal that you are only scanning parameters that came from untrusted input. An internal workflow that includes the phrase ignore previous instructions in a help-center article lookup is not a prompt injection, it is a search query. The taint flag on the context is the cheap version of that distinction.

Detector 2: secrets and PII in tool-call parameters (ASI06)

The second-most-common signal. Not an attack in every case, but a control failure in most, and the false-positive rate is low.

_SECRET_PATTERNS = re.compile(
    "|".join([
        r"(password|passwd|secret|api[_\-]?key|access[_\-]?key|token|auth)[\s:=]+['\"]?[\w\-/+]{8,}",
        r"(aws|gcp|azure)[_\-]?(secret|key|token)[\s:=]+['\"]?[\w\-/+]{16,}",
        r"\b\d{3}-\d{2}-\d{4}\b",         # US SSN
        r"\b4[0-9]{12}(?:[0-9]{3})?\b",   # Visa-like PAN
    ]),
    re.IGNORECASE,
)

def detect_pii_in_parameters(ctx: DetectorContext) -> DetectorFinding | None:
    hits = [k for k, v in ctx.parameters.items()
            if isinstance(v, str) and _SECRET_PATTERNS.search(v)]
    if not hits:
        return None
    return DetectorFinding(
        owasp_code="ASI06",
        severity="high",
        title="Potential secrets or PII in action parameters",
        description=f"Sensitive-data patterns in: {', '.join(hits)}",
        remediation_hint=(
            "Use secret references (SSM Parameter Store, Vault, KMS-decrypt at "
            "call time) instead of passing raw credentials through tool params. "
            "Rotate anything exposed."
        ),
    )

This is the detector most likely to fire on your first day running it. A lot of "secure" agent deployments are quietly shipping API keys through tool parameters because nobody is looking. Fire this in observe mode for the first week, read the report, rotate what it finds, then flip it to enforce.

Detector 3: wildcard resources and over-permissive calls (ASI02)

The classic cloud-agent failure mode. An agent, asked to "delete the stale S3 objects", happily writes Resource: "*" into the call because that is the easiest way to "match everything".

_WILDCARD_PATTERNS = re.compile(
    r"(^|[,\[{'\":\s])\*(\s|,|\]|}|'|\"|$)"     # raw "*" as a value
    r"|arn:aws:[^:]*:[^:]*:[^:]*:\*"            # wildcard in any ARN slot
    r"|\"Resource\"\s*:\s*\"\*\""
    r"|\"Action\"\s*:\s*\"\*\"",
    re.IGNORECASE,
)

def detect_tool_misuse(ctx: DetectorContext) -> DetectorFinding | None:
    for v in ctx.parameters.values():
        if isinstance(v, str) and _WILDCARD_PATTERNS.search(v):
            return DetectorFinding(
                owasp_code="ASI02",
                severity="high",
                title="Over-permissive tool call (wildcard resource)",
                description=(
                    f"{ctx.action_identifier} includes a wildcard resource or "
                    f"action in its parameters. An agent asking for '*' access "
                    f"is almost always a scoping mistake or exploitation attempt."
                ),
                remediation_hint=(
                    "Restrict the resource to a specific ARN or ID. If the "
                    "agent genuinely needs multi-resource access, prefer "
                    "tag-based conditions over wildcards."
                ),
            )
    return None

We have never seen a legitimate agent workflow write Resource: "*" on purpose. Every hit on this detector in production has been either a scoping bug, a fine-tuned model confabulating an IAM policy it saw in training data, or a real exploitation attempt. It is free to ship.

Detector 4: identity and privilege escalation identifiers (ASI03)

This one is not a pattern match against parameters; it is a set-membership check against the action identifier itself. Certain IAM verbs should never be called by a non-admin agent.

_PRIV_ESCALATION_IDENTIFIERS = {
    # AWS IAM
    "aws:iam:attachuserpolicy",
    "aws:iam:attachrolepolicy",
    "aws:iam:putuserpolicy",
    "aws:iam:putrolepolicy",
    "aws:iam:createaccesskey",
    "aws:iam:createloginprofile",
    "aws:iam:updateassumerolepolicy",
    "aws:sts:assumerole",
    # GCP IAM
    "gcp:iam:setiampolicy",
    "gcp:iam:createserviceaccountkey",
    # Azure
    "azure:roleassignments:create",
    # Kubernetes RBAC
    "kubernetes:rbac:clusterrolebindings.create",
    "kubernetes:rbac:rolebindings.create",
}

def detect_privilege_escalation(ctx: DetectorContext) -> DetectorFinding | None:
    if ctx.action_identifier.lower() not in _PRIV_ESCALATION_IDENTIFIERS:
        return None
    role = ctx.parameters.get("RoleArn") or ctx.parameters.get("role_arn")
    suspicious_role = (
        isinstance(role, str) and ("*" in role or "admin" in role.lower())
    )
    severity = "critical" if suspicious_role or ctx.tainted else "high"
    return DetectorFinding(
        owasp_code="ASI03",
        severity=severity,
        title="Identity or privilege escalation action",
        description=(
            f"Agent {ctx.agent_id} invoked {ctx.action_identifier}, a known "
            f"privilege-escalation surface."
        ),
        remediation_hint=(
            "Restrict privileged IAM actions to a dedicated admin-persona "
            "server. Revoke the agent's credentials and rotate any access "
            "keys created in the last hour."
        ),
    )

The identifier set is the part you evolve over time. Start with this list, add whatever surfaces in your first month of alerts. If your threat model is mostly cloud agents, the AWS IAM verbs alone will catch 90% of what matters.

The taint flag and the wildcard-role check are what turn this from a "high" into a "critical". An agent calling sts:AssumeRole on a specific read-only role is different from an agent, with upstream-tainted context, calling it on arn:aws:iam::*:role/*admin*. The detector can tell the difference.

Detector 5: code-execution identifiers (ASI05)

A related one. Certain action identifiers let an agent spawn arbitrary code. Lambda invoke, SSM send-command, ECS run-task, Terraform apply, GitHub Actions workflow dispatch. These are not escalation by themselves, but they are the most common lateral-movement surface in a compromised agent.

_CODE_EXEC_IDENTIFIERS = {
    # AWS
    "aws:lambda:invoke",
    "aws:lambda:invokeasync",
    "aws:ecs:runtask",
    "aws:ssm:sendcommand",
    "aws:ssm:startsession",
    "aws:ec2:runinstances",
    "aws:batch:submitjob",
    # GCP
    "gcp:cloudfunctions:call",
    "gcp:run:services.call",
    "gcp:compute:instances.create",
    # Azure
    "azure:functions:invoke",
    "azure:virtualmachines:runcommand",
    # Kubernetes
    "kubernetes:core:pods.exec",
    "kubernetes:core:pods.create",
    # IaC
    "terraform:apply",
    "terraform:destroy",
    # CI
    "github:actions:workflows.dispatch",
}

def detect_code_execution(ctx: DetectorContext) -> DetectorFinding | None:
    if ctx.action_identifier.lower() not in _CODE_EXEC_IDENTIFIERS:
        return None
    if ctx.simulation_mode or ctx.decision != "ALLOW":
        return None
    severity = "critical" if ctx.tainted else "high"
    return DetectorFinding(
        owasp_code="ASI05",
        severity=severity,
        title="Code-execution action allowed",
        description=(
            f"Agent {ctx.agent_id} is about to execute code via "
            f"{ctx.action_identifier}. Unless this is an explicit workflow "
            f"step, arbitrary code execution from an agent is a strong "
            f"lateral-movement signal."
        ),
        remediation_hint=(
            "Require human approval for the code-execution identifiers above "
            "by adding a policy rule with effect=require_approval."
        ),
    )

A subtle one: this detector only fires when the decision is ALLOW and the mode is not simulation. The point is not to re-flag what the policy engine already blocked. The point is to surface the case where a code-execution call slipped through because no policy rule covered it. That is the silent failure mode you most want to see.

Detector 6: burst rate (ASI04 and ASI08)

The only stateful detector of the six. An agent calling 30 tools per minute is not doing anything a human operator would recognize as normal work. It is either a runaway loop, a goal-hijacking prompt telling it to hammer a target, or a misconfigured retry storm.

def detect_burst_rate(ctx: DetectorContext) -> DetectorFinding | None:
    THRESHOLD = 30
    if ctx.recent_actions_count <= THRESHOLD:
        return None
    return DetectorFinding(
        owasp_code="ASI04",
        severity="high",
        title="Anomalous action burst rate",
        description=(
            f"Agent {ctx.agent_id} performed {ctx.recent_actions_count} "
            f"actions in the last 60s (threshold: {THRESHOLD})."
        ),
        remediation_hint=(
            "Review the task for runaway loops or goal-hijacking. Consider "
            "reducing the action budget or enabling approval-gated mode."
        ),
    )

The counter itself is a one-liner on Redis. On every intercept, INCR agent:{id}:rate:{minute} and EXPIRE ... 120. Read it back in the context builder. Total cost: one Redis round-trip per intercept, which you were already doing for rate-limiting anyway.

What the threshold should be depends on the agent. An autonomous log-analysis agent legitimately fires dozens of queries per minute; a deployment agent should never fire more than a handful. For the first version, 30 is a useful floor that almost nobody hits legitimately. Per-agent-class thresholds come later, once you have two weeks of burst-rate histograms to pick them from.

Wiring it all up

Each detector is pure. The entry point is a function that runs every registered detector against one context and returns the list of findings:

DETECTORS = [
    detect_prompt_injection,
    detect_pii_in_parameters,
    detect_tool_misuse,
    detect_privilege_escalation,
    detect_code_execution,
    detect_burst_rate,
]

def run_all_detectors(ctx: DetectorContext) -> list[DetectorFinding]:
    findings = []
    for fn in DETECTORS:
        try:
            f = fn(ctx)
            if f is not None:
                findings.append(f)
        except Exception:
            logger.exception("detector %s raised", fn.__name__)
    return findings

The try/except is not paranoia. A detector that crashes should degrade silently rather than take down the alerts pipeline, because the alerts pipeline is itself a post-hoc signal, not the primary enforcement path. The primary enforcement path is the policy engine; the detectors are for surfacing patterns that the policy rules did not catch.

Call run_all_detectors from your gateway's post-decision hook. Persist findings to an alerts table keyed by action_id. Surface them in your UI with the OWASP code as a filter and the severity as a sort key. You now have production alerting.

What this does not do

It is worth being honest about the ceiling.

A rule-based detector will miss paraphrased injections. "Forget everything you were told" does not match ignore\s+previous\s+instructions. You can extend the regex, but you are playing whack-a-mole, and eventually a fine-tuned classifier earns its keep.

A rule-based detector will miss semantic tool-misuse. An agent that legitimately has s3:DeleteObject permissions but is manipulated into deleting the wrong bucket does not trigger any of these six. That is where blast-radius scoring and approval gates come in, not detection.

A rule-based detector is also the easiest thing for an attacker to probe. Anyone with read access to your code can see the regex and craft an input that evades it. Defense-in-depth exists because no single layer is sufficient.

What these six detectors buy you is the foundation on top of which the rest gets built. Before you ship a classifier, before you train a taint propagation model, before you write a semantic-equivalence checker for tool calls, you ship the rules. The rules block the 80% of attempts that are trivial. The rules give you the labelled data you need to train anything more sophisticated. The rules are what makes the first customer trust you enough to let you keep going.

That is what you can do in a weekend.

Further reading

For the context pipeline that feeds these detectors, see the companion post Why your AI agent needs a gateway, not a system prompt. For the audit trail where the findings land, How to build an immutable audit log with HMAC hash chaining.

And the framework these detectors map to is OWASP Top 10 for Agentic AI (ASI01 through ASI10).

If you want to see what a production version of this looks like end-to-end against real agent traffic, that is what Tracehold ships. Come talk to us.


See Tracehold in action

30-minute sandbox walkthrough. No SDK install, no credentials.

Book a demo