/10 min read/Tracehold team

Blast radius scoring for AI agents: the runtime primitive prompt defenses don't have

Prompt-level defenses treat every tool call the same. But an AI agent deleting a log file is not the same as one deleting an IAM role. Here is how to score blast radius at runtime, before the action runs, and use that score as a gate.

AI agent securityruntime guardrailsblast radiusagentic AIOWASP ASI

The problem prompt defenses cannot solve

Most AI agent security today is content-shaped. The system prompt says "do not delete production resources." The model card says "we tuned against prompt injection." A guardrail library scans the model's output for dangerous strings. The thesis is that if the agent never decides to do something bad, nothing bad will happen.

This framing has a blind spot that is invisible until the agent is in production with real tool access: two tool calls that look identical in the transcript can have wildly different consequences in the real world.

An agent calling aws:s3:DeleteObject to clean up a log file is a minor cost savings. The same verb applied to an object in a bucket that holds the last copy of a customer's data is a P0 incident. An agent calling iam:DeleteRole against a scratch role from a tutorial is fine. The same call against the role that fronts production authentication is a company-ending event.

No amount of prompt hardening tells you which one is about to happen. The content of the request is the same. The tool name is the same. The verb is the same. What differs is the context of the resource being acted on, and that context only exists at runtime, in the actual environment the agent is connected to.

This is what blast radius scoring is for.

What we actually want to measure

At the moment an agent calls a tool, and before we let the action touch anything real, we want a single number that represents: if this goes through and turns out to be wrong, how bad is it?

Not "how risky is the verb"; we already have risk levels for that (READ, WRITE, DESTRUCTIVE, CRITICAL). Risk is a property of the action type. Blast radius is a property of the specific instance: this action, on this resource, in this environment, right now.

The inputs that matter:

  • Environment. Prod is not the same as dev.
  • Impact plane. IAM changes affect identity. Network changes affect connectivity. Data changes affect state. These are not equally severe.
  • Resource cardinality. One row is not the same as a million.
  • Shared infrastructure. Touching a resource that fronts many tenants is not the same as a single-tenant object.
  • Rollback path. A snapshot-restorable action is recoverable. A terminate is not.
  • Maintenance window. An outage during a declared window is absorbed; an outage at 3pm on a Tuesday is not.

The output is a float between 0.0 and 1.0. The goal is not to be perfectly precise; it is to be meaningfully comparable. A score of 0.9 should always indicate more latent damage than a score of 0.3, regardless of which provider the action comes from.

A concrete scoring function

Here is how the engine does it in production:

_ENV_WEIGHT = {
    "prod": 1.0,
    "staging": 0.5,
    "dev": 0.2,
    "unknown": 0.7,   # conservative, treat unknown as near-prod
}

_RISK_WEIGHT = {
    "READ": 0.0,
    "WRITE": 0.3,
    "DESTRUCTIVE": 0.6,
    "CRITICAL": 1.0,
}

raw = (
    env_weight * 0.4
    + risk_weight * 0.5
    + shared_bonus            # +0.15 if shared infra
    + rollback_reduction      # -0.10 if rollback available
    + maintenance_reduction   # -0.15 if inside maintenance window
    + plane_bonus             # +0.10 if IAM plane touched
    + count_bonus             # up to +0.10 for many resources
)
score = clamp(raw, 0.0, 1.0)

Three design choices in that formula are worth pulling out.

Risk carries more weight than environment. The coefficients (0.5 for risk, 0.4 for env) say that a CRITICAL action in dev is still serious, and a READ action in prod is still cheap. This matches reality: destroying the wrong thing is bad in every environment, and reading a config map in prod is not an incident.

Unknown environment gets 0.7, not 0.0. If an agent's tool call lands on a resource without an Environment tag, the scorer does not shrug and assume safety; it assumes near-production. The default is conservative because an untagged resource is exactly the case where the operator does not have good visibility, which is exactly when a cautious prior is most valuable.

IAM gets a separate bonus. An IAM mutation is never just an IAM mutation. Granting a permission, rotating a role, or deleting a policy has cascading effects on every resource in the account that trusts that identity. The +0.1 plane bonus on IAM is a small number that reflects a large qualitative fact: identity sits under everything else.

Impact plane, in detail

The hardest part of a blast radius function is figuring out what planes an action actually touches. Some verbs are obvious. Many are not.

A non-exhaustive plane map from the production engine:

Action familyPlane(s)
iam, authorizationiam
ec2:securitygroup, ec2:vpc, compute:firewallnetwork
s3, storage, rds, sqladmindata
ec2:instance, lambda, virtualmachinescompute
eks, container, containerservicecompute, network
kubernetes:rbaciam
kubernetes:core:secrets, :configmapsdata
cloudflare:dns, cloudflare:sslnetwork, data
cloudflare:accessiam, network
github:secretsdata, iam
github:org:members, :teamsiam
terraform:apply, :destroyiam, network, data, compute, storage

The Terraform row is the interesting one. A single terraform apply can touch every plane at once, because the HCL is a black box at the call site; you do not know what the plan covers without reading it. The honest answer is to treat it as cross-plane by default, score it accordingly, and require an approval path. The alternative (parsing the plan to score it precisely) is worth doing, but it is a v2 problem. Shipping a conservative v1 is more important than shipping a precise v3.

Cloudflare DNS is the other instructive case. The action identifier is humble (cloudflare:dns), but the consequence is that every visitor to a customer's site can be redirected somewhere else. The scorer tags it network + data because a DNS change is simultaneously a routing change and an effective data-exposure change.

Why this belongs at runtime, not at policy-write time

A reasonable objection: "can't we just write policies that say 'no prod DNS changes without approval'?" You can, and you should. Policies are a good tool. But they are a categorical tool; they tell you which classes of action require which treatment.

Blast radius is a graded tool. It lets you say: "this specific instance of the action scores 0.92, route it to approval even though the category normally auto-approves." Or the other direction: "this category normally requires approval, but this specific instance is against a dev resource with a known rollback path, score is 0.18, auto-approve and log." Policies express the rule; blast radius expresses the facts the rule acts on.

Concretely, the gateway uses the score as an input to the policy engine, not as a replacement for it. A policy can key on blast_radius.score > 0.7, or on blast_radius.impact_plane contains "iam", or on blast_radius.environment == "prod" and not blast_radius.rollback_available. The score is a well-defined feature the policy author can reference, not a magic threshold hardcoded somewhere else.

This also means blast radius is observable in the trace. Every scoring pass opens a span:

blast_radius.score
  blast_radius.action = "aws:iam:DeleteRole"
  blast_radius.risk_level = "CRITICAL"
  blast_radius.environment = "prod"
  blast_radius.resource_count = 1
  blast_radius.score = 0.92

When the agent is blocked, or worse, allowed through and something goes wrong, the incident reviewer has the scoring attributes right there in the trace, next to the policy decision that referenced them. This is what "explainable trust" actually looks like in production: every decision has a derivation you can replay.

What blast radius is not

Three disclaimers that matter, because a scoring function always ships with an implicit promise of precision that is never fully kept.

Blast radius is not a prediction of damage; it is a prior. A score of 0.9 does not mean this action will cause catastrophic damage. It means if the action turns out to be wrong, the damage will be larger than average. The score gets paired with other signals (taint indicators, policy rule, user approval) before it gates anything.

Blast radius does not understand semantic content. It sees iam:DeleteRole and the target resource ID. It does not know that role-tracehold-prod-authz is the authorization role, and deleting it breaks every request to the API. That knowledge comes from a CMDB, from resource tags, or from an IAM/network graph, inputs the scorer consumes rather than synthesizes.

The weights are not universal. The exact coefficients in the formula above reflect our design choices, informed by what matters for cloud/ops agents. A team doing media processing or scientific compute should tune them to their own cost structure. The shape of the function generalizes; the numbers do not.

Where it fits in the enforcement pipeline

For reference, here is the order of operations when an agent calls a tool through the gateway:

  1. Classifier: what is the action's risk level? (READ, WRITE, DESTRUCTIVE, CRITICAL)
  2. Resource context: what environment and tags does the target carry?
  3. Blast radius: compute the score from the above.
  4. Taint: has the task ingested untrusted input (prompt injection, poisoned tool output)? Taint signals propagate through delegation chains.
  5. Policy: evaluate the full PolicyInput (action, risk, blast radius, taint sources, etc.) against the org's rules.
  6. Decision: ALLOW, BLOCK, or REQUIRES_APPROVAL.
  7. Credential: if ALLOW, issue task-scoped JIT credentials with the minimum IAM surface the action requires.
  8. Audit: write a hash-chained, HMAC-signed row with every score and attribute attached.

Blast radius is step 3. It runs before the policy engine so its output is available as an input. It runs after risk classification and resource context because those are its inputs. And it runs on every action, not just destructive ones, because a READ against a sensitive data plane (for example, a secrets manager) has its own latent blast radius, even if the verb looks innocuous.

The takeaway

Prompt-level defenses can tell the agent what it should not want to do. They cannot tell you, at the moment of the call, how much damage this specific action is capable of. That gap is the reason prompt-hardened agents still cause incidents when they land in an environment with real tool access.

A blast radius score is not glamorous. It is not a model. It is a deterministic function of inputs that are all knowable at call time: the verb, the resource, the environment, the rollback path, the maintenance window. What it gives you in return is a graded signal that policies can key on, a feature the policy engine can weight, and, most importantly, a number the auditor can point at when asking "why did you let this through?" or "why did you block this?"

Build this primitive. Wire it into your policy inputs. Log it on every decision. The first time an agent attempts something that would have been catastrophic in the wrong environment, the score will already be in the trace, whether or not anything else in your stack was watching.


Tracehold computes a blast radius score on every intercepted tool call, feeds it into a policy engine with a canonical input schema, and writes the result into a hash-chained audit log. Book a 30-minute walk-through and we will show you the full pipeline (classifier, blast radius, trust, policy, credential, audit) end to end on a live tool call.


See Tracehold in action

30-minute sandbox walkthrough. No SDK install, no credentials.

Book a demo