Prompt safety is visible. It is easy to demonstrate, easy to screenshot, and easy to turn into a story. A model refuses something. A model fails to refuse something. A jailbreak works. A jailbreak stops working. The whole interaction can fit inside a chat transcript.
Capability safety is less visible, but it is closer to the system.
An agent is not only a model with instructions. It is a model connected to actions. The important question is not just what the model says. It is what the system allows the model to do. Can it read files? Can it write files? Can it open a browser? Can it send a request? Can it run a shell command? Can it access tokens? Can it persist memory? Can it modify a remote service?
Those questions define the real boundary.
This is why I think agent security should move from prompts to capabilities. Prompts still matter, but they should not be the center of the design. A prompt is an instruction. A capability is an affordance. If the affordance exists, a future prompt, tool call, bug, or malicious document may eventually try to use it.
The same instruction can be safe or unsafe depending on the surrounding capabilities. “Summarize this file” is different if the agent can only read an uploaded document, if it can scan the whole home directory, or if it can also send results to an external endpoint. The language is similar. The system risk is not.
A capability-first view asks for a different inventory:
- What tools exist?
- What resources can they touch?
- What arguments can the model control?
- What data can cross a boundary?
- What operations are irreversible?
- What gets logged?
- What requires confirmation?
This inventory is not glamorous, but it is where many practical failures live. The boring parts of agent security are often the important parts: path limits, network allowlists, shell restrictions, confirmation gates, audit logs, file scopes, secrets handling, and clear tool schemas.
Static analysis fits naturally into this view. It cannot decide every possible runtime behavior, but it can help build the inventory. It can read a skill before the model uses it and ask: what capability is being exposed here? Is this a file operation? A network operation? A command execution path? A browser action? A credential access path? A persistent write?
Once capabilities are visible, policy becomes easier to discuss.
The goal is not to make agents powerless. A powerless agent is just a chatbot with better branding. The goal is to make power legible. The system should know what it is giving away, the user should understand when a boundary is crossed, and the developer should have a way to inspect the surface before something goes wrong.
Prompts are still part of the story. But they are not the ground truth.
The ground truth is what the agent can do.