Skip to content

Capability Controls

For each capability and risk combination identified in the previous sections, we provide a list of technical controls that may be useful in mitigating those safety or security risks. Due to the rapid developments in the agentic AI space, this list provides a simple starting point and is far from complete or thorough.

Click here for a downloadable version.

Category Capability Risk Technical Control
Cognitive Planning & Goal Management Devising plans that are not effective in meeting the user's requirements Prompt the agent to self-reflect on the adherence of the plan to the user's instructions
Cognitive Planning & Goal Management Devising plans that are not effective in meeting the user's requirements Require the user to approve the plan in high-impact cases
Cognitive Planning & Goal Management Devising plans that do not adhere to common sense or implicit assumptions about the user's instructions Prompt the agent to self-reflect on whether the plan is sensible and reasonable, given the user's original request
Cognitive Planning & Goal Management Devising plans that do not adhere to common sense or implicit assumptions about the user's instructions Ensure important assumptions about feasibility, scope, and cost, where relevant, are included in the system prompt
Cognitive Reasoning & Problem-Solving Becoming ineffective, inefficient, or unsafe due to overthinking Enforce time or token limits for agents' reasoning
Cognitive Reasoning & Problem-Solving Becoming ineffective, inefficient, or unsafe due to overthinking Test different variations of the prompt to estimate likelihood of overthinking
Cognitive Reasoning & Problem-Solving Becoming ineffective, inefficient, or unsafe due to overthinking Adjust short-term and long-term memory options for the agent
Cognitive Reasoning & Problem-Solving Engaging in deceptive behaviour through pursuing or prioritising other goals Provide access to a scratchpad for agents to use to record its inner thoughts
Cognitive Agent Delegation Assigning tasks incorrectly to other agents Apply guardrails to limit the scope of tasks that can be assigned to specialised agents
Cognitive Agent Delegation Attempting to use other agents maliciously Log all task assignments by the agent to other agents
Cognitive Agent Delegation Attempting to use other agents maliciously Conduct rigorous adversarial testing on centralised planning agents
Cognitive Tool Use Choosing the wrong tool for the given action or task Provide comprehensive descriptions of each tool, including its intended use, required inputs, and potential outputs
Interaction Natural Language Communication Generating undesirable content (e.g. toxic, hateful, sexual) Implement output safety text guardrails to detect if undesirable content is being generated
Interaction Natural Language Communication Generating unqualified advice in specialised domains (e.g. medical, financial, legal) Implement input text guardrails to detect if the question is related to one of the specialised domains, and if so, to decline answering the question
Interaction Natural Language Communication Generating controversial content (e.g. political, competitors) Implement input text guardrails to detect instructions to generate content that is controversial according to the organisation's policies
Interaction Natural Language Communication Regurgitating personally identifiable information Implement output text guardrails to detect personally identifiable information in the LLM's outputs before it reaches the user
Interaction Natural Language Communication Generating non-factual or hallucinated content Implement methods to reduce hallucination rates (e.g. retrieval-augmented generation)
Interaction Natural Language Communication Generating non-factual or hallucinated content Implement UI/UX cues to highlight the risk of hallucination to the user
Interaction Natural Language Communication Generating non-factual or hallucinated content Implement features to enable users to easily verify the generated answer against the original content
Interaction Natural Language Communication Generating copyrighted content Implement input text guardrails to detect instructions to generate copyrighted content
Interaction Multimodal Understanding & Generation Generating undesirable content (e.g. toxic, hateful, sexual) Implement output multimodal safety guardrails for the output to detect if undesirable content is being generated
Interaction Multimodal Understanding & Generation Generating unqualified advice in specialised domains (e.g. medical, financial, legal) Implement input multimodal guardrails to detect if the instruction is related to one of the specialised domains, and if so, to decline fulfilling the instruction
Interaction Multimodal Understanding & Generation Generating controversial content (e.g. political, competitors) Implement input multimodal guardrails to detect instructions to generate content that is controversial according to the organisation's policies
Interaction Multimodal Understanding & Generation Regurgitating personally identifiable information Implement output multimodal guardrails to detect personally identifiable information in the LLM's outputs before it reaches the user
Interaction Multimodal Understanding & Generation Generating non-factual or hallucinated content Conduct testing to measure hallucination and factuality rates for multimodal outputs
Interaction Multimodal Understanding & Generation Generating copyrighted content Implement input guardrails to detect instructions to generate copyrighted content
Interaction Official Communications Making inaccurate promises or statements to the public Limit the communications to standard processes, where communication templates are available
Interaction Official Communications Making inaccurate promises or statements to the public Require human approval for communications for more sensitive matters
Interaction Official Communications Making inaccurate promises or statements to the public Provide alternate channels for users to clarify communications or give feedback
Interaction Official Communications Sending undesirable content to recipients Implement output safety guardrails to detect if undesirable content is in the communications before it is sent to the user
Interaction Official Communications Sending malicious content to recipients Check for adherence to communication templates prior to sending email
Interaction Official Communications Sending malicious content to recipients Validate all links and attachments prior to sending them to users
Interaction Official Communications Misleading recipients about the authorship of the communications Declare upfront that the communications are generated by an AI system
Interaction Official Communications Sending personally identifiable or sensitive data Implement output guardrails to detect personally identifiable information in the communications before it is sent to the user
Interaction Business Transactions Allowing unauthorised transactions Require human validation for high-impact transactions
Interaction Business Transactions Allowing unauthorised transactions Logging all requests leading up to the transaction
Interaction Business Transactions Allowing unauthorised transactions Apply fraud detection models or heuristics to the agent's own decisions
Interaction Business Transactions Increasing the system's vulnerability to attackers exfiltrating credentials for transactions through the agent Ensure virtual isolation for agents carrying out transactions
Interaction Business Transactions Increasing the system's vulnerability to attackers exfiltrating credentials for transactions through the agent Do not share credentials with the agent directly, require the agent to use a separate service for authentication and transactions
Interaction Internet & Search Access Opening vulnerabilities to prompt injection attacks via malicious websites Implement input guardrails to detect prompt injection or adversarial attacks
Interaction Internet & Search Access Opening vulnerabilities to prompt injection attacks via malicious websites Implement escape filtering before including web content into prompts
Interaction Internet & Search Access Opening vulnerabilities to prompt injection attacks via malicious websites Use structured retrieval APIs for searching the web rather than through web scraping
Interaction Internet & Search Access Returning unreliable information or websites Prioritise results from verified, high-quality domains (e.g. .gov, .edu, well-known publishers)
Interaction Internet & Search Access Returning unreliable information or websites Require cross-source validation for some of the claims made
Interaction Computer Use Opening vulnerabilities to prompt injection attacks Ensure computer use protocol or application provides immediate interruptability
Interaction Computer Use Opening vulnerabilities to prompt injection attacks Limit computer use to accessing only safe resources on the computer
Interaction Computer Use Accessing personally identifiable or sensitive data Ensure "take over" mode is activated when keying in sensitive data (e.g. passwords, API keys)
Interaction Other Programmatic Interfaces Leaking personally identifiable or sensitive data Use short‑lived, rotating credentials that expire immediately after agent use
Interaction Other Programmatic Interfaces Leaking personally identifiable or sensitive data Specify a whitelist of interfaces that agents are allowed to use
Interaction Other Programmatic Interfaces Increasing the system's vulnerability to supply chain attacks Enforce zero-trust input handling and validate all data flows
Operational Agent Communication Enabling the exfiltration of sensitive data Implement a whitelist approach for outward network access, including API requests
Operational Agent Communication Enabling the exfiltration of sensitive data Ensure that sensitive data is not passed and leaked between agents by using appropriate guardrails
Operational Agent Communication Communicating insecurely resulting in man-in-the-middle attacks Ensure all cross-agent authentication and message validation and encryption where necessary
Operational Agent Communication Misinterpreting inter-agent messages due to poor formatting or weak protocols Constrain agent communication with structured outputs and interactions
Operational Agent Communication Passing on prompt injection attacks across agents Sanitise messages before agents process them - strip or escape unexpected instruction-like content that may have been injected
Operational Agent Communication Impersonating or accessing peer agents or services via shared roles or credentials Isolate roles and credentials of each agent
Operational Code Execution Executing poor code Use code linters to screen for bad practices, anti-patterns, unused variables, or poor syntax
Operational Code Execution Executing poor code Use static code analysers to detect problems with the code
Operational Code Execution Executing poor code Run code only in virtually isolated compute environments (e.g. Docker containers)
Operational Code Execution Executing poor code Ensure monitoring of code runtime and memory consumption
Operational Code Execution Executing vulnerable or malicious code Use static code analysers to identify dangerous patterns in the code before execution
Operational Code Execution Executing vulnerable or malicious code Conduct CVE scanning and block execution if any High or Critical CVEs are detected
Operational Code Execution Executing vulnerable or malicious code Block all inward and outward network access by default
Operational Code Execution Executing vulnerable or malicious code Scope execution privileges strictly only to what is necessary, ensuring that privileges are customised to each agent within a system
Operational Code Execution Executing vulnerable or malicious code Do not grant admin or sudo privileges
Operational Code Execution Executing vulnerable or malicious code Sanitise all inputs
Operational Code Execution Executing vulnerable or malicious code Implement a whitelist approach for inward network access
Operational Code Execution Executing vulnerable or malicious code Review all code generated by agents, including shell scripts, before execution
Operational Code Execution Executing vulnerable or malicious code Create a Deny list of commands that agents are not allowed to run autonomously
Operational File & Data Management Overwriting or deleting database tables or files No write access to tables in the database unless strictly required
Operational File & Data Management Overwriting or deleting database tables or files Require human approval for any changes to the database, table, or file
Operational File & Data Management Overwriting or deleting database tables or files Avoid mounting broad or persistant paths
Operational File & Data Management Overwhelming the database with poor, inefficient, or repeated queries Limit the number of concurrent queries to the database from the agent
Operational File & Data Management Overwhelming the database with poor, inefficient, or repeated queries Analyse past database queries to identify repeated or inefficient queries
Operational File & Data Management Exposing personally identifiable or sensitive data from databases or files Implement input guardrails to detect personally identifiable information
Operational File & Data Management Exposing personally identifiable or sensitive data from databases or files Do not allow access to personally identifiable data or sensitive data unless strictly required
Operational File & Data Management Exposing personally identifiable or sensitive data from databases or files Log all database queries in production
Operational File & Data Management Opening vulnerabilities to prompt injection attacks via malicious data or files Validate new data used to supplement RAG databases or training data
Operational File & Data Management Opening vulnerabilities to prompt injection attacks via malicious data or files Implement input guardrails to detect prompt injection or adversarial attacks
Operational File & Data Management Opening vulnerabilities to prompt injection attacks via malicious data or files Disallow unknown or external files unless it is scanned
Operational File & Data Management Overwriting or deleting required files Require user confirmation before overwriting or deleting any files
Operational File & Data Management Overwriting or deleting required files Keep separate copy of original files
Operational File & Data Management Overwriting or deleting required files Ensure second copy of database is not changed until a pre-specified amount of time has passed / ensure database versioning
Operational File & Data Management Making unauthorised changes to files Require user confirmation before executing each change to a file
Operational System Management Escalating the agent's own privileges Scope system privileges strictly only to what is necessary
Operational System Management Escalating the agent's own privileges Do not grant admin privileges to agents
Operational System Management Escalating the agent's own privileges Do not allow agents to modify privileges
Operational System Management Misconfiguring system resources, compromising system integrity and availability Only grant agents privileges to modify system resources if strictly necessary for completion of tasks
Operational System Management Misconfiguring system resources, compromising system integrity and availability Set minimum and maximum limits to what the agent can modify within a given system resource
Operational System Management Misconfiguring system resources, compromising system integrity and availability Ensure logging of system health metrics and automated alerts to the developer team if any metrics are abnormal
Operational System Management Overwhelming the system with poor, inefficient, or repeated requests Limit the number of concurrent queries to external systems from the agent
Operational System Management Overwhelming the system with poor, inefficient, or repeated requests Log all queries to external systems from the agent