Capability Risks

Similar to what we did for baseline components and risks, we identify safety and security risks that may arise from the specific capabilities identified in the previous section. For each risk, we provide an academic paper, case study, or article which provides more details about the nature of the risk.

Click here for a downloadable version.

Category	Capability	Risk
Cognitive	Reasoning & Problem-Solving	Becoming ineffective, inefficient, or unsafe due to overthinking¹
Cognitive	Reasoning & Problem-Solving	Engaging in deceptive behaviour through pursuing or prioritising other goals²
Cognitive	Planning & Goal Management	Devising plans that are not effective in meeting the user's requirements³
Cognitive	Planning & Goal Management	Devising plans that do not adhere to common sense or implicit assumptions about the user's instructions⁴
Cognitive	Agent Delegation	Assigning tasks incorrectly to other agents⁵
Cognitive	Agent Delegation	Attempting to use other agents maliciously⁶
Cognitive	Tool Use	Choosing the wrong tool for the given action or task⁷
Interaction	Natural Language Communication	Generating undesirable content (e.g. toxic, hateful, sexual)⁸
Interaction	Natural Language Communication	Generating unqualified advice in specialised domains (e.g. medical, financial, legal)⁹
Interaction	Natural Language Communication	Generating controversial content (e.g. political, competitors)¹⁰
Interaction	Natural Language Communication	Regurgitating personally identifiable information¹¹
Interaction	Natural Language Communication	Generating non-factual or hallucinated content¹²
Interaction	Natural Language Communication	Generating copyrighted content¹³
Interaction	Multimodal Understanding & Generation	Generating undesirable content (e.g. toxic, hateful, sexual)¹⁴
Interaction	Multimodal Understanding & Generation	Generating unqualified advice in specialised domains (e.g. medical, financial, legal)¹⁵
Interaction	Multimodal Understanding & Generation	Generating controversial content (e.g. political, competitors)¹⁶
Interaction	Multimodal Understanding & Generation	Regurgitating personally identifiable information¹⁷
Interaction	Multimodal Understanding & Generation	Generating non-factual or hallucinated content¹⁸
Interaction	Multimodal Understanding & Generation	Generating copyrighted content¹⁹
Interaction	Official Communications	Making inaccurate promises or statements to the public²⁰
Interaction	Official Communications	Sending undesirable content to recipients²¹
Interaction	Official Communications	Sending malicious content to recipients²²
Interaction	Official Communications	Misleading recipients about the authorship of the communications²³
Interaction	Official Communications	Sending personally identifiable or sensitive data²⁴
Interaction	Business Transactions	Allowing unauthorised transactions²⁵
Interaction	Business Transactions	Increasing the system's vulnerability to attackers exfiltrating credentials for transactions through the agent²⁶
Interaction	Internet & Search Access	Opening vulnerabilities to prompt injection attacks via malicious websites²⁷
Interaction	Internet & Search Access	Returning unreliable information or websites²⁸
Interaction	Computer Use	Opening vulnerabilities to prompt injection attacks²⁹
Interaction	Computer Use	Accessing personally identifiable or sensitive data³⁰
Interaction	Other Programmatic Interfaces	Leaking personally identifiable or sensitive data³¹
Interaction	Other Programmatic Interfaces	Increasing the system's vulnerability to supply chain attacks³²
Operational	Agent Communication	Enabling the exfiltration of sensitive data³³
Operational	Agent Communication	Communicating insecurely resulting in man-in-the-middle attacks³⁴
Operational	Agent Communication	Misinterpreting inter-agent messages due to poor formatting or weak protocols³⁵
Operational	Agent Communication	Passing on prompt injection attacks across agents³⁶
Operational	Code Execution	Executing poor code³⁷
Operational	Code Execution	Executing vulnerable or malicious code³⁸
Operational	File & Data Management	Overwriting or deleting database tables or files³⁹
Operational	File & Data Management	Overwhelming the database with poor, inefficient, or repeated queries⁴⁰
Operational	File & Data Management	Exposing personally identifiable or sensitive data from databases or files⁴¹
Operational	File & Data Management	Opening vulnerabilities to prompt injection attacks via malicious data or files⁴²
Operational	System Management	Escalating the agent's own privileges⁴³
Operational	System Management	Misconfiguring system resources, compromising system integrity and availability⁴⁴
Operational	System Management	Overwhelming the system with poor, inefficient, or repeated requests⁴⁵

Cuadron et al. The Danger of Overthinking: Examining the Reasoning‑Action Dilemma in Agentic Tasks. arXiv preprint arXiv:2502.08235, 2025. https://arxiv.org/pdf/2502.08235, Accessed: 2025‑07‑26. ↩
Chen et al. Reasoning Models Don't Always Say What They Think. https://www.anthropic.com/research/reasoning-models-dont-say-think, Accessed: 2025‑07‑26. ↩
Xie et al. TravelPlanner: A Benchmark for Real‑World Planning with Language Agents. arXiv preprint arXiv:2402.01622, 2024. https://arxiv.org/pdf/2402.01622v4, Accessed: 2025‑07‑26. Xie et al. Revealing the Barriers of Language Agents in Planning. NAACL Long Papers 2025. https://aclanthology.org/2025.naacl-long.93.pdf, Accessed: 2025‑07‑26. ↩
Marcus et al. AI still lacks "common" sense, 70 years later. Substack essay, January 5, 2025. https://garymarcus.substack.com/p/ai-still-lacks-common-sense-70-years, Accessed: 2025‑07‑26. ↩
Cemri et al. Why Do Multi‑Agent LLM Systems Fail? arXiv preprint arXiv:2503.13657, 2025. https://arxiv.org/abs/2503.13657, Accessed: 2025‑07‑26. ↩
Lupinacci et al. The Dark Side of LLMs: Agent‑based Attacks for Complete Computer Takeover. arXiv preprint arXiv:2507.06850, 2025. https://arxiv.org/abs/2507.06850, Accessed: 2025‑07‑26. ↩
Kokane et al. ToolScan: A Benchmark for Characterizing Errors in Tool‑Use LLMs. arXiv preprint arXiv:2411.13547, 2024. https://arxiv.org/abs/2411.13547, Accessed: 2025‑07‑26. ↩
Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249v2, 2024. https://arxiv.org/abs/2402.04249v2, Accessed: 2025‑07‑26. ↩
Barbera, Isabel. AI Privacy Risks & Mitigations in Large Language Models (LLMs). European Data Protection Board Report, 2025. https://www.edpb.europa.eu/system/files/2025-04/ai-privacy-risks-and-mitigations-in-llms.pdf, Accessed: 2025‑07‑26. ↩
Stanford HAI. AI models like ChatGPT, Claude, and Gemini show partisan bias, study finds. Stanford News, May 22, 2025. https://news.stanford.edu/stories/2025/05/ai-models-llms-chatgpt-claude-gemini-partisan-bias-research-study, Accessed: 2025‑07‑26. ↩
See footnote 9. ↩
Zhang et al. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219, 2023. https://arxiv.org/abs/2309.01219, Accessed: 2025‑07‑26. ↩
Chen et al. CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright‑Protected Text in Language Model Generation. arXiv preprint arXiv:2407.07087, 2024. https://arxiv.org/abs/2407.07087, Accessed: 2025‑07‑26. ↩
Liu et al. MM‑SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models. arXiv preprint arXiv:2311.17600, 2023. https://arxiv.org/abs/2311.17600, Accessed: 2025‑07‑26. ↩
Yan et al. Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA. Findings of ACL 2025. https://aclanthology.org/2025.findings-acl.981.pdf, Accessed: 2025‑07‑27. ↩
Motoki et al. Assessing political bias and value misalignment in generative artificial intelligence. Journal of Economic Behavior & Organization, 2025. https://www.sciencedirect.com/science/article/pii/S0167268125000241, Accessed: 2025‑07‑27. ↩
Carlini et al. Extracting Training Data from Diffusion Models. arXiv preprint arXiv:2301.13188, 2023. https://arxiv.org/abs/2301.13188, Accessed: 2025‑07‑27. ↩
Bai et al. Hallucination of Multimodal Large Language Models: A Survey. arXiv preprint arXiv:2404.18930v2, 2025. https://arxiv.org/abs/2404.18930, Accessed: 2025‑07‑27. ↩
See footnote 17. ↩
The Decoder. People buy brand-new Chevrolets for $1 from a ChatGPT chatbot. The Decoder, 2025. https://the-decoder.com/people-buy-brand-new-chevrolets-for-1-from-a-chatgpt-chatbot/, Accessed: 2025‑07‑26. ↩
Harwell, Drew. X ordered its Grok chatbot to 'tell like it is.' Then the Nazi tirade began., July 11, 2025. https://www.washingtonpost.com/technology/2025/07/11/grok-ai-elon-musk-antisemitism/, Accessed: 2025‑07‑26. ↩
Threat Hunter Team. AI: Advent of Agents Opens New Possibilities for Attackers. Threat Intelligence Blog (Symantec / Broadcom), March 13 2025. https://www.security.com/threat-intelligence/ai-agent-attacks, Accessed: 2025‑07‑27. ↩
Goldman. A customer support AI went rogue—and it's a warning for every company. Fortune, April 2025. https://fortune.com/article/customer-support-ai-cursor-went-rogue/, Accessed: 2025‑07‑27. ↩
See footnote 9. ↩
Kulp, Patrick. AI agents may be vulnerable to financial attacks. Tech Brew (Emerging Tech Brew), May 29 2025. https://www.emergingtechbrew.com/stories/2025/05/29/ai-agents-vulnerable-financial-attacks, Accessed: 2025‑07‑27. ↩
Alizadeh et al. Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution. arXiv preprint arXiv:2506.01055, 2025. https://arxiv.org/pdf/2506.01055, Accessed: 2025‑07‑27. ↩
Chen et al. AI Agents Are Here—So Are the Threats: Unit 42 Unveils the Top 10 Agentic‑AI Security Risks. Palo Alto Networks Unit 42 blog, 2025. https://unit42.paloaltonetworks.com/agentic-ai-threats/, Accessed: 2025‑07‑27. ↩
Delaney, Max. Google's AI Overviews are often so confidently wrong that I've lost all trust in them. TechRadar, 2025. https://www.techradar.com/computing/artificial-intelligence/googles-ai-overviews-are-often-so-confidently-wrong-that-ive-lost-all-trust-in-them, Accessed: 2025‑07‑27. ↩
Mudryi et al. The Hidden Dangers of Browsing AI Agents. arXiv preprint arXiv:2505.13076v1, 2025. https://arxiv.org/html/2505.13076v1, Accessed: 2025‑07‑27. Martin, Jason. Indirect Prompt Injection of Claude Computer Use. HiddenLayer blog, 2025. https://hiddenlayer.com/innovation-hub/indirect-prompt-injection-of-claude-computer-use/, Accessed: 2025‑07‑27. ↩
Yang et al. RiOSWorld: Benchmarking the Risk of Multimodal Computer‑Use Agents. arXiv preprint arXiv:2506.00618, 2025. https://arxiv.org/html/2506.00618v3, Accessed: 2025‑07‑27. ↩
Park, Sean. Unveiling AI Agent Vulnerabilities Part III: Data Exfiltration. TrendMicro, May 2025. https://www.trendmicro.com/vinfo/sg/security/news/threat-landscape/unveiling-ai-agent-vulnerabilities-part-iii-data-exfiltration, Accessed: 2025-07-29. ↩
Unit 42. GitHub Actions Supply Chain Attack: A Targeted Attack on Coinbase Expanded to the Widespread tj-actions/changed-files Incident: Threat Assessment. Palo Alto Networks, Apr 2025. https://unit42.paloaltonetworks.com/github-actions-supply-chain-attack/, Accessed: 2025-08-04. ↩
Munoz, Alvaro. GHSL-2024-294: Environment variable injection leading to potential secret exfiltration and privilege escalation in Azure/cli. Security Lab, Dec 2024. https://securitylab.github.com/advisories/GHSL-2024-294_Azure-cli/?utm_source=chatgpt.com, Accessed: 2025-08-04. ↩
He et al. Red-Teaming LLM Multi-Agent Systems via Communication Attacks. arXiv preprint arXiv:2502.14847, 2025. https://arxiv.org/pdf/2502.14847, Accessed: 2025‑07‑27. ↩
Kong et al. A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures. arXiv preprint arXiv:2506.19676, 2025. https://arxiv.org/html/2506.19676, Accessed: 2025‑07‑27. ↩
Ferrag et al. From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows. arXiv preprint arXiv:2506.23260, 2025. https://arxiv.org/html/2506.23260, Accessed: 2025‑07‑27. ↩
Spracklen et al. We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs. USENIX Security Symposium 2025 (preprint), 2025. https://www.usenix.org/system/files/conference/usenixsecurity25/sec25cycle1-prepub-742-spracklen.pdf, Accessed: 2025‑07‑27. METR. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR blog, July 10 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/, Accessed: 2025‑07‑27. Guo et al. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. NeurIPS 2024 Datasets and Benchmarks Track, 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/bfd082c452dffb450d5a5202b0419205-Abstract-Datasets_and_Benchmarks_Track.html, Accessed: 2025‑07‑27. ↩
Peng et al. CWEVAL: Outcome‑driven Evaluation on Functionality and Security of LLM Code Generation. arXiv preprint arXiv:2501.08200, 2025. https://arxiv.org/pdf/2501.08200, Accessed: 2025‑07‑27. Dilgren et al. SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories. arXiv preprint arXiv:2504.21205, 2025. https://arxiv.org/html/2504.21205v1, Accessed: 2025‑07‑27. ↩
Pedro et al. Holodeck: Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses. ICSE 2025 research track. https://syssec.dpss.inesc-id.pt/papers/pedro_icse25.pdf, Accessed: 2025‑07‑27. ↩
Ramirez et al. Which LLM Writes the Best SQL? Benchmarking analytical SQL generation by LLMs. Tinybird Blog, 2025. https://www.tinybird.co/blog-posts/which-llm-writes-the-best-sql, Accessed: 2025‑07‑27. ↩
Poireault, Kevin. Microsoft 365 Copilot hit by a zero‑click AI vulnerability allowing data exfiltration. Infosecurity Magazine, 2025. https://www.infosecurity-magazine.com/news/microsoft-365-copilot-zeroclick-ai/, Accessed: 2025‑07‑27. ↩
diskordia. Inside CVE-2025-32711 (EchoLeak): Prompt injection meets AI exfiltration. Hack The Box Blog, 2025. https://www.hackthebox.com/blog/cve-2025-32711-echoleak-copilot-vulnerability, Accessed: 2025‑07‑27. Burgess, Matt. Here Come the AI Worms. WIRED, 2025. https://www.wired.com/story/here-come-the-ai-worms/, Accessed: 2025‑07‑27. ↩
Kim et al. Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents. arXiv preprint arXiv:2503.15547v1, 2025. https://arxiv.org/html/2503.15547v1, Accessed: 2025‑07‑27. ↩
Kon et al. IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs. NeurIPS 2024 poster, 2024. https://neurips.cc/virtual/2024/poster/97835, Accessed: 2025‑07‑27. Romeo et al. ARPaCCino: An Agentic-RAG for Policy as Code Compliance. arXiv preprint arXiv:2507.10584v1, 2025. https://arxiv.org/html/2507.10584v1, Accessed: 2025‑07‑27. ↩
Zhang et al. Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification. arXiv preprint arXiv:2407.20859v1, 2024. https://arxiv.org/html/2407.20859v1, Accessed: 2025‑07‑27. OWASP. LLMRISK‑102025: Unbounded Consumption. OWASP GenAI Risk Database, 2025. https://genai.owasp.org/llmrisk/llm102025-unbounded-consumption/, Accessed: 2025‑07‑27. ↩