Skip to content

Capability Risks

Similar to what we did for baseline components and risks, we identify safety and security risks that may arise from the specific capabilities identified in the previous section. For each risk, we provide an academic paper, case study, or article which provides more details about the nature of the risk.

Click here for a downloadable version.

Category Capability Risk
Cognitive Reasoning & Problem-Solving Becoming ineffective, inefficient, or unsafe due to overthinking1
Cognitive Reasoning & Problem-Solving Engaging in deceptive behaviour through pursuing or prioritising other goals2
Cognitive Planning & Goal Management Devising plans that are not effective in meeting the user's requirements3
Cognitive Planning & Goal Management Devising plans that do not adhere to common sense or implicit assumptions about the user's instructions4
Cognitive Agent Delegation Assigning tasks incorrectly to other agents5
Cognitive Agent Delegation Attempting to use other agents maliciously6
Cognitive Tool Use Choosing the wrong tool for the given action or task7
Interaction Natural Language Communication Generating undesirable content (e.g. toxic, hateful, sexual)8
Interaction Natural Language Communication Generating unqualified advice in specialised domains (e.g. medical, financial, legal)9
Interaction Natural Language Communication Generating controversial content (e.g. political, competitors)10
Interaction Natural Language Communication Regurgitating personally identifiable information11
Interaction Natural Language Communication Generating non-factual or hallucinated content12
Interaction Natural Language Communication Generating copyrighted content13
Interaction Multimodal Understanding & Generation Generating undesirable content (e.g. toxic, hateful, sexual)14
Interaction Multimodal Understanding & Generation Generating unqualified advice in specialised domains (e.g. medical, financial, legal)15
Interaction Multimodal Understanding & Generation Generating controversial content (e.g. political, competitors)16
Interaction Multimodal Understanding & Generation Regurgitating personally identifiable information17
Interaction Multimodal Understanding & Generation Generating non-factual or hallucinated content18
Interaction Multimodal Understanding & Generation Generating copyrighted content19
Interaction Official Communications Making inaccurate promises or statements to the public20
Interaction Official Communications Sending undesirable content to recipients21
Interaction Official Communications Sending malicious content to recipients22
Interaction Official Communications Misleading recipients about the authorship of the communications23
Interaction Official Communications Sending personally identifiable or sensitive data24
Interaction Business Transactions Allowing unauthorised transactions25
Interaction Business Transactions Increasing the system's vulnerability to attackers exfiltrating credentials for transactions through the agent26
Interaction Internet & Search Access Opening vulnerabilities to prompt injection attacks via malicious websites27
Interaction Internet & Search Access Returning unreliable information or websites28
Interaction Computer Use Opening vulnerabilities to prompt injection attacks29
Interaction Computer Use Accessing personally identifiable or sensitive data30
Interaction Other Programmatic Interfaces Leaking personally identifiable or sensitive data31
Interaction Other Programmatic Interfaces Increasing the system's vulnerability to supply chain attacks32
Operational Agent Communication Enabling the exfiltration of sensitive data33
Operational Agent Communication Communicating insecurely resulting in man-in-the-middle attacks34
Operational Agent Communication Misinterpreting inter-agent messages due to poor formatting or weak protocols35
Operational Agent Communication Passing on prompt injection attacks across agents36
Operational Code Execution Executing poor code37
Operational Code Execution Executing vulnerable or malicious code38
Operational File & Data Management Overwriting or deleting database tables or files39
Operational File & Data Management Overwhelming the database with poor, inefficient, or repeated queries40
Operational File & Data Management Exposing personally identifiable or sensitive data from databases or files41
Operational File & Data Management Opening vulnerabilities to prompt injection attacks via malicious data or files42
Operational System Management Escalating the agent's own privileges43
Operational System Management Misconfiguring system resources, compromising system integrity and availability44
Operational System Management Overwhelming the system with poor, inefficient, or repeated requests45

  1. Cuadron et al. The Danger of Overthinking: Examining the Reasoning‑Action Dilemma in Agentic Tasks. arXiv preprint arXiv:2502.08235, 2025. https://arxiv.org/pdf/2502.08235, Accessed: 2025‑07‑26. 

  2. Chen et al. Reasoning Models Don't Always Say What They Think. https://www.anthropic.com/research/reasoning-models-dont-say-think, Accessed: 2025‑07‑26. 

  3. Xie et al. TravelPlanner: A Benchmark for Real‑World Planning with Language Agents. arXiv preprint arXiv:2402.01622, 2024. https://arxiv.org/pdf/2402.01622v4, Accessed: 2025‑07‑26. Xie et al. Revealing the Barriers of Language Agents in Planning. NAACL Long Papers 2025. https://aclanthology.org/2025.naacl-long.93.pdf, Accessed: 2025‑07‑26. 

  4. Marcus et al. AI still lacks "common" sense, 70 years later. Substack essay, January 5, 2025. https://garymarcus.substack.com/p/ai-still-lacks-common-sense-70-years, Accessed: 2025‑07‑26. 

  5. Cemri et al. Why Do Multi‑Agent LLM Systems Fail? arXiv preprint arXiv:2503.13657, 2025. https://arxiv.org/abs/2503.13657, Accessed: 2025‑07‑26. 

  6. Lupinacci et al. The Dark Side of LLMs: Agent‑based Attacks for Complete Computer Takeover. arXiv preprint arXiv:2507.06850, 2025. https://arxiv.org/abs/2507.06850, Accessed: 2025‑07‑26. 

  7. Kokane et al. ToolScan: A Benchmark for Characterizing Errors in Tool‑Use LLMs. arXiv preprint arXiv:2411.13547, 2024. https://arxiv.org/abs/2411.13547, Accessed: 2025‑07‑26. 

  8. Mazeika et al. HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. arXiv preprint arXiv:2402.04249v2, 2024. https://arxiv.org/abs/2402.04249v2, Accessed: 2025‑07‑26. 

  9. Barbera, Isabel. AI Privacy Risks & Mitigations in Large Language Models (LLMs). European Data Protection Board Report, 2025. https://www.edpb.europa.eu/system/files/2025-04/ai-privacy-risks-and-mitigations-in-llms.pdf, Accessed: 2025‑07‑26. 

  10. Stanford HAI. AI models like ChatGPT, Claude, and Gemini show partisan bias, study finds. Stanford News, May 22, 2025. https://news.stanford.edu/stories/2025/05/ai-models-llms-chatgpt-claude-gemini-partisan-bias-research-study, Accessed: 2025‑07‑26. 

  11. See footnote 9. 

  12. Zhang et al. Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219, 2023. https://arxiv.org/abs/2309.01219, Accessed: 2025‑07‑26. 

  13. Chen et al. CopyBench: Measuring Literal and Non-Literal Reproduction of Copyright‑Protected Text in Language Model Generation. arXiv preprint arXiv:2407.07087, 2024. https://arxiv.org/abs/2407.07087, Accessed: 2025‑07‑26. 

  14. Liu et al. MM‑SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models. arXiv preprint arXiv:2311.17600, 2023. https://arxiv.org/abs/2311.17600, Accessed: 2025‑07‑26. 

  15. Yan et al. Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA. Findings of ACL 2025. https://aclanthology.org/2025.findings-acl.981.pdf, Accessed: 2025‑07‑27. 

  16. Motoki et al. Assessing political bias and value misalignment in generative artificial intelligence. Journal of Economic Behavior & Organization, 2025. https://www.sciencedirect.com/science/article/pii/S0167268125000241, Accessed: 2025‑07‑27. 

  17. Carlini et al. Extracting Training Data from Diffusion Models. arXiv preprint arXiv:2301.13188, 2023. https://arxiv.org/abs/2301.13188, Accessed: 2025‑07‑27. 

  18. Bai et al. Hallucination of Multimodal Large Language Models: A Survey. arXiv preprint arXiv:2404.18930v2, 2025. https://arxiv.org/abs/2404.18930, Accessed: 2025‑07‑27. 

  19. See footnote 17. 

  20. The Decoder. People buy brand-new Chevrolets for $1 from a ChatGPT chatbot. The Decoder, 2025. https://the-decoder.com/people-buy-brand-new-chevrolets-for-1-from-a-chatgpt-chatbot/, Accessed: 2025‑07‑26. 

  21. Harwell, Drew. X ordered its Grok chatbot to 'tell like it is.' Then the Nazi tirade began., July 11, 2025. https://www.washingtonpost.com/technology/2025/07/11/grok-ai-elon-musk-antisemitism/, Accessed: 2025‑07‑26. 

  22. Threat Hunter Team. AI: Advent of Agents Opens New Possibilities for Attackers. Threat Intelligence Blog (Symantec / Broadcom), March 13 2025. https://www.security.com/threat-intelligence/ai-agent-attacks, Accessed: 2025‑07‑27. 

  23. Goldman. A customer support AI went rogue—and it's a warning for every company. Fortune, April 2025. https://fortune.com/article/customer-support-ai-cursor-went-rogue/, Accessed: 2025‑07‑27. 

  24. See footnote 9. 

  25. Kulp, Patrick. AI agents may be vulnerable to financial attacks. Tech Brew (Emerging Tech Brew), May 29 2025. https://www.emergingtechbrew.com/stories/2025/05/29/ai-agents-vulnerable-financial-attacks, Accessed: 2025‑07‑27. 

  26. Alizadeh et al. Simple Prompt Injection Attacks Can Leak Personal Data Observed by LLM Agents During Task Execution. arXiv preprint arXiv:2506.01055, 2025. https://arxiv.org/pdf/2506.01055, Accessed: 2025‑07‑27. 

  27. Chen et al. AI Agents Are Here—So Are the Threats: Unit 42 Unveils the Top 10 Agentic‑AI Security Risks. Palo Alto Networks Unit 42 blog, 2025. https://unit42.paloaltonetworks.com/agentic-ai-threats/, Accessed: 2025‑07‑27. 

  28. Delaney, Max. Google's AI Overviews are often so confidently wrong that I've lost all trust in them. TechRadar, 2025. https://www.techradar.com/computing/artificial-intelligence/googles-ai-overviews-are-often-so-confidently-wrong-that-ive-lost-all-trust-in-them, Accessed: 2025‑07‑27. 

  29. Mudryi et al. The Hidden Dangers of Browsing AI Agents. arXiv preprint arXiv:2505.13076v1, 2025. https://arxiv.org/html/2505.13076v1, Accessed: 2025‑07‑27. Martin, Jason. Indirect Prompt Injection of Claude Computer Use. HiddenLayer blog, 2025. https://hiddenlayer.com/innovation-hub/indirect-prompt-injection-of-claude-computer-use/, Accessed: 2025‑07‑27. 

  30. Yang et al. RiOSWorld: Benchmarking the Risk of Multimodal Computer‑Use Agents. arXiv preprint arXiv:2506.00618, 2025. https://arxiv.org/html/2506.00618v3, Accessed: 2025‑07‑27. 

  31. Park, Sean. Unveiling AI Agent Vulnerabilities Part III: Data Exfiltration. TrendMicro, May 2025. https://www.trendmicro.com/vinfo/sg/security/news/threat-landscape/unveiling-ai-agent-vulnerabilities-part-iii-data-exfiltration, Accessed: 2025-07-29. 

  32. Unit 42. GitHub Actions Supply Chain Attack: A Targeted Attack on Coinbase Expanded to the Widespread tj-actions/changed-files Incident: Threat Assessment. Palo Alto Networks, Apr 2025. https://unit42.paloaltonetworks.com/github-actions-supply-chain-attack/, Accessed: 2025-08-04. 

  33. Munoz, Alvaro. GHSL-2024-294: Environment variable injection leading to potential secret exfiltration and privilege escalation in Azure/cli. Security Lab, Dec 2024. https://securitylab.github.com/advisories/GHSL-2024-294_Azure-cli/?utm_source=chatgpt.com, Accessed: 2025-08-04. 

  34. He et al. Red-Teaming LLM Multi-Agent Systems via Communication Attacks. arXiv preprint arXiv:2502.14847, 2025. https://arxiv.org/pdf/2502.14847, Accessed: 2025‑07‑27. 

  35. Kong et al. A Survey of LLM-Driven AI Agent Communication: Protocols, Security Risks, and Defense Countermeasures. arXiv preprint arXiv:2506.19676, 2025. https://arxiv.org/html/2506.19676, Accessed: 2025‑07‑27. 

  36. Ferrag et al. From Prompt Injections to Protocol Exploits: Threats in LLM-Powered AI Agents Workflows. arXiv preprint arXiv:2506.23260, 2025. https://arxiv.org/html/2506.23260, Accessed: 2025‑07‑27. 

  37. Spracklen et al. We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs. USENIX Security Symposium 2025 (preprint), 2025. https://www.usenix.org/system/files/conference/usenixsecurity25/sec25cycle1-prepub-742-spracklen.pdf, Accessed: 2025‑07‑27. METR. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. METR blog, July 10 2025. https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/, Accessed: 2025‑07‑27. Guo et al. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. NeurIPS 2024 Datasets and Benchmarks Track, 2024. https://proceedings.neurips.cc/paper_files/paper/2024/hash/bfd082c452dffb450d5a5202b0419205-Abstract-Datasets_and_Benchmarks_Track.html, Accessed: 2025‑07‑27. 

  38. Peng et al. CWEVAL: Outcome‑driven Evaluation on Functionality and Security of LLM Code Generation. arXiv preprint arXiv:2501.08200, 2025. https://arxiv.org/pdf/2501.08200, Accessed: 2025‑07‑27. Dilgren et al. SecRepoBench: Benchmarking LLMs for Secure Code Generation in Real-World Repositories. arXiv preprint arXiv:2504.21205, 2025. https://arxiv.org/html/2504.21205v1, Accessed: 2025‑07‑27. 

  39. Pedro et al. Holodeck: Prompt-to-SQL Injections in LLM-Integrated Web Applications: Risks and Defenses. ICSE 2025 research track. https://syssec.dpss.inesc-id.pt/papers/pedro_icse25.pdf, Accessed: 2025‑07‑27. 

  40. Ramirez et al. Which LLM Writes the Best SQL? Benchmarking analytical SQL generation by LLMs. Tinybird Blog, 2025. https://www.tinybird.co/blog-posts/which-llm-writes-the-best-sql, Accessed: 2025‑07‑27. 

  41. Poireault, Kevin. Microsoft 365 Copilot hit by a zero‑click AI vulnerability allowing data exfiltration. Infosecurity Magazine, 2025. https://www.infosecurity-magazine.com/news/microsoft-365-copilot-zeroclick-ai/, Accessed: 2025‑07‑27. 

  42. diskordia. Inside CVE-2025-32711 (EchoLeak): Prompt injection meets AI exfiltration. Hack The Box Blog, 2025. https://www.hackthebox.com/blog/cve-2025-32711-echoleak-copilot-vulnerability, Accessed: 2025‑07‑27. Burgess, Matt. Here Come the AI Worms. WIRED, 2025. https://www.wired.com/story/here-come-the-ai-worms/, Accessed: 2025‑07‑27. 

  43. Kim et al. Prompt Flow Integrity to Prevent Privilege Escalation in LLM Agents. arXiv preprint arXiv:2503.15547v1, 2025. https://arxiv.org/html/2503.15547v1, Accessed: 2025‑07‑27. 

  44. Kon et al. IaC-Eval: A Code Generation Benchmark for Cloud Infrastructure-as-Code Programs. NeurIPS 2024 poster, 2024. https://neurips.cc/virtual/2024/poster/97835, Accessed: 2025‑07‑27. Romeo et al. ARPaCCino: An Agentic-RAG for Policy as Code Compliance. arXiv preprint arXiv:2507.10584v1, 2025. https://arxiv.org/html/2507.10584v1, Accessed: 2025‑07‑27. 

  45. Zhang et al. Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification. arXiv preprint arXiv:2407.20859v1, 2024. https://arxiv.org/html/2407.20859v1, Accessed: 2025‑07‑27. OWASP. LLMRISK‑102025: Unbounded Consumption. OWASP GenAI Risk Database, 2025. https://genai.owasp.org/llmrisk/llm102025-unbounded-consumption/, Accessed: 2025‑07‑27.