Why the ARC Framework?

In this section, we explain why we developed the ARC framework, how it fits into the existing literature, and why we chose to focus on capabilities specifically.

Background

OpenAI dubbed 2025 the "year of the AI agent”,¹ a prediction which quickly proved prescient. Within the first half of the year, major companies launched increasingly powerful systems that allow large language model (LLM) agents to reason, plan, and autonomously execute tasks such as code development or trip planning.² However, this surge in agent-driven AI innovation also brought renewed scrutiny to these systems' safety and security risks. Recent research has shown that LLM agents are often more prone to unsafe behaviors than their base models,³ partly due to their growing autonomy and expanded capabilities through tool integration. As the ecosystem for agentic systems, including frameworks like Model Context Protocol⁴, continues to mature, there is greater urgency to establish robust governance mechanisms.

Existing Work on Technical Frameworks

As increased autonomy represents reduced determinism and predictability of agentic systems, many technical frameworks have focused on assessing the risks from a security perspective, adapting from cybersecurity literature. This includes Cloud Security Alliance's MAESTRO framework⁵ and OWASP's whitepaper on agentic AI risks⁶, which focus on threat modeling approaches to identify risks and mitigations. These approaches detail security threats agentic systems could face (e.g., data poisoning, agent impersonation, cascading hallucination) and provide accompanying mitigations. Developers are expected to map out the system's architecture, and then identify attack vectors in each component from the list of risks. However, this is an onerous and challenging process for non-cybersecurity developers, as they are expected to enumerate across all possible risks for each component. NVIDIA's recommended taint tracing approach⁷ similarly requires developers to identify security boundaries in the system's architecture with respect to the flow of untrusted data, assess the sensitivity of tools impacted by the flow and then adopt appropriate security controls accordingly. However, unlike OWASP and MAESTRO, the recommended controls are not as comprehensive and largely entail human oversight, which has severe limitations.

Academic literature has focused on producing safety and security benchmarks to assess agentic AI risks. These benchmarks entail test scenarios or tasks that are expected to reveal specific risk behaviours of the agentic system. For example, CVEBench,⁸ RedCode,⁹ AgentHarm,¹⁰, AgentDojo¹¹ assess whether LLMs can complete cybersecurity attacks or harmful tasks like fraud. In these settings, agentic systems are assessed on whether they comply with and successfully complete multi-step requests. However, these benchmarks do not help developers systematically identify the risks and attack scenarios of their specific applications. Tool-based benchmarks that entail evaluating the risks of individual tools that agents have access to face similar problems. APIBench,¹² ToolSword,¹³, and ToolEmu¹⁴, evaluate the performance and safety of LLMs in utilizing and interacting with tools like bash. However, there exists agentic system risks unrelated to tool use (e.g., using a misaligned planning agent resulting in bad plans) which such benchmarks will not cover. As such, they are insufficient in helping developers assess agentic risks.

In terms of technical controls for managing risks, AI control has emerged as a useful paradigm in preventing misaligned AI systems from causing harm.¹⁵ Rather than relying solely on training techniques to shape model behavior, AI control focuses on designing mechanisms like monitoring and human approval to constrain AI systems. In line with this paradigm, some works have emerged to generate custom controls for specific applications. Progent introduces a language for flexibly expressing privilege control policies applied during execution for each agent.¹⁶ The UK AI Security Institute similarly advocates for application-specific controls, which is derived from evaluating each system's threat model-specific capabilities based on models and deployment contexts.¹⁷

Instead of providing a list of risks and controls upfront, some works have instead published guides on best practices in evaluating and managing risks of agentic AI systems. OpenAI¹⁸ shared best practices like constraining the agent's action spaces, setting default behaviors and ensuring attributabilty, as well as open problems for further investigation. Google¹⁹ emphasises a hybrid defence-in-depth strategy that combines traditional, deterministic security measures with dynamic, reasoning-based defenses, and higlights core principles of human controls, agent permissions and observability in designing agentic systems. Similarly, Beurer-Kellner et al.²⁰ propose six design patterns for building AI agents with provable resistance to prompt injection, based on the principle that once an LLM agent has ingested untrusted input, it must be constrained so that it is impossible for that input to trigger any consequential actions.

What is different about the ARC?

The ARC builds on existing works to enable organisations to govern agentic systems effectively and provide practical guidance in assessing and mitigating the associated risks. Importantly, it attempts to structure the risks and controls identified in existing works into a unified and actionable framework. This is achieved mainly by including and differentiating between baseline and capability risks and controls. With a structured approach, the ARC attempts to strike an appropriate balance between:

Thoroughly mapping out the risk landscape, and not being too inflexible to changes - Instead of listing out all possible agentic risks upfront, the ARC ties capabilities to risks, providing flexibility for identifying more risks as agentic systems acquire more capabilities
Being applicable and adaptable to all agentic AI systems, and not being too generic or watered down - The ARC combines both generic (i.e., baseline) and specific (i.e., capability) risks and controls to ensure broad applicability and minimal standards while mantaining customisability
Providing meaningful and concrete guidance on the risks and controls, and not being too prescriptive - Contextualising risk identification from a capabilities perspective is simpler and more straightforward for developers, resulting in practical and proportional risk assessment and mitigation
Scalability at an organisation-level, and not simply being a paper exercise - The ARC provides an initial set of risks and controls to reduce compliance effort and ensure scalability across large organisations, while the customisation process ensures proper risk assessment

Hamilton, E. 2025 is the year of ai agents, OpenAI CPO says. Axios, January 2025. URL https://www.axios.com/2025/01/23/davos-2025-ai-agents. Accessed: 2025-05-11. ↩
Both Anthropic and Google released agentic tools (Claude Code and Gemini CLI respectively), built on top of their popular models. Most recently, OpenAI released ChatGPT Agent in July 2025 (article). ↩
See the following articles: (i) Chiang et al. Harmful helper: Perform malicious tasks? web AI agents might help. In ICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025. URL https://openreview.net/forum?id=4KoMbO2RJ9, (ii) Kumar et al. Aligned LLMs are not aligned browser agents. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=NsFZZU9gvk, (iii) Yu, C. and Papakyriakopoulos, O. Safety devolution in AI agents. In ICLR 2025 Workshop on Human-AI Coevolution, 2025. URL https://openreview.net/forum?id=7nJmuFFkWd. ↩
Anthropic. Model Context Protocol (MCP). https://docs.anthropic.com/en/docs/agents-and-tools/mcp, 2024. Accessed: 2025-05-11. ↩
Ken Huang. Agentic AI Threat Modeling Framework: MAESTRO. https://cloudsecurityalliance.org/blog/2025/02/06/agentic-ai-threat-modeling-framework-maestro, 2025. Accessed: 2025‑07‑26. ↩
OWASP Foundation. Agentic AI – Threats and Mitigations. https://genai.owasp.org/resource/agentic-ai-threats-and-mitigations/, 2025. Accessed: 2025‑07‑26. ↩
Harang et al. Agentic Autonomy Levels and Security. https://developer.nvidia.com/blog/agentic-autonomy-levels-and-security/, 2025. Accessed: 2025‑07‑26. ↩
Zhang et al. Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risks of Language Models. https://arxiv.org/abs/2408.08926, 2024. Accessed: 2025-05-11. ↩
Guo et al. RedCode: Risky Code Execution and Generation Benchmark for Code Agents. NeurIPS 2024 Datasets and Benchmarks Track. https://proceedings.neurips.cc/paper_files/paper/2024/hash/bfd082c452dffb450d5a5202b0419205‑Abstract‑Datasets_and_Benchmarks_Track.html, 2024. Accessed: 2025‑07‑26. ↩
Andriushchenko et al. AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents. https://arxiv.org/abs/2410.09024, 2025. Accessed: 2025-05-11. ↩
Debenedetti et al. AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents. https://arxiv.org/abs/2406.13352, 2024. Accessed: 2025-05-11. ↩
Patil et al. Gorilla: Large Language Model Connected with Massive APIs. https://arxiv.org/abs/2305.15334, 2023. Accessed: 2025-05-11. ↩
Ye et al. ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages. https://arxiv.org/abs/2402.10753, 2024. Accessed: 2025-05-11. ↩
Ruan et al. Identifying the Risks of LM Agents with an LM-Emulated Sandbox. https://arxiv.org/abs/2309.15817, 2023. Accessed: 2025-05-11. ↩
Greenblatt et al. arXiv preprint arXiv:2312.06942, 2023. https://arxiv.org/abs/2312.06942, accessed: 2025‑07‑26. ↩
Shi et al. Progent: Programmable Privilege Control for LLM Agents. https://arxiv.org/abs/2504.11703, 2025. Accessed: 2025-05-11. ↩
Korbak et al. How to evaluate control measures for LLM agents? A trajectory from today to superintelligence. https://arxiv.org/abs/2504.05259, 2025. Accessed: 2025-05-11. ↩
Shavit et al. Practices for Governing Agentic AI Systems. OpenAI white paper, 2023. https://cdn.openai.com/papers/practices-for-governing-agentic-ai-systems.pdf, Accessed: 2025‑07‑26. ↩
Díaz et al. An Introduction to Google’s Approach for Secure AI Agents. Google white paper, 2025. https://research.google/pubs/an-introduction-to-googles-approach-for-secure-ai-agents/, Accessed: 2025‑07‑26. ↩
Beurer‑Kellner et al. Design Patterns for Securing LLM Agents against Prompt Injections. arXiv preprint arXiv:2506.08837, first posted June 10, 2025; revised versions through June 27, 2025. https://arxiv.org/abs/2506.08837, Accessed: 2025‑07‑26. ↩