Anthropic’s updated Responsible Scaling Policy emphasizes continuous risk monitoring, external review, and greater transparency as AI capabilities advance. Image Source: ChatGPT-5.2

Anthropic Revises AI Safety Policy With Risk Reports, External Review, and New Transparency Rules


Anthropic has released Version 3.0 of its Responsible Scaling Policy (RSP), updating the company’s voluntary framework for managing catastrophic risks from increasingly capable AI systems. The revision introduces new transparency mechanisms — including recurring Risk Reports, external expert review, and a public Frontier Safety Roadmap — as frontier AI models gain more autonomous capabilities and raise new governance challenges.

The update matters because companies developing advanced AI are increasingly expected to demonstrate how they evaluate and mitigate risks in real time, not just publish static safety commitments. This update follows heightened scrutiny around AI safety governance, including recent reporting on tensions between Anthropic and U.S. defense officials over safeguards governing advanced AI deployment.

Anthropic says lessons learned over two years operating under earlier versions of the policy revealed growing uncertainty in determining when AI systems cross dangerous capability thresholds — a challenge affecting regulators, enterprise adopters, and policymakers evaluating how frontier AI should be governed.

Here’s what the company changed, why it changed now, and what it reveals about how AI safety oversight may evolve as model capabilities accelerate.

Key Takeaways: Anthropic Responsible Scaling Policy 3.0

  • Anthropic released Version 3.0 of its Responsible Scaling Policy, updating how the company evaluates catastrophic AI risks.

  • The policy introduces Risk Reports published every 3–6 months detailing model capabilities, threat scenarios, and mitigations.

  • New rules require external expert review of safety assessments under certain conditions.

  • Anthropic created a Frontier Safety Roadmap outlining goals for security, alignment, safeguards, and AI policy development.

  • The company acknowledges growing uncertainty in measuring AI capability thresholds, particularly in biological and dual-use domains.

  • The update separates company safety commitments from industry-wide safety recommendations requiring government or multilateral coordination.

How Anthropic’s Responsible Scaling Policy Works

Anthropic introduced its Responsible Scaling Policy (RSP) in September 2023 to address a core governance challenge: how to manage risks from AI systems that do not yet exist but may emerge rapidly as capabilities advance.

At the time the policy was first written, large language models functioned primarily as conversational tools. Since then, frontier systems have gained the ability to browse the web, write and execute code, operate computer environments, and perform autonomous multi-step actions. As these capabilities expanded, new categories of risk emerged — from misuse in biological research to the potential theft or replication of model weights.

To address this uncertainty, the policy was built around conditional commitments. In practice, this means:

  • If a model exceeds a defined capability threshold — such as demonstrating advanced biological knowledge that could assist in the development of dangerous weapons

  • Then additional safeguards must be implemented before that model can be deployed.

These safeguards are organized into escalating AI Safety Levels (ASLs). For example:

  • ASL-2 requires a defined baseline set of protections.

  • ASL-3 introduces more stringent controls aimed at preventing misuse and improving model security.

  • Higher levels were intentionally left less defined, reflecting uncertainty about what future AI systems might be capable of and with the expectation that those safeguards would be developed in greater detail once clearer evidence emerged about higher capability thresholds.

Anthropic describes the RSP as more than an internal compliance tool. It was designed as a broader “theory of change” for influencing AI governance.

First, the policy was intended to function as an internal forcing mechanism, requiring critical safety safeguards to be treated as formal prerequisites for both training and launching new models rather than optional add-ons. Anthropic says this helped clarify expectations across a rapidly growing organization, embedding safety requirements into development timelines and accelerating progress on mitigation efforts.

Second, Anthropic hoped the public release of the RSP would encourage a “race to the top,” prompting other frontier AI developers to adopt similar safety frameworks and strengthen industry-wide safeguards.

Third, the company anticipated that clearly defined capability thresholds could serve as coordination points for the broader ecosystem. If a model crossed a particularly sensitive threshold — such as enabling end-to-end support for bioweapon developmentAnthropic intended to implement safeguards internally while encouraging coordinated action across industry and government if major capability thresholds were reached.

Finally, the policy acknowledged that some future safeguards — particularly those aimed at defending against state-level cyber threats — might exceed what a single company could achieve independently. The expectation was that as AI capabilities advanced, governments would recognize emerging risks and collaborate on collective mitigation strategies.

Version 3.0 reflects Anthropic’s reassessment of how well those assumptions have held up in practice.

How Anthropic Evaluates the Results of Its Responsible Scaling Policy

Two and a half years after introducing the Responsible Scaling Policy, Anthropic says some parts of its original strategy worked as intended, while others proved more difficult to implement in practice.

According to Anthropic, the policy produced three main outcomes:

  • Accelerated development and activation of ASL-3 safety safeguards

  • Adoption of comparable safety frameworks by other frontier AI labs

  • Ongoing difficulty determining when advanced models cross defined risk thresholds

The company reports that the framework successfully accelerated the development of internal safeguards. To meet its ASL-3 deployment standard, Anthropic developed increasingly sophisticated input and output classifiers designed to block harmful content, particularly in areas related to chemical and biological risks. ASL-3 protections were activated for relevant models in May 2025 and have continued to evolve since deployment.

Anthropic also says the policy helped encourage similar safety frameworks across the industry. The company notes that within months of publishing the original RSP, OpenAI and Google DeepMind introduced broadly comparable governance approaches, and that some developers implemented biological-risk classifiers similar to Anthropic’s ASL-3 safeguards.

Anthropic argues that principles behind these voluntary standards have contributed to early AI policy efforts requiring frontier developers to publish risk-assessment and mitigation frameworks. The company points to emerging regulations and policy initiatives — including California’s SB 53, New York’s RAISE Act, and elements of the EU AI Act’s Codes of Practice — which emphasize transparency and structured risk evaluation. Anthropic says encouraging industry-wide transparency frameworks was a central objective of the Responsible Scaling Policy.

However, the company acknowledges that key assumptions behind the policy did not fully play out as expected.

A central challenge emerged around what Anthropic calls a “zone of ambiguity.” The company says its original idea — that predefined capability thresholds could serve as clear coordination points for industry-wide action — did not fully materialize in practice. In several cases, model capabilities appeared to approach RSP thresholds, yet it remained unclear whether those thresholds had definitively been crossed.

Anthropic notes that the science of model evaluation is not yet mature enough to provide definitive answers in high-risk domains. In some instances, the company says it adopted a precautionary approach and implemented stricter safeguards even amid uncertainty. However, internal uncertainty made it more difficult to build a persuasive external case for coordinated, multilateral action across the AI industry.

Biological capabilities provide a concrete example. Anthropic says its models now demonstrate sufficient biological knowledge to pass most available short-form evaluation tests, making it difficult to argue that risks are clearly low. At the same time, those tests alone do not provide conclusive evidence that risks are high. The company supported more extensive validation efforts, including wet-lab studies, but notes that such research takes time — often long enough that newer, more capable models emerge before definitive results are available.

Anthropic also says government engagement on AI safety progressed more slowly than anticipated despite rapid advances in AI capabilities over the past several years. According to the company, policy discussions increasingly prioritized economic competitiveness and technological leadership, while safety-focused initiatives struggled to gain sustained traction at the federal level.

Anthropic maintains that government involvement remains necessary to address frontier AI risks but describes progress as gradual rather than automatic, arguing that stronger coordination does not naturally emerge simply as AI systems become more capable or approach predefined risk thresholds. The company says effective engagement must be grounded in evidence and framed in terms of national security interests, economic competitiveness, and public trust — areas it believes can align safety considerations with broader policy priorities.

Finally, Anthropic says its experience implementing ASL-3 safeguards demonstrated that meaningful protections can be deployed unilaterally and at manageable operational cost. However, the company warns this may not hold true at higher capability levels. While later AI Safety Levels remain only partially defined, Anthropic says some of the stronger mitigations envisioned in earlier versions of the policy could prove impossible for a single company to implement independently without collective action.

As an illustration, Anthropic cites research from RAND indicating that the highest tier of model-weight security — designed to defend against sophisticated state-level cyber operations — is currently considered unattainable without support from national security institutions. The company argues this highlights the growing gap between risks that frontier AI systems may introduce and the safeguards private organizations can realistically deploy alone.

Anthropic says this challenge is compounded by three overlapping factors: uncertainty about when models cross risk thresholds, slower-than-expected government coordination on AI safety amid a policy climate that increasingly prioritized economic competitiveness over safety-oriented regulation, and safety requirements at higher capability levels that a single company may not be able to implement on its own. The company notes that it could have redefined future ASL-4 and ASL-5 safeguards to make compliance easier, but doing so would have weakened the original intent of the framework.

Rather than lowering standards to make future compliance easier, the company says it chose to restructure the Responsible Scaling Policy before reaching those higher capability tiers. Version 3.0 shifts toward commitments that Anthropic believes are difficult but achievable independently, while continuing to outline broader safeguards that would require coordinated industry and government action.

New Safety Measures Introduced in Responsible Scaling Policy 3.0

1. Separating Company Actions From Industry Recommendations

Anthropic says Version 3.0 separates what it will do as an individual company from what it believes would be needed across the broader AI ecosystem to manage catastrophic risks at higher capability levels.

In practice, the updated policy outlines two tracks:

  • Company commitments: safeguards Anthropic says it will pursue regardless of what competitors, regulators, or governments do.

  • Industry-wide recommendations: a more ambitious capabilities-to-mitigations map describing what Anthropic believes would be required if advanced AI risks become harder to control without coordinated action.

The company presents this separation as a response to the limitations it identified in earlier versions of the RSP, including uncertainty about capability thresholds and the possibility that some future safeguards may not be achievable by any single developer acting alone.

2. Frontier Safety Roadmap

Anthropic says it will now develop and publish a Frontier Safety Roadmap that describes concrete goals for reducing risk across four categories:

  • Security (protecting systems and model weights from theft or compromise)

  • Alignment (ensuring systems reliably follow intended rules and constraints)

  • Safeguards (preventing harmful misuse and managing deployment risk)

  • Policy (proposals intended to guide scalable oversight as risks increase)

Unlike hard contractual commitments, Anthropic characterizes these as public, nonbinding goals that it plans to grade itself against over time. The company says this approach is meant to preserve ambition while making progress measurable to outside observers.

Anthropic’s examples include:

  • Launching “moonshot” R&D projects aimed at achieving unusually high levels of information security

  • Developing more automated red-teaming methods intended to exceed what Anthropic currently gathers through large bug bounty participation

  • Expanding systematic measures designed to keep Claude aligned with its constitutional principles

  • Creating centralized records of critical AI development activities — and using AI to analyze those records for potential insider risk (human or AI) and other security threats

  • Publishing a policy roadmap that proposes a “regulatory ladder”oversight requirements designed to scale as AI capability and risk increase and intended to help guide future government AI policy.

3. Risk Reports and External Review

Anthropic says Version 3.0 places greater emphasis on recurring Risk Reports, which it plans to publish online every three to six months (with some redactions).

Each Risk Report will provide detailed information about the safety profile of Anthropic’s models at the time of publication. The company says the reports will go beyond listing model capabilities. Instead, they are intended to explain how:

  • model capabilities,

  • defined threat models (the specific ways systems might pose risks), and

  • active risk mitigations

fit together into an overall assessment of the model’s safety risk at that point in time.

Anthropic says this reporting structure builds on lessons from its May 2025 Safeguards Report, which it describes as useful both for internal analysis and for communicating risk to the public.

Anthropic says it will appoint expert third-party reviewers who are deeply familiar with AI safety research, free of major conflicts of interest, and explicitly incentivized to provide open and honest assessments of the company’s safety position. These reviewers would receive unredacted or minimally redacted access to relevant Risk Reports and publish a public evaluation of Anthropic’s reasoning, analysis, and decision-making. Anthropic says it is already running pilots for this process, even though it does not believe external review is required for its current models.

Anthropic adds that Risk Reports are intended to highlight gaps between what it is currently able to implement and what it believes would be necessary for broader industry-wide safety — and argues that making those gaps visible could influence public understanding and future policy.

Q&A: Anthropic Responsible Scaling Policy 3.0

Q: What is the Responsible Scaling Policy?
A: A voluntary framework Anthropic uses to introduce stronger safeguards as AI models reach higher capability levels.

Q: What changed in Version 3.0?
A: The policy adds Risk Reports, external expert review, and a Frontier Safety Roadmap while separating company commitments from industry recommendations.

Q: Why did Anthropic update the policy now?
A: The company says real-world experience revealed uncertainty in measuring when AI systems cross dangerous capability thresholds.

Q: Will outside experts review Anthropic’s safety decisions?
A: Yes. Independent reviewers may evaluate Risk Reports and publish public assessments.

Q: Does this create new legal requirements?
A: No. The Responsible Scaling Policy remains voluntary but may inform future regulation.

What This Means: How Anthropic’s AI Safety Transparency Model Could Influence Governance

Anthropic’s updated Responsible Scaling Policy reflects a growing reality across the AI industry: technological capability is advancing faster than the tools used to measure risk and the institutions responsible for oversight.

By openly acknowledging uncertainty — rather than treating predefined thresholds as definitive proof of danger or safety — the company is moving toward a model of continuous disclosure. That approach emphasizes recurring risk documentation, independent review, and public reporting over one-time policy commitments.

These governance questions are already moving beyond internal company policy into real-world national security debates, as seen in recent disputes over how AI safeguards should apply to government and defense use cases.

Who should care: Policymakers drafting AI oversight frameworks, enterprise leaders deploying frontier models in sensitive environments, and researchers studying AI governance mechanisms should pay attention to how companies operationalize safety commitments in practice — not just how they describe them.

Why it matters now: As AI systems become more capable and begin taking autonomous actions, users, businesses, and governments increasingly depend on safeguards that evolve alongside the technology. Static safety rules written for earlier generations of AI may not address new risks, making ongoing transparency and continuous safety updates essential for maintaining trust in how these systems are developed and deployed.

What decision this affects: Organizations selecting AI providers may begin evaluating transparency practices, risk documentation processes, and external review mechanisms alongside performance benchmarks and cost considerations — particularly in regulated, defense, healthcare, and critical infrastructure sectors.

Ultimately, the debate may no longer center on whether companies claim their systems are safe — but on whether they can continuously demonstrate, in measurable and reviewable ways, how they are managing risks as those systems grow more capable.

Sources:

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.

Keep Reading