Anthropic Walks Back "Secret Sabotage" Policy After Researchers Revolt Over Claude Fable 5 Safeguards
Anthropic reversed course Wednesday on a controversial safeguard policy for its newly released Claude Fable 5 model that would have covertly degraded the model's performance for researchers working on competing AI systems — without informing them. The company issued an apology and pledged to make all safety interventions visible to users after facing fierce backlash from the AI research community.
The policy, first reported by WIRED, was part of the safety framework Anthropic deployed when it released Claude Fable 5 earlier this week. While the company publicly disclosed that it would reroute users asking about cybersecurity, biology, or chemistry to a less capable model, it quietly reserved a more aggressive approach for users engaged in frontier AI development: deliberately sabotaging model performance through methods like prompt modification, steering vectors, and parameter-efficient fine-tuning — all without any visible notification to the user.
Widespread Backlash and False Positives
The hidden safeguard was compounded by an unrelated but compounding problem: Fable 5's safety classifiers were so aggressively tuned that they were flagging and refusing completely benign requests. Researchers at the Gates Foundation's Institute for Disease Modeling reported that Claude Fable 5 triggered a refusal on the word "Hello" as the only user input. An immunologist at the Jackson Laboratory noted that the word "cancer" was being flagged as a biosecurity risk. Numerous bug reports flooded Anthropic's Claude Code GitHub repository documenting false positives on harmless queries.
"It felt like Anthropic was saying to the public, 'We don't trust anybody else to do AI research. We are the only ones who have to do AI research.' It feels a bit like they're starting to pull the ladder up behind them." — Will Brown, research lead at Prime Intellect, in an interview with WIRED
Dean Ball, a senior fellow at the Foundation for American Innovation and a former White House AI adviser, called the hidden performance degradation "shockingly hostile and a terrible look," noting that the "secret sabotage" policy undermines Anthropic's stated commitment to AI safety collaboration by limiting other researchers from participating in frontier AI safety work.
Anthropic's Reversal
Late Wednesday, Anthropic announced it was changing course. "We're changing Fable 5's safeguards for frontier LLM development to make them visible," the company said in a statement. "We made the wrong trade-off and we apologize for not getting the balance right." Under the revised policy, flagged requests will now visibly fall back to Claude Opus 4.8, and API users will receive a clear reason for any refusal. Anthropic estimated that the hidden safeguard affected approximately 0.05% of tasks and fewer than 0.05% of organizations, but noted that making safeguards visible requires a wider detection net, which may increase false positives in the short term.
The company previously argued that hidden safeguards are harder to probe and work around, allowing for narrower targeting. In a statement to WIRED, Anthropic said the measures were designed to prevent foreign adversaries from using its most capable models to erode the US advantage in frontier chips and optimized AI software. Critics, however, countered that the policy could have created a future where only a handful of leading AI labs could perform advanced AI research — effectively centralizing control over the direction of the field.
The episode underscores the growing tension between AI companies' safety ambitions and the developer community's expectation of transparency. As Anthropic prepares for what could be the largest IPO in history at a $965 billion valuation, the incident serves as a reminder that trust — not just capability — will determine which AI platforms earn long-term loyalty from the research community.
Source: WIRED, The Register