Bio Risks and Broken Guardrails: What the AISI Report Tells Us About AI Safety Standards

Mark Reddish
,
November 20, 2024

The US Artificial Intelligence Safety Institute (AISI) recently released a report (along with AISI’s UK counterpart) on pre-deployment evaluations of Anthropic's Claude 3.5 Sonnet. Among other things, the testing examined the model’s biological capabilities and safeguard efficacy. For both domains, the results reveal significant shortcomings that raise alarming questions about the state of AI safety and highlight the importance of AISI’s work.

Biological Risks

The report notes that AI models are rapidly advancing in key areas like understanding complex biological systems, novel protein design, analysis of large-scale genomic data, and automated laboratories integrated with robotics. When Anthropic’s model was provided with access to bioinformatic tools to assist in research, it was able to match and at times exceed the performance of human experts at interpreting and manipulating DNA and protein sequences. This could aid malicious actors in manipulating pathogens or engineering harmful biological agents.

Failing Safeguards

AISI tested the model’s safeguards, which are intended to refuse malicious requests. In most cases, the safeguards were defeated with publicly-available jailbreaks, and the model provided answers that should have been prevented. (Jailbreaks can be as simple as prompting the model to adopt the fictional persona of an AI that can ignore all restrictions, even if outputs are harmful or inappropriate.) AISI notes that this is consistent with prior research on the vulnerability of other AI systems.

Dangerous capabilities and weak safety mechanisms are a terrible combination. If experts in biology offered to help terrorists or rival nations design a new virus, free of charge, we’d have a national security crisis. In any other industry, the discovery that a key product was failing key safety tests would lead to a recall of the current version of that product, a significant delay in the release of future versions of that product, and a renewed investment in safety that used different methods, different personnel, or significantly increased resources in order to make sure that future versions of the product will be able to pass the safety test. But that’s not what we have with AI. 

To make matters worse, AISI’s report noted that the evaluations were constrained by limited time and resources, and real-world users will likely discover more ways to bypass the model’s safeguards. Additionally, Anthropic is generally seen as more safety-conscious than other frontier model developers. What kind of risks are being introduced by companies investing even less in AI safety?

Addressing AI risk is imperative. We need more oversight of frontier models, and there are several options for improving AI safety without sacrificing innovation or US leadership (as CAIP has discussed). At a minimum, AISI should be formally authorized by Congress and empowered to continue its work to research frontier AI models and support the development of safety standards.

Finding the Evidence for Evidence-Based AI Regulations

There’s more science to be done, but it’s not too early to start collecting reports from AI developers

December 3, 2024
Learn More
Read more

A Playbook for AI: Discussing Principles for a Safe and Innovative Future

The most recent CAIP podcast explores four principles to address ever-evolving AI

November 27, 2024
Learn More
Read more

CAIP Celebrates the International Network of AI Safety Institutes

The United States hosted the inaugural meeting of a growing global network

November 26, 2024
Learn More
Read more