Anthropic, one of the leaders in AI development, is calling for proposals to help “fund evaluations developed by third-party organizations.”
Properly evaluating AI’s potential is a growing challenge for AI firms as the technology evolves. Not only is it challenging to properly evaluate an AI’s capabilities, but there are also concerns with evaluating the various safety issues involved.
Anthropic has increasingly been setting itself apart in the AI field, not only for its powerful Claude model that is currently beating OpenAI’s GPT-4o, but also for its safety-first approach to AI. In fact, the company was founded by OpenAI executives that were concerned with the direction of OpenAI, and the company has continued to attract disillusioned OpenAI engineers. The most notable recent example is Jan Leike, who left OpenAI after the safety team he co-lead was disbanded.
With that background, it’s not surprising that Anthropic is interested in developing and discovering new and better ways to properly evaluate AI. The company outlines its highest priority areas of focus:
- AI Safety Level assessments
- Advanced capability and safety metrics
- Infrastructure, tools, and methods for developing evaluations
The company outlines a number of AI Safety Levels (ASLs) that is concerned with, including cybersecurity; chemical, biological, radiological, and nuclear (CBRN) risks; model autonomy; national security risks; and misalignment risks. In all of these areas, the company is concerned with the risk that AI could be used to aid individuals in doing harm.
We’re particularly interested in capabilities that, if automated and scaled, could pose significant risks to critical infrastructure and economically valuable systems at levels approaching advanced persistent threat actors.
We’re prioritizing evaluations that assess two critical capabilities: a) the potential for models to significantly enhance the abilities of non-experts or experts in creating CBRN threats, and b) the capacity to design novel, more harmful CBRN threats.
AI systems have the potential to significantly impact national security, defense, and intelligence operations of both state and non-state actors. We’re committed to developing an early warning system to identify and assess these complex emerging risks.
Anthropic reveals a fascinating, and terrifying, observation about current AI models, what the company identifies as “misalignment risks.”
Our research shows that, under some circumstances, AI models can learn dangerous goals and motivations, retain them even after safety training, and deceive human users about actions taken in their pursuit.
The company says this represents a major danger moving forward as AI models become more advanced.
These abilities, in combination with the human-level persuasiveness and cyber capabilities of current AI models, increases our concern about the potential actions of future, more-capable models. For example, future models might be able to pursue sophisticated and hard-to-detect deception that bypasses or sabotages the security of an organization, either by causing humans to take actions they would not otherwise take or exfiltrating sensitive information.
Anthropic goes on to highlight its desire to improve evaluation methods to address bias issues, something that has been a significant challenge in training existing AI models.
Evaluations that provide sophisticated, nuanced assessments that go beyond surface-level metrics to create rigorous assessments targeting concepts like harmful biases, discrimination, over-reliance, dependence, attachment, psychological influence, economic impacts, homogenization, and other broad societal impacts.
The company also wants to ensure AI benchmarks support multiple languages, something that is not currently the case. New evaluation methods should also be able to “detect potentially harmful model outputs,” such as “attempts to automate cyber incidents.” The company also wants the new evaluation methods to better determine AI’s ability to learn, especially in the sciences.
Anthropic’s Criteria
Parties interested in submitting a proposal should keep the company’s 10 requirements in mind:
- Sufficiently difficult: Evaluations should be relevant for measuring the capabilities listed for levels ASL-3 or ASL-4 in our Responsible Scaling Policy, and/or human-expert level behavior.
- Not in the training data: Too often, evaluations end up measuring model memorization because the data is in its training set. Where possible and useful, make sure the model hasn’t seen the evaluation. This helps indicate that the evaluation is capturing behavior that generalizes beyond the training data.
- Efficient, scalable, ready-to-use: Evaluations should be optimized for efficient execution, leveraging automation where possible. They should be easily deployable using existing infrastructure with minimal setup.
- High volume where possible: All else equal, evaluations with 1,000 or 10,000 tasks or questions are preferable to those with 100. However, high-quality, low-volume evaluations are also valuable.
- Domain expertise: If the evaluation is about expert performance on a particular subject matter (e.g. science), make sure to use subject matter experts to develop or review the evaluation.
- Diversity of formats: Consider using formats that go beyond multiple choice, such as task-based evaluations (for example, seeing if code passes a test or a flag is captured in a CTF), model-graded evaluations, or human trials.
- Expert baselines for comparison: It is often useful to compare the model’s performance to the performance of human experts on that domain.
- Good documentation and reproducibility: We recommend documenting exactly how the evaluation was developed and any limitations or pitfalls it is likely to have. Use standards like the Inspect or the METR standard where possible.
- Start small, iterate, and scale: Start by writing just one to five questions or tasks, run a model on the evaluation, and read the model transcripts. Frequently, you’ll realize the evaluation doesn’t capture what you want to test, or it’s too easy.
- Realistic, safety-relevant threat modeling: Safety evaluations should ideally have the property that if a model scored highly, experts would believe that a major incident could be caused. Most of the time, when models have performed highly, experts have realized that high performance on that version of the evaluation is not sufficient to worry them.
Those interested in submitting a proposal, and possibly working long-term with Anthropic, should use this application form.
OpenAI has been criticized for for a lack of transparency that has led many to believe the company has lost its way and is no longer focused on its one-time goal of safe AI development. Anthropic’s willingness to engage the community and industry is a refreshing change of pace.