MLCommons, a non-profit organization that helps companies measure the performance of theirs artificial intelligence Systems, is launching a new benchmark to also evaluate the negative side of AI.
The new landmark, called AIluminateevaluates responses from broad language patterns to more than 12,000 test prompts across 12 categories, including incitement to violent crime, child sexual exploitation, hate speech, promotion of self-harm, and intellectual property infringement.
Models are given a rating of “poor”, “fair”, “good”, “very good” or “excellent”, depending on their performance. The instructions used to test models are kept secret to prevent them from becoming training data that would allow a model to pass the test.
Peter Mattson, founder and president of MLCommons and a senior engineer at Google, says measuring the potential harms of AI models is technically difficult and leads to inconsistencies across the industry. “AI is a really young technology, and AI testing is a really young discipline,” he says. “Improving safety benefits society; it also benefits the market.”
Reliable, independent methods for measuring AI risks may become more relevant under the next US administration. Donald Trump has promised to get rid of President Biden’s executive order on artificial intelligence, which has introduced measures to ensure that AI is used responsibly from companies and a new AI Safety Institute to test powerful models.
The effort could also provide a more international perspective on the harms of AI. MLCommons counts numerous international companies among its member organizations, including the Chinese companies Huawei and Alibaba. If all of these companies used the new benchmark, it would provide a way to compare AI safety in the United States, China and elsewhere.
Some large US AI providers have already used AILuminate to test their models. Anthropic’s Claude model, Google’s smaller Gemma model, and a model from Microsoft called Phi all scored “very good” in tests. OpenAI’s GPT-4o and Meta’s larger Llama model both earned a “good” rating. The only model to get a “poor” rating was OLMo of the Allen Institute for AIalthough Mattson notes that this is a search offering not designed with security in mind.
“Overall, it is good to see scientific rigor in AI evaluation processes,” says Rumman Chowdhury, CEO of Human intelligencea non-profit organization specializing in test or collaborate with AI models for incorrect behavior. “We need best practices and inclusive measurement methods to determine whether AI models perform as we expect.”