Benchmark Model - Search News

MLCommons releases new AILuminate benchmark for measuring AI model safety

MLCommons today released AILuminate, a new benchmark test for evaluating the safety of large language models. Launched in 2020, MLCommons is an industry consortium backed by several dozen tech firms.

SiliconANGLE

OpenAI details o3 reasoning model with record-breaking benchmark scores

OpenAI today detailed o3, its new flagship large language model for reasoning tasks. The model’s introduction caps off a 12-day product announcement series that started with the launch of a new ...

Business Wire

New MLPerf Training and HPC Benchmark Results Showcase 49X Performance Gains in 5 Years

SAN FRANCISCO--(BUSINESS WIRE)--Today, MLCommons® announced new results from two industry-standard MLPerf™ benchmark suites: MLPerf Training v3.1 The MLPerf Training benchmark suite comprises full ...

VentureBeat

LiveBench is an open LLM benchmark that uses contamination-free test data and objective scoring

Want smarter insights in your inbox? Sign up for our weekly newsletters to get only what matters to enterprise AI, data, and security leaders. Subscribe Now A team of Abacus.AI, New York University, ...

VentureBeat

Google Gemini unexpectedly surges to No. 1, over OpenAI, but benchmarks don’t tell the whole story

Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more Google has claimed the top spot in a ...

TechCrunch

Meta’s vanilla Maverick AI model ranks below rivals on a popular chat benchmark

Earlier this week, Meta landed in hot water for using an experimental, unreleased version of its Llama 4 Maverick model to achieve a high score on a crowdsourced benchmark, LM Arena. The incident ...

The Next Web

Anthropic releases Claude Opus 4.7 with benchmark-leading coding and agentic performance

In short: Anthropic has released Claude Opus 4.7, its most capable generally available model, with benchmark-leading scores on SWE-bench Pro (64.3% vs GPT-5.4’s 57.7%), multi-agent coordination for ...

Hosted on MSN

Claude Opus 4.5 Launch: Anthropic’s new flagship model sets benchmark to outperform ChatGPT, Gemini

In what appears to be a direct shot across the bow in the generative AI model arms race, Anthropic has released Claude Opus 4.5, the company's new flagship offering. Anthropic is touting the model as ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results