Arena, the AI leaderboard everyone uses, is now a $100M business

The startup, which runs a popular free AI leaderboard, launched its commercial service just last September.

Get smart on it

Arena is an AI leaderboard that started as a research project and lets users compare how well different AI models perform by choosing which one produces better responses to prompts. The platform recently began charging AI model developers and companies for detailed performance analytics based on data from over 10 million user evaluations, reaching $100 million in annualized revenue just eight months after launching its commercial service. This matters because it shows strong demand among AI providers for services that help them refine their models during post-training, a process that has become increasingly important as companies compete to improve their AI systems. Arena competes for business with other companies offering similar services to help model makers enhance their AI during development.

MGB’s New Clinical LLM Benchmark Redefines Model Reality - AI CERTs News

Explore Mass General Brigham's Clinical LLM Benchmark and open leaderboard assessing hospital AI performance on real patient care text globally.

BenchmarksPredictOpen story →

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

A new Cursor study reports that newer coding agents often retrieve known fixes instead of deriving them, inflating popular benchmark scores. Reward hacking means a model earns the reward without doing the intended work. Here the reward is a passing test. The intended work is deriving the bug fix. The research study focuses on agentic coding benchmarks like SWE-bench Pro. These suites draw tasks from real, already-fixed open-source bugs. Because each bug was fixed, the answer often exists onl

Arena, the AI leaderboard everyone uses, is now a $100M business

MGB’s New Clinical LLM Benchmark Redefines Model Reality - AI CERTs News

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Qwen-AgentWorld predicts environment states | VentureBeat

Which tokens does a hybrid model predict better?

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Thinking to recall: How reasoning unlocks parametric knowledge in LLMs