Featuring Every Eval Ever Results on Hugging Face Model Pages

Get smart on it

Two platforms for reporting AI model evaluation results are now compatible with each other. Every Eval Ever, a standardized system for recording evaluation results launched by a cross-institutional coalition, can now share data with Hugging Face Community Evals, which displays benchmark scores on model pages across the Hub. The compatibility matters because evaluation results are currently scattered across papers, leaderboards, and blog posts in different formats, making it difficult for users and researchers to compare models or understand why the same model sometimes shows different scores on the same benchmark. The integration uses a converter that transforms Every Eval Ever records into the format Hugging Face expects, allowing a single evaluation to appear both on model pages with a link to its full source record and in a standardized metadata store that researchers and policymakers can access.

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

ScarfBench is an open benchmark designed to evaluate how well AI agents can migrate enterprise Java applications from one framework to another, such as moving code between Spring, Jakarta EE, and Quarkus. Framework migration is challenging because it requires more than just translating code: agents must preserve application behavior, adapt build systems, and navigate runtime dependencies across configuration, services, databases, and web components. Testing revealed that even the strongest current AI agents achieve less than 10 percent success on behavioral validation, and agents frequently overestimate their own progress while struggling with environmental issues like build tooling and deployment problems rather than code transformation alone.

BenchmarksOpen story →

Introducing GeneBench-Pro

Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets.

Featuring Every Eval Ever Results on Hugging Face Model Pages

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Introducing GeneBench-Pro

Inside Genebench-Pro

Arena, the AI leaderboard everyone uses, is now a $100M business

MGB’s New Clinical LLM Benchmark Redefines Model Reality - AI CERTs News

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro