
Two platforms for reporting AI model evaluation results are now compatible with each other. Every Eval Ever, a standardized system for recording evaluation results launched by a cross-institutional coalition, can now share data with Hugging Face Community Evals, which displays benchmark scores on model pages across the Hub. The compatibility matters because evaluation results are currently scattered across papers, leaderboards, and blog posts in different formats, making it difficult for users and researchers to compare models or understand why the same model sometimes shows different scores on the same benchmark. The integration uses a converter that transforms Every Eval Ever records into the format Hugging Face expects, allowing a single evaluation to appear both on model pages with a link to its full source record and in a standardized metadata store that researchers and policymakers can access.

ScarfBench is an open benchmark designed to evaluate how well AI agents can migrate enterprise Java applications from one framework to another, such as moving code between Spring, Jakarta EE, and Quarkus. Framework migration is challenging because it requires more than just translating code: agents must preserve application behavior, adapt build systems, and navigate runtime dependencies across configuration, services, databases, and web components. Testing revealed that even the strongest current AI agents achieve less than 10 percent success on behavioral validation, and agents frequently overestimate their own progress while struggling with environmental issues like build tooling and deployment problems rather than code transformation alone.

Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets.

GeneBench-Pro is a benchmark containing 10 case studies designed to evaluate how well AI models can analyze complex genomic and genetic data. The benchmark covers diverse areas including tumor therapy decisions, CRISPR gene targeting, drug target prioritization, genetic screening, and population genetics, each with real experimental datasets and supporting materials. Models are assessed not just on numerical accuracy but on the quality of their analytical reasoning across tasks that require interpreting multiple types of molecular and clinical evidence. The benchmark was released to provide a standardized way to measure performance on genomics problems that require integrating information from long-read sequencing, gene expression, tumor data, and pharmacogenomic evidence.
Want to go deeper than the news? Explore live, cohort-based AI courses taught by practitioners.
Browse AI courses on Maven