Introducing GeneBench-Pro

Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets.

Get smart on it

GeneBench-Pro is a research-level benchmark designed to test whether AI models can handle the complex judgment calls required in computational biology research, such as deciding which analytical approaches to use with messy datasets and knowing when results are reliable enough for decisions. The benchmark contains 129 problems across 10 domains of biology, from statistical genetics to cancer genomics, each presenting realistic datasets where models must explore data, choose appropriate methods, and iterate to reach correct answers. This matters because real scientific research depends not just on executing analysis steps but on making higher-order judgments about ambiguity and assumptions, skills that have been difficult to measure rigorously even though they increasingly limit AI performance. The benchmark uses synthetically generated data with known causal structures, allowing creators to verify that correct answers require proper analytical thinking rather than numerical shortcuts or arbitrary author preferences.

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

ScarfBench is an open benchmark designed to evaluate how well AI agents can migrate enterprise Java applications from one framework to another, such as moving code between Spring, Jakarta EE, and Quarkus. Framework migration is challenging because it requires more than just translating code: agents must preserve application behavior, adapt build systems, and navigate runtime dependencies across configuration, services, databases, and web components. Testing revealed that even the strongest current AI agents achieve less than 10 percent success on behavioral validation, and agents frequently overestimate their own progress while struggling with environmental issues like build tooling and deployment problems rather than code transformation alone.

BenchmarksOpen story →

Inside Genebench-Pro

GeneBench-Pro is a benchmark containing 10 case studies designed to evaluate how well AI models can analyze complex genomic and genetic data. The benchmark covers diverse areas including tumor therapy decisions, CRISPR gene targeting, drug target prioritization, genetic screening, and population genetics, each with real experimental datasets and supporting materials. Models are assessed not just on numerical accuracy but on the quality of their analytical reasoning across tasks that require interpreting multiple types of molecular and clinical evidence. The benchmark was released to provide a standardized way to measure performance on genomics problems that require integrating information from long-read sequencing, gene expression, tumor data, and pharmacogenomic evidence.

Introducing GeneBench-Pro

ScarfBench: Benchmarking AI Agents for Enterprise Java Framework Migration

Inside Genebench-Pro

Featuring Every Eval Ever Results on Hugging Face Model Pages

Arena, the AI leaderboard everyone uses, is now a $100M business

MGB’s New Clinical LLM Benchmark Redefines Model Reality - AI CERTs News

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro