
Introducing GeneBench-Pro, a new benchmark testing AI performance in genomics, biology, and scientific research using complex, real-world datasets.
GeneBench-Pro is a research-level benchmark designed to test whether AI models can handle the complex judgment calls required in computational biology research, such as deciding which analytical approaches to use with messy datasets and knowing when results are reliable enough for decisions. The benchmark contains 129 problems across 10 domains of biology, from statistical genetics to cancer genomics, each presenting realistic datasets where models must explore data, choose appropriate methods, and iterate to reach correct answers. This matters because real scientific research depends not just on executing analysis steps but on making higher-order judgments about ambiguity and assumptions, skills that have been difficult to measure rigorously even though they increasingly limit AI performance. The benchmark uses synthetically generated data with known causal structures, allowing creators to verify that correct answers require proper analytical thinking rather than numerical shortcuts or arbitrary author preferences.

ScarfBench is an open benchmark designed to evaluate how well AI agents can migrate enterprise Java applications from one framework to another, such as moving code between Spring, Jakarta EE, and Quarkus. Framework migration is challenging because it requires more than just translating code: agents must preserve application behavior, adapt build systems, and navigate runtime dependencies across configuration, services, databases, and web components. Testing revealed that even the strongest current AI agents achieve less than 10 percent success on behavioral validation, and agents frequently overestimate their own progress while struggling with environmental issues like build tooling and deployment problems rather than code transformation alone.

GeneBench-Pro is a benchmark containing 10 case studies designed to evaluate how well AI models can analyze complex genomic and genetic data. The benchmark covers diverse areas including tumor therapy decisions, CRISPR gene targeting, drug target prioritization, genetic screening, and population genetics, each with real experimental datasets and supporting materials. Models are assessed not just on numerical accuracy but on the quality of their analytical reasoning across tasks that require interpreting multiple types of molecular and clinical evidence. The benchmark was released to provide a standardized way to measure performance on genomics problems that require integrating information from long-read sequencing, gene expression, tumor data, and pharmacogenomic evidence.
Want to go deeper than the news? Explore live, cohort-based AI courses taught by practitioners.
Browse AI courses on Maven