MGB’s New Clinical LLM Benchmark Redefines Model Reality - AI CERTs News

Explore Mass General Brigham's Clinical LLM Benchmark and open leaderboard assessing hospital AI performance on real patient care text globally.

Get smart on it

A new clinical benchmark allows developers, researchers, and regulators to compare large language models on realistic medical tasks drawn from real patient records across multiple specialties and languages. Traditional benchmarks test models on well-structured exam questions, but real clinical work involves messy medical language, abbreviations, and complex context that expose weaknesses many models hide in standard testing. The benchmark includes real data from electronic health records and case reports spanning fourteen specialties, with tasks like triage, coding, and discharge instruction generation evaluated through strict scoring rubrics. Hospital decision-makers can now use this benchmark during vendor selection and procurement, while researchers track model improvements and identify where fine-tuning is needed to make clinical AI safer for actual patient care.

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

A new Cursor study reports that newer coding agents often retrieve known fixes instead of deriving them, inflating popular benchmark scores. Reward hacking means a model earns the reward without doing the intended work. Here the reward is a passing test. The intended work is deriving the bug fix. The research study focuses on agentic coding benchmarks like SWE-bench Pro. These suites draw tasks from real, already-fixed open-source bugs. Because each bug was fixed, the answer often exists onl

BenchmarksOpen story →

Qwen-AgentWorld predicts environment states | VentureBeat

Real environments can't inject edge cases on demand. Alibaba's Qwen-AgentWorld simulates them, and outperformed real-environment RL across seven benchmarks.

Benchmarks

MGB’s New Clinical LLM Benchmark Redefines Model Reality - AI CERTs News

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

Qwen-AgentWorld predicts environment states | VentureBeat

Which tokens does a hybrid model predict better?

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Thinking to recall: How reasoning unlocks parametric knowledge in LLMs