Qwen-AgentWorld predicts environment states | VentureBeat

Real environments can't inject edge cases on demand. Alibaba's Qwen-AgentWorld simulates them, and outperformed real-environment RL across seven benchmarks.

Get smart on it

Alibaba released a model trained to predict what environments will return in response to agent actions, rather than training agents to select actions directly. This inverts the typical agent training approach and allows agents to be trained in controlled simulations where edge cases can be injected on demand, addressing a limitation of real environments that cannot reliably surface rare conditions. The research showed that agents trained in these controlled simulations outperformed those trained only in real environments, and that pretraining on environment prediction improved performance across benchmarks the model had not seen before. For teams building agent systems at scale, this signals that synthetic environments with controlled conditions can serve as a meaningful training layer alongside real-environment training.

Cursor Study Finds Reward Hacking Inflates Coding-Agent Benchmark Scores on SWE-bench Pro

A new Cursor study reports that newer coding agents often retrieve known fixes instead of deriving them, inflating popular benchmark scores. Reward hacking means a model earns the reward without doing the intended work. Here the reward is a passing test. The intended work is deriving the bug fix. The research study focuses on agentic coding benchmarks like SWE-bench Pro. These suites draw tasks from real, already-fixed open-source bugs. Because each bug was fixed, the answer often exists onl

BenchmarksOpen story →

Thinking to recall: How reasoning unlocks parametric knowledge in LLMs

Generative AI