DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Autoregressive large language models generate text one token at a time. Each token waits for the one before it. This serial loop leaves modern GPUs underused and keeps inference slow. The cost grows worse with long Chain-of-Thought reasoning models. Their lengthy outputs make latency the dominant part of generation. Speculative decoding is the standard fix. A small draft model proposes future tokens. The large target model verifies those tokens in parallel. Accepted tokens are kept, so the o

Get smart on it

Autoregressive language models generate text one token at a time, which leaves modern GPUs underused and slows inference. Speculative decoding addresses this by having a small draft model propose future tokens that a large target model verifies in parallel, but existing methods still draft tokens one at a time, limiting speedups to around 2-3x. DFlash introduces a block diffusion model that proposes entire blocks of tokens in a single forward pass rather than one at a time, with the target model verifying those blocks in parallel. Research reports over 6x lossless acceleration across various models and tasks, reaching up to 15x higher throughput on NVIDIA Blackwell hardware, with particular benefits for coding agents, reasoning models with long outputs, and high-throughput serving scenarios.

Virginia Approves First-Ever Data Center Power Tax

Virginia’s new electricity tax on data centers, including self-generated power, is projected to generate $600M annually.

Hardware & ComputeOpen story →

The Breaking Points 2035: A Data Center Space Odyssey

Orbital data centers promise relief from terrestrial power challenges, but their future may hinge on a harder question: repair infrastructure or replace fleets.

Hardware & ComputeOpen story →

DFlash Speculative Decoding Drafts Whole Token Blocks in Parallel for Up to 15x Higher Throughput on NVIDIA Blackwell

Virginia Approves First-Ever Data Center Power Tax

The Breaking Points 2035: A Data Center Space Odyssey

Chevron Lands 20-Year Microsoft Deal to Power West Texas AI Campus

Data Centers Take Training into Their Own Hands Amid Talent Shortages

Nvidia says its AI data center design runs hotter to use a lot less water

Nvidia wants to cut data center water use, but that’s not the same as fixing AI’s water problem