
Autoregressive large language models generate text one token at a time. Each token waits for the one before it. This serial loop leaves modern GPUs underused and keeps inference slow. The cost grows worse with long Chain-of-Thought reasoning models. Their lengthy outputs make latency the dominant part of generation. Speculative decoding is the standard fix. A small draft model proposes future tokens. The large target model verifies those tokens in parallel. Accepted tokens are kept, so the o
Autoregressive language models generate text one token at a time, which leaves modern GPUs underused and slows inference. Speculative decoding addresses this by having a small draft model propose future tokens that a large target model verifies in parallel, but existing methods still draft tokens one at a time, limiting speedups to around 2-3x. DFlash introduces a block diffusion model that proposes entire blocks of tokens in a single forward pass rather than one at a time, with the target model verifying those blocks in parallel. Research reports over 6x lossless acceleration across various models and tasks, reaching up to 15x higher throughput on NVIDIA Blackwell hardware, with particular benefits for coding agents, reasoning models with long outputs, and high-throughput serving scenarios.

Virginia’s new electricity tax on data centers, including self-generated power, is projected to generate $600M annually.

Orbital data centers promise relief from terrestrial power challenges, but their future may hinge on a harder question: repair infrastructure or replace fleets.

Microsoft's West Texas power agreement with Chevron shows how AI developers are securing generation capacity alongside compute.
Want to go deeper than the news? Explore live, cohort-based AI courses taught by practitioners.
Browse AI courses on Maven