Auto-Regressive Masked Diffusion Models

Image generated by Gemini AI
The Auto-Regressive Masked Diffusion (ARMD) model addresses performance gaps in masked diffusion models (MDMs) compared to autoregressive models (ARMs) by combining their training efficiency with the parallel capabilities of diffusion models. ARMD employs a causal, permutation-equivariant architecture, enabling efficient autoregressive-style decoding and a new strided parallel generation strategy. This innovation accelerates inference while ensuring coherence, leading to state-of-the-art results on language modeling benchmarks with fewer training steps and bridging the gap between parallel and sequential decoding methods.
Auto-Regressive Masked Diffusion Models Revolutionize Language Modeling
Recent advancements in language modeling have introduced Auto-Regressive Masked Diffusion (ARMD) models, enhancing performance by merging autoregressive models and diffusion-based architectures. This innovative approach improves training efficiency and narrows the performance gap.
Key Innovations of the ARMD Model
- Causal Architecture: Computes all conditional probabilities during multiple denoising steps within a single parallel forward pass.
- Efficient Decoding: Supports autoregressive-style decoding with a progressive permutation training scheme, accommodating various token orderings.
- Strided Parallel Generation: Accelerates inference by generating tokens across parallel streams while ensuring global coherence.
Empirical evaluations indicate that ARMD sets a new standard in language modeling benchmarks, outstripping established diffusion baselines while requiring significantly fewer training steps.
ARMD's performance enhancements showcase its ability to bridge the gap between parallel and sequential decoding methods, redefining expectations in language model training.
Related Topics:
📰 Original Source: https://arxiv.org/abs/2601.16971v1
All rights and credit belong to the original publisher.