AI
AI News

Auto-Regressive Masked Diffusion Models

Source:arXiv
Original Author:Mahdi Karami et al.
Auto-Regressive Masked Diffusion Models

Image generated by Gemini AI

The Auto-Regressive Masked Diffusion (ARMD) model addresses performance gaps in masked diffusion models (MDMs) compared to autoregressive models (ARMs) by combining their training efficiency with the parallel capabilities of diffusion models. ARMD employs a causal, permutation-equivariant architecture, enabling efficient autoregressive-style decoding and a new strided parallel generation strategy. This innovation accelerates inference while ensuring coherence, leading to state-of-the-art results on language modeling benchmarks with fewer training steps and bridging the gap between parallel and sequential decoding methods.

Auto-Regressive Masked Diffusion Models Revolutionize Language Modeling

Recent advancements in language modeling have introduced Auto-Regressive Masked Diffusion (ARMD) models, enhancing performance by merging autoregressive models and diffusion-based architectures. This innovative approach improves training efficiency and narrows the performance gap.

Key Innovations of the ARMD Model

  • Causal Architecture: Computes all conditional probabilities during multiple denoising steps within a single parallel forward pass.
  • Efficient Decoding: Supports autoregressive-style decoding with a progressive permutation training scheme, accommodating various token orderings.
  • Strided Parallel Generation: Accelerates inference by generating tokens across parallel streams while ensuring global coherence.

Empirical evaluations indicate that ARMD sets a new standard in language modeling benchmarks, outstripping established diffusion baselines while requiring significantly fewer training steps.

ARMD's performance enhancements showcase its ability to bridge the gap between parallel and sequential decoding methods, redefining expectations in language model training.

Related Topics:

Auto-Regressive Masked Diffusionlanguage modelingautoregressive modelsparallel generationperformance gap

📰 Original Source: https://arxiv.org/abs/2601.16971v1

All rights and credit belong to the original publisher.

Share this article