Soft-Masked Diffusion Language Models
March 2026
20 min read
Language Models, Diffusion, Masked Modeling, NLP
Use arrow keys or click to navigate slides. Press 'F' or Fullscreen icon for best experience.
What You'll Learn
- •Autoregression of LLMs struggle with high computational cost.
- •Diffusion provides LLMs with speed, self-correction, and bidirectional modeling.
- •Masked diffusion offers benefits, but binary masking sacrifices prediction information.
- •Continuous feedback MDLM proposes soft masking.
- •Soft mask consistently outperforms binary mask in MAUVE scores.
- •Performance gains are task and compute budget dependent.
Key Concepts Covered
Enriches mask tokens with weighted top-k prediction tokens instead of a binary selection.
Applies diffusion models to language with masking, unlocking bidirectional modeling over autoregression.
A metric for evaluating open-ended text generation quality where soft masking consistently outperformed binary masks.
Resources
Slide Overview
- Issues with Autoregression & Binary Masking
- Introduction to Soft-Masked Diffusion
- Optimal Hyperparameters & Tradeoffs (80% mask ratio, k=3)
- Results & Compute Impact
- Limitations at Scale & Future directions
