Soft-Masked Diffusion Language Models

March 2026

20 min read

Language Models, Diffusion, Masked Modeling, NLP

Use arrow keys or click to navigate slides. Press 'F' or Fullscreen icon for best experience.

What You'll Learn

•Autoregression of LLMs struggle with high computational cost.
•Diffusion provides LLMs with speed, self-correction, and bidirectional modeling.
•Masked diffusion offers benefits, but binary masking sacrifices prediction information.
•Continuous feedback MDLM proposes soft masking.
•Soft mask consistently outperforms binary mask in MAUVE scores.
•Performance gains are task and compute budget dependent.

Key Concepts Covered

Enriches mask tokens with weighted top-k prediction tokens instead of a binary selection.

Applies diffusion models to language with masking, unlocking bidirectional modeling over autoregression.

A metric for evaluating open-ended text generation quality where soft masking consistently outperformed binary masks.

Resources

Open in Google Slides

Slide Overview

Issues with Autoregression & Binary Masking
Introduction to Soft-Masked Diffusion
Optimal Hyperparameters & Tradeoffs (80% mask ratio, k=3)
Results & Compute Impact
Limitations at Scale & Future directions

Soft-Masked Diffusion Language Models

March 2026

20 min read

Language Models, Diffusion, Masked Modeling, NLP

Use arrow keys or click to navigate slides. Press 'F' or Fullscreen icon for best experience.

What You'll Learn

•Autoregression of LLMs struggle with high computational cost.
•Diffusion provides LLMs with speed, self-correction, and bidirectional modeling.
•Masked diffusion offers benefits, but binary masking sacrifices prediction information.
•Continuous feedback MDLM proposes soft masking.
•Soft mask consistently outperforms binary mask in MAUVE scores.
•Performance gains are task and compute budget dependent.

Key Concepts Covered

Enriches mask tokens with weighted top-k prediction tokens instead of a binary selection.

Applies diffusion models to language with masking, unlocking bidirectional modeling over autoregression.

A metric for evaluating open-ended text generation quality where soft masking consistently outperformed binary masks.

Resources

Open in Google Slides

Slide Overview

Issues with Autoregression & Binary Masking
Introduction to Soft-Masked Diffusion
Optimal Hyperparameters & Tradeoffs (80% mask ratio, k=3)
Results & Compute Impact
Limitations at Scale & Future directions

Issam Alzouby

Soft-Masked Diffusion Language Models

What You'll Learn

Key Concepts Covered

Resources

Slide Overview

Further Reading

Related Presentations

From Autoencoders to VQ-VAEs: A Mathematical Timeline

Training BAMM - Bidirectional Autoregressive Motion Model

Issam Alzouby

Soft-Masked Diffusion Language Models

What You'll Learn

Key Concepts Covered

Resources

Slide Overview

Further Reading

Related Presentations

From Autoencoders to VQ-VAEs: A Mathematical Timeline

Training BAMM - Bidirectional Autoregressive Motion Model

Soft-Masked Diffusion Language Models

What You'll Learn

Key Concepts Covered

Soft Masking

Masked Diffusion Language Models (MDLM)

MAUVE Score

Resources

Slide Overview

Further Reading

Related Presentations

From Autoencoders to VQ-VAEs: A Mathematical Timeline

Training BAMM - Bidirectional Autoregressive Motion Model

Soft-Masked Diffusion Language Models

What You'll Learn

Key Concepts Covered

Soft Masking

Masked Diffusion Language Models (MDLM)

MAUVE Score

Resources

Slide Overview

Further Reading

Related Presentations

From Autoencoders to VQ-VAEs: A Mathematical Timeline

Training BAMM - Bidirectional Autoregressive Motion Model