Diffusion Transformers with Representation Autoencoders

January 2026
25 min read
Generative Models, Diffusion Models, Transformers, Representation Learning
Loading...
Use arrow keys or click to navigate slides. Press 'F' or Fullscreen icon for best experience.

What You'll Learn

  • Limitations of standard VAE encoders (outdated backbones, low-dim latents)
  • Introduction to Representation Autoencoders (RAEs)
  • Challenges of operating in high-dimensional latent spaces
  • Theoretical solutions for faster convergence
  • ImageNet generation results (1.51 FID)

Key Concepts Covered

Pairing pretrained encoders (DINO, SigLIP) with trained decoders.

Semantically rich representations that challenge standard diffusion.

The backbone architecture scaled for these experiments.

Resources

Slide Overview

  • Motivation & VAE Limitations (Slides 1-5)
  • RAE Architecture Definition (Slides 6-12)
  • Latent Space Analysis (Slides 13-20)
  • Experimental Results (Slides 21-end)