Diffusion Transformers with Representation Autoencoders
January 2026
25 min read
Generative Models, Diffusion Models, Transformers, Representation Learning
Use arrow keys or click to navigate slides. Press 'F' or Fullscreen icon for best experience.
What You'll Learn
- •Limitations of standard VAE encoders (outdated backbones, low-dim latents)
- •Introduction to Representation Autoencoders (RAEs)
- •Challenges of operating in high-dimensional latent spaces
- •Theoretical solutions for faster convergence
- •ImageNet generation results (1.51 FID)
Key Concepts Covered
Pairing pretrained encoders (DINO, SigLIP) with trained decoders.
Semantically rich representations that challenge standard diffusion.
The backbone architecture scaled for these experiments.
Resources
Slide Overview
- Motivation & VAE Limitations (Slides 1-5)
- RAE Architecture Definition (Slides 6-12)
- Latent Space Analysis (Slides 13-20)
- Experimental Results (Slides 21-end)
