Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

•

Original Author:Shengbang Tong et al.

•

January 22, 2026

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Image generated by Gemini AI

Research on Representation Autoencoders (RAEs) indicates they excel in large-scale text-to-image (T2I) generation, outperforming state-of-the-art Variational Autoencoders (VAEs) across model scales. RAEs show faster convergence, superior generation quality, and stability during finetuning. This suggests RAEs could streamline T2I frameworks, enhancing multimodal models that integrate visual understanding and generation.

Advancements in Text-to-Image Generation with Representation Autoencoders

Recent research demonstrates that Representation Autoencoders (RAEs) significantly enhance text-to-image (T2I) generation by scaling diffusion models beyond traditional datasets like ImageNet. The study highlights the efficacy of RAEs in high-dimensional semantic latent spaces, indicating robust performance in generating images from freeform text.

By leveraging a frozen representation encoder, SigLIP-2, the research team expanded RAE decoders' capabilities by incorporating diverse datasets, including web, synthetic, and text-rendering data. The findings suggest that while increasing model scale enhances fidelity, the composition of the training data is crucial for optimizing performance in specific domains.

Key Findings from RAE Scaling

The investigation revealed that scaling simplifies the framework. Key insights include:

Dimension-dependent noise scheduling is vital for effective performance.
Architectural enhancements, such as wide diffusion heads, provide minimal advantages at larger scales.

RAEs were benchmarked against the state-of-the-art FLUX Variational Autoencoder (VAE) across a range of diffusion transformer scales, from 0.5 billion to 9.8 billion parameters. Results consistently showed that RAEs surpassed VAEs in pretraining across all scales, reflecting superior capability in T2I tasks.

During finetuning on high-quality datasets, RAE models exhibited notable stability, maintaining performance after 256 epochs, while VAE-based models tended to overfit after just 64 epochs. This stability indicates the RAE's robustness in handling large-scale data.

Enhanced Performance Metrics

RAE-based diffusion models converge faster and yield better image generation quality compared to VAE counterparts, positioning RAEs as a powerful foundation for large-scale T2I generation.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

Scaling Text-to-Image Diffusion Transformers with Representation Autoencoders

Advancements in Text-to-Image Generation with Representation Autoencoders

Key Findings from RAE Scaling

Enhanced Performance Metrics

Related Topics:

Share this article