Yandex Unveils Switti: A Fast, High-Quality Text-to-Image Model

Yandex Research has introduced Switti, a new generative transformer model designed for text-to-image synthesis. This innovative model provides high-quality, aesthetic image outputs at a remarkable speed, producing 512×512 samples in just 0.13 seconds.

Advancements Over Diffusion Models

Switti challenges existing diffusion-based architectures by offering faster image generation without compromising quality. Built on the STAR architecture, the researchers added enhancements to overcome learning process instability.

Inspired by Lumina’s approach, they included extra normalization layers, which stabilized the training process and boosted performance.

Key Technical Improvements

The Yandex team noted that at a given resolution, the model rarely attends to earlier scales. This insight allowed them to eliminate autoregression, achieving an 11% speed increase with no quality loss.

Classifier Free Guidance (CFG) further improved image quality and text alignment. However, since CFG requires two model passes, it was found to have minimal impact at higher resolutions. Disabling CFG at these resolutions sped up generation by an additional 20%.

Performance and Benchmarking

Switti was trained on Yandex’s large internal dataset and evaluated against top-tier models, including Stable Diffusion XL (and its accelerated variants), SD3-Medium, Lumina-Next, and autoregressive models like LlamaGen and HART.

The model excelled in established metrics such as FID, CLIP, Pickscore, and Image Reward. User studies also confirmed its superiority over existing autoregressive methods.

Switti generates images seven times faster than the original SDXL and twice as fast as its accelerated versions, while maintaining quality comparable to diffusion-based models.

Future Development

Although Switti delivers promising results, it still lags behind models like MidJourney-v6.1, FLUX, and Ideogram-v2 in overall quality. The Yandex team plans to explore further enhancements to close this gap.

Switti demonstrates significant progress in transformer-based text-to-image synthesis, offering faster and more efficient image generation. Yandex Research is optimistic about its potential and encourages interested users to stay tuned for updates.