Get the latest tech news

Efficient high-resolution image synthesis with linear diffusion transformer


xploring the Frontiers of Efficient Generative Foundation Models Enze Xie1*, Junsong Chen1*, Junyu Chen2,3, Han Cai1, Haotian Tang2, Yujun Lin2, Zhekai Zhang2, Muyang Li2, Ligeng Zhu1, Yao Lu1, Song Han1,2 1NVIDIA, 2MIT, 3Tsinghua University *Project co-lead We introduce Sana, a text-to-image framework that can efficiently generate images up to 4096 × 4096 resolution. Sana can synthesize high-resolution, high-quality images with strong text-image alignment at a remarkably fast speed, deployable on laptop GPU.

We address training instability and design complex human instructions (CHI) to leverage Gemma’s in-context learning, improving image-text alignment. For 512 × 512 resolution, Sana-0.6 demonstrates a throughput that is 5× faster than PixArt-Σ, which has a similar model size, and significantly outperforms it in FID, Clip Score, GenEval, and DPG-Bench. Our mission is to develop efficient, lightweight, and accelerated AI technologies that address practical challenges and deliver fast, open-source solutions.

Get the Android app

Or read this on Hacker News

Read more on:

Photo of linear

linear

Related news:

News photo

Linear Variable Differential Transformer (LVDT) Basics

News photo

The Sims 5 not happening as EA moves "beyond linear, sequential Sims releases"

News photo

Linear, symmetric, self-selecting 14-bit molecular memristors (2023)