Diffusion Transformer models generate images with remarkable fidelity, yet training them at ultra-high resolutions is often cost-prohibitive due to the quadratic scaling of self-attention. In this talk, I will present Dynamic Position Extrapolation (DyPE), a training-free method that enables pre-trained diffusion transformers to synthesize images at resolutions far beyond their training data with no additional sampling cost.
The core of our approach leverages the spectral progression of the diffusion process, where low-frequency structures converge early and high-frequency details emerge in later stages. We introduce a mechanism to dynamically adjust positional encodings at each step, matching the frequency spectrum to the current stage of the generative process. I will demonstrate how DyPE enables models like FLUX to generate images at extreme scales, up to 16 million pixels, while consistently achieving state-of-the-art fidelity on high-resolution benchmarks. https://noamissachar.github.io/DyPE/
Noam is a Phd student in the Hebrew University of Jerusalem under the supervision of Prof. Dani Lischinski and Prof. Raanan Fattal. His reserach interest is visual generative models and their applications.