Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

About

Diffusion transformers (DiTs) have emerged as a dominant architecture for text-to-image generation, yet their performance drops when generating at resolutions beyond their training range. Existing training-free approaches mitigate this by modifying inference-time attention behavior, often through Rotary Position Embeddings (RoPE) extrapolation combined with attention scaling. However, these strategies apply a uniform and content-agnostic scaling across RoPE components with distinct frequency characteristics, inducing a trade-off between preserving global structure and recovering fine detail. We introduce SEGA, a training-free method that dynamically scales attention across RoPE components according to the latent's spatial-frequency structure at each denoising step. This adaptive scaling improves both structural coherence and fine-detail fidelity. Experiments show that SEGA consistently improves high-resolution synthesis across multiple target resolutions, outperforming state-of-the-art training-free baselines.

Javad Rajabi, Kimia Shaban, Koorosh Roohi, David B. Lindell, Babak Taati• 2026

Related benchmarks

TaskDatasetResultRank
High-Resolution Image GenerationAesthetic-4K
IR1.3
64
Text-to-Image GenerationAesthetic-4K (test)
IR1.39
20
Text-to-Image GenerationAesthetic-4K v1.0 (test)
IR1.51
16
Text-to-Image GenerationAesthetic-4K zero-shot 4096 x 4096
IR1.58
11
Showing 4 of 4 rows

Other info

Follow for update