Token Pruning for In-Context Generation in Diffusion Transformers

About

In-context generation significantly enhances Diffusion Transformers (DiTs) by enabling controllable image-to-image generation through reference examples. However, the resulting input concatenation drastically increases sequence length, creating a substantial computational bottleneck. Existing token reduction techniques, primarily tailored for text-to-image synthesis, fall short in this paradigm as they apply uniform reduction strategies, overlooking the inherent role asymmetry between reference contexts and target latents across spatial, temporal, and functional dimensions. To bridge this gap, we introduce ToPi, a training-free token pruning framework tailored for in-context generation in DiTs. Specifically, ToPi utilizes offline calibration-driven sensitivity analysis to identify pivotal attention layers, serving as a robust proxy for redundancy estimation. Leveraging these layers, we derive a novel influence metric to quantify the contribution of each context token for selective pruning, coupled with a temporal update strategy that adapts to the evolving diffusion trajectory. Empirical evaluations demonstrate that ToPi can achieve over 30\% speedup in inference while maintaining structural fidelity and visual consistency across complex image generation tasks.

Junqing Lin, Xingyu Zheng, Pei Cheng, Bin Fu, Jingwei Sun, Guangzhong Sun• 2026

Related benchmarks

Task	Dataset	Result
Camera Move	AnyEdit Camera Move 1.0	PSNR25.07	10
Global Edit	AnyEdit Global Edit 1.0	PSNR22.46	10
Implicit Change	AnyEdit Implicit Change 1.0	PSNR29.4	10
Local Edit	AnyEdit Local Edit 1.0	PSNR27.55	10

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord