Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vanilla Group Equivariant Vision Transformer: Simple and Effective

About

Incorporating symmetry priors as inductive biases to design equivariant Vision Transformers (ViTs) has emerged as a promising avenue for enhancing their performance. However, existing equivariant ViTs often struggle to balance performance with equivariance, primarily due to the challenge of achieving holistic equivariant modifications across the diverse modules in ViTs-particularly in harmonizing the Self-Attention mechanism with Patch Embedding. To address this, we propose a straightforward framework that systematically renders key ViT components, including patch embedding, self-attention, positional encodings, and Down/Up-Sampling, equivariant, thereby constructing ViTs with guaranteed equivariance. The resulting architecture serves as a plug-and-play replacement that is both theoretically grounded and practically versatile, scaling seamlessly even to Swin Transformers. Extensive experiments demonstrate that our equivariant ViTs consistently improve performance and data efficiency across a wide spectrum of vision tasks.

Jiahong Fu, Qi Xie, Deyu Meng, Zongben Xu• 2026

Related benchmarks

TaskDatasetResultRank
Classic Image Super-ResolutionSet5
PSNR38.38
83
Classic Image Super-ResolutionSet14
PSNR34.1
70
Image ClassificationMini-ImageNet (val)
Peak Accuracy87.08
36
Video Super-ResolutionREDS (val)
PSNR34.79
24
Classical Image Super-ResolutionUrban100
PSNR33.54
11
Classical Image Super-ResolutionBSD100
PSNR32.46
11
Classical Image Super-ResolutionManga100
PSNR39.59
7
Showing 7 of 7 rows

Other info

Follow for update