Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

About

Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. The explicit nature of the semantic masks also makes the model's reasoning auditable, enabling powerful test-time interventions to further enhance robustness. Extensive experiments across diverse benchmarks demonstrate that this approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam

Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	Waterbirds	Average Accuracy99	209
Image Classification	ImageNet-1K	Accuracy84.3	52
Image Classification	MetaShift	Average Accuracy88.7	33
Image Classification	WaterBird (OOD)	Accuracy86.2	20
Classification	ImageNet-9 Backgrounds Challenge	Accuracy (Original IN-9)97.5	17
Image Classification	CUB (in-distrib.)	Top-1 Accuracy90.6	10
Pneumothorax detection	SIIM-ACR (test)	AUC (A)92.1	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord