Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Two-stage Vision Transformers and Hard Masking offer Robust Object Representations

About

Context can strongly affect object representations, sometimes leading to undesired biases, particularly when objects appear in out-of-distribution backgrounds at inference. At the same time, many object-centric tasks require to leverage the context for identifying the relevant image regions. We posit that this conundrum, in which context is simultaneously needed and a potential nuisance, can be addressed by an attention-based approach that uses learned binary attention masks to ensure that only attended image regions influence the prediction. To test this hypothesis, we evaluate a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, for which context cues are likely to be needed, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. The explicit nature of the semantic masks also makes the model's reasoning auditable, enabling powerful test-time interventions to further enhance robustness. Extensive experiments across diverse benchmarks demonstrate that this approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds. Code: https://github.com/ananthu-aniraj/ifam

Ananthu Aniraj, Cassio F. Dantas, Dino Ienco, Diego Marcos• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationWaterbirds
Average Accuracy99
157
Image ClassificationImageNet-1K
Accuracy84.3
43
Image ClassificationMetaShift
Average Accuracy88.7
33
Image ClassificationWaterBird (OOD)
Accuracy86.2
20
ClassificationImageNet-9 Backgrounds Challenge
Accuracy (Original IN-9)97.5
17
Image ClassificationCUB (in-distrib.)
Top-1 Accuracy90.6
10
Pneumothorax detectionSIIM-ACR (test)
AUC (A)92.1
9
Showing 7 of 7 rows

Other info

Follow for update