Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs
About
Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Gender Classification | COCO 95% spurious correlation | Average Score78.1 | 24 | |
| Image Classification | Waterbirds 95% correlation (test) | Worst-group Accuracy92.5 | 23 | |
| Image Classification | Waterbirds 100% correlation (test) | Worst-group Accuracy91.9 | 21 | |
| Gender Classification | COCO 100% spurious correlation | Average Score77.9 | 20 | |
| Binary Classification | CounterAnimal Pair 1: Brambling vs. Bulbul | Average Accuracy93.2 | 16 | |
| Binary Classification | CounterAnimal Pair 2: Ptarmigan vs. Prairie-Chicken | Average Score83.8 | 16 | |
| Binary Classification | NICO++ Car vs. Truck | Average Accuracy86.1 | 12 | |
| Binary Classification | NICO++ Ship vs. Sailboat | Accuracy84.6 | 12 | |
| Binary Classification | NICO++ Bike vs. Motorbike | Accuracy (AVG)90.2 | 12 | |
| Binary Classification | NICO++ Car vs. Bus | Accuracy88.2 | 12 |