Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Birds of a Feather Flock Together: Background-Invariant Representations via Linear Structure in VLMs

About

Vision-language models (VLMs), such as CLIP and SigLIP 2, are widely used for image classification, yet their vision encoders remain vulnerable to systematic biases that undermine robustness. In particular, correlations between foreground objects and their backgrounds constitute a salient and practically important class of spurious dependencies. In this work, we revisit the well-known property of high linear additivity in VLM embedding spaces and show that it enables a decomposition of scene representations into foreground and background components. Leveraging this insight, we introduce a pre-training approach that exploits this property to construct background-invariant representations using synthetic data. Our method achieves, to our knowledge, the first worst-group accuracy exceeding $90\%$ on Waterbirds under perfect ($100\%$) spurious correlation (i.e., no minority-group examples in the training data). Furthermore, it demonstrates strong sim-to-real transfer and requires no access to real-world debiased data, making it practical for real-world deployment.

Youssef Zaazou, Mark Thomas• 2026

Related benchmarks

TaskDatasetResultRank
Gender ClassificationCOCO 95% spurious correlation
Average Score78.1
24
Image ClassificationWaterbirds 95% correlation (test)
Worst-group Accuracy92.5
23
Image ClassificationWaterbirds 100% correlation (test)
Worst-group Accuracy91.9
21
Gender ClassificationCOCO 100% spurious correlation
Average Score77.9
20
Binary ClassificationCounterAnimal Pair 1: Brambling vs. Bulbul
Average Accuracy93.2
16
Binary ClassificationCounterAnimal Pair 2: Ptarmigan vs. Prairie-Chicken
Average Score83.8
16
Binary ClassificationNICO++ Car vs. Truck
Average Accuracy86.1
12
Binary ClassificationNICO++ Ship vs. Sailboat
Accuracy84.6
12
Binary ClassificationNICO++ Bike vs. Motorbike
Accuracy (AVG)90.2
12
Binary ClassificationNICO++ Car vs. Bus
Accuracy88.2
12
Showing 10 of 10 rows

Other info

Follow for update