Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Test-Time Matching: Unlocking Compositional Reasoning in Multimodal Models

About

Frontier AI models have achieved remarkable progress, yet recent studies suggest they struggle with compositional reasoning, often performing at or below random chance on established benchmarks. We revisit this problem and show that widely used evaluation metrics systematically underestimate model capability. To correct this artifact, we introduce a group matching score that more faithfully evaluates model capability. Moreover, correctness under the new metric can be translated into correctness under existing metrics via a simple overfitting step. This adjustment enables SigLIP-B16 to surpass all previous results and GPT-4.1 to yield the first result surpassing estimated human performance on Winoground. Building on this insight, we propose Test-Time Matching (TTM), an iterative, self-improving algorithm that further bootstraps model performance without any external supervision. TTM delivers additional, non-trivial improvements: for example, TTM enables SigLIP-B16 to surpass GPT-4.1 on MMVP-VLM, establishing a new state of the art. TTM also extends beyond contrastive vision-language models, yielding clear gains on a generative multimodal model across benchmarks. Importantly, TTM remains broadly effective even on benchmarks without metric-induced effects or group structures, achieving relative gains up to 85.7% on challenging datasets such as WhatsUp. Across 16 dataset variants spanning diverse setups, our experiments demonstrate that TTM consistently improves model performance and advances the frontier of compositional reasoning.

Yinglun Zhu, Jiancheng Zhang, Fuzhi Tang• 2025

Related benchmarks

TaskDatasetResultRank
Vision-Language AlignmentWinoground
Accuracy63.38
3
Vision-Language CompositionalityColorSwap
Accuracy85.17
3
Vision-Language Spatial ReasoningWhatsUp A-LR 2x2 directional variants
GroupScore95.87
3
Vision-Language Spatial ReasoningWhatsUp A-OU 2x2 directional variants
GroupScore99.03
3
Vision-Language Spatial ReasoningWhatsUp B-LR 2x2 directional variants
GroupScore82.84
3
Vision-Language Spatial ReasoningWhatsUp B-FB 2x2 directional variants
GroupScore66.67
3
Visual Perception AlignmentMMVP-VLM
Accuracy81.67
3
Compositional EvaluationSugarCrepe Replace Relation
GroupScore76.23
2
Compositional EvaluationSugarCrepe Swap Attribute
GroupScore77.36
2
Compositional EvaluationSugarcrepe swap-object
GroupScore66.12
2
Showing 10 of 13 rows

Other info

Follow for update