| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SugarCrepe | CE-CLIP+ | Overall Accuracy87.5 | 50 | 22d ago | |
| VL-Checklist | SGVL | Attribute Score81.8 | 37 | 1mo ago | |
| VALSE | CE-CLIP+ | Average Score76.7 | 26 | 1mo ago | |
| Compositional Reasoning Suite Aggregated | LLaVA-7B† | Sugarcrepe Score93.1 | 23 | 9d ago | |
| Winoground | SAIL-L-NV2 | Txt2Img Score40.25 | 21 | 1mo ago | |
| GenAI-Bench (test) | CDG | Spatial Score83.79 | 18 | 1mo ago | |
| ARO | CE-CLIP+ | Relation Score83.6 | 17 | 1mo ago | |
| MMStar | GPT-4o | Accuracy64.7 | 16 | 4d ago | |
| BLINK | GPT-4o | Accuracy68 | 12 | 4d ago | |
| HLE | RCE | Accuracy23.1 | 11 | 1mo ago | |
| GPQA | RCE | Accuracy48.9 | 11 | 1mo ago | |
| ARC-AGI 2 | RCE | Accuracy33.6 | 11 | 1mo ago | |
| CompA Attribute sub-task | CLAPScore | Text Attribute Accuracy44.28 | 11 | 1mo ago | |
| CompA Order sub-task | AQAScore | Text Score0.67 | 11 | 1mo ago | |
| NaturalBench | InternVL3.5-14B +FINER-Tuning | Accuracy35.5 | 10 | 1mo ago | |
| Cola | NegCLIP++ | Txt2Img Score33.33 | 10 | 1mo ago | |
| BISCOR-CTRL (test) | CLIP+TF_Local | Group Score15.1 | 8 | 4d ago | |
| BIVLC (test) | CLIP+TF_Local | Group Score61.3 | 8 | 4d ago | |
| SUGARCREPE (test) | CLIP+TF_Local | Accuracy86.3 | 8 | 4d ago | |
| SugarCrepe 1.0 (test) | Human | Replace Acc (Object)100 | 8 | 1mo ago | |
| Compositional Reasoning Correction Input Ic | CREME | Event Probability Ratio95.3 | 8 | 1mo ago | |
| Compositional Reasoning Dataset | CREME | Correction Score (C)43.3 | 8 | 1mo ago | |
| SugarCrepe++ | C^2LIP | Replace I2T79.7 | 7 | 22d ago | |
| Compositional Reasoning Paraphrasing Input Ip | CREME | Event Probability p(Ac) > p(Aw)70.5 | 6 | 1mo ago | |
| Compositional Reasoning Benchmarks (ARO, VLC, SVO, CREPE) (test) | MosaiCLIP | Compositional Reasoning Score74.29 | 5 | 1mo ago |