| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| SugarCrepe | CE-CLIP+ | Overall Accuracy87.5 | 43 | 4d ago | |
| VL-Checklist | SGVL | Attribute Score81.8 | 37 | 4d ago | |
| VALSE | CE-CLIP+ | Average Score76.7 | 26 | 4d ago | |
| Winoground | SAIL-L-NV2 | Txt2Img Score40.25 | 21 | 3d ago | |
| ARO | CE-CLIP+ | Relation Score83.6 | 17 | 3d ago | |
| HLE | RCE | Accuracy23.1 | 11 | 4d ago | |
| GPQA | RCE | Accuracy48.9 | 11 | 4d ago | |
| ARC-AGI 2 | RCE | Accuracy33.6 | 11 | 4d ago | |
| CompA Attribute sub-task | CLAPScore | Text Attribute Accuracy44.28 | 11 | 3d ago | |
| CompA Order sub-task | AQAScore | Text Score0.67 | 11 | 3d ago | |
| Compositional Reasoning Suite Aggregated | TripletCLIP | Overall Score28.06 | 10 | 4d ago | |
| Cola | NegCLIP++ | Txt2Img Score33.33 | 10 | 3d ago | |
| SugarCrepe 1.0 (test) | Human | Replace Acc (Object)100 | 8 | 3d ago | |
| Compositional Reasoning Correction Input Ic | CREME | Event Probability Ratio95.3 | 8 | 4d ago | |
| Compositional Reasoning Dataset | CREME | Correction Score (C)43.3 | 8 | 4d ago | |
| Compositional Reasoning Paraphrasing Input Ip | CREME | Event Probability p(Ac) > p(Aw)70.5 | 6 | 4d ago | |
| Compositional Reasoning Benchmarks (ARO, VLC, SVO, CREPE) (test) | MosaiCLIP | Compositional Reasoning Score74.29 | 5 | 4d ago | |
| ARO (test) | MosaiCLIP | Relation Score83.7 | 4 | 3d ago | |
| Winoground (test) | TLC-A | Image Accuracy27 | 3 | 4d ago | |
| Winoground clean 171 samples | CLIP | Text Score31.58 | 2 | 3d ago | |
| FB15k-237 1,000 paths direct edges removed (test) | Tensor Logic | MRR0.3346 | 1 | 4d ago |