| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Composed Image Retrieval | Standard Benchmarks CIRR, FashionIQ, GeneCIS | Average Performance38.3 | 10 | |
| Language Modeling and Question Answering | Standard Benchmarks (ARC-E, ARC-C, BoolQ, HellaSwag, OBQA, PIQA, WinoGrande, MMLU, SciQ) (test) | ARC-E Acc (Norm)49.75 | 8 | |
| Text-to-image | Standard text-to-image benchmarks | CLIP Score97.28 | 6 | |
| Correlation analysis of reasoning metrics with ground-truth accuracy | 39 standard benchmarks AIME GSM8K ARC MMLU MMLU-PRO GPQA SuperGPQA | Pearson r0.741 | 4 |