Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Standard benchmarks

Benchmarks

Task NameDataset NameSOTA ResultTrend
Composed Image RetrievalStandard Benchmarks CIRR, FashionIQ, GeneCIS
Average Performance38.3
10
Language Modeling and Question AnsweringStandard Benchmarks (ARC-E, ARC-C, BoolQ, HellaSwag, OBQA, PIQA, WinoGrande, MMLU, SciQ) (test)
ARC-E Acc (Norm)49.75
8
Text-to-imageStandard text-to-image benchmarks
CLIP Score97.28
6
Correlation analysis of reasoning metrics with ground-truth accuracy39 standard benchmarks AIME GSM8K ARC MMLU MMLU-PRO GPQA SuperGPQA
Pearson r0.741
4
Showing 4 of 4 rows