Share your thoughts, 1 month free Claude Pro on usSee more

Combined

Benchmarks

Task Name	Dataset Name	SOTA Result
Relative Robustness Analysis	Combined Past Tense, OR-Bench, MMLU	R-Score78.9	36
Reasoning	Combined 37 Tasks (test)	Accuracy72.4	28
Reasoning	Combined 107 Tasks (train)	Accuracy68.8	28
Question Answering	Combined 7 Datasets	Average Score45	18
Harmful prompt detection	Combined Average	F1 Score (Combined Average)90.18	17
Question Answering	Combined NQ, TriviaQA, PopQA, HotpotQA, 2WikiMQA, MuSiQue, Bamboogle	Total Score280.3	15
All-in-One Image Restoration	Combined (Deraining, Desnowing, Dehazing)	PSNR34.02	13
Subject-Level Detection	Combined (ADFTD, BrainLat, AD-Auditory, ADFSU, APAVA)	Accuracy78.23	12
Segment-Level Classification	Combined ADFTD, BrainLat, AD-Auditory, ADFSU, APAVA	Accuracy68.03	12
Bayesian neural network regression	Combined (test)	RMSE3.925	12
General Video Understanding	Combined (VideoMME, LVBench, LongVideoBench, EgoSchema, MLVU)	Average Score64.9	11
Agentic Performance Aggregate	Combined LCtx, ALFWorld, ScienceWorld, Web/Tool	Mean Task Success78.9	10
Negative Concept Suppression	Combined LLM-generated + COCO-derived	Suppression Rate85.25	10
Multi-turn attack detection	Combined LMSYS, SafeDialBench, Synthetic (held-out)	Detection Accuracy99	10
Malicious Prompt Detection	Combined All Datasets (test)	ASR4.5	6
Agentic search	Combined General AI Assistant, WebWalkerQA, BrowseComp, XBench	Overall Score11.8	5
Language Understanding and Reasoning	Combined (GSM8k, MATH500, MAWPS, SVAMP, AQuA, GLUE, CSQA, OBQA)	Average Score72.94	5
Probabilistic Calibration	Combined 20K labeled samples	Brier Score0.0759	5
Classification	Combined (label-stratified)	AUROC0.986	4
Zero-shot Language Understanding	Combined Zero-shot	Average Accuracy63.71	4
Visualization Generation	Combined (C)	Compilation Success Rate100	4
Prompt Ambiguity Localization	Combined Synthetic (evaluation set)	AUROC84	3
Brain Tumor Classification	Combined 4 datasets	Accuracy99.5	3
Data-to-text generation	Combined	FE8.05	3
Shadow Detection	Combined Dataset	Testing Time (hours)0.55	3

Showing 25 of 31 rows