Share your thoughts, 1 month free Claude Pro on usSee more

Vision-Language Reasoning

Benchmarks

Dataset Name	SOTA Method	Metric
VL Reasoning Benchmarks	FRISM	MVista Score74	28	4mo ago
VL Reasoning Benchmarks MathVista, MVerse, MathVision, MMMU, R1-OV, MMStar	FRISM	MathVista Acc79.8	25	4mo ago
DisasterBench (val)	Gemini-2.5-pro	Overall Accuracy87.7	22	1mo ago
DisasterBench (test)		RAE66.67	22	1mo ago
UHR-Micro (test)	MAP-Agent	Average Score44.1	16	2mo ago
MMStar	DAPO + ReMind	Accuracy74.7	14	1mo ago
SQA3D ScanNet scenes (test)	DINOv3 + SpatialBoost	BLEU-154.9	13	4mo ago
ScanQA ScanNet scenes (test)	DINOv3 + SpatialBoost	BLEU-143.3	13	4mo ago
CODA-LM 1.0 (test)	Qwen2.5-VL-7B + Ours	Barrier79.8	13	4mo ago
CVBench	Qwen3-VL-8B + GRPO	Accuracy86.16	12	4mo ago
VRSBench	SkyNative	Accuracy69.38	10	2mo ago
MMStar cleaned	Jigsaw + CARE	Score77.59	10	4mo ago
Winoground	LLaVA-1.5 13B	Simple Acc59.88	9	4mo ago
MathVista (test)	Top Probability + Confidence Modulation	Accuracy34.6	7	1mo ago
SugarCrepe (test)	Q4 system redistr (prop)	Simple Accuracy62.75	7	4mo ago
NaturalBench (test)	Q4 system redistr (prop)	Simple Accuracy66.02	7	4mo ago
MME (test)		Simple Accuracy78.98	7	4mo ago
HallusionBench (test)	Q4 system redistr (prop)	Simple Accuracy53.31	7	4mo ago
BEAF (test)	Q4 system redistr (prop)	Simple Accuracy88.4	7	4mo ago
LanEvil++	Omni-Q	Road Damage (Perturbed) Performance31.58	4	17d ago
CheXthought 300 1.0 (test)	CheXthought Data	Comprehensiveness of Findings4.67	4	2mo ago
SSRBench	SOLE-R1	General Score (SSRBench)85.6	4	3mo ago
nuScenes (reasoning)	Qwen1.5-0.5B	BERT F1 Score67	4	4mo ago
Winoground	CLIP	Text Score30.5	4	4mo ago
Vision-Language Reasoning Suite (MathVerse, MathVista, MathVision, MMMU-Pro, We-Math) (test)	PLM-HoneyBee-3B-GRPO	Average Accuracy46.2	3	4mo ago

Showing 25 of 31 rows