Share your thoughts, 1 month free Claude Pro on usSee more

Multitask Language Understanding on MMLU Pro (pass@1)

86.7pass@1

Qwen 3.5

Updated 2mo ago

Evaluation Results

Method	Links
Qwen 3.5 2026.05		86.7
Nemotron 3 Super 2026.05		83.73
Llama 4 Maverick 2026.05		80.62
Phoenix-VL 1.5 Medium 2026.05		76.81
GLM-4.5V 2026.05		72.17
UNA-score (MSE) 2024.08		48.94
DPO 2024.08		47.48
UNA-pairwise 2024.08		47.48
Qwen 8B 2024.08		47.21
KTO 2024.08		47.18
UNA-binary (BCE) 2024.08		46.89
UNA-score & binary 2024.08		44.83
UNA-score (MSE) 2024.08		34.42
UNA-score & binary 2024.08		34.25
DPO 2024.08		33.05
UNA-pairwise 2024.08		33.05
UNA-binary (BCE) 2024.08		33.01
KTO 2024.08		32.86
Llama 8B 2024.08		32.73
UNA-binary (BCE) 2024.08		30.73
KTO 2024.08		30.43
DPO 2024.08		30.41
UNA-pairwise 2024.08		30.41
Mistral 7B 2024.08		30.11
UNA-score & binary 2024.08		30.09
UNA-score (MSE) 2024.08		29.72
UNA-score (MSE) 2024.08		28.72
UNA-score & binary 2024.08		28.49
DPO 2024.08		28.03
UNA-pairwise 2024.08		28.03
KTO 2024.08		27.98
UNA-binary (BCE) 2024.08		27.95
Gemma 4B 2024.08		27.92
InfLLM-v2 2026.01		0.793
SPLA 2026.01		0.793
SPA 2026.01		0.791
Dense Attention 2026.01		0.789
NSA 2026.01		0.688