Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
General Language Model Evaluation on Comprehensive Evaluation Suite
Loading...
50.7
Overall Average Score
CARE-RL
37.284
40.767
44.25
47.733
May 30, 2026
Overall Average Score
Updated 1d ago
Evaluation Results
Method
Method
Links
Overall Average Score
CARE-RL
Backbone=Qwen3-4B
2026.05
50.7
MOPD
Backbone=Qwen3-4B
2026.05
49.8
MGS
Backbone=Qwen3-4B
2026.05
49.3
V→NV
Backbone=Qwen3-4B
2026.05
48.3
NV→V
Backbone=Qwen3-4B
2026.05
48.3
CARE-RL
Backbone=Qwen2.5-7B
2026.05
47.9
Naive Mixing
Backbone=Qwen3-4B
2026.05
47.8
MOPD
Backbone=Qwen2.5-7B
2026.05
46.9
MGS
Backbone=Qwen2.5-7B
2026.05
45.9
V→NV
Backbone=Qwen2.5-7B
2026.05
44.9
NV→V
Backbone=Qwen2.5-7B
2026.05
44.7
Naive Mixing
Backbone=Qwen2.5-7B
2026.05
44.2
Base
Backbone=Qwen3-4B
2026.05
41.4
Base
Backbone=Qwen2.5-7B
2026.05
37.8
Feedback
Search any
task
Search any
task