Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Qwen Evaluation Suite

Benchmarks

Task NameDataset NameSOTA ResultTrend
LLM EvaluationQwen3-1.7B Evaluation Suite (avg)
Average Performance58.64
38
Language Model EvaluationQwen3-0.6B Evaluation Suite average
Average Performance47.8
24
Pre-verbalization preference stabilizationQwen Evaluation Suite Prompt shift Qwen3
Accuracy100
2
Pre-verbalization preference stabilizationQwen3 Evaluation Suite Verbalizer shift
Accuracy100
1
Pre-verbalization preference stabilizationQwen3 Evaluation Suite Canonical
Accuracy100
1
Showing 5 of 5 rows