Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Compositional Generalization on Evaluation Dataset Unseen (Fold 2)
Loading...
50
Score
Mistral-7B
17.0632
25.6141
34.165
42.7159
Jan 29, 2026
Score
Updated 4d ago
Evaluation Results
Method
Method
Links
Score
Mistral-7B
Fine-tuning Strategy=L...
2026.01
50
Qwen2.5-7B
Fine-tuning Strategy=L...
2026.01
46.36
LLaMA-3-8B + COGLM
Fine-tuning Strategy=L...
2026.01
45.66
DeepSeek V3 + FS*
Evaluation Protocol=Fe...
2026.01
44.51
LLaMA-3-8B
Fine-tuning Strategy=L...
2026.01
44.22
Gemma-7B + COGLM
Fine-tuning Strategy=L...
2026.01
44
Qwen2.5-7B + COGLM
Fine-tuning Strategy=L...
2026.01
43.87
Mistral-7B + COGLM
Fine-tuning Strategy=L...
2026.01
43.61
GPT-4o + FS*
Evaluation Protocol=Fe...
2026.01
41.1
Claude 3.5 Sonnet + FS*
Evaluation Protocol=Fe...
2026.01
40.35
DeepSeek V3
Evaluation Protocol=Ze...
2026.01
40.23
Gemma-7B
Fine-tuning Strategy=L...
2026.01
39.85
DeepSeek-7B + COGLM
Fine-tuning Strategy=L...
2026.01
38.69
Claude 3.5 Sonnet
Evaluation Protocol=Ze...
2026.01
36.88
DeepSeek-7B
Fine-tuning Strategy=L...
2026.01
34.17
Llama 3 (70B) + FS*
Evaluation Protocol=Fe...
2026.01
33.3
GPT-4o
Evaluation Protocol=Ze...
2026.01
32.51
Llama 3 (70B)
Evaluation Protocol=Ze...
2026.01
18.33
Feedback
Search any
task
Search any
task