Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Compositional Generalization on Evaluation Dataset (Fold 2 Seen)
Loading...
63.63
Score
Gemma-7B + COGLM
22.0612
32.8531
43.645
54.4369
Jan 29, 2026
Score
Updated 4d ago
Evaluation Results
Method
Method
Links
Score
Gemma-7B + COGLM
Fine-tuning Strategy=L...
2026.01
63.63
LLaMA-3-8B + COGLM
Fine-tuning Strategy=L...
2026.01
63.46
Mistral-7B + COGLM
Fine-tuning Strategy=L...
2026.01
63.17
Mistral-7B
Fine-tuning Strategy=L...
2026.01
62.48
Qwen2.5-7B + COGLM
Fine-tuning Strategy=L...
2026.01
61.85
LLaMA-3-8B
Fine-tuning Strategy=L...
2026.01
61.81
Qwen2.5-7B
Fine-tuning Strategy=L...
2026.01
61.61
Gemma-7B
Fine-tuning Strategy=L...
2026.01
57.37
DeepSeek-7B + COGLM
Fine-tuning Strategy=L...
2026.01
52.04
DeepSeek-7B
Fine-tuning Strategy=L...
2026.01
49.52
DeepSeek V3
Evaluation Protocol=Ze...
2026.01
45.27
DeepSeek V3 + FS*
Evaluation Protocol=Fe...
2026.01
44.18
Claude 3.5 Sonnet + FS*
Evaluation Protocol=Fe...
2026.01
43.18
Claude 3.5 Sonnet
Evaluation Protocol=Ze...
2026.01
41.11
GPT-4o
Evaluation Protocol=Ze...
2026.01
40
GPT-4o + FS*
Evaluation Protocol=Fe...
2026.01
39.12
Llama 3 (70B) + FS*
Evaluation Protocol=Fe...
2026.01
33.49
Llama 3 (70B)
Evaluation Protocol=Ze...
2026.01
23.66
Feedback
Search any
task
Search any
task