Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Compositional Generalization on Evaluation Dataset (Fold 3 Seen)
Loading...
66.69
Score
LLaMA-3-8B + COGLM
20.5348
32.5174
44.5
56.4826
Jan 29, 2026
Score
Updated 4d ago
Evaluation Results
Method
Method
Links
Score
LLaMA-3-8B + COGLM
Fine-tuning Strategy=L...
2026.01
66.69
Mistral-7B + COGLM
Fine-tuning Strategy=L...
2026.01
65.1
Mistral-7B
Fine-tuning Strategy=L...
2026.01
62.64
Qwen2.5-7B + COGLM
Fine-tuning Strategy=L...
2026.01
61.73
Gemma-7B + COGLM
Fine-tuning Strategy=L...
2026.01
61.7
Qwen2.5-7B
Fine-tuning Strategy=L...
2026.01
61.09
Gemma-7B
Fine-tuning Strategy=L...
2026.01
60.43
LLaMA-3-8B
Fine-tuning Strategy=L...
2026.01
59.72
DeepSeek V3 + FS*
Evaluation Protocol=Fe...
2026.01
46.62
DeepSeek-7B + COGLM
Fine-tuning Strategy=L...
2026.01
45.47
Claude 3.5 Sonnet + FS*
Evaluation Protocol=Fe...
2026.01
45.04
DeepSeek V3
Evaluation Protocol=Ze...
2026.01
42.46
Claude 3.5 Sonnet
Evaluation Protocol=Ze...
2026.01
41.59
GPT-4o + FS*
Evaluation Protocol=Fe...
2026.01
40.05
DeepSeek-7B
Fine-tuning Strategy=L...
2026.01
39.02
GPT-4o
Evaluation Protocol=Ze...
2026.01
35.58
Llama 3 (70B) + FS*
Evaluation Protocol=Fe...
2026.01
33.15
Llama 3 (70B)
Evaluation Protocol=Ze...
2026.01
22.31
Feedback
Search any
task
Search any
task