Share your thoughts, 1 month free Claude Pro on usSee more

Compositional Reasoning on Harder-set

51.3Strict Accuracy

Qwen2.5-7B

Updated 2mo ago

Evaluation Results

Method	Links
Qwen2.5-7B 2026.05		51.3	47.6	50.4	50.4
Mistral-7B 2026.05		47.9	100	47.9	47.9
DeepHermes-3-8B 2026.05		46.7	71.4	46.7	40.2
Qwen3-8B 2026.05		3.4	14.3	2.8	3.4
Llama2-7B-Chat 2026.05		0	-	-	0
Llama2-13B-Chat 2026.05		0	76	76	0