Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning on Omni Hard
Loading...
18.58
Accuracy
Cog-DRIFT
3.7496
7.5998
11.45
15.3002
Apr 6, 2026
Accuracy
Updated 11d ago
Evaluation Results
Method
Method
Links
Accuracy
Cog-DRIFT
Base Model=Qwen3-4B-In...
2026.04
18.58
GRPO
Base Model=Qwen3-4B-In...
2026.04
16.37
Few-shot
Base Model=Qwen3-4B-In...
2026.04
16.31
NuRL (Abstract)
Base Model=Qwen3-4B-In...
2026.04
16.03
NuRL (Prefix)
Base Model=Qwen3-4B-In...
2026.04
15.69
Zero-shot
Base Model=Qwen3-4B-In...
2026.04
15.67
RFT
Base Model=Qwen3-4B-In...
2026.04
15.13
NuRL (Prefix)
Base Model=Llama3.2-3B...
2026.04
7.19
Cog-DRIFT
Base Model=Llama3.2-3B...
2026.04
7.04
NuRL (Abstract)
Base Model=Llama3.2-3B...
2026.04
6.67
RFT
Base Model=Llama3.2-3B...
2026.04
6.52
Few-shot
Base Model=Llama3.2-3B...
2026.04
5.74
Zero-shot
Base Model=Llama3.2-3B...
2026.04
5.09
GRPO
Base Model=Llama3.2-3B...
2026.04
4.32
Feedback
Search any
task
Search any
task