Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Task Completion on Real-world (test)
Loading...
8.09
Score
Vanilla
7.986
8.013
8.04
8.067
Feb 12, 2026
Score
Absolute Delta
Updated 4d ago
Evaluation Results
Method
Method
Links
Score
Absolute Delta
Vanilla
LLM Backbone=Llama-3-8...
2026.02
8.09
-
Vanilla
LLM Backbone=Qwen3-8b
2026.02
8.09
-
Vanilla
LLM Backbone=GPT-oss-20b
2026.02
8.09
-
PPOpt
LLM Backbone=Qwen3-8b
2026.02
8.08
0.01
PPOpt
LLM Backbone=Llama-3-8...
2026.02
8.05
0.04
PPOpt
LLM Backbone=GPT-oss-20b
2026.02
7.99
0.1
Feedback
Search any
task
Search any
task