Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
One-step next-observation prediction on AgentGym Unweighted Average (test)
Loading...
85
Token F1
Word2World
49.64
58.82
68
77.18
May 29, 2026
Token F1
BLEU-4
Updated 2d ago
Evaluation Results
Method
Method
Links
Token F1
BLEU-4
Word2World
Backbone=Qwen3.5-4B, L...
2026.05
85
67
PatchWorld-Residual
Backbone=Mimo-v2.5, LL...
2026.05
70
49
PatchWorld-Residual
Backbone=DeepSeek-V4-F...
2026.05
70
50
PatchWorld-Residual
Backbone=Qwen3-Coder-4...
2026.05
69
50
LLM-Direct
Backbone=DeepSeek-V4-F...
2026.05
66
43
PoE-World
Backbone=Mimo-v2.5, LL...
2026.05
65
49
LLM-Direct
Backbone=Qwen3-Coder-4...
2026.05
64
41
LLM-Direct
Backbone=Mimo-v2.5, LL...
2026.05
63
41
WorldCoder
Backbone=Qwen3-Coder-4...
2026.05
63
48
WorldCoder
Backbone=Mimo-v2.5, LL...
2026.05
63
46
PoE-World
Backbone=Qwen3-Coder-4...
2026.05
63
47
PatchWorld-Simple
Backbone=Mimo-v2.5, LL...
2026.05
60
38
PatchWorld-Simple
Backbone=DeepSeek-V4-F...
2026.05
60
38
WorldCoder
Backbone=DeepSeek-V4-F...
2026.05
57
39
PatchWorld-Simple
Backbone=Qwen3-Coder-4...
2026.05
57
34
PoE-World
Backbone=DeepSeek-V4-F...
2026.05
51
35
Feedback
Search any
task
Search any
task