Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
One-step next-observation prediction on BabyAI (test)
Loading...
93
Token F1
Word2World
38.92
52.96
67
81.04
May 29, 2026
Token F1
BLEU-4
Updated 2d ago
Evaluation Results
Method
Method
Links
Token F1
BLEU-4
Word2World
Backbone=Qwen3.5-4B, L...
2026.05
93
75
PatchWorld-Simple
Backbone=Qwen3-Coder-4...
2026.05
85
58
LLM-Direct
Backbone=Mimo-v2.5, LL...
2026.05
81
54
LLM-Direct
Backbone=DeepSeek-V4-F...
2026.05
81
56
PatchWorld-Residual
Backbone=DeepSeek-V4-F...
2026.05
80
56
WorldCoder
Backbone=Qwen3-Coder-4...
2026.05
78
61
PoE-World
Backbone=Qwen3-Coder-4...
2026.05
78
62
PoE-World
Backbone=Mimo-v2.5, LL...
2026.05
78
62
WorldCoder
Backbone=Mimo-v2.5, LL...
2026.05
77
61
PatchWorld-Simple
Backbone=DeepSeek-V4-F...
2026.05
77
51
LLM-Direct
Backbone=Qwen3-Coder-4...
2026.05
73
44
PoE-World
Backbone=DeepSeek-V4-F...
2026.05
73
54
WorldCoder
Backbone=DeepSeek-V4-F...
2026.05
70
54
PatchWorld-Residual
Backbone=Qwen3-Coder-4...
2026.05
69
47
PatchWorld-Residual
Backbone=Mimo-v2.5, LL...
2026.05
49
28
PatchWorld-Simple
Backbone=Mimo-v2.5, LL...
2026.05
41
19
Feedback
Search any
task
Search any
task