Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Indirect Object Identification on IOI evaluation episodes (held-out)
Loading...
2.976
Policy Score
MechRL
2.8272
2.9016
2.976
3.0504
May 25, 2026
Policy Score
Oracle Score
Gap
Updated 7d ago
Evaluation Results
Method
Method
Links
Policy Score
Oracle Score
Gap
MechRL
K=1, In-episode picks=...
2026.05
2.976
-
0.028
Oracle
Selection=per-episode,...
2026.05
-
2.948
-
Feedback
Search any
task
Search any
task