Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MESSENGER-WM

Benchmarks

Task NameDataset NameSOTA ResultTrend
Policy generalizationMESSENGER-WM NewAll (test)
Average Sum of Scores1.16
4
Policy generalizationMESSENGER-WM NewAttr (test)
Average Score115
4
Policy generalizationMESSENGER-WM NewCombo (test)
Avg Sum Score1.31
4
Showing 3 of 3 rows