Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Putdishwasher on Household Tasks livingroom_and_bedroom
Loading...
82.8
Original Success Rate
Deepseek-R1
71.048
74.099
77.15
80.201
Mar 9, 2026
Original Success Rate
Average Action Accuracy
Total Steps
Updated 1mo ago
Evaluation Results
Method
Method
Links
Original Success Rate
Average Action Accuracy
Total Steps
Deepseek-R1
Model=Deepseek-R1
2026.03
82.8
71.3
11
Llama3.3-70B
Model=Llama3.3-70B
2026.03
76.1
83.3
12
GPT-5-mini
Model=GPT-5-mini
2026.03
71.5
85.6
10
Feedback
Search any
task
Search any
task