Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Setuptable on Household Tasks kitchen_and_bedroom
Loading...
81.9
Original Success Rate
Deepseek-R1
69.316
72.583
75.85
79.117
Mar 9, 2026
Original Success Rate
Average Action Accuracy
Total Steps
Updated 1mo ago
Evaluation Results
Method
Method
Links
Original Success Rate
Average Action Accuracy
Total Steps
Deepseek-R1
Model=Deepseek-R1
2026.03
81.9
88.5
15
Llama3.3-70B
Model=Llama3.3-70B
2026.03
74.8
84.4
17
GPT-5-mini
Model=GPT-5-mini
2026.03
69.8
88.9
16
Feedback
Search any
task
Search any
task