Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Putfridge on VirtualHome kitchen and bathroom
Loading...
80.6
TSR
Deepseek-R1
66.56
70.205
73.85
77.495
Mar 9, 2026
TSR
TSR_R
TSR_C
Error Rate (ER)
Updated 1mo ago
Evaluation Results
Method
Method
Links
TSR
TSR_R
TSR_C
Error Rate (ER)
Deepseek-R1
Model=Deepseek-R1
2026.03
80.6
94.2
100
70.9
Llama3.3-70B
Model=Llama3.3-70B
2026.03
73
86.3
50
31.5
GPT-5-mini
Model=GPT-5-mini
2026.03
67.1
92.9
100
55.1
Feedback
Search any
task
Search any
task