| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multi-step manipulation | DROID Tabletop Multi-step tasks | Success Rate98 | 18 | |
| Semantic reasoning manipulation | DROID Tabletop Semantic tasks | Success Rate26 | 18 | |
| Rearrangement with distractors | DROID Tabletop Distractor tasks | Success Rate27 | 18 | |
| Pick-and-place | DROID Tabletop Simple tasks | Success Rate27 | 12 | |
| Video Depth | DROID | Abs Rel0.223 | 8 | |
| Robot Policy Learning | DROID Franka Panda | Average Success Rate47.4 | 7 | |
| Video-to-Video Generation | Droid (test) | VBench0.81 | 6 | |
| Interactive long-trajectory generation | DROID (val) | PSNR23.56 | 6 | |
| Video Frame Rank-Correlation | DROID | VOC Rank-Correlation (Sparse)0.99 | 6 | |
| Autoregressive rollout | DROID External Camera (val) | SSIM86 | 5 | |
| Camera Tracking | DROID-W | Error Rate (Downtown 1)0.1 | 5 | |
| Temporal Value Estimation | DROID (test) | VOC+93.67 | 5 | |
| Segmentation | DROID internal held-out | Dice Coefficient76.7 | 5 | |
| Monocular Depth Estimation | DROID (unseen domain) | Abs Rel0.237 | 4 | |
| Dynamic Affordance Prediction | DROID 70/30 (test) | Open Microwave MAE37 | 4 | |
| Video generation | DROID (Unseen Scene) | PSNR19.73 | 4 | |
| Video generation | DROID Unseen Camera Viewpoint | PSNR20.87 | 4 | |
| Video generation | DROID (In-Domain) | PSNR22.89 | 4 | |
| Multi-view Video Generation | Droid 300 cases (test) | FID39.97 | 3 | |
| Autoregressive rollout | DROID Wrist Camera (val) | SSIM67 | 2 | |
| Articulation Estimation | DROID 19 articulated object manipulation demos | Prismatic Joint Angle Error (deg)7.15 | 2 |