| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Image Generation | VISOR | OA (%)77.28 | 21 | |
| Egocentric Referring Video Object Segmentation | VISOR (val) | mIoU67 | 10 | |
| Segmentation | VISOR | mIoU61.8 | 9 | |
| 3D Hand Mesh Reconstruction | HInt VISOR All Joints (test) | PCK@0.0547.2 | 8 | |
| Referring Video Object Segmentation | VISOR hard | mIoU62.3 | 4 | |
| Referring Video Object Segmentation | VISOR novel | mIoU60 | 4 |