InternVideo-Ego4D: A Pack of Champion Solutions to Ego4D Challenges
About
In this report, we present our champion solutions to five tracks at Ego4D challenge. We leverage our developed InternVideo, a video foundation model, for five Ego4D tasks, including Moment Queries, Natural Language Queries, Future Hand Prediction, State Change Object Detection, and Short-term Object Interaction Anticipation. InternVideo-Ego4D is an effective paradigm to adapt the strong foundation model to the downstream ego-centric video understanding tasks with simple head designs. In these five tasks, the performance of InternVideo-Ego4D comprehensively surpasses the baseline methods and the champions of CVPR2022, demonstrating the powerful representation ability of InternVideo as a video foundation model. Our code will be released at https://github.com/OpenGVLab/ego4d-eccv2022-solutions
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Temporal Action Detection | THUMOS-14 (test) | -- | 330 | |
| Temporal Action Detection | ActivityNet v1.3 (val) | -- | 185 | |
| Natural Language Queries | Ego4D NLQ (val) | Recall@1 (IoU=0.3)15.61 | 23 | |
| Natural Language Queries | Ego4D NLQ (test) | R@1 (IoU=0.3)16.46 | 21 | |
| Moment Query | Ego4D Moment Query (val) | R@1 (IoU=0.5)41.13 | 19 | |
| Short-Term Anticipation | Ego4D STA v2 (val) | N mAP19.45 | 16 | |
| Short-Term Anticipation | Ego4D-STA v1 (test) | mAP (N)24.6 | 9 | |
| Natural Language Queries | Ego4D-NLQ v1 (test) | R@1 (IoU=0.3)16.45 | 8 | |
| Object Detection | SCOD (val) | AP36.4 | 7 | |
| Temporal Grounding | Ego4D 1.0 (test) | Recall@1 (IoU=0.3)16.45 | 7 |