Object-Shot Enhanced Grounding Network for Egocentric Video

About

Egocentric video grounding is a crucial task for embodied intelligence applications, distinct from exocentric video moment localization. Existing methods primarily focus on the distributional differences between egocentric and exocentric videos but often neglect key characteristics of egocentric videos and the fine-grained information emphasized by question-type queries. To address these limitations, we propose OSGNet, an Object-Shot enhanced Grounding Network for egocentric video. Specifically, we extract object information from videos to enrich video representation, particularly for objects highlighted in the textual query but not directly captured in the video features. Additionally, we analyze the frequent shot movements inherent to egocentric videos, leveraging these features to extract the wearer's attention information, which enhances the model's ability to perform modality alignment. Experiments conducted on three datasets demonstrate that OSGNet achieves state-of-the-art performance, validating the effectiveness of our approach. Our code can be found at https://github.com/Yisen-Feng/OSGNet.

Yisen Feng, Haoyu Zhang, Meng Liu, Weili Guan, Liqiang Nie• 2025

Related benchmarks

Task	Dataset	Result
Moment Retrieval	TACOS (test)	Recall@1 (IoU=0.5)55.77	23
Natural Language Queries	Ego4D NLQ (val)	Recall@1 (IoU=0.3)21.97	23
Natural Language Video Grounding	TACoS (val)	Recall@1 (IoU=0.3)57.57	16
Video Temporal Grounding	Ego4D NLQ v1 (val)	R@1 (IoU=0.3)16.13	12
Long Video Temporal Grounding	Ego4D	Average Recall22.46	9
Natural Language Queries	Ego4D-NLQ v1 (test)	R@1 (IoU=0.3)22.13	8
Natural Language Queries	Ego4D NLQ v2 (val)	R@1 (IoU=0.3)31.63	7
Natural Language Queries	Ego4D-NLQ v2 (test)	Recall@1 (IoU=0.3)27.6	7
Step Grounding	Ego4D (test)	Recall@1 (IoU=0.3)38.83	7
Natural Language Query	GoalStep-QnF	Recall@1 (IoU=0.3)30.2	7

Showing 10 of 13 rows

Other info

Code

Follow for update

@wizwand_team Discord