GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

About

In this report, we present our champion solution for Ego4D Natural Language Queries (NLQ) Challenge in CVPR 2023. Essentially, to accurately ground in a video, an effective egocentric feature extractor and a powerful grounding model are required. Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data. In addition, we introduce a novel grounding model GroundNLQ, which employs a multi-modal multi-scale grounding module for effective video and text fusion and various temporal intervals, especially for long videos. On the blind test set, GroundNLQ achieves 25.67 and 18.18 for R1@IoU=0.3 and R1@IoU=0.5, respectively, and surpasses all other teams by a noticeable margin. Our code will be released at\url{https://github.com/houzhijian/GroundNLQ}.

Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou• 2023

Related benchmarks

Task	Dataset	Result
Natural Language Queries	Ego4D NLQ (test)	R@1 (IoU=0.3)24.5	21
Natural Language Queries	Ego4D NLQ v2 (val)	R@1 (IoU=0.3)26.98	7
Natural Language Queries	Ego4D-NLQ v2 (test)	Recall@1 (IoU=0.3)24.5	7
Natural Language Query	Ego4D QnF	Recall@1 (IoU=0.3)29.6	7
Natural Language Query	GoalStep-QnF	Recall@1 (IoU=0.3)23.3	7
Natural Language Query	HD-EPIC QnF	R@1 (IoU=0.3)11.3	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord