LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

About

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng• 2024

Related benchmarks

Task	Dataset	Result
Audio-Visual Question Answering	MUSIC-AVQA	Accuracy49.4	38
Audio-to-Video temporal grounding	ChronusAV	BLEU-40.21	17
Text-to-Audio temporal grounding	ChronusAV	BLEU-40.2	17
Video-to-Audio temporal grounding	ChronusAV	BLEU-40.1	17
Text-to-Video (T2V) Temporally Grounded Generation	ChronusAV	BLEU-40.4	9
Video-to-Text (V2T) Temporally Grounded Generation	ChronusAV	R@0.59.5	9
Multi-Scene Segmentation	OmniDCBench 1.0 (test)	F1 Score45.2	9
Dense Video Captioning	LongVale	F131.2	9
Dense Video Captioning	ChronusAV	F1 Score18.4	9
Time-aware Dense Captioning	OmniDCBench 1.0 (test)	Camera Score0.8	9

Showing 10 of 20 rows

Other info

Code

Follow for update

@wizwand_team Discord