Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos

About

Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.

Tiantian Geng, Jinrui Zhang, Qingni Wang, Teng Wang, Jinming Duan, Feng Zheng• 2024

Related benchmarks

TaskDatasetResultRank
Audio-Visual Question AnsweringMUSIC-AVQA
Accuracy49.4
21
Multi-Scene SegmentationOmniDCBench 1.0 (test)
F1 Score45.2
9
Time-aware Dense CaptioningOmniDCBench 1.0 (test)
Camera Score0.8
9
Omni-modal dense video captioningLongVALE 1.0 (test)
SODA_c2.8
8
Omni-modal segment captioningLongVALE 1.0 (test)
ROUGE-L0.224
8
Omni-modal temporal video groundingLongVALE 1.0 (test)
R@0.315.7
8
Audio-to-Video temporal groundingChronusAV
BLEU-40.21
8
Text-to-Video temporal groundingChronusAV
BLEU-40.35
8
Video-to-Text temporal groundingChronusAV
Recall@IoU=0.59.5
8
Text-to-Audio temporal groundingChronusAV
BLEU-40.15
8
Showing 10 of 13 rows

Other info

Code

Follow for update