LongVALE: Vision-Audio-Language-Event Benchmark Towards Time-Aware Omni-Modal Perception of Long Videos
About
Despite impressive advancements in video understanding, most efforts remain limited to coarse-grained or visual-only video tasks. However, real-world videos encompass omni-modal information (vision, audio, and speech) with a series of events forming a cohesive storyline. The lack of multi-modal video data with fine-grained event annotations and the high cost of manual labeling are major obstacles to comprehensive omni-modality video perception. To address this gap, we propose an automatic pipeline consisting of high-quality multi-modal video filtering, semantically coherent omni-modal event boundary detection, and cross-modal correlation-aware event captioning. In this way, we present LongVALE, the first-ever Vision-Audio-Language Event understanding benchmark comprising 105K omni-modal events with precise temporal boundaries and detailed relation-aware captions within 8.4K high-quality long videos. Further, we build a baseline that leverages LongVALE to enable video large language models (LLMs) for omni-modality fine-grained temporal video understanding for the first time. Extensive experiments demonstrate the effectiveness and great potential of LongVALE in advancing comprehensive multi-modal video understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio-Visual Question Answering | MUSIC-AVQA | Accuracy49.4 | 21 | |
| Multi-Scene Segmentation | OmniDCBench 1.0 (test) | F1 Score45.2 | 9 | |
| Time-aware Dense Captioning | OmniDCBench 1.0 (test) | Camera Score0.8 | 9 | |
| Omni-modal dense video captioning | LongVALE 1.0 (test) | SODA_c2.8 | 8 | |
| Omni-modal segment captioning | LongVALE 1.0 (test) | ROUGE-L0.224 | 8 | |
| Omni-modal temporal video grounding | LongVALE 1.0 (test) | R@0.315.7 | 8 | |
| Audio-to-Video temporal grounding | ChronusAV | BLEU-40.21 | 8 | |
| Text-to-Video temporal grounding | ChronusAV | BLEU-40.35 | 8 | |
| Video-to-Text temporal grounding | ChronusAV | Recall@IoU=0.59.5 | 8 | |
| Text-to-Audio temporal grounding | ChronusAV | BLEU-40.15 | 8 |