CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
About
We propose Cross-Attention in Audio, Space, and Time (CA^2ST), a transformer-based method for holistic video recognition. Recognizing actions in videos requires both spatial and temporal understanding, yet most existing models lack a balanced spatio-temporal understanding of videos. To address this, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), using only RGB input. In each layer of CAST, Bottleneck Cross-Attention (B-CA) enables spatial and temporal experts to exchange information and make synergistic predictions. For holistic video understanding, we extend CAST by integrating an audio expert, forming Cross-Attention in Visual and Audio (CAVA). We validate the CAST on benchmarks with different characteristics, EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400, consistently showing balanced performance. We also validate the CAVA on audio-visual action recognition benchmarks, including UCF-101, VGG-Sound, KineticsSound, and EPIC-SOUNDS. With a favorable performance of CAVA across these datasets, we demonstrate the effective information exchange among multiple experts within the B-CA module. In summary, CA^2ST combines CAST and CAVA by employing spatial, temporal, and audio experts through cross-attention, achieving balanced and holistic video understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Kinetics-400 | Top-1 Acc85.3 | 413 | |
| Action Recognition | SSV2 | Top-1 Acc71.6 | 93 | |
| Action Recognition | EK100 | Verb Top-1 Acc72.5 | 24 | |
| Audio-Visual Classification | VGGSound | Top-1 Acc68.3 | 24 | |
| Action Recognition | Epic-100 (test) | -- | 20 | |
| Audio-Video Classification | Kinetics-Sound | Accuracy93.3 | 19 | |
| Action Recognition | EK100, SSV2, and K400 | Overall Harmonic Mean71.6 | 18 | |
| Action Recognition | EPIC-SOUNDS | Top-1 Accuracy61 | 17 | |
| Audio-Visual Recognition | UCF-101 (full) | Top-1 Accuracy97.2 | 11 | |
| Action Recognition | ActivityNet 1.3 (val) | Top-1 Accuracy91.3 | 7 |