GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection
About
Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Active Speaker Detection | AVA-ActiveSpeaker (test) | mAP90.7 | 22 | |
| Active Speaker Detection | Active Speaker Detection Inference Efficiency Profiling | VRAM (GB)3.35 | 14 | |
| Active Speaker Detection | AVA-ActiveSpeaker | mAP95 | 11 | |
| Active Speaker Detection | UniTalk (test) | Overall mAP86.1 | 10 | |
| Active Speaker Detection | Ego4D Audio-Visual benchmark | mAP77.8 | 9 | |
| Active Speaker Detection | WASD (test) | mAP (OC)98.9 | 9 | |
| Active Speaker Detection | AVA-ActiveSpeaker Internal In-Domain (test) | mAP95 | 7 | |
| Active Speaker Detection | WASD External/Out-of-Domain (test) | mAP88.8 | 7 |