Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

GateFusion: Hierarchical Gated Cross-Modal Fusion for Active Speaker Detection

About

Active Speaker Detection (ASD) aims to identify who is currently speaking in each frame of a video. Most state-of-the-art approaches rely on late fusion to combine visual and audio features, but late fusion often fails to capture fine-grained cross-modal interactions, which can be critical for robust performance in unconstrained scenarios. In this paper, we introduce GateFusion, a novel architecture that combines strong pretrained unimodal encoders with a Hierarchical Gated Fusion Decoder (HiGate). HiGate enables progressive, multi-depth fusion by adaptively injecting contextual features from one modality into the other at multiple layers of the Transformer backbone, guided by learnable, bimodally-conditioned gates. To further strengthen multimodal learning, we propose two auxiliary objectives: Masked Alignment Loss (MAL) to align unimodal outputs with multimodal predictions, and Over-Positive Penalty (OPP) to suppress spurious video-only activations. GateFusion establishes new state-of-the-art results on several challenging ASD benchmarks, achieving 77.8% mAP (+9.4%), 86.1% mAP (+2.9%), and 96.1% mAP (+0.5%) on Ego4D-ASD, UniTalk, and WASD benchmarks, respectively, and delivering competitive performance on AVA-ActiveSpeaker. Out-of-domain experiments demonstrate the generalization of our model, while comprehensive ablations show the complementary benefits of each component.

Yu Wang, Juhyung Ha, Frangil M. Ramirez, Yuchen Wang, David J. Crandall• 2025

Related benchmarks

TaskDatasetResultRank
Active Speaker DetectionAVA-ActiveSpeaker (test)
mAP90.7
22
Active Speaker DetectionActive Speaker Detection Inference Efficiency Profiling
VRAM (GB)3.35
14
Active Speaker DetectionAVA-ActiveSpeaker
mAP95
11
Active Speaker DetectionUniTalk (test)
Overall mAP86.1
10
Active Speaker DetectionEgo4D Audio-Visual benchmark
mAP77.8
9
Active Speaker DetectionWASD (test)
mAP (OC)98.9
9
Active Speaker DetectionAVA-ActiveSpeaker Internal In-Domain (test)
mAP95
7
Active Speaker DetectionWASD External/Out-of-Domain (test)
mAP88.8
7
Showing 8 of 8 rows

Other info

Follow for update