Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection
About
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Active Speaker Detection | AVA-ActiveSpeaker (val) | mAP96.4 | 107 | |
| Active Speaker Detection | AVA-ActiveSpeaker v1.0 (val) | mAP92.3 | 27 | |
| Active Speaker Detection | AVA-ActiveSpeaker (test) | mAP90.8 | 22 | |
| Active Speaker Localization | EasyCom (test) | mAP69.13 | 16 | |
| Active Speaker Detection | Active Speaker Detection Inference Efficiency Profiling | VRAM (GB)1.89 | 14 | |
| Active Speaker Detection | Talkies (val) | mAP85.9 | 14 | |
| Active Speaker Detection | AVA-ActiveSpeaker v1.0 (test) | mAP90.8 | 13 | |
| Active Speaker Detection | AVA-ActiveSpeaker | mAP92.3 | 11 | |
| Active Speaker Detection | UniTalk (test) | Overall mAP75.7 | 10 | |
| Active Speaker Detection | WASD (test) | mAP (OC)95.8 | 9 |