Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

About

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or more speakers. The successful ASD depends on accurate interpretation of short-term and long-term audio and visual information, as well as audio-visual interaction. Unlike the prior work where systems make decision instantaneously using short-term features, we propose a novel framework, named TalkNet, that makes decision by taking both short-term and long-term features into consideration. TalkNet consists of audio and visual temporal encoders for feature representation, audio-visual cross-attention mechanism for inter-modality interaction, and a self-attention mechanism to capture long-term speaking evidence. The experiments demonstrate that TalkNet achieves 3.5% and 2.2% improvement over the state-of-the-art systems on the AVA-ActiveSpeaker dataset and Columbia ASD dataset, respectively. Code has been made available at: https://github.com/TaoRuijie/TalkNet_ASD.

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li• 2021

Related benchmarks

TaskDatasetResultRank
Active Speaker DetectionAVA-ActiveSpeaker (val)
mAP96.4
107
Active Speaker DetectionAVA-ActiveSpeaker v1.0 (val)
mAP92.3
27
Active Speaker DetectionAVA-ActiveSpeaker (test)
mAP90.8
22
Active Speaker LocalizationEasyCom (test)
mAP69.13
16
Active Speaker DetectionActive Speaker Detection Inference Efficiency Profiling
VRAM (GB)1.89
14
Active Speaker DetectionTalkies (val)
mAP85.9
14
Active Speaker DetectionAVA-ActiveSpeaker v1.0 (test)
mAP90.8
13
Active Speaker DetectionAVA-ActiveSpeaker
mAP92.3
11
Active Speaker DetectionUniTalk (test)
Overall mAP75.7
10
Active Speaker DetectionWASD (test)
mAP (OC)95.8
9
Showing 10 of 18 rows

Other info

Follow for update