Active Speakers in Context
About
Current methods for active speak er detection focus on modeling short-term audiovisual information from a single speaker. Although this strategy can be enough for addressing single-speaker scenarios, it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our Active Speaker Context is designed to learn pairwise and temporal relations from an structured ensemble of audio-visual observations. Our experiments show that a structured feature ensemble already benefits the active speaker detection performance. Moreover, we find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving a mAP of 87.1%. We present ablation studies that verify that this result is a direct consequence of our long-term multi-speaker analysis.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Active Speaker Detection | AVA-ActiveSpeaker (val) | mAP87.1 | 107 | |
| Active Speaker Detection | AVA-ActiveSpeaker v1.0 (val) | mAP87.1 | 27 | |
| Active Speaker Detection | AVA-ActiveSpeaker (test) | mAP86.7 | 22 | |
| Active Speaker Detection | AVA-ActiveSpeaker v1.0 (test) | mAP86.7 | 13 | |
| Active Speaker Detection | UniTalk (test) | Overall mAP61.4 | 10 | |
| Active Speaker Detection | AVA-ActiveSpeaker ActivityNet Challenge 2019 (test) | mAP86.7 | 9 | |
| Active Speaker Detection | WASD (test) | mAP (OC)91.2 | 9 | |
| Active Speaker Detection | AVA-ActiveSpeaker Internal In-Domain (test) | mAP83.6 | 7 | |
| Active Speaker Detection | WASD External/Out-of-Domain (test) | mAP74.6 | 7 | |
| Active Speaker Detection | Talkies 1.0 (test) | mAP77.4 | 4 |