AutoAD III: The Prequel -- Back to the Pixels
About
Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently, visual language models for AD generation are limited by a lack of suitable training data, and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper, we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data, and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD, using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well-matched to human performance. Taken together, we improve the state of the art on AD generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Movie Audio Description generation | MAD-eval-Named v2 (test) | C Score24 | 17 | |
| Audio Description | MAD-Eval (test) | CIDEr24 | 16 | |
| Audio Description Generation | CMD-AD (test) | CIDEr25 | 7 | |
| Movie Audio Description generation | MAD-Eval 1.0 (test) | CIDEr24 | 7 | |
| Audio Description Generation | CMDAD (test) | CIDEr25 | 5 | |
| Audio Description Generation | CMDAD | CIDEr25 | 5 | |
| Movie Audio Description generation | CMD-AD-Eval 1.0 (test) | CIDEr25 | 5 | |
| Audio Description | CMD-AD-Eval (test) | CIDEr21.7 | 3 | |
| Audio Description Generation | TV-AD | CIDEr26.1 | 3 | |
| Audio Description Generation | TVAD (test) | CIDEr26.1 | 3 |