AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description
About
Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Movie Audio Description generation | MAD-eval-Named v2 (test) | C Score22.4 | 17 | |
| Audio Description | MAD-Eval (test) | CIDEr22.4 | 16 | |
| Audio Description Generation | CMD-AD (test) | CIDEr17.7 | 7 | |
| Audio Description Generation | CMDAD (test) | CIDEr17.7 | 5 | |
| Audio Description Generation | CMDAD | CIDEr17.7 | 5 | |
| Audio Description Generation | TV-AD | CIDEr22.6 | 3 | |
| Audio Description Generation | TVAD (test) | CIDEr22.6 | 3 | |
| Audio Description Generation | TVAD | CIDEr22.6 | 3 |