AutoAD-Zero: A Training-Free Framework for Zero-Shot Audio Description

About

Our objective is to generate Audio Descriptions (ADs) for both movies and TV series in a training-free manner. We use the power of off-the-shelf Visual-Language Models (VLMs) and Large Language Models (LLMs), and develop visual and text prompting strategies for this task. Our contributions are three-fold: (i) We demonstrate that a VLM can successfully name and refer to characters if directly prompted with character information through visual indications without requiring any fine-tuning; (ii) A two-stage process is developed to generate ADs, with the first stage asking the VLM to comprehensively describe the video, followed by a second stage utilising a LLM to summarise dense textual information into one succinct AD sentence; (iii) A new dataset for TV audio description is formulated. Our approach, named AutoAD-Zero, demonstrates outstanding performance (even competitive with some models fine-tuned on ground truth ADs) in AD generation for both movies and TV series, achieving state-of-the-art CRITIC scores.

Junyu Xie, Tengda Han, Max Bain, Arsha Nagrani, G\"ul Varol, Weidi Xie, Andrew Zisserman• 2024

Related benchmarks

Task	Dataset	Result
Movie Audio Description generation	MAD-eval-Named v2 (test)	C Score22.4	17
Audio Description	MAD-Eval (test)	CIDEr22.4	16
Audio Description Generation	MAD-Eval (test)	ROUGE-L14.6	14
Audio Description Generation	CMD-AD	CIDEr22.4	9
Audio Description Generation	CMD-AD (test)	CIDEr17.7	7
Audio Description Generation	CMDAD (test)	CIDEr17.7	5
Audio Description Generation	CMDAD	CIDEr17.7	5
Audio Description Generation	TV-AD	CIDEr22.6	3
Audio Description Generation	TVAD (test)	CIDEr22.6	3
Audio Description Generation	TVAD	CIDEr22.6	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord