Adapting a Text-to-Audio Model for Room Impulse Response Generation
About
Room Impulse Responses (RIRs) enable realistic acoustic simulation, with applications ranging from multimedia production to speech data augmentation. However, acquiring high-quality real-world RIRs is labor-intensive, and data scarcity remains a challenge for data-driven RIR generation approaches. In this paper, we propose a novel approach to RIR generation by adapting a pre-trained text-to-audio model, demonstrating for the first time that large-scale generative audio priors can be effectively leveraged for the task. To address the lack of text-RIR paired data, we utilize a labeling pipeline leveraging vision-language models to extract acoustic descriptions from existing image-RIR datasets. We introduce an in-context learning strategy to accommodate free-form user prompts during inference. Evaluations including subjective listening test demonstrate that our model generates plausible RIRs. Audio examples are available on our demo website.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Room Impulse Response Perceptual Realism Evaluation | BUT ReverbDB (test) | MUSHRA Score55.01 | 5 | |
| Automatic Speech Recognition | LibriSpeech ASR (test) | -- | 5 | |
| Room Impulse Response Generation | BUT ReverbDB (test) | Mean RT60 Error5.56 | 3 | |
| Speech Quality and Intelligibility Evaluation | LibriSpeech (test) | Mean PESQ1.57 | 2 |