The Rhythm In Anything: Audio-Prompted Drums Generation with Masked Language Modeling

About

Musicians and nonmusicians alike use rhythmic sound gestures, such as tapping and beatboxing, to express drum patterns. While these gestures effectively communicate musical ideas, realizing these ideas as fully-produced drum recordings can be time-consuming, potentially disrupting many creative workflows. To bridge this gap, we present TRIA (The Rhythm In Anything), a masked transformer model for mapping rhythmic sound gestures to high-fidelity drum recordings. Given an audio prompt of the desired rhythmic pattern and a second prompt to represent drumkit timbre, TRIA produces audio of a drumkit playing the desired rhythm (with appropriate elaborations) in the desired timbre. Subjective and objective evaluations show that a TRIA model trained on less than 10 hours of publicly-available drum data can generate high-quality, faithful realizations of sound gestures across a wide range of timbres in a zero-shot manner.

Patrick O'Reilly, Julia Barnett, Hugo Flores Garc\'ia, Annie Chu, Nathan Pruyne, Prem Seetharaman, Bryan Pardo• 2025

Related benchmarks

Task	Dataset	Result	Rank
Rhythm prompt adherence	AVP	Onset F134.7		3

Showing 1 of 1 rows

Other info

Follow for update

@wizwand_team Discord