Enabling Factorized Piano Music Modeling and Generation with the MAESTRO Dataset

About

Generating musical audio directly with neural networks is notoriously difficult because it requires coherently modeling structure at many different timescales. Fortunately, most music is also highly structured and can be represented as discrete note events played on musical instruments. Herein, we show that by using notes as an intermediate representation, we can train a suite of models capable of transcribing, composing, and synthesizing audio waveforms with coherent musical structure on timescales spanning six orders of magnitude (~0.1 ms to ~100 s), a process we call Wave2Midi2Wave. This large advance in the state of the art is enabled by our release of the new MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organization) dataset, composed of over 172 hours of virtuosic piano performances captured with fine alignment (~3 ms) between note labels and audio waveforms. The networks and the dataset together present a promising approach toward creating new expressive and interpretable neural models of music.

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, Douglas Eck• 2018

Related benchmarks

Task	Dataset	Result
Piano Transcription	Maps	Activation Precision90.5	13
Piano Transcription	MAESTRO (test)	Frame Precision93.1	6
Pedal Transcription	MAESTRO (test)	Frame Precision94.3	5
Automatic Piano Transcription	Maps (test)	Precision87.5	5
Automatic Piano Transcription (Onset & Offset)	MAESTRO	Precision82.95	5
Music Transcription	MAESTRO	Precision98.27	5
Unconditional Music Generation	MAESTRO	Selection Rate37.7	4
Symbolic music generation	MAESTRO v1 (val)	Negative Log-Likelihood1.84	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord