DuoTok: Source-Aware Dual-Track Tokenization for Multi-Track Music Language Modeling
About
Audio tokenization bridges continuous waveforms and multi-track music language models. In dual-track modeling, tokens should preserve three properties at once: high-fidelity reconstruction, strong predictability under a language model, and cross-track correspondence. We introduce DuoTok, a source-aware dual-track tokenizer that addresses this trade-off through staged disentanglement. DuoTok first pretrains a semantic encoder, then regularizes it with multi-task supervision, freezes the encoder, and applies hard dual-codebook routing while keeping auxiliary objectives on quantized codes. A diffusion decoder reconstructs high-frequency details, allowing tokens to focus on structured information for sequence modeling. On standard benchmarks, DuoTok achieves a favorable predictability-fidelity trade-off, reaching the lowest cnBPT while maintaining competitive reconstruction at 0.75 kbps. Under a held-constant dual-track language modeling protocol, enBPT also improves, indicating gains beyond codebook size effects. Controlled diagnostics show larger predictability costs under cross-track corruption and larger gains from longer context, suggesting that models trained on DuoTok tokens use cross-track structure and non-local history.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Tagging | MTT | MTT AP35 | 8 | |
| Neural Audio Coding | Codec Benchmark | cnBPT48 | 8 | |
| Track-to-track modeling | Codec Benchmark Vocal to Accompaniment | Accuracy @113 | 3 | |
| Unconditional multi-track modeling | Multi-track Music Codec Vocal track | Accuracy @114 | 3 | |
| Unconditional multi-track modeling | Multi-track Music Codec Accompaniment track | Accuracy @111 | 3 | |
| Unconditional multi-track modeling | Multi-track Music Codec Average | cnBPT0.483 | 3 |