A Lightweight Two-Branch Architecture for Multi-Instrument Transcription via Note-Level Contrastive Clustering
About
Existing multi-timbre transcription models struggle with generalization beyond pre-trained instruments, rigid source-count constraints, and high computational demands that hinder deployment on low-resource devices. We address these limitations with a lightweight model that extends a timbre-agnostic transcription backbone with a dedicated timbre encoder and performs deep clustering at the note level, enabling joint transcription and dynamic separation of arbitrary instruments given a specified number of instrument classes. Practical optimizations including spectral normalization, dilated convolutions, and contrastive clustering further improve efficiency and robustness. Despite its small size and fast inference, the model achieves competitive performance with heavier baselines in terms of transcription accuracy and separation quality, and shows promising generalization ability, making it highly suitable for real-world deployment in practical and resource-constrained settings.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Timbre-agnostic transcription | BACH10 (tutti) | F_F Score86.5 | 14 | |
| Timbre-agnostic transcription | BACH10 (stems) | F-score (F)91.9 | 14 | |
| Timbre-separated transcription | BACH10 2mix | F-score (Frame-based Separation)83.4 | 7 | |
| Timbre-separated transcription | BACH10 3mix | F-score FS77.8 | 7 | |
| Timbre-separated transcription | BACH10 4mix | F-score (Frame-based)68.5 | 7 | |
| Timbre-separated transcription | URMP 2mix | F-score (Frame-level)69.1 | 7 | |
| Timbre-separated transcription | URMP 3mix | F-score (Frame-level Separation)58.6 | 7 | |
| Timbre-agnostic transcription | PHENICX | F_F Score70.1 | 5 | |
| Timbre-agnostic transcription | URMP stems | F_F Score82.3 | 3 | |
| Timbre-agnostic transcription | URMP (tutti) | F_F79.7 | 3 |