ULTRAS -- Unified Learning of Transformer Representations for Audio and Speech Signals

About

Self-supervised learning (SSL) has driven impressive advances in speech processing by adopting time-domain prediction objectives, while audio representation learning frameworks operate on time-frequency spectrograms. Models optimized for one paradigm struggle to transfer to the other, highlighting the need for a joint framework. We propose Unified Learning of Transformer Representations for Audio and Speech (ULTRAS), where the masking and predictive modeling is performed over long patches of the data. The model, based on the transformer architecture, encodes spectral-patches of log-mel spectrogram features. The predictive modeling of masked segments is performed on spectral and temporal targets using a combined loss-function, forcing the representations to encode time and frequency traits. Experiments are performed on a variety of speech and audio tasks, where we illustrate that the ULTRAS framework achieves improved performance over other established baselines.

Ameenudeen P E, Charumathi Narayanan, Sriram Ganapathy• 2026

Related benchmarks

Task	Dataset	Result
Environmental Sound Classification	ESC-50 (5-fold cross-validation)	Accuracy91.15	38
Speech Emotion Recognition	IEMOCAP (five-fold/ten-fold cross-validation)	WA67.78	25
Musical Instrument Classification	NSynth (test)	Accuracy76.52	22
Audio Classification	UrbanSound8K (official 10 fold split)	Accuracy (%)86.07	15
Speaker Identification	VOX 1 (evaluation)	Accuracy73.55	5
Speech Command Recognition	SPCV2 (evaluation)	Accuracy95.1	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord