Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

About

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

Louis Mouchon• 2026

Related benchmarks

Task	Dataset	Result
Speaker Diarization	VoxConverse v0.3	--	6
Speaker Diarization	VoxCeleb2 (held-out)	Predicted K3.84	2
Dynamic K routing	VoxCeleb2 (held-out)	Accuracy94.4	1
Speaker / content factorisation gap	VoxCeleb2 (held-out)	Gap (points)53.5	1
Speaker Diarization	Synthetic VoxCeleb2 2-spk	DER15	1
Speaker Identification	VoxCeleb1 40 spk (test)	Top-1 Accuracy93.76	1
Speaker Identification	VoxCeleb2 40 x 30 (held-out)	Top-1 Accuracy71.25	1
Speaker Identification	VoxCeleb2 ArcFace (held-out)	Top-1 Accuracy63.17	1
Speaker Separation	VoxCeleb2 (held-out)	PIT Accuracy (MSE)97.8	1
Speaker Diarization	Simulated 2-spk mixtures	--	1

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord