Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space

About

We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.

Louis Mouchon• 2026

Related benchmarks

TaskDatasetResultRank
Speaker DiarizationVoxConverse v0.3--
6
Speaker DiarizationVoxCeleb2 (held-out)
Predicted K3.84
2
Dynamic K routingVoxCeleb2 (held-out)
Accuracy94.4
1
Speaker / content factorisation gapVoxCeleb2 (held-out)
Gap (points)53.5
1
Speaker DiarizationSynthetic VoxCeleb2 2-spk
DER15
1
Speaker IdentificationVoxCeleb1 40 spk (test)
Top-1 Accuracy93.76
1
Speaker IdentificationVoxCeleb2 40 x 30 (held-out)
Top-1 Accuracy71.25
1
Speaker IdentificationVoxCeleb2 ArcFace (held-out)
Top-1 Accuracy63.17
1
Speaker SeparationVoxCeleb2 (held-out)
PIT Accuracy (MSE)97.8
1
Speaker DiarizationSimulated 2-spk mixtures--
1
Showing 10 of 13 rows

Other info

Follow for update