Echo: A Joint-Embedding Predictive Architecture for Speaker Diarization and Speech Recognition in a Shared Latent Space
About
We present Echo, a proof-of-concept audio system built around a single 25 M-parameter ViT encoder. The encoder is pretrained with a JEPA objective and then specialised by stages to carry speaker identity, phonetic content, and dynamic source routing in the same 512-dimensional latent space, with no per-task fine-tuning at deployment. Light heads handle diarization (ArcFace + VBx) and dynamic source separation (null-target K-set prediction). On synthetic VoxCeleb2 mixtures with unknown K, the canonical stack reaches 15.00% blind DER, 97.80% PIT separation accuracy with +9.52 dB latent SI-SDR, and a +53.50-point speaker/content factorisation gap on a held-out k-NN probe. The point of Echo is not a new SOTA on any single task but the joint coexistence of three tasks on one encoder at this footprint. We document the design stage by stage, report the dead-ends, and identify the structural wall on end-to-end ASR through the VQ bottleneck that still bounds the PoC.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speaker Diarization | VoxConverse v0.3 | -- | 6 | |
| Speaker Diarization | VoxCeleb2 (held-out) | Predicted K3.84 | 2 | |
| Dynamic K routing | VoxCeleb2 (held-out) | Accuracy94.4 | 1 | |
| Speaker / content factorisation gap | VoxCeleb2 (held-out) | Gap (points)53.5 | 1 | |
| Speaker Diarization | Synthetic VoxCeleb2 2-spk | DER15 | 1 | |
| Speaker Identification | VoxCeleb1 40 spk (test) | Top-1 Accuracy93.76 | 1 | |
| Speaker Identification | VoxCeleb2 40 x 30 (held-out) | Top-1 Accuracy71.25 | 1 | |
| Speaker Identification | VoxCeleb2 ArcFace (held-out) | Top-1 Accuracy63.17 | 1 | |
| Speaker Separation | VoxCeleb2 (held-out) | PIT Accuracy (MSE)97.8 | 1 | |
| Speaker Diarization | Simulated 2-spk mixtures | -- | 1 |