Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings
About
Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous multivariate time series remains underexplored. We introduce CHARM (Channel-Aware Representation Model), which incorporates channel-level textual descriptions into a Transformer encoder equivariant to channel order. CHARM is trained with a Joint Embedding Predictive Architecture (JEPA) and a novel loss promoting informative, temporally stable embeddings; latent-space prediction encourages robustness to sensor noise while description-aware gating provides interpretability through learned inter-channel relationships. Across anomaly detection, classification, and short- and long-term forecasting, the learned embeddings achieve strong performance using only a linear probe. Performance is driven primarily by the JEPA objective and conditioning architecture, with text descriptions serving as channel identifiers for cross-dataset generalization.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Time Series Forecasting | ETTm1 | MSE0.411 | 363 | |
| Anomaly Detection | UCR | F1 Score75.4 | 28 | |
| Multivariate Time Series Classification | UEA | Average Accuracy80.9 | 18 | |
| Forecasting | Exchange Rate | MSE0.092 | 16 | |
| Time Series Forecasting | Weather | MSE0.222 | 14 | |
| Multivariate Anomaly Detection | SKAB | F1 Score86 | 6 | |
| Classification | UCI Hydraulic Systems (unseen) | Valve Condition Accuracy (4-class)99.8 | 2 |