Giving Sensors a Voice: Multimodal JEPA for Semantic Time-Series Embeddings

About

Transformer-based architectures have advanced sequence modeling in language and vision, yet general-purpose representation learning for heterogeneous multivariate time series remains underexplored. We introduce CHARM (Channel-Aware Representation Model), which incorporates channel-level textual descriptions into a Transformer encoder equivariant to channel order. CHARM is trained with a Joint Embedding Predictive Architecture (JEPA) and a novel loss promoting informative, temporally stable embeddings; latent-space prediction encourages robustness to sensor noise while description-aware gating provides interpretability through learned inter-channel relationships. Across anomaly detection, classification, and short- and long-term forecasting, the learned embeddings achieve strong performance using only a linear probe. Performance is driven primarily by the JEPA objective and conditioning architecture, with text descriptions serving as channel identifiers for cross-dataset generalization.

Utsav Dutta, Gerardo Pastrana, Sina Khoshfetrat Pakazad, Henrik Ohlsson• 2026

Related benchmarks

Task	Dataset	Result
Time Series Forecasting	ETTm1	MSE0.411	363
Anomaly Detection	UCR	F1 Score75.4	28
Multivariate Time Series Classification	UEA	Average Accuracy80.9	18
Forecasting	Exchange Rate	MSE0.092	16
Time Series Forecasting	Weather	MSE0.222	14
Multivariate Anomaly Detection	SKAB	F1 Score86	6
Classification	UCI Hydraulic Systems (unseen)	Valve Condition Accuracy (4-class)99.8	2

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord