TimeMaster: Training Time-Series Multimodal LLMs to Reason via Reinforcement Learning

About

Time-series reasoning remains a significant challenge in multimodal large language models (MLLMs) due to the dynamic temporal patterns, ambiguous semantics, and lack of temporal priors. In this work, we introduce TimeMaster, a reinforcement learning (RL)-based method that enables time-series MLLMs to perform structured, interpretable reasoning directly over visualized time-series inputs and task prompts. TimeMaster adopts a three-part structured output format, reasoning, classification, and domain-specific extension, and is optimized via a composite reward function that aligns format adherence, prediction accuracy, and open-ended insight quality. The model is trained using a two-stage pipeline: we first apply supervised fine-tuning (SFT) to establish a good initialization, followed by Group Relative Policy Optimization (GRPO) at the token level to enable stable and targeted reward-driven improvement in time-series reasoning. We evaluate TimeMaster on the TimerBed benchmark across six real-world classification tasks based on Qwen2.5-VL-3B-Instruct. TimeMaster achieves state-of-the-art performance, outperforming both classical time-series models and few-shot GPT-4o by over 14.6% and 7.3% performance gain, respectively. Notably, TimeMaster goes beyond time-series classification: it also exhibits expert-like reasoning behavior, generates context-aware explanations, and delivers domain-aligned insights. Our results highlight that reward-driven RL can be a scalable and promising path toward integrating temporal understanding into time-series MLLMs.

Junru Zhang, Lang Feng, Xu Guo, Yuhan Wu, Yabo Dong, Duanqing Xu• 2025

Related benchmarks

Task	Dataset	Result
Time Series Reasoning	SLEEP QA	Acc0.7255	22
Time Series Reasoning	RCW	Accuracy76.99	22
Time Series Reasoning	TSQA	Accuracy61.22	22
Time Series Reasoning	ECG-QA	Accuracy69.31	22
Time Series Reasoning	TRQA	Accuracy72.08	22
Time Series Reasoning	ETI	Accuracy49	22
Anomaly Location Detection	AnomLLM (test)	Frequency Precision (P)57.3	14
Anomaly Classification	AnomLLM (test)	Accuracy57.9	13

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord