A Picture is Worth A Thousand Numbers: Enabling LLMs Reason about Time Series via Visualization

About

Large language models (LLMs), with demonstrated reasoning abilities across multiple domains, are largely underexplored for time-series reasoning (TsR), which is ubiquitous in the real world. In this work, we propose TimerBed, the first comprehensive testbed for evaluating LLMs' TsR performance. Specifically, TimerBed includes stratified reasoning patterns with real-world tasks, comprehensive combinations of LLMs and reasoning strategies, and various supervised models as comparison anchors. We perform extensive experiments with TimerBed, test multiple current beliefs, and verify the initial failures of LLMs in TsR, evidenced by the ineffectiveness of zero shot (ZST) and performance degradation of few shot in-context learning (ICL). Further, we identify one possible root cause: the numerical modeling of data. To address this, we propose a prompt-based solution VL-Time, using visualization-modeled data and language-guided reasoning. Experimental results demonstrate that Vl-Time enables multimodal LLMs to be non-trivial ZST and powerful ICL reasoners for time series, achieving about 140% average performance improvement and 99% average token costs reduction.

Haoxin Liu, Chenghao Liu, B. Aditya Prakash• 2024

Related benchmarks

Task	Dataset	Result
Time Series Reasoning	TRQA	Accuracy58.5	36
Time Series Reasoning	TSQA	Accuracy44.93	36
Time Series Reasoning	ETI	Accuracy28.5	36
Time Series Reasoning	ECG-QA	Accuracy64.36	22
Time Series Reasoning	RCW	Accuracy57.52	22
Time Series Reasoning	SLEEP QA	Acc0.2353	22
Classification	TimerBed	Accuracy33.2	9
Regression	TSQA	MAE61.294	9
Question Answering	MTBench Weather	Accuracy61.3	9
Classification	TSQA	Accuracy49.8	9

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord