Health-LLM: Large Language Models for Health Prediction via Wearable Sensor Data

About

Large language models (LLMs) are capable of many natural language tasks, yet they are far from perfect. In health applications, grounding and interpreting domain-specific and non-linguistic data is crucial. This paper investigates the capacity of LLMs to make inferences about health based on contextual information (e.g. user demographics, health knowledge) and physiological data (e.g. resting heart rate, sleep minutes). We present a comprehensive evaluation of 12 state-of-the-art LLMs with prompting and fine-tuning techniques on four public health datasets (PMData, LifeSnaps, GLOBEM and AW_FB). Our experiments cover 10 consumer health prediction tasks in mental health, activity, metabolic, and sleep assessment. Our fine-tuned model, HealthAlpaca exhibits comparable performance to much larger models (GPT-3.5, GPT-4 and Gemini-Pro), achieving the best performance in 8 out of 10 tasks. Ablation studies highlight the effectiveness of context enhancement strategies. Notably, we observe that our context enhancement can yield up to 23.8% improvement in performance. While constructing contextually rich prompts (combining user context, health knowledge and temporal information) exhibits synergistic improvement, the inclusion of health knowledge context in prompts significantly enhances overall performance.

Yubin Kim, Xuhai Xu, Daniel McDuff, Cynthia Breazeal, Hae Won Park• 2024

Related benchmarks

Task	Dataset	Result
Medical Question Answering	VitalBench Tier A (635) (test)	Overall Accuracy82	11
Medical Question Answering	VitalBench Tier B (670) (test)	Overall Accuracy58.8	11
Long-Horizon Analytical Question Answering	Wearable Sensing Datasets (test)	Numeric Metric28.9	8
Short-Horizon Analytical Question Answering	Wearable Sensing Datasets (test)	Exact Match (EM)10	8
Predictive Reasoning Question Answering	Wearable Sensing Datasets (test)	UAR55.7	8
Fine-grained Query Answering	Clinical Narratives Item-level (test)	ROUGE-L14.2	3
Item-level Evaluation	Item-level QA dataset	ROUGE-117.6	3
Narrative Summary Generation	Clinical Narratives Summary-level (test)	ROUGE-L15.1	3
Summary-level Evaluation	Narrative dataset Summary-level QA	ROUGE-129.4	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord