Dynamic Summary Generation for Interpretable Multimodal Depression Detection

About

Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.

Shiyu Teng, Jiaqing Liu, Hao Sun, Yu Li, Shurong Chai, Ruibo Hou, Tomoko Tateyama, Lanfen Lin, Yen-Wei Chen• 2026

Related benchmarks

Task	Dataset	Result	Rank
Depression Recognition	E-DAIC-WOZ (test)	MAE3.32		16
Depression Recognition	CMDC (test)	RMSE3.81		8

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord