Dynamic Summary Generation for Interpretable Multimodal Depression Detection
About
Depression remains widely underdiagnosed and undertreated because stigma and subjective symptom ratings hinder reliable screening. To address this challenge, we propose a coarse-to-fine, multi-stage framework that leverages large language models (LLMs) for accurate and interpretable detection. The pipeline performs binary screening, five-class severity classification, and continuous regression. At each stage, an LLM produces progressively richer clinical summaries that guide a multimodal fusion module integrating text, audio, and video features, yielding predictions with transparent rationale. The system then consolidates all summaries into a concise, human-readable assessment report. Experiments on the E-DAIC and CMDC datasets show significant improvements over state-of-the-art baselines in both accuracy and interpretability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depression Recognition | E-DAIC-WOZ (test) | MAE3.32 | 16 | |
| Depression Recognition | CMDC (test) | RMSE3.81 | 8 |