Fully-automated sleep staging: multicenter validation of a generalizable deep neural network for Parkinson's disease and isolated REM sleep behavior disorder
About
Isolated REM sleep behavior disorder (iRBD) is a key prodromal marker of Parkinson's disease (PD), and video-polysomnography (vPSG) remains the diagnostic gold standard. However, manual sleep staging is particularly challenging in neurodegenerative diseases due to EEG abnormalities and fragmented sleep, making PSG assessments a bottleneck for deploying new RBD screening technologies at scale. We adapted U-Sleep, a deep neural network, for generalizable sleep staging in PD and iRBD. A pretrained U-Sleep model, based on a large, multisite non-neurodegenerative dataset (PUB; 19,236 PSGs across 12 sites), was fine-tuned on research datasets from two centers (Lundbeck Foundation Parkinson's Disease Research Center (PACE) and the Cologne-Bonn Cohort (CBC); 112 PD, 138 iRBD, 89 age-matched controls. The resulting model was evaluated on an independent dataset from the Danish Center for Sleep Medicine (DCSM; 81 PD, 36 iRBD, 87 sleep-clinic controls). A subset of PSGs with low agreement between the human rater and the model (Cohen's $\kappa$ < 0.6) was re-scored by a second blinded human rater to identify sources of disagreement. Finally, we applied confidence-based thresholds to optimize REM sleep staging. The pretrained model achieved mean $\kappa$ = 0.81 in PUB, but $\kappa$ = 0.66 when applied directly to PACE/CBC. By fine-tuning the model, we developed a generalized model with $\kappa$ = 0.74 on PACE/CBC (p < 0.001 vs. the pretrained model). In DCSM, mean and median $\kappa$ increased from 0.60 to 0.64 (p < 0.001) and 0.64 to 0.69 (p < 0.001), respectively. In the interrater study, PSGs with low agreement between the model and the initial scorer showed similarly low agreement between human scorers. Applying a confidence threshold increased the proportion of correctly identified REM sleep epochs from 85% to 95.5%, while preserving sufficient (> 5 min) REM sleep for 95% of subjects.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sleep Stage Classification | SHHS | F1 Macro80 | 23 | |
| Sleep Stage Classification | Sleep-EDF ST | MF183 | 11 | |
| Sleep Staging | Physio 2018 | Macro F181 | 4 | |
| Sleep Stage Classification | ABC | Macro F181 | 2 | |
| Sleep Stage Classification | Chat | Macro F186 | 2 | |
| Sleep Stage Classification | HOMEPAP | Macro F178 | 2 | |
| Sleep Stage Classification | MESA | Macro F181 | 2 | |
| Sleep Stage Classification | SOF | Macro F179 | 2 | |
| Sleep Stage Classification | CCSHS | Macro F10.84 | 2 | |
| Sleep Stage Classification | CFS | Macro F177 | 2 |