Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation

About

Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings

Saurabh Hinduja, Gurmeet Kaur, Maneesh Bilalpur, Jeffrey Cohn, Shaun Canavan• 2026

Related benchmarks

TaskDatasetResultRank
Action Unit DetectionBP4D+--
22
Action Unit DetectionBP4D
F1 (AU1)53.7
1
Action Unit DetectionBP4D+
F1 (AU1)36.8
1
Action Unit DetectionGFT
F1 (AU1)22.1
1
Action Unit DetectionUNBC
F1 (AU4)9.5
1
Showing 5 of 5 rows

Other info

Follow for update