Beyond the Fold: Quantifying Split-Level Noise and the Case for Leave-One-Dataset-Out AU Evaluation

About

Subject-exclusive cross-validation is the standard evaluation protocol for facial Action Unit (AU) detection, yet reported improvements are often small. We show that cross-validation itself introduces measurable stochastic variance. On BP4D+, repeated 3-fold subject-exclusive splits produce an empirical noise floor of $\pm 0.065$ in average F1, with substantially larger variation for low-prevalence AUs. Operating-point metrics such as F1 fluctuate more than threshold-independent measures such as AUC, and model ranking can change under different fold assignments. We further evaluate cross-dataset robustness using a Leave-One-Dataset-Out (LODO) protocol across five AU datasets. LODO removes partition randomness and exposes domain-level instability that is not visible under single-dataset cross-validation. Together, these results suggest that gains often reported in cross-fold validation may fall within protocol variance. Leave-one-dataset-out cross-validation yields more stable and interpretable findings

Saurabh Hinduja, Gurmeet Kaur, Maneesh Bilalpur, Jeffrey Cohn, Shaun Canavan• 2026

Related benchmarks

Task	Dataset	Result
Action Unit Detection	BP4D+	--	22
Action Unit Detection	BP4D	F1 (AU1)53.7	1
Action Unit Detection	BP4D+	F1 (AU1)36.8	1
Action Unit Detection	GFT	F1 (AU1)22.1	1
Action Unit Detection	UNBC	F1 (AU4)9.5	1

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord