Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

About

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin• 2026

Related benchmarks

Task	Dataset	Result
Cognitive and Psychological spectrum prediction	offspring data 25 fold (val)	Spectrum (AUC)67.33	25
Cognitive and Psychological tasks	Parent data (25-fold cross-val)	Spectrum Score66.75	25
Document-level tasks	Parent data (25-fold cross-val)	Affect80.34	25
Document-level traits prediction	offspring data 25 fold cross-validation	Affect AUC83.85	25

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord