Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Joint-Centric Dual Contrastive Alignment with Structure-Preserving and Information-Balanced Regularization

About

We propose HILBERT (HIerarchical Long-sequence Balanced Embedding with Reciprocal contrastive Training), a cross-attentive multimodal framework for learning document-level audio-text representations from long, segmented sequences in low-resource data settings. HILBERT leverages frozen pre-trained speech and language encoders to extract segment-level features, which are aggregated via cross-modal attention and self-attentive pooling to form modality-specific document representations and a joint cross-attentive embedding. To align modalities while preserving modality-specific structure under severe audio-text dimensional imbalance, we introduce a reciprocal dual contrastive objective that simultaneously aligns audio-to-joint and text-to-joint representations, rather than directly contrasting audio and text alone. Two auxiliary regularizers further stabilize long-sequence fusion: a Centered Kernel Alignment (CKA) loss that preserves structural consistency between each modality and the joint embedding, and a mutual information balancing loss that prevents dominance of a single modality by equalizing information flow from audio and text into the joint space. For downstream prediction, HILBERT employs a Mixture-of-Experts (MoE) classifier over concatenated audio, text, and joint representations to accommodate heterogeneous label regimes. Extensive evaluation across multiple audio-text backbone combinations demonstrates that HILBERT learns semantically meaningful long-sequence representations and achieves superior performance on highly imbalanced multi-class settings.

Habibeh Naderi, Behrouz Haji Soleimani, Stan Matwin• 2026

Related benchmarks

TaskDatasetResultRank
Cognitive and Psychological spectrum predictionoffspring data 25 fold (val)
Spectrum (AUC)67.33
25
Cognitive and Psychological tasksParent data (25-fold cross-val)
Spectrum Score66.75
25
Document-level tasksParent data (25-fold cross-val)
Affect80.34
25
Document-level traits predictionoffspring data 25 fold cross-validation
Affect AUC83.85
25
Showing 4 of 4 rows

Other info

Follow for update