Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unified Multimodal Uncertain Inference

About

We introduce Unified Multimodal Uncertain Inference (UMUI), a multimodal inference task spanning text, audio, and video, where models must produce calibrated probability estimates of hypotheses conditioned on a premise in any modality or combination. While uncertain inference has been explored in text, extension to other modalities has been limited to single-modality binary entailment judgments, leaving no framework for fine-grained probabilistic reasoning in or across other modalities. To address this, we curate a human-annotated evaluation set with scalar probability judgments across audio, visual, and audiovisual settings, and additionally evaluate on existing text and audio benchmarks. We introduce CLUE (Calibrated Latent Uncertainty Estimation), which combines self-consistent teacher calibration and distribution-based confidence probing to produce calibrated predictions. We demonstrate that our 3B-parameter model achieves equivalent or stronger performance than baselines up to 32B parameters across all modalities.

Dengjia Zhang, Alexander Martin, William Jurayj, Kenton Murray, Benjamin Van Durme, Reno Kriz• 2026

Related benchmarks

TaskDatasetResultRank
Scalar probability judgmentUNLI
MSE0.057
9
UMUI Judgment CalibrationUNLI (full)
MSE0.0573
9
UMUI Judgment CalibrationWIKIVIDEO (v)
MSE0.0784
8
Binary JudgmentClotho UMUI-Binary
Accuracy97.5
5
Binary JudgmentWikiVideo Audio-only
Accuracy71.5
5
Scalar probability judgmentWikiVideo Vision-only
MSE0.078
5
Scalar probability judgmentWikiVideo Audio-only
MSE (x100)3.4
5
UMUI Judgment CalibrationWIKIVIDEO (A)
MSE0.0335
5
Binary JudgmentWikiVideo Vision-only
Accuracy74.6
5
Binary JudgmentWikiVideo Audio-Visual
Accuracy70.1
3
Showing 10 of 11 rows

Other info

Follow for update