Functional-level Uncertainty Quantification for Calibrated Fine-tuning on LLMs
About
Accurate uncertainty quantification in large language models (LLMs) is essential for reliable confidence estimation, yet fine-tuned LLMs often become overconfident under limited adaptation data. Existing uncertainty methods for PEFT-based LLMs are largely post hoc, estimating uncertainty after fine-tuning rather than improving how adapters specialize to task-specific input-output relationships. We propose Functional-Level Uncertainty Quantification for Calibrated Fine-Tuning (UQ4CT), which calibrates uncertainty over the functional space induced by prompt-dependent mixtures of LoRA experts. UQ4CT implements this perspective through a mixture-of-experts fine-tuning framework, where a calibration loss aligns functional-level confidence with predictive correctness during training. Across four multiple-choice benchmarks and two open-ended generative QA tasks, UQ4CT reduces Expected Calibration Error (ECE) by over $25\%$ while preserving high accuracy. Under distribution shift, UQ4CT maintains superior calibration and competitive accuracy, demonstrating improved reliability and generalization for fine-tuned LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Commonsense Reasoning | ARC-C | Accuracy79.601 | 215 | |
| Commonsense Reasoning | OBQA | Accuracy88.4 | 187 | |
| Commonsense Reasoning | ARC-E | Accuracy88.66 | 152 | |
| Open-ended generation | TriviaQA | ECE6.63 | 37 | |
| Multiple-choice Question Answering | ARC-C | Accuracy79 | 28 | |
| Multiple-choice Question Answering | ARC-E | Accuracy87.8 | 16 | |
| Domain-specific Reasoning | ClimateQA | Accuracy (ACC)79.97 | 9 | |
| Multiple-choice Question Answering | OBQA | Accuracy0.884 | 8 | |
| Multiple-choice Question Answering | ENG | Accuracy61.13 | 8 | |
| Multiple-choice Question Answering | Law | Accuracy45.4 | 8 |