Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Probabilistic Calibration Is a Trainable Capability in Language Models

About

Language models are increasingly used in settings where outputs must satisfy user-specified randomness constraints, yet their generation probabilities are often poorly calibrated to those targets. We study whether this capability can be improved directly through fine-tuning. Concretely, we fine-tune language models on synthetic prompts that require sampling from mathematical distributions, and compare two Calibration Fine-Tuning variants: a soft-target method that converts the desired output distribution into trie-derived next-token targets, and a hard-target method that trains on sampled completions from the same target distribution. Across 12 models spanning four families, both methods substantially improve structured-sampling fidelity on held-out distribution families and unseen parameter settings, showing that probabilistic calibration is a trainable capability. Under our selected training configurations, the two methods exhibit different empirical profiles: hard-target fine-tuning is often strongest on structured numeric sampling, while soft-target fine-tuning performs better on broader stochastic generation benchmarks, including open-ended random generation, multiple-choice answer-position balancing, and NoveltyBench. The gains sometimes reduce downstream capability, especially arithmetic reasoning, with costs varying by model. Overall, our results show that probabilistic calibration can be improved through fine-tuning, with our hard-target configuration favoring exact numeric fidelity and our soft-target configuration favoring broader stochastic transfer. Code is available at https://github.com/chandar-lab/calibration-finetuning.

Davide Baldelli, Sruthi Kuriakose, Maryam Hashemzadeh, Amal Zouaq, Sarath Chandar• 2026

Related benchmarks

TaskDatasetResultRank
Patience-discounted reward evaluationNOVELTYBENCH
Utility4.096
36
Random-generation diversity evaluationOpen-ended random generation
Top-90% Support Size864.5
36
Structured Distribution SamplingOOD distribution families Bernoulli, Poisson, Maxwell, TruncNorm, Chi, and Weibull (Held-out)
OOD W10.0707
36
Structured Distribution SamplingSeen distribution families unseen parameter settings
Logit KL Divergence0.45
36
Answer-position balance evaluationMCQ
MCQ TV0.061
34
Showing 5 of 5 rows

Other info

Follow for update