Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Functional Entropy: Predicting Functional Correctness in LLM-Generated Code with Uncertainty Quantification

About

Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.

Dylan Bouchard, Mohit Singh Chauhan, Zeya Ahmad, Ho-Kyeong Ra• 2026

Related benchmarks

TaskDatasetResultRank
Code Correctness PredictionLiveCodeBench Python
AUROC86.7
60
Code Correctness PredictionLiveCodeBench Python
Brier Score0.073
60
Code Correctness PredictionMultiPL-E Java
AUROC0.701
60
Predicting code correctnessLiveCodeBench Python
ECE0.024
60
Code Correctness PredictionMultiPL-E Java
Brier Score0.243
60
Code Correctness PredictionMultiPL-E Java
ECE0.155
60
Code correctness classificationLiveSQLBench SQLite
AUROC0.842
55
Predicting code correctnessLiveSQLBench SQLite
Brier Score0.129
55
Showing 8 of 8 rows

Other info

Follow for update