Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ReFEree: Reference-Free and Fine-Grained Method for Evaluating Factual Consistency in Real-World Code Summarization

About

As Large Language Models (LLMs) have become capable of generating long and descriptive code summaries, accurate and reliable evaluation of factual consistency has become a critical challenge. However, previous evaluation methods are primarily designed for short summaries of isolated code snippets. Consequently, they struggle to provide fine-grained evaluation of multi-sentence functionalities and fail to accurately assess dependency context commonly found in real-world code summaries. To address this, we propose ReFEree, a reference-free and fine-grained method for evaluating factual consistency in real-world code summaries. We define factual inconsistency criteria specific to code summaries and evaluate them at the segment level using these criteria along with dependency information. These segment-level results are then aggregated into a fine-grained score. We construct a code summarization benchmark with human-annotated factual consistency labels. The evaluation results demonstrate that ReFEree achieves the highest correlation with human judgment among 13 baselines, improving 15-18% over the previous state-of-the-art. Our code and data are available at https://github.com/bsy99615/ReFEree.git.

Suyoung Bae, CheolWon Na, Jaehoon Lee, Yumin Lee, YunSeok Choi, Jee-Hyong Lee• 2026

Related benchmarks

TaskDatasetResultRank
Code Summarization Factual ConsistencyPython
Pearson Correlation (rp)0.497
15
Code Summarization Factual ConsistencyJava
Pearson (rp)0.515
15
Factual Consistency Evaluation2,055 code summary evaluations
Time (s)10.24
14
Docstring EvaluationDevEval 183 human-written docstrings
Score0.938
5
Showing 4 of 4 rows

Other info

Follow for update