Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

TRACT: Regression-Aware Fine-tuning Meets Chain-of-Thought Reasoning for LLM-as-a-Judge

About

The LLM-as-a-judge paradigm uses large language models (LLMs) for automated text evaluation, where a numerical assessment is assigned by an LLM to the input text following scoring rubrics. Existing methods for LLM-as-a-judge use cross-entropy (CE) loss for fine-tuning, which neglects the numeric nature of score prediction. Recent work addresses numerical prediction limitations of LLM fine-tuning through regression-aware fine-tuning, which, however, does not consider chain-of-thought (CoT) reasoning for score prediction. In this paper, we introduce TRACT (Two-stage Regression-Aware fine-tuning with CoT), a method combining CoT reasoning with regression-aware training. TRACT consists of two stages: first, seed LLM is fine-tuned to generate CoTs, which serve as supervision for the second stage fine-tuning. The training objective of TRACT combines the CE loss for learning the CoT reasoning capabilities, and the regression-aware loss for the score prediction. Experiments across four LLM-as-a-judge datasets and two LLMs show that TRACT significantly outperforms existing methods. Extensive ablation studies validate the importance of each component in TRACT.

Cheng-Han Chiang, Hung-yi Lee, Michal Lukasik• 2025

Related benchmarks

TaskDatasetResultRank
Reward ModelingRewardBench v1.0 (test)
Average Score0.748
89
LLM-as-a-judge evaluationFB Bench (Feedback Bench)
Pearson's r0.949
36
LLM-as-a-judge evaluationMT-Bench
Pearson's r0.672
36
LLM-as-a-judge evaluationFLASK
Pearson's r0.518
36
LLM-as-a-judge evaluationAverage Across FB Bench, FLASK, Vic. Bench, MT Bench
Pearson (r)63.2
20
LLM-as-a-judge evaluationVicuna benchmark
Pearson Correlation (r)56.2
20
LLM-as-a-judge evaluationVicuna-bench
Pearson Correlation (r)0.605
16
Feedback Evaluation AlignmentMT-Bench
Kendall's Tau0.494
11
Feedback Evaluation AlignmentVicuna-bench
Kendall's Tau0.423
6
Feedback Evaluation AlignmentFeedback Bench
Kendall's Tau82
6
Showing 10 of 14 rows

Other info

Code

Follow for update