Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BoRP: Bootstrapped Regression Probing for Scalable and Human-Aligned LLM Evaluation

About

Accurate evaluation of user satisfaction is critical for iterative development of conversational AI. However, for open-ended assistants, traditional A/B testing lacks reliable metrics: explicit feedback is sparse, while implicit metrics are ambiguous. To bridge this gap, we introduce BoRP (Bootstrapped Regression Probing), a scalable framework for high-fidelity satisfaction evaluation. Unlike generative approaches, BoRP leverages the geometric properties of LLM latent space. It employs a polarization-index-based bootstrapping mechanism to automate rubric generation and utilizes Partial Least Squares (PLS) to map hidden states to continuous scores. Experiments on industrial datasets show that BoRP (Qwen3-8B/14B) significantly outperforms generative baselines (even Qwen3-Max) in alignment with human judgments. Furthermore, BoRP reduces inference costs by orders of magnitude, enabling full-scale monitoring and highly sensitive A/B testing via CUPED.

Peng Sun, Xiangyu Zhang, Duan Wu• 2026

Related benchmarks

TaskDatasetResultRank
Helpful Response EvaluationHelpSteer-2--
7
LLM Evaluation Efficiency100k sessions
Task Cost4
4
User Acceptance ScoringIndustrial Dataset
K-alpha0.796
4
Showing 3 of 3 rows

Other info

Follow for update