*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

About

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Quentin Lemesle, L\'eane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive• 2026

Related benchmarks

Task	Dataset	Result
Scientific Text Revision	Scientific Text Revision	Pairwise Accuracy62	21
Nile Translation	Nile Translation	Accuracy82	17
Paraphrase Classification	Paraphrase Classification	Accuracy75	17
Nile Translation	Nile	Pairwise Accuracy72	15

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord