Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

*-PLUIE: Personalisable metric with Llm Used for Improved Evaluation

About

Evaluating the quality of automatically generated text often relies on LLM-as-a-judge (LLM-judge) methods. While effective, these approaches are computationally expensive and require post-processing. To address these limitations, we build upon ParaPLUIE, a perplexity-based LLM-judge metric that estimates confidence over ``Yes/No'' answers without generating text. We introduce *-PLUIE, task specific prompting variants of ParaPLUIE and evaluate their alignment with human judgement. Our experiments show that personalised *-PLUIE achieves stronger correlations with human ratings while maintaining low computational cost.

Quentin Lemesle, L\'eane Jourdan, Daisy Munson, Pierre Alain, Jonathan Chevelu, Arnaud Delhay, Damien Lolive• 2026

Related benchmarks

TaskDatasetResultRank
Scientific Text RevisionScientific Text Revision
Pairwise Accuracy62
21
Nile TranslationNile Translation
Accuracy82
17
Paraphrase ClassificationParaphrase Classification
Accuracy75
17
Nile TranslationNile
Pairwise Accuracy72
15
Showing 4 of 4 rows

Other info

Follow for update