An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment
About
Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs' simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models' performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation's reliability. To address these problems, this study provides in-depth insights into LLMs' performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs' simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4's and Qwen2.5-72B's struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sentence Simplification (Lexical-Paraphrasing) | SimPA (out-of-domain) | SARI28.6 | 9 | |
| Sentence Simplification (Overall-Rewriting) | SimPA (out-of-domain) | LENS59.7 | 9 | |
| Sentence Simplification (Overall-Rewriting) | Newsela (out-of-domain) | LENS Score60.9 | 9 | |
| Lexical Paraphrasing | TURK (test) | Mean Score3.65 | 3 | |
| Overall Rewriting | ASSET (test) | Mean Score3.93 | 3 |