Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction
About
Aspect sentiment quad prediction (ASQP) facilitates a detailed understanding of opinions expressed in a text by identifying the opinion term, aspect term, aspect category and sentiment polarity for each opinion. However, annotating a full set of training examples to fine-tune models for ASQP is a resource-intensive process. In this study, we explore the capabilities of large language models (LLMs) for zero- and few-shot learning on the ASQP task across five diverse datasets. We report F1 scores almost up to par with those obtained with state-of-the-art fine-tuned models and exceeding previously reported zero- and few-shot performance. In the 20-shot setting on the Rest16 restaurant domain dataset, LLMs achieved an F1 score of 51.54, compared to 60.39 by the best-performing fine-tuned method MVP. Additionally, we report the performance of LLMs in target aspect sentiment detection (TASD), where the F1 scores were close to fine-tuned models, achieving 68.93 on Rest16 in the 30-shot setting, compared to 72.76 with MVP. While human annotators remain essential for achieving optimal performance, LLMs can reduce the need for extensive manual annotation in ASQP tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Target Aspect Sentiment Detection | Rest 2016 | F1 Score68.53 | 31 | |
| Aspect-Sentiment-Query-Pair Extraction | Rest15 | F1 Score41.74 | 21 | |
| Aspect-Sentiment-Query-Pair Extraction | Rest 16 | F1 Score51.1 | 21 | |
| Target Aspect Sentiment Detection | Rest15 | F1 Score62.12 | 21 | |
| Target Aspect Sentiment Detection | FlightABSA | F1 Score64.6 | 9 | |
| Aspect-Sentiment-Query-Pair Extraction | FlightABSA | F1 Score48.37 | 9 | |
| Target Aspect Sentiment Detection | Coursera | F1 Score41.69 | 6 | |
| Target Aspect Sentiment Detection | Hotels | F1 Score56.51 | 6 |