$\texttt{COSMIC}$: Mutual Information for Task-Agnostic Summarization Evaluation
About
Assessing the quality of summarizers poses significant challenges. In response, we propose a novel task-oriented evaluation approach that assesses summarizers based on their capacity to produce summaries that are useful for downstream tasks, while preserving task outcomes. We theoretically establish a direct relationship between the resulting error probability of these tasks and the mutual information between source texts and generated summaries. We introduce $\texttt{COSMIC}$ as a practical implementation of this metric, demonstrating its strong correlation with human judgment-based metrics and its effectiveness in predicting downstream task performance. Comparative analyses against established metrics like $\texttt{BERTScore}$ and $\texttt{ROUGE}$ highlight the competitive performance of $\texttt{COSMIC}$.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Summarization Evaluation | SummEval | Coherence23 | 41 | |
| Paraphrase embedding | Paraphrase embedding | Correlation0.81 | 12 | |
| Emotion Classification | Emotion | Correlation Coefficient0.56 | 12 | |
| Policy classification | Policy classification | Correlation0.58 | 12 | |
| GPT detection | GPT detector | Correlation0.59 | 12 |