Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

About

Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.

Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour• 2026

Related benchmarks

TaskDatasetResultRank
Rhetorical Role LabelingSCOTUSRF (test)
mF172.13
20
Rhetorical Role LabelingSCOTUSSteps (test)
mF154.18
20
Rhetorical Role LabelingDEEPRHOLE (test)
mF148.89
20
Rhetorical Role LabelingSCOTUSCategory (test)
Macro F185.03
14
Rhetorical Role LabelingLEGALEVAL (test)
Micro-F10.6297
14
Rhetorical Role LabelingPubMed (test)
mF182.61
14
Rhetorical Role LabelingBIORC (test)
mF187.45
14
Rhetorical Role LabelingCS-ABSTRACTS (test)
Micro F167.39
14
Showing 8 of 8 rows

Other info

Follow for update