Semantic Reranking at Inference Time for Hard Examples in Rhetorical Role Labeling

About

Rhetorical Role Labeling (RRL) assigns a functional role to each sentence in a document and is widely used in legal, medical, and scientific domains. While language models (LMs) achieve strong average performance, they remain unreliable on hard examples, where prediction confidence is low. Existing approaches typically handle uncertainty implicitly and treat labels as discrete identifiers, overlooking the semantic information encoded in label names. We introduce RISE, an inference-time semantic reranking framework that leverages label semantics to refine predictions on hard instances. RISE automatically identifies low-confidence predictions and reranks model outputs using contrastively learned label representations, without retraining or modifying the underlying model. Experiments on eight domain-specific RRL datasets with seven LMs, including encoder-based and causal architectures, show an average gain of +9.15 macro-F1 points on hard examples. For explainability, we further propose manual hardness annotations to study difficulty from both model and human perspectives, revealing a moderate agreement with Cohen's kappa = 0.40.

Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Richard Dufour• 2026

Related benchmarks

Task	Dataset	Result
Rhetorical Role Labeling	SCOTUSRF (test)	mF172.13	20
Rhetorical Role Labeling	SCOTUSSteps (test)	mF154.18	20
Rhetorical Role Labeling	DEEPRHOLE (test)	mF148.89	20
Rhetorical Role Labeling	SCOTUSCategory (test)	Macro F185.03	14
Rhetorical Role Labeling	LEGALEVAL (test)	Micro-F10.6297	14
Rhetorical Role Labeling	PubMed (test)	mF182.61	14
Rhetorical Role Labeling	BIORC (test)	mF187.45	14
Rhetorical Role Labeling	CS-ABSTRACTS (test)	Micro F167.39	14

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord