Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attribute-Based Diagnosis of LLM Alignment with Hate Speech Annotations

About

Hate speech annotation is costly, subjective, and prone to annotator disagreement, making large-scale dataset construction challenging. We systematically analyze how well large language models (LLMs) align with human judgments across ten theoretically grounded subjective attributes, such as dehumanization, violence, and sentiment, evaluating both small and large variants of Llama 3.1 and Qwen 2.5. Our analysis reveals a consistent split across all models: behaviorally explicit dimensions (insult, humiliate, attack-defend) correlate strongly with human annotations, while evaluative dimensions (respect, sentiment, hate speech) are systematically inverted. Demographic persona conditioning reduces model confidence without improving alignment. Building on these insights, we propose combining attribute-level LLM predictions via a confidence-weighted Ridge regression to reconstruct continuous hate speech scores from the Measuring Hate Speech corpus, achieving $R^2$ of up to 0.71 and outperforming direct prompting baselines, demonstrating that structured attribute decomposition recovers a richer and more human-aligned signal than end-to-end label prediction alone.

Mohammad Amine Jradi, Faeze Ghorbanpour, Alexander Fraser• 2026

Related benchmarks

TaskDatasetResultRank
Hate speech classificationMHS small models
F1 Score60.42
14
Hate speech classificationMHS
F1 Score69.2
14
Hate Speech Score ReconstructionMHS
R^270.71
4
Hate Speech Score ReconstructionMHS small models
62.55
4
Showing 4 of 4 rows

Other info

Follow for update