Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

About

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou• 2026

Related benchmarks

TaskDatasetResultRank
Toxic span extractionCNTP
Overlap Recall86.36
13
Toxicity ClassificationCOLD (test)
Accuracy83.84
12
Toxicity ClassificationTOXICN (test)
Accuracy83.87
12
Showing 3 of 3 rows

Other info

Follow for update