ToxiTrace: Gradient-Aligned Training for Explainable Chinese Toxicity Detection

About

Existing Chinese toxic content detection methods mainly target sentence-level classification but often fail to provide readable and contiguous toxic evidence spans. We propose \textbf{ToxiTrace}, an explainability-oriented method for BERT-style encoders with three components: (1) \textbf{CuSA}, which refines encoder-derived saliency cues into fine-grained toxic spans with lightweight LLM guidance; (2) \textbf{GCLoss}, a gradient-constrained objective that concentrates token-level saliency on toxic evidence while suppressing irrelevant activations; and (3) \textbf{ARCL}, which constructs sample-specific contrastive reasoning pairs to sharpen the semantic boundary between toxic and non-toxic content. Experiments show that ToxiTrace improves classification accuracy and toxic span extraction while preserving efficient encoder-based inference and producing more coherent, human-readable explanations. We have released the model at https://huggingface.co/ArdLi/ToxiTrace.

Boyang Li, Hongzhe Shou, Yuanyuan Liang, Jingbin Zhang, Fang Zhou• 2026

Related benchmarks

Task	Dataset	Result
Toxicity Classification	COLD (test)	Accuracy83.84	19
Toxicity Classification	TOXICN (test)	Accuracy83.87	19
Toxic span extraction	CNTP	Overlap Recall86.36	13

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord