Towards Token-Level Text Anomaly Detection

About

Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.

Yang Cao, Bicheng Yu, Sikun Yang, Ming Liu, Yujiu Yang• 2026

Related benchmarks

Task	Dataset	Result
Document-Level Anomaly Detection	Review (test)	AUROC0.9594	7
Token-Level Anomaly Detection	Review (test)	AUROC0.8271	7
Token-Level Anomaly Detection	Grammar (test)	AUROC63.71	7
Document-Level Anomaly Detection	SMS_Spam (test)	AUROC0.5859	7
Document-Level Anomaly Detection	Grammar (test)	AUROC0.6553	7
Token-Level Anomaly Detection	SMS Spam (test)	AUROC67.92	7

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord