Towards Token-Level Text Anomaly Detection
About
Despite significant progress in text anomaly detection for web applications such as spam filtering and fake news detection, existing methods are fundamentally limited to document-level analysis, unable to identify which specific parts of a text are anomalous. We introduce token-level anomaly detection, a novel paradigm that enables fine-grained localization of anomalies within text. We formally define text anomalies at both document and token-levels, and propose a unified detection framework that operates across multiple levels. To facilitate research in this direction, we collect and annotate three benchmark datasets spanning spam, reviews and grammar errors with token-level labels. Experimental results demonstrate that our framework get better performance than other 6 baselines, opening new possibilities for precise anomaly localization in text. All the codes and data are publicly available on https://github.com/charles-cao/TokenCore.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document-Level Anomaly Detection | Review (test) | AUROC0.9594 | 7 | |
| Token-Level Anomaly Detection | Review (test) | AUROC0.8271 | 7 | |
| Token-Level Anomaly Detection | Grammar (test) | AUROC63.71 | 7 | |
| Document-Level Anomaly Detection | SMS_Spam (test) | AUROC0.5859 | 7 | |
| Document-Level Anomaly Detection | Grammar (test) | AUROC0.6553 | 7 | |
| Token-Level Anomaly Detection | SMS Spam (test) | AUROC67.92 | 7 |