Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

About

State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.

Yi Tay, Vinh Q. Tran, Sebastian Ruder, Jai Gupta, Hyung Won Chung, Dara Bahri, Zhen Qin, Simon Baumgartner, Cong Yu, Donald Metzler• 2021

Related benchmarks

Task	Dataset	Result
Natural Language Understanding	GLUE	SST-291.6	551
Text Classification	AG News (test)	Accuracy94.1	293
Text Classification	IMDB (test)	CA94.4	81
Comment Classification	Civil Comments	Accuracy83	30
Question Answering	TyDiQA GoldP (test)	F1 Score86.3	12
Taxonomic Classification	CAMI II metagenome 2017	Taxa F1 Score89.3	9
Variant Calling	GIAB HG002 truth set (test)	F1 Score (Variant)85.6	9
Sequence Reconstruction	Genomic Reads ART simulator 150bp paired-end GRCh38 reference	Reconstruction Rate27.9	9
Comment Classification	Wiki Comments	Accuracy93.5	5

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord