CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation
About
Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Natural Language Understanding | GLUE | SST-290 | 452 | |
| Text Classification | AG News (test) | Accuracy93.5 | 210 | |
| Text Classification | IMDB (test) | CA91.1 | 79 | |
| Comment Classification | Civil Comments | Accuracy82.9 | 21 | |
| Sequence Reconstruction | Genomic Reads ART simulator 150bp paired-end GRCh38 reference | Reconstruction Rate32.5 | 9 | |
| Taxonomic Classification | CAMI II metagenome 2017 | Taxa F1 Score85.2 | 9 | |
| Variant Calling | GIAB HG002 truth set (test) | F1 Score (Variant)81.8 | 9 | |
| 6-continent classification | 6-continent classification (test) | Accuracy77.6 | 7 | |
| Classification | 14-region | Accuracy68.3 | 7 | |
| Nationality classification | 99-country nationality dataset | Accuracy45 | 7 |