Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language Representation

About

Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, we present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias. To use its finer-grained input effectively and efficiently, CANINE combines downsampling, which reduces the input sequence length, with a deep transformer stack, which encodes context. CANINE outperforms a comparable mBERT model by 2.8 F1 on TyDi QA, a challenging multilingual benchmark, despite having 28% fewer model parameters.

Jonathan H. Clark, Dan Garrette, Iulia Turc, John Wieting• 2021

Related benchmarks

TaskDatasetResultRank
Natural Language UnderstandingGLUE
SST-290
452
Text ClassificationAG News (test)
Accuracy93.5
210
Text ClassificationIMDB (test)
CA91.1
79
Comment ClassificationCivil Comments
Accuracy82.9
21
Sequence ReconstructionGenomic Reads ART simulator 150bp paired-end GRCh38 reference
Reconstruction Rate32.5
9
Taxonomic ClassificationCAMI II metagenome 2017
Taxa F1 Score85.2
9
Variant CallingGIAB HG002 truth set (test)
F1 Score (Variant)81.8
9
6-continent classification6-continent classification (test)
Accuracy77.6
7
Classification14-region
Accuracy68.3
7
Nationality classification99-country nationality dataset
Accuracy45
7
Showing 10 of 13 rows

Other info

Follow for update