SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing

About

This paper describes SentencePiece, a language-independent subword tokenizer and detokenizer designed for Neural-based text processing, including Neural Machine Translation. It provides open-source C++ and Python implementations for subword units. While existing subword segmentation tools assume that the input is pre-tokenized into word sequences, SentencePiece can train subword models directly from raw sentences, which allows us to make a purely end-to-end and language independent system. We perform a validation experiment of NMT on English-Japanese machine translation, and find that it is possible to achieve comparable accuracy to direct subword training from raw sentences. We also compare the performance of subword training and segmentation with various configurations. SentencePiece is available under the Apache 2 license at https://github.com/google/sentencepiece.

Taku Kudo, John Richardson• 2018

Related benchmarks

Task	Dataset	Result
Natural Language Understanding	GLUE	SST-290.8	551
Natural Language Understanding	GLUE (dev)	SST-2 (Acc)91.1	529
Text Classification	AG News (test)	Accuracy92.4	293
Text Classification	Yelp P. (test)	Accuracy93.8	34
Multiclass text classification	Multilingual Amazon Reviews Corpus (test)	Accuracy (Avg)90.8	24
Text Classification	Average All Datasets	Accuracy86.5	18
Text Classification	MASSIVE (test)	Accuracy69.6	18
Tokenisation	Wikipedia/OpenWebText	F1 Score99.9	9
Sequence Reconstruction	Genomic Reads ART simulator 150bp paired-end GRCh38 reference	Reconstruction Rate30.1	9
Taxonomic Classification	CAMI II metagenome 2017	Taxa F1 Score87.2	9

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord