Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multilingual Language Processing From Bytes

About

We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as bytes and outputs span annotations of the form [start, length, label] where start positions, lengths, and labels are separate entries in our vocabulary. Because we operate directly on unicode bytes rather than language-specific words or characters, we can analyze text in many languages with a single model. Due to the small vocabulary size, these multilingual models are very compact, but produce results similar to or better than the state-of- the-art in Part-of-Speech tagging and Named Entity Recognition that use only the provided training datasets (no external data sources). Our models are learning "from scratch" in that they do not rely on any elements of the standard pipeline in Natural Language Processing (including tokenization), and thus can run in standalone fashion on raw text.

Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya• 2015

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionCoNLL English 2003 (test)
F1 Score86.5
135
Named Entity RecognitionCoNLL Spanish NER 2002 (test)
F1 Score82.95
98
Named Entity RecognitionCoNLL Dutch 2002 (test)
F1 Score82.84
87
Named Entity RecognitionConll 2003
F1 Score86.5
86
Named Entity RecognitionCoNLL German 2003 (test)
F1 Score76.22
78
Part-of-Speech TaggingUD Average 1.2 (test)
Accuracy95.7
22
Named Entity RecognitionDutch NER CoNLL 2002
F1 Score82.84
8
Named Entity RecognitionSpanish NER CoNLL 2002
F1 Score82.95
7
Part-of-Speech TaggingUD Bulgarian 1.2 (test)
Accuracy97.84
4
Part-of-Speech TaggingUD English 1.2 (test)
Accuracy93.87
4
Showing 10 of 10 rows

Other info

Follow for update