Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multilingual Sequence-to-Sequence Models for Hebrew NLP

About

Recent work attributes progress in NLP to large language models (LMs) with increased model size and large quantities of pretraining data. Despite this, current state-of-the-art LMs for Hebrew are both under-parameterized and under-trained compared to LMs in other languages. Additionally, previous work on pretrained Hebrew LMs focused on encoder-only models. While the encoder-only architecture is beneficial for classification tasks, it does not cater well for sub-word prediction tasks, such as Named Entity Recognition, when considering the morphologically rich nature of Hebrew. In this paper we argue that sequence-to-sequence generative architectures are more suitable for LLMs in the case of morphologically rich languages (MRLs) such as Hebrew. We demonstrate that by casting tasks in the Hebrew NLP pipeline as text-to-text tasks, we can leverage powerful multilingual, pretrained sequence-to-sequence models as mT5, eliminating the need for a specialized, morpheme-based, separately fine-tuned decoder. Using this approach, our experiments show substantial improvements over previously published results on existing Hebrew NLP benchmarks. These results suggest that multilingual sequence-to-sequence models present a promising building block for NLP for MRLs.

Matan Eyal, Hila Noga, Roee Aharoni, Idan Szpektor, Reut Tsarfaty• 2022

Related benchmarks

TaskDatasetResultRank
Named Entity RecognitionBen-Mordecai Corpora (BMC)
F1 Score93.29
13
POS TaggingHebrew UD Corpus v2 (test)
mset Accuracy97.46
10
Word SegmentationHebrew UD Corpus v2 (test)
mset98.67
10
Named Entity RecognitionNEMO Token-level
F189.86
9
Sentiment AnalysisSentiment Analysis
F1 Score89.61
9
Named Entity RecognitionNEMO Morpheme-level
F1 Score86.65
8
Question AnsweringParaShoot
EM50.37
8
LemmatizationHebrew UD Corpus v2 (test)
mset Score95.53
6
Showing 8 of 8 rows

Other info

Follow for update