MoNoise: Modeling Noise Using a Modular Normalization System

About

We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.

Rob van der Goot, Gertjan van Noord• 2017

Related benchmarks

Task	Dataset	Result
Emotion intensity ordinal classification	Affect in Tweets EI-oc Fear SemEval-2018 Task 1	Pearson r0.694	9
Irony Detection	Irony Detection Irony-a SemEval-2018 Task 3	F1 Score72.6	9
Irony Detection	Irony Detection Irony-b SemEval-2018 Task 3	F1 Score51	9
Emotion intensity ordinal classification	Affect in Tweets EI-oc Joy SemEval-2018 Task 1	Pearson r0.715	9
Emotion intensity ordinal classification	Affect in Tweets EI-oc Anger SemEval-2018 Task 1	Pearson r0.717	9
Emotion intensity ordinal classification	Affect in Tweets EI-oc Sad SemEval-2018 Task 1	Pearson r68	9
Text Normalization	LexNorm 2015 (test)	Precision93.53	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord