MoNoise: Modeling Noise Using a Modular Normalization System
About
We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Emotion intensity ordinal classification | Affect in Tweets EI-oc Fear SemEval-2018 Task 1 | Pearson r0.694 | 9 | |
| Irony Detection | Irony Detection Irony-a SemEval-2018 Task 3 | F1 Score72.6 | 9 | |
| Irony Detection | Irony Detection Irony-b SemEval-2018 Task 3 | F1 Score51 | 9 | |
| Emotion intensity ordinal classification | Affect in Tweets EI-oc Joy SemEval-2018 Task 1 | Pearson r0.715 | 9 | |
| Emotion intensity ordinal classification | Affect in Tweets EI-oc Anger SemEval-2018 Task 1 | Pearson r0.717 | 9 | |
| Emotion intensity ordinal classification | Affect in Tweets EI-oc Sad SemEval-2018 Task 1 | Pearson r68 | 9 | |
| Text Normalization | LexNorm 2015 (test) | Precision93.53 | 5 |