Title: Noisy Text Normalization Using an Enhanced Language Model

Year of Publication: Nov - 2014
Page Numbers: 111-122
Authors: Mohammad Arshi Saloot , Norisma Idris, Aiti Aw
Conference Name: The International Conference on Artificial Intelligence and Pattern Recognition (AIPR2014)
- Malaysia

Abstract:


User generated text in social network sites contains enormous amount and vast variety of out-of-vocabulary words, formed both deliberately and mistakenly by the end-users. It is of essential usefulness to normalize the noisy text before employing NLP tasks. This paper describes an unsupervised normalization system, which encompasses two phases: candidate generation and candidate selection. We generate candidate via six different methods: 1) one-edit distance lexically generation, 2) phonemically generation, 3) blending the previous methods, 4) two-edit distance lexically generation, 5) dictionary translation, and 6) heuristic rules. Although in candidate selection we use a trigram language model, a new method presented to select candidates with respect to all other words in the sentence. Our experiments on a large dataset show promising results.