Document Categorization with Modified Statistical Language Models for Agglutinative Languages
- DOI
- 10.2991/ijcis.2010.3.5.12How to use a DOI?
- Keywords
- document categorization, statistical language modeling, n-gram, Turkish
- Abstract
In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.
- Copyright
- © 2010, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - JOUR AU - Ahmet Cüneyd Tantuğ PY - 2010 DA - 2010/10/01 TI - Document Categorization with Modified Statistical Language Models for Agglutinative Languages JO - International Journal of Computational Intelligence Systems SP - 632 EP - 645 VL - 3 IS - 5 SN - 1875-6883 UR - https://doi.org/10.2991/ijcis.2010.3.5.12 DO - 10.2991/ijcis.2010.3.5.12 ID - Tantuğ2010 ER -