International Journal of Computational Intelligence Systems

Volume 3, Issue 5, October 2010, Pages 632 - 645

Document Categorization with Modified Statistical Language Models for Agglutinative Languages

Authors
Ahmet Cüneyd Tantuğ
Corresponding Author
Ahmet Cüneyd Tantuğ
Received 16 October 2009, Accepted 25 June 2010, Available Online 1 October 2010.
DOI
10.2991/ijcis.2010.3.5.12How to use a DOI?
Keywords
document categorization, statistical language modeling, n-gram, Turkish
Abstract

In this paper, we investigate the document categorization task with statistical language models. Our study mainly focuses on categorization of documents in agglutinative languages. Due to the productive morphology of agglutinative languages, the number of word forms encountered in naturally occurring text is very large. From the language modeling perspective, a large vocabulary results in serious data sparseness problems. In order to cope with this drawback, previous studies in various application areas suggest modified language models based on different morphological units. It is reported that performance improvements can be achieved with these modified language models. In our document categorization experiments, we use standard word form based language models as well as other modified language models based on root words, root words and part-of-speech information, truncated word forms and character sequences. Additionally, to find an optimum parameter set, multiple tests are carried out with different language model orders and smoothing methods. Similar to previous studies on other tasks, our experimental results on categorization of Turkish documents reveal that applying linguistic preprocessing steps for language modeling provides improvements over standard language models to some extent. However, it is also observed that similar level of performance improvements can also be acquired by simpler character level or truncated word form models which are language independent.

Copyright
© 2010, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Journal
International Journal of Computational Intelligence Systems
Volume-Issue
3 - 5
Pages
632 - 645
Publication Date
2010/10/01
ISSN (Online)
1875-6883
ISSN (Print)
1875-6891
DOI
10.2991/ijcis.2010.3.5.12How to use a DOI?
Copyright
© 2010, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Ahmet Cüneyd Tantuğ
PY  - 2010
DA  - 2010/10/01
TI  - Document Categorization with Modified Statistical Language Models for Agglutinative Languages
JO  - International Journal of Computational Intelligence Systems
SP  - 632
EP  - 645
VL  - 3
IS  - 5
SN  - 1875-6883
UR  - https://doi.org/10.2991/ijcis.2010.3.5.12
DO  - 10.2991/ijcis.2010.3.5.12
ID  - Tantuğ2010
ER  -