A Novel Term Selection Approach in sLDA for Imbalanced Text Categorization
- DOI
- 10.2991/isci-15.2015.231How to use a DOI?
- Keywords
- sLDA; imbalanced dataset; text categorization; topic model.
- Abstract
The supervised Latent Dirichlet Allocation (sLDA) is a probabilistic topic model of labelled documents, which is better than unsupervised LDA for text categorization. But sLDA experiments were based upon this default assumtion that the corpus is balanced, that is, the samples of each class are approximately equal, and chose a vocabulary by tf-idf. While the corpus is imbalanced, tf-idf tends to choose terms from the majority classes and ignore terms of the minority ones. Thus the performance of text classifier will be degraded severely. Therefore this paper proposed a new term selection approach which can fairly choose more discriminative terms from every category. Experimental results show that using this new approach in sLDA for imbalanced text categorization can greatly impove the recall and precision of the minority classes, and it is superior to tf-idf.
- Copyright
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Zhenyan Liu AU - Dan Meng AU - Weiping Wang AU - Yong Wang AU - Chenhao Bai PY - 2015/01 DA - 2015/01 TI - A Novel Term Selection Approach in sLDA for Imbalanced Text Categorization BT - Proceedings of the 2015 International Symposium on Computers & Informatics PB - Atlantis Press SP - 1733 EP - 1740 SN - 2352-538X UR - https://doi.org/10.2991/isci-15.2015.231 DO - 10.2991/isci-15.2015.231 ID - Liu2015/01 ER -