A Novel Feature Selection Method Based on Category Distribution Ratio in Text Classification
- DOI
- 10.2991/mbdasm-19.2019.45How to use a DOI?
- Keywords
- text classification; feature selection; feature reduce
- Abstract
In text classification, texts are represented as a high-dimensional and sparse matrix, whose dimension is the same as the total number of terms of all texts. Using all terms for text classification tasks will affect the accuracy and efficiency. Feature selection algorithm can select some features most relevant to text category and reduce the dimension of text representation vector. In this paper, we propose a new feature ranking metric as category distribution ratio (CDR) which takes the true positive rate and false positive rate and their difference of a term into account while estimating the significance of a term. To prove the effectiveness of the proposed feature selection algorithm, we compare its performance against six metrics ( balanced accuracy measure (ACC), odds ratio (OR), Gini index (GI), max-min Ratio (MMR), normalized difference measure(NDM),chi - square (CHI)) on three benchmark data sets (20newsgropus, Ohsumed, Reuters 21578) using multinomial naive Bayes, support vector machines and k-nearest neighbor classifiers. The experimental results show that the classification evaluation index macro F1 based on CDR feature selection is higher than the other six algorithms.
- Copyright
- © 2019, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Pujian Zong AU - Jian Bian PY - 2019/10 DA - 2019/10 TI - A Novel Feature Selection Method Based on Category Distribution Ratio in Text Classification BT - Proceedings of the 2019 International Conference on Mathematics, Big Data Analysis and Simulation and Modelling (MBDASM 2019) PB - Atlantis Press SP - 195 EP - 200 SN - 2352-538X UR - https://doi.org/10.2991/mbdasm-19.2019.45 DO - 10.2991/mbdasm-19.2019.45 ID - Zong2019/10 ER -