An Improved Feature Selection Algorithm Utilizing the within Category Variance
- DOI
- 10.2991/eame-15.2015.217How to use a DOI?
- Keywords
- text classification; feature selection; 2 statistics
- Abstract
The 2 statistics is a commonly used and effective method of feature selection for corpus. However, it suffers several deficiencies. First, it only counts the document frequency for each feature. Secondly, this method does not distinguish among features that have different frequency distributions within a category. To overcome these shortcomings, two indexes, naming, the within category frequency and the within category variance, are introduced. Experiments are carried out to compare the traditional 2 statistics, some existing improvement, and the improved 2 statistics proposed in this paper using either naive Bayesian or SVM on the corpus collected by Fudan University and Sogou. Experimental results reveal that the improvement of this paper is effective and robust with respect to various classifiers and corpus.
- Copyright
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - P.J. Zhang AU - S.C. Gan PY - 2015/07 DA - 2015/07 TI - An Improved Feature Selection Algorithm Utilizing the within Category Variance BT - Proceedings of the 2015 International Conference on Electrical, Automation and Mechanical Engineering PB - Atlantis Press SP - 808 EP - 810 SN - 2352-5401 UR - https://doi.org/10.2991/eame-15.2015.217 DO - 10.2991/eame-15.2015.217 ID - Zhang2015/07 ER -