Utilizing Multi-Field Text Features for Efficient Email Spam Filtering
- DOI
- 10.1080/18756891.2012.696915How to use a DOI?
- Keywords
- Email Spam Filtering, Text Classification, Multi-Field Learning, Lightweight Field Classifier, Power Law, TREC Spam Track
- Abstract
Large-scale spam emails cause a serious waste of time and resources. This paper investigates the text features of email documents and the feature noises among multi-field texts, resulting in an observation of a power law distribution of feature strings within each text field. According to the observation, we propose an efficient filtering approach including a compound weight method and a lightweight field text classification algorithm. The compound weight method considers both the historical classifying ability of each field classifier and the classifying contribution of each text field in the current classified email. The lightweight field text classification algorithm straightforwardly calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a string-frequency index for labeled emails storing. The string-frequency index structure has a random-sampling-based compressible property owing to the power law distribution and can largely reduce the storage space. The experimental results in the TREC spam track show that the proposed approach can complete the filtering task in low space cost and high speed, whose overall performance 1-ROCA exceeds the best one among the participators at the trec07p evaluation.
- Copyright
- © 2017, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - JOUR AU - Wuying Liu AU - Ting Wang PY - 2012 DA - 2012/06/01 TI - Utilizing Multi-Field Text Features for Efficient Email Spam Filtering JO - International Journal of Computational Intelligence Systems SP - 505 EP - 518 VL - 5 IS - 3 SN - 1875-6883 UR - https://doi.org/10.1080/18756891.2012.696915 DO - 10.1080/18756891.2012.696915 ID - Liu2012 ER -