International Journal of Computational Intelligence Systems

Volume 5, Issue 3, June 2012, Pages 505 - 518

Utilizing Multi-Field Text Features for Efficient Email Spam Filtering

Authors
Wuying Liu, Ting Wang
Corresponding Author
Ting Wang
Received 12 December 2010, Accepted 24 January 2012, Available Online 1 June 2012.
DOI
10.1080/18756891.2012.696915How to use a DOI?
Keywords
Email Spam Filtering, Text Classification, Multi-Field Learning, Lightweight Field Classifier, Power Law, TREC Spam Track
Abstract

Large-scale spam emails cause a serious waste of time and resources. This paper investigates the text features of email documents and the feature noises among multi-field texts, resulting in an observation of a power law distribution of feature strings within each text field. According to the observation, we propose an efficient filtering approach including a compound weight method and a lightweight field text classification algorithm. The compound weight method considers both the historical classifying ability of each field classifier and the classifying contribution of each text field in the current classified email. The lightweight field text classification algorithm straightforwardly calculates the arithmetical average of multiple conditional probabilities predicted from feature strings according to a string-frequency index for labeled emails storing. The string-frequency index structure has a random-sampling-based compressible property owing to the power law distribution and can largely reduce the storage space. The experimental results in the TREC spam track show that the proposed approach can complete the filtering task in low space cost and high speed, whose overall performance 1-ROCA exceeds the best one among the participators at the trec07p evaluation.

Copyright
© 2017, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Journal
International Journal of Computational Intelligence Systems
Volume-Issue
5 - 3
Pages
505 - 518
Publication Date
2012/06/01
ISSN (Online)
1875-6883
ISSN (Print)
1875-6891
DOI
10.1080/18756891.2012.696915How to use a DOI?
Copyright
© 2017, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - JOUR
AU  - Wuying Liu
AU  - Ting Wang
PY  - 2012
DA  - 2012/06/01
TI  - Utilizing Multi-Field Text Features for Efficient Email Spam Filtering
JO  - International Journal of Computational Intelligence Systems
SP  - 505
EP  - 518
VL  - 5
IS  - 3
SN  - 1875-6883
UR  - https://doi.org/10.1080/18756891.2012.696915
DO  - 10.1080/18756891.2012.696915
ID  - Liu2012
ER  -