A Type of Web Content Extraction Algorithm Based on Adaptive Threshold

Guang Zheng; Xianghui Hui; Xin Xu; Lei Xi

doi:10.2991/icsma-16.2016.45

<Previous Article In Volume

Next Article In Volume>

A Type of Web Content Extraction Algorithm Based on Adaptive Threshold

Authors

Guang Zheng, Xianghui Hui, Xin Xu, Lei Xi

Corresponding Author

Guang Zheng

Available Online December 2016.

DOI: 10.2991/icsma-16.2016.45 How to use a DOI?
Keywords: new rural community; Web information fetching; text density; adaptive threshold; Otsu threshold algorithm; Web page text extraction algorithm
Abstract: On the basis of the text extraction based on the density of text, the Web page text extraction algorithm based on the adaptive threshold was proposed and applied in the new rural community employment information service system for the employment information fetching from the related government affairs website combined with the Otsu threshold algorithm. Through the web page text extraction contrast experiments to the Webpages including "The ministry of human resources and social security of the People's Republic of China", "The ministry of human resources and social security hall of henan province" and "Sina.com", the text extraction rate of the algorithm reached 90%, 92% and 92% respectively. The results showed that the application of the algorithm in new rural community employment information service system could provide technical support for the directional employment information acquisition and realize accurate employment information retrieval.
Copyright: © 2016, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2016 4th International Conference on Sensors, Mechatronics and Automation (ICSMA 2016)
Series: Advances in Intelligent Systems Research
Publication Date: December 2016
ISBN: 978-94-6252-274-9
ISSN: 1951-6851
DOI: 10.2991/icsma-16.2016.45 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Guang Zheng
AU  - Xianghui Hui
AU  - Xin Xu
AU  - Lei Xi
PY  - 2016/12
DA  - 2016/12
TI  - A Type of Web Content Extraction Algorithm Based on Adaptive Threshold
BT  - Proceedings of the 2016 4th International Conference on Sensors, Mechatronics and Automation (ICSMA 2016)
PB  - Atlantis Press
SP  - 244
EP  - 250
SN  - 1951-6851
UR  - https://doi.org/10.2991/icsma-16.2016.45
DO  - 10.2991/icsma-16.2016.45
ID  - Zheng2016/12
ER  -

download .riscopy to clipboard