A Type of Web Content Extraction Algorithm Based on Adaptive Threshold
- DOI
- 10.2991/icsma-16.2016.45How to use a DOI?
- Keywords
- new rural community; Web information fetching; text density; adaptive threshold; Otsu threshold algorithm; Web page text extraction algorithm
- Abstract
On the basis of the text extraction based on the density of text, the Web page text extraction algorithm based on the adaptive threshold was proposed and applied in the new rural community employment information service system for the employment information fetching from the related government affairs website combined with the Otsu threshold algorithm. Through the web page text extraction contrast experiments to the Webpages including "The ministry of human resources and social security of the People's Republic of China", "The ministry of human resources and social security hall of henan province" and "Sina.com", the text extraction rate of the algorithm reached 90%, 92% and 92% respectively. The results showed that the application of the algorithm in new rural community employment information service system could provide technical support for the directional employment information acquisition and realize accurate employment information retrieval.
- Copyright
- © 2016, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Guang Zheng AU - Xianghui Hui AU - Xin Xu AU - Lei Xi PY - 2016/12 DA - 2016/12 TI - A Type of Web Content Extraction Algorithm Based on Adaptive Threshold BT - Proceedings of the 2016 4th International Conference on Sensors, Mechatronics and Automation (ICSMA 2016) PB - Atlantis Press SP - 244 EP - 250 SN - 1951-6851 UR - https://doi.org/10.2991/icsma-16.2016.45 DO - 10.2991/icsma-16.2016.45 ID - Zheng2016/12 ER -