An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability
Authors
Qingsong Lv, Shulin Cao, Yifan Wang, Qian Yin, Xin Zheng
Corresponding Author
Qingsong Lv
Available Online April 2017.
- DOI
- 10.2991/emim-17.2017.19How to use a DOI?
- Keywords
- Web page extraction; Web page classification; Law of total probability
- Abstract
Since Internet web pages have diverse contents and complex structure, it is of great significance to use a uniform algorithm to tackle them. In this paper, we proposed an algorithm called P value algorithm to extract the main text of one webpage. By calculating the P value of each tag in an HTML page, we can locate where the main text is. Moreover, the P value of a web page can also represent the probability of "This web page has main text". The experiments show that the accuracy of extracting web pages is 95.42% and the accuracy of judging whether a page has main text is 93.98% without any prior knowledge.
- Copyright
- © 2017, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Qingsong Lv AU - Shulin Cao AU - Yifan Wang AU - Qian Yin AU - Xin Zheng PY - 2017/04 DA - 2017/04 TI - An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability BT - Proceedings of the 7th International Conference on Education, Management, Information and Mechanical Engineering (EMIM 2017) PB - Atlantis Press SP - 93 EP - 96 SN - 2352-538X UR - https://doi.org/10.2991/emim-17.2017.19 DO - 10.2991/emim-17.2017.19 ID - Lv2017/04 ER -