An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability

Qingsong Lv; Shulin Cao; Yifan Wang; Qian Yin; Xin Zheng

doi:10.2991/emim-17.2017.19

<Previous Article In Volume

Next Article In Volume>

An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability

Authors

Qingsong Lv, Shulin Cao, Yifan Wang, Qian Yin, Xin Zheng

Corresponding Author

Qingsong Lv

Available Online April 2017.

DOI: 10.2991/emim-17.2017.19 How to use a DOI?
Keywords: Web page extraction; Web page classification; Law of total probability
Abstract: Since Internet web pages have diverse contents and complex structure, it is of great significance to use a uniform algorithm to tackle them. In this paper, we proposed an algorithm called P value algorithm to extract the main text of one webpage. By calculating the P value of each tag in an HTML page, we can locate where the main text is. Moreover, the P value of a web page can also represent the probability of "This web page has main text". The experiments show that the accuracy of extracting web pages is 95.42% and the accuracy of judging whether a page has main text is 93.98% without any prior knowledge.
Copyright: © 2017, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 7th International Conference on Education, Management, Information and Mechanical Engineering (EMIM 2017)
Series: Advances in Computer Science Research
Publication Date: April 2017
ISBN: 978-94-6252-356-2
ISSN: 2352-538X
DOI: 10.2991/emim-17.2017.19 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Qingsong Lv
AU  - Shulin Cao
AU  - Yifan Wang
AU  - Qian Yin
AU  - Xin Zheng
PY  - 2017/04
DA  - 2017/04
TI  - An Algorithm to Extract and Judge the Main Text Based on the Law of Total Probability
BT  - Proceedings of the 7th International Conference on Education, Management, Information and Mechanical Engineering (EMIM 2017)
PB  - Atlantis Press
SP  - 93
EP  - 96
SN  - 2352-538X
UR  - https://doi.org/10.2991/emim-17.2017.19
DO  - 10.2991/emim-17.2017.19
ID  - Lv2017/04
ER  -

download .riscopy to clipboard