Quality Assessment Method of Web Documents Based on Random Forest
- DOI
- 10.2991/icmmcce-17.2017.190How to use a DOI?
- Keywords
- Web document; quality assessment; LDA topic model; random forest
- Abstract
This paper proposes a method based on the method of Random Forest (RF) for better assessing quality of web documents, and formulates a novel quality evaluation index system including features of organization structure, network access and content. In order to extract the content feature of a document, a topic coverage degree calculation model based on LDA is put forward. Finally, it conduct some experiments on two document sets: Wikipedia and Baidu Encyclopedia, and precision rate, recall rate and F-Measure are used to verify the validity of the proposed quality assessment method. Experimental results show that the proposed evaluation index system and the RF-based quality assessment method can achieve good performance and advantages.
- Copyright
- © 2017, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Li He AU - Li Tang AU - Ning Wang PY - 2017/09 DA - 2017/09 TI - Quality Assessment Method of Web Documents Based on Random Forest BT - Proceedings of the 2017 5th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering (ICMMCCE 2017) PB - Atlantis Press SP - 1058 EP - 1065 SN - 2352-5401 UR - https://doi.org/10.2991/icmmcce-17.2017.190 DO - 10.2991/icmmcce-17.2017.190 ID - He2017/09 ER -