Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page

Jianfu Cai; Hua Zhang

doi:10.2991/icmmcce-15.2015.505

<Previous Article In Volume

Next Article In Volume>

Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page

Authors

Jianfu Cai, Hua Zhang

Corresponding Author

Jianfu Cai

Available Online December 2015.

DOI: 10.2991/icmmcce-15.2015.505 How to use a DOI?
Keywords: distributed crawler, dynamic web page, HtmlUnit.
Abstract: Nowadays, it has became a widespread approach for achieving rich information in modern web applications using AJAX ,which cause two serious problems for web crawler. One is the incomplete information getting from web page due to the inability to parse dynamic web page. Another is the efficiency of the crawler. In order to solve the above problems, this paper proposes a distributed dynamic web crawler naming Dis-Dyn Crawler. This system uses HtmlUnit to page dynamic pages and choose Redis and ZMQ (Message Queue Zero) to realize the distribution feature, which improve the efficiency of the crawler. The experimental results show that Dis-Dyn Crawler has better performance than Nutch-a distributed crawler system, and the dynamic page parsing efficiency is also improved.
Copyright: © 2015, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015
Series: Advances in Computer Science Research
Publication Date: December 2015
ISBN: 978-94-6252-133-9
ISSN: 2352-538X
DOI: 10.2991/icmmcce-15.2015.505 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Jianfu Cai
AU  - Hua Zhang
PY  - 2015/12
DA  - 2015/12
TI  - Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page
BT  - Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015
PB  - Atlantis Press
SN  - 2352-538X
UR  - https://doi.org/10.2991/icmmcce-15.2015.505
DO  - 10.2991/icmmcce-15.2015.505
ID  - Cai2015/12
ER  -

download .riscopy to clipboard