Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015

Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page

Authors
Jianfu Cai, Hua Zhang
Corresponding Author
Jianfu Cai
Available Online December 2015.
DOI
10.2991/icmmcce-15.2015.505How to use a DOI?
Keywords
distributed crawler, dynamic web page, HtmlUnit.
Abstract

Nowadays, it has became a widespread approach for achieving rich information in modern web applications using AJAX ,which cause two serious problems for web crawler. One is the incomplete information getting from web page due to the inability to parse dynamic web page. Another is the efficiency of the crawler. In order to solve the above problems, this paper proposes a distributed dynamic web crawler naming Dis-Dyn Crawler. This system uses HtmlUnit to page dynamic pages and choose Redis and ZMQ (Message Queue Zero) to realize the distribution feature, which improve the efficiency of the crawler. The experimental results show that Dis-Dyn Crawler has better performance than Nutch-a distributed crawler system, and the dynamic page parsing efficiency is also improved.

Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015
Series
Advances in Computer Science Research
Publication Date
December 2015
ISBN
978-94-6252-133-9
ISSN
2352-538X
DOI
10.2991/icmmcce-15.2015.505How to use a DOI?
Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Jianfu Cai
AU  - Hua Zhang
PY  - 2015/12
DA  - 2015/12
TI  - Dis-Dyn Crawler:A Distributed Crawler for Dynamic Web Page
BT  - Proceedings of the 4th International Conference on Mechatronics, Materials, Chemistry and Computer Engineering 2015
PB  - Atlantis Press
SN  - 2352-538X
UR  - https://doi.org/10.2991/icmmcce-15.2015.505
DO  - 10.2991/icmmcce-15.2015.505
ID  - Cai2015/12
ER  -