The Design and Implementation of Web Crawler Distributed News Domain Detection System
- DOI
- 10.2991/aer.k.201124.017How to use a DOI?
- Keywords
- Web crawler, news domain, distributed, focus crawler
- Abstract
Spreading data or info through internet to increase the chances of success in a business through analysis of market trends is very common today. Web Crawl is one important thing, so that the incomplete data will not be appeared, and the data received is the most recent data. Exploration Web crawler technology is a technology that downloads web pages via a program. Crawlers and search engines face unpredictable challenges. A focused web crawl is essential for mining the unlimited data available on the internet. The web crawl encountered an undetermined latency issue due to their difference in response time. The proposed research tries to optimize the design and implementation of a distributed news domain detection system on a web crawler. This study proposes a distributed focused crawler because it reduces the appearance of time outs on each website, eliminates backlist capabilities, distributes resources and improves web crawlers work in efficient network bandwidth and storage capacity. The main objective of distributed theory Web Crawler implements crawler scheduling, sorting sites to define URL queues. The crawler is only focused on news data. This research implements URL Gate explorer, which is used as the main bridge of instructions from the database, URL Seed to check all URLs for each news, and get metadata to check each meta data whether there is the same title.
- Copyright
- © 2020, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - I Gusti Lanang Putra Eka Prismana AU - Dedy Rahman Prehanto AU - I Kadek Dwi Nuryana PY - 2020 DA - 2020/11/24 TI - The Design and Implementation of Web Crawler Distributed News Domain Detection System BT - Proceedings of the International Joint Conference on Science and Engineering (IJCSE 2020) PB - Atlantis Press SP - 92 EP - 97 SN - 2352-5401 UR - https://doi.org/10.2991/aer.k.201124.017 DO - 10.2991/aer.k.201124.017 ID - Prismana2020 ER -