Automation Distributed Cloud Based Crawler

Lanang Prismana

doi:10.2991/978-94-6463-525-6_3

<Previous Article In Volume

Next Article In Volume>

Automation Distributed Cloud Based Crawler

Authors

Lanang Prismana¹^{, *}

¹Department of Informatics Engineering, Faculty of Engineering, Universitas Negeri Surabaya, Surabaya, Indonesia

^*Corresponding author. Email: lanangprismana@unesa.ac.id

Corresponding Author

Lanang Prismana

Available Online 29 October 2024.

DOI: 10.2991/978-94-6463-525-6_3 How to use a DOI?
Keywords: crawler; automation; distributed systems; fog-cloud; distributed cloud
Abstract: Information is very important data and is needed in various needs. Online news is one type of site that ranks in the top 10 most visited by internet users in Indonesia. Online news sites publish articles to the internet every minute. An online news corpus is necessary for information processing. Retrieval of online news corpus in general has obstacles such as large resource requirements, delays due to excessive access restrictions categorized as bots / spam, thus affecting the speed of retrieval of information from online news. To overcome this problem, it is necessary to develop a framework to improve performance in the creation of an online news corpus. In this study, a framework was developed in creating an online news corpus based on distributed cloud based crawler automation using the MCDM method. The process of self-optimization of cloud tasks in research uses a topsis approach method with alternative data as objects to be assessed, then the task scheduling process of selecting edge nodes in this study will apply the AHP method to get the best alternatives. This framework divides the crawler system and information extraction into several sub-systems. The first stage developed a distributed crawler system, a mechanism for distributing work using a node selection mechanism. The second stage is to develop an information extraction system using a combination of pattern based and node density. The third stage developed automated node management. The contribution of this research is the automation of distributed cloud-based crawler framework which has not been done by previous researchers. This framework activates nodes according to the priority of existing work so that it can speed up the process of retrieving information by using small resources. The performance of this framework will be tested for the accuracy of the extraction results and the average time required. The stages carried out in this research start from URL collection, URL filtering, scheduling, accessing URLs and data extraction. This research focused on the automation of distributed cloud-based crawlers.
Copyright: © 2024 The Author(s)
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2023 Brawijaya International Conference (BIC 2023)
Series: Advances in Economics, Business and Management Research
Publication Date: 29 October 2024
ISBN: 978-94-6463-525-6
ISSN: 2352-5428
DOI: 10.2991/978-94-6463-525-6_3 How to use a DOI?
Open Access: Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

ris enw bib

TY  - CONF
AU  - Lanang Prismana
PY  - 2024
DA  - 2024/10/29
TI  - Automation Distributed Cloud Based Crawler
BT  - Proceedings of the 2023 Brawijaya International Conference (BIC 2023)
PB  - Atlantis Press
SP  - 13
EP  - 21
SN  - 2352-5428
UR  - https://doi.org/10.2991/978-94-6463-525-6_3
DO  - 10.2991/978-94-6463-525-6_3
ID  - Prismana2024
ER  -

download .riscopy to clipboard