Proceedings of the 2023 Brawijaya International Conference (BIC 2023)

Automation Distributed Cloud Based Crawler

Authors
Lanang Prismana1, *
1Department of Informatics Engineering, Faculty of Engineering, Universitas Negeri Surabaya, Surabaya, Indonesia
*Corresponding author. Email: lanangprismana@unesa.ac.id
Corresponding Author
Lanang Prismana
Available Online 29 October 2024.
DOI
10.2991/978-94-6463-525-6_3How to use a DOI?
Keywords
crawler; automation; distributed systems; fog-cloud; distributed cloud
Abstract

Information is very important data and is needed in various needs. Online news is one type of site that ranks in the top 10 most visited by internet users in Indonesia. Online news sites publish articles to the internet every minute. An online news corpus is necessary for information processing. Retrieval of online news corpus in general has obstacles such as large resource requirements, delays due to excessive access restrictions categorized as bots / spam, thus affecting the speed of retrieval of information from online news. To overcome this problem, it is necessary to develop a framework to improve performance in the creation of an online news corpus. In this study, a framework was developed in creating an online news corpus based on distributed cloud based crawler automation using the MCDM method. The process of self-optimization of cloud tasks in research uses a topsis approach method with alternative data as objects to be assessed, then the task scheduling process of selecting edge nodes in this study will apply the AHP method to get the best alternatives. This framework divides the crawler system and information extraction into several sub-systems. The first stage developed a distributed crawler system, a mechanism for distributing work using a node selection mechanism. The second stage is to develop an information extraction system using a combination of pattern based and node density. The third stage developed automated node management. The contribution of this research is the automation of distributed cloud-based crawler framework which has not been done by previous researchers. This framework activates nodes according to the priority of existing work so that it can speed up the process of retrieving information by using small resources. The performance of this framework will be tested for the accuracy of the extraction results and the average time required. The stages carried out in this research start from URL collection, URL filtering, scheduling, accessing URLs and data extraction. This research focused on the automation of distributed cloud-based crawlers.

Copyright
© 2024 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Download article (PDF)

Volume Title
Proceedings of the 2023 Brawijaya International Conference (BIC 2023)
Series
Advances in Economics, Business and Management Research
Publication Date
29 October 2024
ISBN
978-94-6463-525-6
ISSN
2352-5428
DOI
10.2991/978-94-6463-525-6_3How to use a DOI?
Copyright
© 2024 The Author(s)
Open Access
Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

Cite this article

TY  - CONF
AU  - Lanang Prismana
PY  - 2024
DA  - 2024/10/29
TI  - Automation Distributed Cloud Based Crawler
BT  - Proceedings of the 2023 Brawijaya International Conference (BIC 2023)
PB  - Atlantis Press
SP  - 13
EP  - 21
SN  - 2352-5428
UR  - https://doi.org/10.2991/978-94-6463-525-6_3
DO  - 10.2991/978-94-6463-525-6_3
ID  - Prismana2024
ER  -