Bertopic and NER Stop Words for Topic Modeling on Agricultural Instructional Sentences
- DOI
- 10.2991/978-94-6463-364-1_14How to use a DOI?
- Keywords
- BERTopic; NER; Stop Words; Topic Modeling
- Abstract
A drawback of topic modeling is the lack of consistent sentence frequency within each topic. The outcome of this event manifests as varying levels of topic coherence and topic diversity. One potential approach to addressing this issue involves the modification of stop words, which refers to the removal of unneeded or excessively utilized terms. In the context of specialist areas like health, law, and agriculture, the identification of stop words can be achieved through the utilization of Name Entity Recognition (NER). This procedure involves preprocessing the data before subjecting it to topic modeling. Furthermore, it is possible to investigate the utilization of several topic modeling elements in conjunction with BERTopic to enhance the efficacy of the generated topics. The most effective configuration for the BERTopic pipeline consists of employing Sentence Embedding for text representation, UMAP Dimensionality Reduction for feature reduction, HDBScan Clustering for grouping similar documents, and utilizing a combination of Named Entity Recognition (NER) for removing stop words and C-TF-IDF for topic representation. This has resulted in the highest level of topic diversity performance for JADI and PUW by 0,982 and 0,990. The method generated the minimum number of outliers. However, there has been a decrease in the effectiveness of topic coherence.
- Copyright
- © 2024 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Trisna Gelar AU - Aprianti Nanda Sari PY - 2024 DA - 2024/02/17 TI - Bertopic and NER Stop Words for Topic Modeling on Agricultural Instructional Sentences BT - Proceedings of the International Conference on Applied Science and Technology on Engineering Science 2023 (iCAST-ES 2023) PB - Atlantis Press SP - 129 EP - 140 SN - 2352-5401 UR - https://doi.org/10.2991/978-94-6463-364-1_14 DO - 10.2991/978-94-6463-364-1_14 ID - Gelar2024 ER -