Building Artificial Neural Networks for NLP Analysis and Classification of Target Content
- DOI
- 10.2991/assehr.k.210225.058How to use a DOI?
- Keywords
- artificial neural network, hybrid architecture, multiclass text analysis, cyber threat
- Abstract
The problems of analyzing texts in natural language (NLP) using artificial intelligence (AI) methods are caused by the semantic and lexicological diversity of texts. This circumstance causes the appearance of various machine learning (ML) metrics for neural network analysis. The problem of AI analysis is further complicated by the fact that the content under study often contains “information garbage”, which is information noise, complicating the solution of a well-known problem of text classification. The lexicological diversity of Internet content requires improving the methods of neural network NLP analysis. The purpose of the research is to identify and solve problems that arise when analyzing information texts using artificial neural networks (ANN), using the example of socio-political content. Well-known NLP technologies include substantiation of the structure and formation of a subject-oriented database of text data bodies, construction of dictionaries based on frequency analysis, and digital vectorization of texts. To identify the latent semantic content, the expediency of using a dense vector representation of terms in a multidimensional space (the embedding model) is justified. In order to justify the choice of basic architectures developed by ins to account for sequences and combinations of analyzed terms, modifications of convolutional (Conv1D) recurrent (CNN, LSTM, etc.) layers were selected that allow storing token sequences. Since such powerful layers contribute to the appearance of undesirable re-training of ins, effective means of regularization are necessary, for example, dropout layers. The authors substantiate a modified NLP approach to identifying sociocultural and cyber threats contained in the information content of Internet resources. Based on the frequency analysis of the target Internet content, dictionaries of terms used for multi-class text analysis are pre-formed, as well as their markup. To justify and study the ins architecture and hyperparameters focused on the content of the analyzed subject area, the ANN family was built in Python 3 using specialized libraries - Keras, ScikitLearn, and others. The ANN architecture included combinations of fully connected, convolutional, and/or recurrent layers. When training ANN in the Google Colaboratory environment, high-performance GPUs were used. Recommendations are given for selecting ANN hyperparameters that are invariant for various architectures of hidden layers of hybrid ANN that are focused on solving the problem of multiclass NLP analysis. The degree of correct text recognition in the test sample exceeded 80%. Recommendations for its improvment it are given.
- Copyright
- © 2021, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Aleksey Rogachev AU - Elena Melikhova AU - Gennady Atamanov PY - 2021 DA - 2021/02/26 TI - Building Artificial Neural Networks for NLP Analysis and Classification of Target Content BT - Proceedings of the conference on current problems of our time: the relationship of man and society (CPT 2020) PB - Atlantis Press SP - 383 EP - 387 SN - 2352-5398 UR - https://doi.org/10.2991/assehr.k.210225.058 DO - 10.2991/assehr.k.210225.058 ID - Rogachev2021 ER -