Proceedings of the conference on current problems of our time: the relationship of man and society (CPT 2020)

Building Artificial Neural Networks for NLP Analysis and Classification of Target Content

Authors
Aleksey Rogachev, Elena Melikhova, Gennady Atamanov
Corresponding Author
Aleksey Rogachev
Available Online 26 February 2021.
DOI
10.2991/assehr.k.210225.058How to use a DOI?
Keywords
artificial neural network, hybrid architecture, multiclass text analysis, cyber threat
Abstract

The problems of analyzing texts in natural language (NLP) using artificial intelligence (AI) methods are caused by the semantic and lexicological diversity of texts. This circumstance causes the appearance of various machine learning (ML) metrics for neural network analysis. The problem of AI analysis is further complicated by the fact that the content under study often contains “information garbage”, which is information noise, complicating the solution of a well-known problem of text classification. The lexicological diversity of Internet content requires improving the methods of neural network NLP analysis. The purpose of the research is to identify and solve problems that arise when analyzing information texts using artificial neural networks (ANN), using the example of socio-political content. Well-known NLP technologies include substantiation of the structure and formation of a subject-oriented database of text data bodies, construction of dictionaries based on frequency analysis, and digital vectorization of texts. To identify the latent semantic content, the expediency of using a dense vector representation of terms in a multidimensional space (the embedding model) is justified. In order to justify the choice of basic architectures developed by ins to account for sequences and combinations of analyzed terms, modifications of convolutional (Conv1D) recurrent (CNN, LSTM, etc.) layers were selected that allow storing token sequences. Since such powerful layers contribute to the appearance of undesirable re-training of ins, effective means of regularization are necessary, for example, dropout layers. The authors substantiate a modified NLP approach to identifying sociocultural and cyber threats contained in the information content of Internet resources. Based on the frequency analysis of the target Internet content, dictionaries of terms used for multi-class text analysis are pre-formed, as well as their markup. To justify and study the ins architecture and hyperparameters focused on the content of the analyzed subject area, the ANN family was built in Python 3 using specialized libraries - Keras, ScikitLearn, and others. The ANN architecture included combinations of fully connected, convolutional, and/or recurrent layers. When training ANN in the Google Colaboratory environment, high-performance GPUs were used. Recommendations are given for selecting ANN hyperparameters that are invariant for various architectures of hidden layers of hybrid ANN that are focused on solving the problem of multiclass NLP analysis. The degree of correct text recognition in the test sample exceeded 80%. Recommendations for its improvment it are given.

Copyright
© 2021, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the conference on current problems of our time: the relationship of man and society (CPT 2020)
Series
Advances in Social Science, Education and Humanities Research
Publication Date
26 February 2021
ISBN
978-94-6239-342-4
ISSN
2352-5398
DOI
10.2991/assehr.k.210225.058How to use a DOI?
Copyright
© 2021, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Aleksey Rogachev
AU  - Elena Melikhova
AU  - Gennady Atamanov
PY  - 2021
DA  - 2021/02/26
TI  - Building Artificial Neural Networks for NLP Analysis and Classification of Target Content
BT  - Proceedings of the conference on current problems of our time: the relationship of man and society (CPT 2020)
PB  - Atlantis Press
SP  - 383
EP  - 387
SN  - 2352-5398
UR  - https://doi.org/10.2991/assehr.k.210225.058
DO  - 10.2991/assehr.k.210225.058
ID  - Rogachev2021
ER  -