Large-scale Chinese Text Infringement Detection Based on Dual-Semantic Fingerprinting
- DOI
- 10.2991/978-94-6463-326-9_25How to use a DOI?
- Keywords
- Infringement detection; Text similarity; CiLin; Dual-semantic fingerprinting
- Abstract
The SimHash algorithm is a type of hash method used to deduplicate large web pages. It is also widely used in text similarity comparison due to its high effectiveness and efficiency. In this paper, we improve the classical SimHash algorithm in semantic similarity detection of large Chinese texts. In our method, word similarity is first calculated using the text similarity determination method based on CiLin path depth algorithm, then the keywords extracted using TF-IDF are processed for synonym redundancy. Finally, dual-semantic fingerprints are generated and the Hamming distance between the fingerprints is calculated. The experimental results show that this improved SimHash algorithm is superior to the classical SimHash algorithm in terms of F1_score. It is suggested that this algorithm can further improve the probability of semantically finding infringing texts and provide technical support for digital copyright infringement detection.
- Copyright
- © 2023 The Author(s)
- Open Access
- Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 4.0 International License (http://creativecommons.org/licenses/by-nc/4.0/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
Cite this article
TY - CONF AU - Ruixue Zhao AU - Xiao Yang AU - Honglei Li PY - 2023 DA - 2023/12/30 TI - Large-scale Chinese Text Infringement Detection Based on Dual-Semantic Fingerprinting BT - Proceedings of the 2023 3rd International Conference on Business Administration and Data Science (BADS 2023) PB - Atlantis Press SP - 232 EP - 244 SN - 2589-4900 UR - https://doi.org/10.2991/978-94-6463-326-9_25 DO - 10.2991/978-94-6463-326-9_25 ID - Zhao2023 ER -