Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm
- DOI
- 10.2991/icismme-15.2015.262How to use a DOI?
- Keywords
- duplicated document detection; Simhash; Fingerprint calculation.
- Abstract
On the background of the deduplication needs of repeated documents in Internet, research the deduplication technique based on Simhash algorithm on large-scale documents. On the basis of taking the Simhash algorithm as core algorithm in duplicated documents detection, improve the procedure of achieving documents features of this algorithm. It takes the meaning and length of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit, provide the document service of making a similarity comparison based on the full text and paragraphs. Through test data and analysis, this technique can guarantee the stable operation, 100 million documents can be memorized in each example. The average request response time is about 20 ms. The respon.se time will increase during the peak hour, but, in general, will not go over 100 ms.
- Copyright
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Yi Yu AU - Zijian Hu AU - Yuzhu Zhang PY - 2015/07 DA - 2015/07 TI - Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm BT - Proceedings of the First International Conference on Information Sciences, Machinery, Materials and Energy PB - Atlantis Press SP - 1225 EP - 1228 SN - 1951-6851 UR - https://doi.org/10.2991/icismme-15.2015.262 DO - 10.2991/icismme-15.2015.262 ID - Yu2015/07 ER -