Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm

Yi Yu; Zijian Hu; Yuzhu Zhang

doi:10.2991/icismme-15.2015.262

<Previous Article In Volume

Next Article In Volume>

Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm

Authors

Yi Yu, Zijian Hu, Yuzhu Zhang

Corresponding Author

Yi Yu

Available Online July 2015.

DOI: 10.2991/icismme-15.2015.262 How to use a DOI?
Keywords: duplicated document detection; Simhash; Fingerprint calculation.
Abstract: On the background of the deduplication needs of repeated documents in Internet, research the deduplication technique based on Simhash algorithm on large-scale documents. On the basis of taking the Simhash algorithm as core algorithm in duplicated documents detection, improve the procedure of achieving documents features of this algorithm. It takes the meaning and length of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit, provide the document service of making a similarity comparison based on the full text and paragraphs. Through test data and analysis, this technique can guarantee the stable operation, 100 million documents can be memorized in each example. The average request response time is about 20 ms. The respon.se time will increase during the peak hour, but, in general, will not go over 100 ms.
Copyright: © 2015, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the First International Conference on Information Sciences, Machinery, Materials and Energy
Series: Advances in Intelligent Systems Research
Publication Date: July 2015
ISBN: 978-94-62520-67-7
ISSN: 1951-6851
DOI: 10.2991/icismme-15.2015.262 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Yi Yu
AU  - Zijian Hu
AU  - Yuzhu Zhang
PY  - 2015/07
DA  - 2015/07
TI  - Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm
BT  - Proceedings of the First International Conference on Information Sciences, Machinery, Materials and Energy
PB  - Atlantis Press
SP  - 1225
EP  - 1228
SN  - 1951-6851
UR  - https://doi.org/10.2991/icismme-15.2015.262
DO  - 10.2991/icismme-15.2015.262
ID  - Yu2015/07
ER  -

download .riscopy to clipboard