Proceedings of the First International Conference on Information Sciences, Machinery, Materials and Energy

Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm

Authors
Yi Yu, Zijian Hu, Yuzhu Zhang
Corresponding Author
Yi Yu
Available Online July 2015.
DOI
10.2991/icismme-15.2015.262How to use a DOI?
Keywords
duplicated document detection; Simhash; Fingerprint calculation.
Abstract

On the background of the deduplication needs of repeated documents in Internet, research the deduplication technique based on Simhash algorithm on large-scale documents. On the basis of taking the Simhash algorithm as core algorithm in duplicated documents detection, improve the procedure of achieving documents features of this algorithm. It takes the meaning and length of words as a consideration factor in measuring the weight of words. Aiming at the Simhash signature of a 64-bit, provide the document service of making a similarity comparison based on the full text and paragraphs. Through test data and analysis, this technique can guarantee the stable operation, 100 million documents can be memorized in each example. The average request response time is about 20 ms. The respon.se time will increase during the peak hour, but, in general, will not go over 100 ms.

Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the First International Conference on Information Sciences, Machinery, Materials and Energy
Series
Advances in Intelligent Systems Research
Publication Date
July 2015
ISBN
978-94-62520-67-7
ISSN
1951-6851
DOI
10.2991/icismme-15.2015.262How to use a DOI?
Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Yi Yu
AU  - Zijian Hu
AU  - Yuzhu Zhang
PY  - 2015/07
DA  - 2015/07
TI  - Rearch on Large Scale Documents Deduplication Technique based on Simhash Algorithm
BT  - Proceedings of the First International Conference on Information Sciences, Machinery, Materials and Energy
PB  - Atlantis Press
SP  - 1225
EP  - 1228
SN  - 1951-6851
UR  - https://doi.org/10.2991/icismme-15.2015.262
DO  - 10.2991/icismme-15.2015.262
ID  - Yu2015/07
ER  -