Duplicate text detection based on LCS algorithm
Authors
Jiankun Yu, Mengrong Li, Dengyin Zhang
Corresponding Author
Jiankun Yu
Available Online May 2016.
- DOI
- 10.2991/itoec-16.2016.2How to use a DOI?
- Keywords
- near-duplicate detection; duplicate detection; duplicate text filter.
- Abstract
Broder's Shingling and MinHash are two of the state-of-the-art approaches in detecting near-duplicate documents. But both of these two methods did not take the relative position of elements into consideration. This paper proposes a method which combines Shingling and LCS algorithm called SWLR (Shingling with Location Relationship). And proposes a pre-filter method to speed up the execution speed of SWLR. Experiment results shows that SWLR performances better than Shingling in both recall and precision rate and better than MinHash in recall rate. By applying pre-filter method, SWLR could even be executed faster than MinHash and Shingling.
- Copyright
- © 2016, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Jiankun Yu AU - Mengrong Li AU - Dengyin Zhang PY - 2016/05 DA - 2016/05 TI - Duplicate text detection based on LCS algorithm BT - Proceedings of the 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016) PB - Atlantis Press SP - 5 EP - 9 SN - 2352-5401 UR - https://doi.org/10.2991/itoec-16.2016.2 DO - 10.2991/itoec-16.2016.2 ID - Yu2016/05 ER -