Duplicate text detection based on LCS algorithm

Jiankun Yu; Mengrong Li; Dengyin Zhang

doi:10.2991/itoec-16.2016.2

<Previous Article In Volume

Next Article In Volume>

Duplicate text detection based on LCS algorithm

Authors

Jiankun Yu, Mengrong Li, Dengyin Zhang

Corresponding Author

Jiankun Yu

Available Online May 2016.

DOI: 10.2991/itoec-16.2016.2 How to use a DOI?
Keywords: near-duplicate detection; duplicate detection; duplicate text filter.
Abstract: Broder's Shingling and MinHash are two of the state-of-the-art approaches in detecting near-duplicate documents. But both of these two methods did not take the relative position of elements into consideration. This paper proposes a method which combines Shingling and LCS algorithm called SWLR (Shingling with Location Relationship). And proposes a pre-filter method to speed up the execution speed of SWLR. Experiment results shows that SWLR performances better than Shingling in both recall and precision rate and better than MinHash in recall rate. By applying pre-filter method, SWLR could even be executed faster than MinHash and Shingling.
Copyright: © 2016, the Authors. Published by Atlantis Press.
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

<Previous Article In Volume

Next Article In Volume>

Volume Title: Proceedings of the 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016)
Series: Advances in Engineering Research
Publication Date: May 2016
ISBN: 978-94-6252-178-0
ISSN: 2352-5401
DOI: 10.2991/itoec-16.2016.2 How to use a DOI?
Open Access: This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

ris enw bib

TY  - CONF
AU  - Jiankun Yu
AU  - Mengrong Li
AU  - Dengyin Zhang
PY  - 2016/05
DA  - 2016/05
TI  - Duplicate text detection based on LCS algorithm
BT  - Proceedings of the 2nd Information Technology and Mechatronics Engineering Conference (ITOEC 2016)
PB  - Atlantis Press
SP  - 5
EP  - 9
SN  - 2352-5401
UR  - https://doi.org/10.2991/itoec-16.2016.2
DO  - 10.2991/itoec-16.2016.2
ID  - Yu2016/05
ER  -

download .riscopy to clipboard