Proceedings of the 2015 International Conference on Electromechanical Control Technology and Transportation

Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop

Authors
Chao Feng, Baoan Li
Corresponding Author
Chao Feng
Available Online November 2015.
DOI
10.2991/icectt-15.2015.88How to use a DOI?
Keywords
Search Engine, Words Segmentation, Hadoop, Double Hash
Abstract

Words Segmentation is an essential stage to establish a search engine, and the quality of words segmentation directly affects the search speed and precision. We have to adopt a word segmentation tool which can deal with a big data when large amounts of data is being segmented, because the traditional single PC segmentation has not been able to meet our needs. This study presents a Chinese words segmentation technology based on Hadoop. In this paper, the adoption of dictionary created by the double hash function, the adoption of the maximum forward successive matching method, and the using of the MR programming realize the parallel words segmentation in the distributed cluster, and it greatly shortens the time and increases efficiency. It provides a convenient and quick method for the words segmentation of a large quantity of text.

Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Download article (PDF)

Volume Title
Proceedings of the 2015 International Conference on Electromechanical Control Technology and Transportation
Series
Advances in Engineering Research
Publication Date
November 2015
ISBN
978-94-6252-124-7
ISSN
2352-5401
DOI
10.2991/icectt-15.2015.88How to use a DOI?
Copyright
© 2015, the Authors. Published by Atlantis Press.
Open Access
This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).

Cite this article

TY  - CONF
AU  - Chao Feng
AU  - Baoan Li
PY  - 2015/11
DA  - 2015/11
TI  - Chinese Words Segmentation Based on Double Hash Dictionary Running on Hadoop
BT  - Proceedings of the 2015 International Conference on Electromechanical Control Technology and Transportation
PB  - Atlantis Press
SP  - 461
EP  - 465
SN  - 2352-5401
UR  - https://doi.org/10.2991/icectt-15.2015.88
DO  - 10.2991/icectt-15.2015.88
ID  - Feng2015/11
ER  -