An Adaptive Chinese Word Segmentation Method
- DOI
- 10.2991/amcce-18.2018.96How to use a DOI?
- Keywords
- Chinese word segmentation; Active learning; CRF; domain adaption
- Abstract
Due to the limitations of the field of training corpus, the Chinese word segmentation based on statistic results in poor self-adaptability in the field. In view of the difficulty of obtaining large-scale annotation corpus in the target area, this paper proposes an area adaptation method that combines domain dictionaries with active learning algorithms. Select a small-scale corpus containing the largest number of unmarked discrepant sentences to prioritize manual annotation, by the statistical analyzing of the difference between the target area text and the existing annotation corpus. Then combine the n-gram statistics in large-scale texts to train the segmentation model in the target area. Finally, the domain adaptiveness is achieved by integrating lexical information into the CRF statistical word segmentation model. Experiments show that this method significantly improves the domain adaptive ability of statistical Chinese word segmentation.
- Copyright
- © 2018, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Zhi Yuan PY - 2018/05 DA - 2018/05 TI - An Adaptive Chinese Word Segmentation Method BT - Proceedings of the 2018 3rd International Conference on Automation, Mechanical Control and Computational Engineering (AMCCE 2018) PB - Atlantis Press SP - 556 EP - 561 SN - 2352-5401 UR - https://doi.org/10.2991/amcce-18.2018.96 DO - 10.2991/amcce-18.2018.96 ID - Yuan2018/05 ER -