New Word Identification for Chinese Patents Based on Multiple Statistic Measures and Pattern Combination
- DOI
- 10.2991/icecee-15.2015.98How to use a DOI?
- Keywords
- New Word Identification (NWI); Out of Vocabulary (OOV); Pattern Combination; Candidate Generation; Statistical Measures Integration; Pattern Filtering
- Abstract
New Words Identification (NWI) is one of the critical researches in Chinese Natural Language Processing (NLP), which has important influence to the successive tasks of Chinese NLP. Aiming at the problem of the NWI, which is disturbed in the automatic or half-automatic processing for text translation of Chinese patents, this paper proposed a method for NWI of Chinese patents based on integration of multiple statistic measures and pattern combination, which included a specifically preprocessing method for string dividing, where the technological terms in patents were reserved, and non-technological words were removed as many as possible; then, the divided strings with different lengths were combined using multiple patterns with a greedy maximum match to generate candidates; furthermore, the noisy candidate strings were filtered using four filtering patterns summarized manually; finally, the statistical measures only adapting to two variables were extended to those adapting to multiple ones; in the meantime, the values of the multiple statistic measures extended were integrated by using a ranking method, which evaluated the candidates according to the thresholds to form the set of new words. Experiments on abstract texts of Chinese patents showed that the precision can reach 80%; and the F1 value can reach 68.15%, verifying the effectiveness of the method.
- Copyright
- © 2015, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Xiong Wen PY - 2015/06 DA - 2015/06 TI - New Word Identification for Chinese Patents Based on Multiple Statistic Measures and Pattern Combination BT - Proceedings of the 2015 International Conference on Electrical, Computer Engineering and Electronics PB - Atlantis Press SP - 472 EP - 478 SN - 2352-538X UR - https://doi.org/10.2991/icecee-15.2015.98 DO - 10.2991/icecee-15.2015.98 ID - Wen2015/06 ER -