Identification Of Seed Users Via Short Messages Based On Hadoop
- DOI
- 10.2991/emcs-16.2016.456How to use a DOI?
- Keywords
- Corpus processing; Parallel association rule mining; Inverted index hash table; MPI; Speedup
- Abstract
With the rapid growth of the text processing technology, many knowledge discovery approaches have been introduced to handle large corpus. Data mining methods such as clustering and categorization, for example, have found wide applications in corpus processing. Recently, association rule mining methods also have a place in this field. However, due to the huge amount of "items" contained in corpus, the traditional association rule mining algorithms encounter great effectiveness and efficiency challenges. In this paper, a new parallel association rule mining algorithm especially customized for corpus is developed and implemented using the MPI programming interface. The main ideas are to adopt a distributed inverted index hash table, and to design a communication scheme based on "chessboard decomposition" to accelerate the generation of candidate itemsets. Experiments are are devised and conducted on the Tianhe-II Supercomputer of Guangzhou National Super Computing Center. The experimental results demonstrate that the new algorithm has achieved desirable performance, with a speedup rate of 16 when using 49 processes altogether.
- Copyright
- © 2016, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Zhiwei Ye AU - Pingjian Zhang PY - 2016/01 DA - 2016/01 TI - Identification Of Seed Users Via Short Messages Based On Hadoop BT - Proceedings of the 2016 International Conference on Education, Management, Computer and Society PB - Atlantis Press SP - 1814 EP - 1817 SN - 2352-538X UR - https://doi.org/10.2991/emcs-16.2016.456 DO - 10.2991/emcs-16.2016.456 ID - Ye2016/01 ER -