General Simhash-based Framework for News Aggregators
- DOI
- 10.2991/mecs-17.2017.152How to use a DOI?
- Keywords
- News Aggregator, Simhash, Deduplication, News Recommendation, Breaking Event Detection.
- Abstract
News aggregator usually indexes billions of news from Internet and try to recommend news according to readers' intrinsic interests. Retrieval for similar news, deduplication and event detection are common problems in aggregator systems, and related works are reported in [1], [2], [3], [4] and [5]. We proposed a general simhash-based framework for news aggregator, the system has no necessary to process crawled news for retrieval, deduplication and event detection respectively, each piece of news is processed only one time and without extra storage space. Duplicates and breaking events can be detected online before new crawled news was stored in system's database. Machine learning are widely used in news aggregator for tasks like topic classification and each piece of news is mapped into a feature vector with fixed length. Simhash fingerprints are generated on feature vectors rather than original text of news, therefore news retrieval, deduplication and breaking news detection can be integrated into any running aggregator systems without extra efforts. Our aggregator collected around 9.6 million of news from Internet and the framework function well in real scenario.
- Copyright
- © 2017, the Authors. Published by Atlantis Press.
- Open Access
- This is an open access article distributed under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/).
Cite this article
TY - CONF AU - Pengcheng Hu AU - Xiangdong You PY - 2016/06 DA - 2016/06 TI - General Simhash-based Framework for News Aggregators BT - Proceedings of the 2017 2nd International Conference on Machinery, Electronics and Control Simulation (MECS 2017) PB - Atlantis Press SP - 310 EP - 315 SN - 2352-5401 UR - https://doi.org/10.2991/mecs-17.2017.152 DO - 10.2991/mecs-17.2017.152 ID - Hu2016/06 ER -