A Methodology to Refine Labels in Web Search Results Clustering
- DOI
- 10.2991/ijcis.2019.125905647How to use a DOI?
- Keywords
- Information retrieval; Machine learning; Web search results clustering; Web intelligence
- Abstract
Information retrieval systems like web search engines can be used to meet the user’s information needs by searching and retrieving the relevant documents that match the user’s query. Firstly, the query is inputted to the web search engine and assumed to be a good representative for the user’s intention and reflecting specifically his information needs and thus it should be long enough, discriminative, specific and unambiguous. Secondly, the web search engine typically respond to the query by sending back a long flat list of web search results and each search result represents a relevant document. Typically, that list may contain thousands or millions of web search results and thus it is difficult to navigate and locate a specific document relevant to a specific topic. As a postretrieval process, web search results clustering may be a solution for this issue where web search results can be categorized as clusters. These clusters supposed to contain topically related documents and labelled by descriptive and concise labels. These labels supposed to correctly describe the contents of each cluster. Thus the users can easily choose a cluster representing the intended topic and navigate through relatively few documents inside that cluster. High-quality labelling for clusters is crucial for users who can now gain insight into that clusters’ contents, general structure, and distribution of the topics among documents in the clusters. This make the user able to preview and navigate easily and fast. To this end, the authors in this paper introduced a methodology to enhance labels for clusters of web search results. The proposed methodology is founded on the idea of using the existing labels nominated by the original Suffix Tree Clustering (STC) algorithm and adapting these labels and/or clusters so that it become more concise and descriptive. The propose methodology was conducted on the original STC algorithm to produce an enhanced version of the classical STC algorithm. The enhanced algorithm was experimented and the produced clusters and labels were evaluated and compared with respect to the classical STC algorithm. For evaluation, the authors used clusters labelling performance measure considered five parameters f1: Comprehensibility, f2: Descriptiveness, f3: Discriminative Power, f4: Uniqueness, and f5: Nonredundancy. The reported results shown that the new enhanced labels outperformed the original labels and the overall performance has been enhanced. The recorded results indicated that: (i) The proposed methodology achieved better performance and the overall average recorded values for the used performance measure (f6) was 0.921. (ii) Number of clusters was decreased from 15 to 9 clusters only. (iii) Number of duplicated results was decreased from 143 to 121 only, and (iv) average number of phrases per label was increased from 1.67 to 2.00 phrases.
- Copyright
- © 2019 The Authors. Published by Atlantis Press SARL.
- Open Access
- This is an open access article distributed under the CC BY-NC 4.0 license (http://creativecommons.org/licenses/by-nc/4.0/).
Download article (PDF)
View full text (HTML)
Cite this article
TY - JOUR AU - Zaher Salah AU - Ahmad Aloqaily AU - Malak Al-Hassan AU - Abdel-Rahman Al-Ghuwairi PY - 2018 DA - 2018/12/31 TI - A Methodology to Refine Labels in Web Search Results Clustering JO - International Journal of Computational Intelligence Systems SP - 299 EP - 310 VL - 12 IS - 1 SN - 1875-6883 UR - https://doi.org/10.2991/ijcis.2019.125905647 DO - 10.2991/ijcis.2019.125905647 ID - Salah2018 ER -