• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2017, Volume: 10, Issue: 10, Pages: 1-9

Original Article

COCUS: Concept Based Document Clustering by Corpus Utility Scale

Abstract

Objective: With the rising quantum of documents in corpuses, it is very important that data management and data assurance is with high interoperability towards retrieving the critical documents from vast range of services. By focusing on the semantic features, which could improve the level of accuracy in document tracing and retrieval, the issues and limitations in the present models could be addressed in an effective manner. Methods/Statistical Analysis: In this study, focus is on depicting the robustness of semantic features based clustering techniques and its efficacy, compared to the other kind of clustering techniques. This paper proposed a concept based document clustering by corpus utility scale (COCUS) proposed. The utility scale proposed in COCUS is derived with support of topic related selected document set as knowledge base that enables to cluster the documents by their concept relevancy. The proposed clustering model is assessed through the state of the art metrics called cluster purity, inverse of purity and cluster level harmonic mean. Experiments were carried out on datasets that comprise the containing specific kind of literature gathered from varied open access journals from publishers. The total 1509 number of documents was collected and among them 497 documents was used as knowledgebase and rest 1012 documents were used for clustering process. Findings: The experimental study evincing that the proposed model is scalable and robust. The purity and harmonic mean of the resultant clusters confirming that the COCUS clusters the documents by their concept relevancy with 94% accuracy (Average of the topic level harmonic mean of the clusters was found as 0.94). Application/ Improvements: The computational complexity of the COCUS is evinced as linear, where the majority of benchmarking models are found to be np-hard.

Keywords: Cluster, Corpus Utility, Harmonic Mean, Text Mining.

 

DON'T MISS OUT!

Subscribe now for latest articles and news.