A Mathematical Approach for Mining Web Content Outliers using Term Frequency Ranking


  • Department of MCA, Sri Krishna College of Technology, Coimbatore - 641042, Tamil Nadu, India


The Internet is a considerable collection of information that makes it extremely difficult to search and retrieve the required and valuable information. The primary objective of this paper is to provide the user with efficient and effective result through search engine, as the result set contains irrelevant and redundant data called outliers. In this research work, a mathematical approach called Spearman's rank correlation coefficient has been used to calculate the correlation between the document pairs. If the correlation value is 1, then the document is redundant which can be removed. This method depends on the term frequency of common words between the document pairs that is ranked based on the frequency value. This method improves the effectiveness, efficiency and reliability of the search engine. The comparison has been made with the performance of n-gram method, TF.IDF, Linear Correlation and Ranking Correlation. Thus the experimental result shows that the proposed method improves the precision, recall, f-score and accuracy. Thus the result produced by this method improves accuracy.


Ranking Correlation Coefficient, Search Engines, Term Frequency, Web Content Mining, Web Content Outliers.

