Total views : 474

A Mathematical Approach for Mining Web Content Outliers using Term Frequency Ranking


  • Department of MCA, Sri Krishna College of Technology, Coimbatore - 641042, Tamil Nadu, India


The Internet is a considerable collection of information that makes it extremely difficult to search and retrieve the required and valuable information. The primary objective of this paper is to provide the user with efficient and effective result through search engine, as the result set contains irrelevant and redundant data called outliers. In this research work, a mathematical approach called Spearman's rank correlation coefficient has been used to calculate the correlation between the document pairs. If the correlation value is 1, then the document is redundant which can be removed. This method depends on the term frequency of common words between the document pairs that is ranked based on the frequency value. This method improves the effectiveness, efficiency and reliability of the search engine. The comparison has been made with the performance of n-gram method, TF.IDF, Linear Correlation and Ranking Correlation. Thus the experimental result shows that the proposed method improves the precision, recall, f-score and accuracy. Thus the result produced by this method improves accuracy.


Ranking Correlation Coefficient, Search Engines, Term Frequency, Web Content Mining, Web Content Outliers.

Full Text:

 |  (PDF views: 391)


  • Kosla R, Blockeel H. Web mining research: a survey. ACM SIGKDDExplorations. 2000 Jun; 2(1):1–15.
  • Liu B, Chang KC-C. Editorial: special issue on web contentmining. SIGKDD Explorations. 2004; 6 (2):1–4.
  • Wang C, Liu Y, Jian L, Zhang P. A utility based web content sensitivitymining approach. International Conference on WebIntelligent and Intelligent Agent Technology (WIIAT),IEEE/WIC/ACM; 2008. p. 428–31.
  • li H, Wu Z, Ji X. Research on the techniques for effectively searchingand retrieving information from internet. IEEE InternationalSymposium on Electronic Commerce and Security; 2008. p. 99–102.
  • Pokorny J, Smizansky J. Page content rank: an approach to theweb content mining. Proceedings of the IADIS International Conferenceon Applied Computing; 2005.
  • Gopalan NP, Akilandeswari J. Distributed, fault-tolerant multi-agent web mining system for scalable web search. 5th WSEASInternational conference on Applied Informatics andCommunications; 2005 Sep 15–7; Malta. p. 384–90
  • Agyemang M, Barker K, Alhajj RS. Framework for mining webcontent outliers. ACM Symposium on Applied Computing; 2004. p. 590–94.
  • Agyemang M, Barker K, Alhajj RS. Hybrid approach to web contentoutlier mining without query vector. Data Warehousing andKnowledge Discovery. 2005; 3589:285–94.
  • Agyemang M, Barker K, Alhajj RS. WCOND– mine: algorithm fordetecting web content outliers from web documents. IEEESymposium on Computers and Communication; 2005 Jun 27-30. p. 885-90.
  • Agyemang M, Barker K, Alhajj RS. A comprehensive survey ofnumeric and symbolic outlier mining techniques. IntelligentData Analysis. 2006 Dec; 10(6):521–38.
  • Huosong X, Zhaoyan F, Liuyan P. Chinese web text outlier miningbased on domain knowledge. Second WRI Global Congresson Intelligent Systems. 2010; 2:73–77.
  • Brian S, Page L. The anatomy of a large-scale hyper textual websearch engine. Computer Networks. 1998; 107–17.
  • Poonkuzhali G, Thiagarajan K, Sarukesi K. Set theoretical approachfor mining web content through outliers detection. InternationalJournal on Research and Industrial Applications. 2009; 2(1):131–8.
  • Poonkuzhali G, Thiagarajan K, Sarukesi K, Uma GV. Signed approachfor mining web content outliers. Proceedings of WorldAcademy of Science, Engineering and Technology. 2009; 56:820–24.
  • Castellano G, Fanelli AM, Torsello MA. Mining usage profiles fromaccess data using fuzzy clustering. 6th WSEAS International Conferenceon Simulation, Modelling and Optimization; 2006 Sep 22–24; Lisbon, Portugal. p. 157–160.
  • Zubi ZS. Using some web content mining techniques for arabictext classification. Proceedings of the 8th WSEAS InternationalConference on Data Networks, Communications, Computers; 2009. p. 73–84.
  • Pop I. Web document classification and its performance evaluation. 9th WSEAS International Conference on Mathematical Methods, Computational (EC’08); Bulgari; 2008.
  • Dzitac I, Moisil I. Advanced AI techniques for web mining. Proceedingsof the 10th WSEAS International Conference onMathematical Methods, Computational Techniques and IntelligentSystems; 2008. p. 343–6.
  • Di Lucca GA, Massimiliano, Fasolina AR. An approach toidentify duplicated web pages. Proceedings of the 28th AnnualInternational Computer Software and Applications Conference; IEEE computer Society press; 2002. p. 481–6.
  • Wang M-Y, Liu D-S. The research of web page de-duplication basedon web pages re-shipment statement. First International Workshopon Database Technology and Applications; 2009. p. 271–74.
  • Weng Y, Li L, Zhong Y. Semantic keywords-based duplicated webpages removing. International Conference on NaturalLanguage Processing and Knowledge Engineering (IEEE NLP-KE ‘08); 2008 Oct 19-22; Beijing. p. 1–7.
  • Han Z, Mo Q, Jianzhi L. Effectively and efficiently detect webpage duplication. IEEE Fourth International Conference onDigital Information Management (ICDIM); 2009 Nov1-4; Ann Arbor, MI. p. 1–6.
  • Zulkifeli WWR, Mustapha N, Mustapha A. Classic term weightingtechnique for mining web content outliers. International Conferenceon Computational Techniques and ArtificialIntelligence (ICCTAI’2012); Malaysia. 2012.
  • Salton G. Automatic text processing: the transformation, analysisand retrieval of information by computer. Addison- WesleyEditors; 1988.
  • Poonkuzhali G, Kishore kumar R, Kripakeshav R, Sudhakar P, Sarukesi K. Correlation based method to detect and removeredundant web document. Advanced Materials Research. 2011; 543–6.
  • Poonkuzhali G, Uma GV, Sarukesi K. Detection and removal ofredundant web content through rectangular and signedapproach. International Journal of Engineering Science andTechnology. 2010; 2(9):4026–32.
  • Poonkuzhali G,Thiagarajan K, Sarukesi K. Elimination ofredundant links in web pages-mathematical approach. WorldAcademy of Science, Engineering and Technology. 2009; 3(4):441–4.
  • Agyemang M, Barker K, Alhajj RS. Mining web content outliersusing structure oriented weighting techniques and N-grams. Proceedings of ACM SAC; New Mexico. 2005. p.482–7.
  • Liddy E. How a search engine works. Searcher: The Magazine forDatabase Professionals. 2001; 9(5):38.
  • Sathya Bama S, Ahmed MSI,Saravanan A. A Mathematical Approachfor improving the performance of the search engine throughweb content mining. Journal of Theoretical and AppliedInformation Technology. 2014 Feb; 60(2):343–0.


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.