Incorporation of contextual information through Graph Modeling in Web content mining

Kaushik Kishore Phukon

doi:10.17485/IJST/v13i46.1660

Article

Incorporation of contextual information through Graph Modeling in Web content mining

VIEWS 1059
PDF 195

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v13i46.1660

Year: 2020, Volume: 13, Issue: 46, Pages: 4573-4578

Original Article

Incorporation of contextual information through Graph Modeling in Web content mining

Kaushik Kishore Phukon^1*

¹Pandit Deendayal Upadhyaya Adarsha Mahavidyalaya, Gauhati University, Bongaigaon, 783383, Assam, India. Tel.: 917002551636

*Corresponding Author
Tel: 917002551636
Email: [email protected]

Received Date:13 September 2020, Accepted Date:07 December 2020, Published Date:19 December 2020

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: The objectives of this research article is to deal with the problem of web document clustering by modeling the web documents as directed completely labeled graphs that incorporate contextual information in the computation process to the extent required. The computational complexity of the MCS algorithm based on this graph model is O(n2), n being the number of nodes. As graph similarity using MCS is an NP-complete problem, so this is an important result that allows us to forgo sub-optimal approximation approaches and find the exact solution in polynomial time. Method: The first step towards this new approach of web document clustering is the representation of the web documents with the help of a directed completely labeled graph that can retain contextual information of the document under consideration. After graphical modeling of the document, the next step is the calculation of similarity between the graphical objects. For this purpose, a customized algorithm proposed as Algorithm for Maximum Common Subgraph Isomorphism (AMCSI)(1) based on a backtracking search scheme is being used. The proposed AMCSI algorithm is solving the problem of maximum common subgraph isomorphism in polynomial time. After obtaining the value for the similarity between the graphical objects we are again using a customized fuzzy-c means algorithm to produce clusters from the target set of web documents. We are using multidimensional scaling to express the distance values between the web pages (graphs) in two coordinates (x,y) and deterministic sampling to calculate the graph median in the process of fuzzy c-means clustering. Findings: We present an alternative method for web document clustering by representing the web documents as directed completely labeled graphs where the computational complexity of the MCS algorithm is O(n2)(1). A new distance measure is also developed based on the directed completely labeledgraph representation which is giving 16.9% better result than the prevailing methods(2). For the clustering purpose, we have chosen the fuzzy cmeans clustering algorithm and customizing the original algorithm to fit with graphical objects. This approach enables us to model the web documents as graphs without discarding contextual information and then cluster these graphical objects with the help of a well-established clustering algorithm.

Keywords: Vector; graph; web document; subgraph; isomorphism; fuzzy; clustering

References

Phukon KK. Maximum Common Subgraph and Median Graph Computation from Graph Representations of Web Documents Using Backtracking Search. International Journal of Advanced Science and Technology. 2013;51:67–80.
Schenker A, Last M, Bunke H, Kandel A. Clustering of web documents using a graph model. In: Series in Machine Perception and Artificial Intelligence. (pp. 3-18) 2003.
Shaban K. Semantic Graph Model for Text Representation and Matching in Document Mining. Electrical and Computer Engineering, Faculty of Engineering, University of Waterloo thesis
Suen CY. n-Gram Statistics for Natural Language Understanding and Text Processing. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979;PAMI-1(2):164–172. Available from: https://dx.doi.org/10.1109/tpami.1979.4766902
Martinez AR, Wegman EJ. Text Stream Transformation for Semantic-Based Clustering. (Vol. 34) 2002.
Hasan M, Matsumoto Y. Document Clustering: Before and After the Singular Value Decomposition. Information Processing Society of Japan . 1999;p. 47–55.
Hotho A, Staab S, Stumme G. Wordnet improves Text Document Clustering. Proceeding of the Semantic Web Workshop at SIGIR, 26th Annual International ACM SIGIR Conference. 2003. Available from: https://doi.org/10.1109/ICDM.2003.1250972
Hammouda K, Kamel M. Phrase-based document similarity based on an index graph model. Proceedings of the 2002 IEEE Int'l Conf. on Data Mining (ICDM'02). 2002. Available from: https://doi.org/10.1109/ICDM.2002.1183904
Schenker A, Bunke H, Last M, Kandel A. Graph Theoretic Techniques for Web Content Mining. (Vol. 62) World Scientific Publishing Co. Ltd. 2005.
Hartigan JA, Wong MA. Algorithm AS 136: A K-Means Clustering Algorithm. Applied Statistics. 1979;28(1):100. Available from: https://dx.doi.org/10.2307/2346830
Jain AK, Dubes RC, “. Algorithms for Clustering Data. NJ. Prentice Hall. 1988.
Dutta A, Riba P, Lladós J, Fornés A. Hierarchical stochastic graphlet embedding for graph-based pattern recognition. Neural Computing and Applications. 2020;32(15):11579–11596. Available from: https://dx.doi.org/10.1007/s00521-019-04642-7
Duda RO, Hart PE. Pattern Classification and Scene Analysis. New York. Wiley. 1973.
Gerritse EJ, Hasibi F, Vries AP. Graph-Embedding Empowered Entity Retrieval. In: Advances in Information Retrieval”, 42nd European Conference on IR Research. (pp. 97-110) 2020.
Buluz B, Yilmaz B. Graph mining approach for modeling academic success. In: 25th Signal Processing and Communications Applications Conference (SIU). (pp. 1-4) 2017.
Chintalapudi RS, Prasad MHK. Mining Overlapping Communities in Real-world Networks Based on Extended Modularity Gain. International Journal of Engineering (IJE), TRANSACTIONS A: Basics. 2017;30(4):486–492.
Berkhin P. Survey of Clustering Data Mining Techniques. Accrue Software.
Hubert LJ. Some applications of graph theory to Clustering. Psychometrika. 1974;38:435–475.
Mehrotra D, Nagpal D, Srivastava R, Nagpal R. Analyse Power Consumption by Mobile Applications Using Fuzzy Clustering Approach”. International Journal of Engineering (IJE), IJE TRANSACTIONS C: Aspects. 2018;31(12):2037–2043.
Mahdizadehand M, Eftekhari M. A Novel Cost Sensitive Imbalanced Classification Method based on New Hybrid Fuzzy Cost Assigning Approaches, Fuzzy Clustering and Evolutionary Algorithms”. International Journal of Engineering (IJE). 2015;28(8):1160–1168.
Martinez MA, Adiego J, Fuente P. Natural Language Compression on Edge-Guided Text Preprocessing”. In: Proc. of 14th International Symposium on String Processing and Information Retrieval (SPIRE’07). (pp. 14-25) 2007.
Phukon KK. A Composite Graph Model for Web Document and the MCS Technique”. International Journal of Multimedia and Ubiquitous Engineering. 2012;7(1):45–51.

Copyright

© 2020 Phukon.This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Published By Indian Society for Education and Environment (iSee)