Indian Journal of Science and Technology
Year: 2015, Volume: 8, Issue: 16, Pages: 1-9
Kolla Bhanu Prakash1,3* , M. A. Dorai Rangaswamy2 , T. V. Ananthan3 and V. N. Rajavarman3
1 Faculty of Computer Science Engineering, Sathyabama University, Chennai - 600 119, Tamil Nadu, India; [email protected]
2 C.S.E & IT, AVIT, Chennai - 603104, Tamil Nadu, India; [email protected]
3 Faculty of C.S.E, Dr. M.G.R. Educational & Research Institute, Chennai - 600 095, Tamil Nadu, India; [email protected] , [email protected]
Objectives: The objective is to develop a generic pixel-map based method to extract content in a short period of time for web documents. Method of Analysis: The method for extraction of content is in three levels, first level is in developing data inputs as attributes, second level in using the attributes to formulate a model and third level in interpretation of results. All three have variations so that validation comparison is possible for different parameters. Input data had all variations like language, script and usage and modeling is done using statistical, pattern recognition and ANN approaches. Findings: The method has demonstrated how quality and size of input data in the form of scalars, vectors and matrices affects the model and the result and this has been done for unstructured word sets chosen from web pages. The models chosen also give an idea of input/output variations in the outcome of the results. The uniqueness of the method is demonstrated for mono lingual, multi-lingual and transliterated datasets so that the applicability is universal. Novelty/Improvement: The method is generic in using pixel-maps, analytically stable in that the matrix input is used and versatility is demonstrated for adoption to different models.
Keywords: Data Mining Extraction, Image Processing Multilingual, Pre-processing, Segmentation, Unstructured
Subscribe now for latest articles and news.