Information Extraction in Unstructured Multilingual Web Documents

Kolla Bhanu Prakash; M  A  Dorai Rangaswamy; T  V  Ananthan  and V  N  Rajavarman

doi:10.17485/ijst/2015/v8i16/54252

Article

Information Extraction in Unstructured Multilingual Web Documents

VIEWS 1000
PDF 1230

Abstract
Full-Text HTML
Full-Text PDF
How to Cite

Indian Journal of Science and Technology

DOI: 10.17485/ijst/2015/v8i16/54252

Year: 2015, Volume: 8, Issue: 16, Pages: 1-9

Original Article

Information Extraction in Unstructured Multilingual Web Documents

Kolla Bhanu Prakash^1,3* , M. A. Dorai Rangaswamy² , T. V. Ananthan³ and V. N. Rajavarman³

¹Faculty of Computer Science Engineering, Sathyabama University, Chennai - 600 119, Tamil Nadu, India; [email protected]
² C.S.E & IT, AVIT, Chennai - 603104, Tamil Nadu, India; [email protected]
³ Faculty of C.S.E, Dr. M.G.R. Educational & Research Institute, Chennai - 600 095, Tamil Nadu, India; [email protected] , [email protected]

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: The objective is to develop a generic pixel-map based method to extract content in a short period of time for web documents. Method of Analysis: The method for extraction of content is in three levels, first level is in developing data inputs as attributes, second level in using the attributes to formulate a model and third level in interpretation of results. All three have variations so that validation comparison is possible for different parameters. Input data had all variations like language, script and usage and modeling is done using statistical, pattern recognition and ANN approaches. Findings: The method has demonstrated how quality and size of input data in the form of scalars, vectors and matrices affects the model and the result and this has been done for unstructured word sets chosen from web pages. The models chosen also give an idea of input/output variations in the outcome of the results. The uniqueness of the method is demonstrated for mono lingual, multi-lingual and transliterated datasets so that the applicability is universal. Novelty/Improvement: The method is generic in using pixel-maps, analytically stable in that the matrix input is used and versatility is demonstrated for adoption to different models.
Keywords: Data Mining Extraction, Image Processing Multilingual, Pre-processing, Segmentation, Unstructured