• P-ISSN 0974-6846 E-ISSN 0974-5645

Indian Journal of Science and Technology

Article

Indian Journal of Science and Technology

Year: 2015, Volume: 8, Issue: 16, Pages: 1-9

Original Article

Information Extraction in Unstructured Multilingual Web Documents

Abstract

Objectives: The objective is to develop a generic pixel-map based method to extract content in a short period of time for web documents. Method of Analysis: The method for extraction of content is in three levels, first level is in developing data inputs as attributes, second level in using the attributes to formulate a model and third level in interpretation of results. All three have variations so that validation comparison is possible for different parameters. Input data had all variations like language, script and usage and modeling is done using statistical, pattern recognition and ANN approaches. Findings: The method has demonstrated how quality and size of input data in the form of scalars, vectors and matrices affects the model and the result and this has been done for unstructured word sets chosen from web pages. The models chosen also give an idea of input/output variations in the outcome of the results. The uniqueness of the method is demonstrated for mono lingual, multi-lingual and transliterated datasets so that the applicability is universal. Novelty/Improvement: The method is generic in using pixel-maps, analytically stable in that the matrix input is used and versatility is demonstrated for adoption to different models.
Keywords: Data Mining Extraction, Image Processing Multilingual, Pre-processing, Segmentation, Unstructured

DON'T MISS OUT!

Subscribe now for latest articles and news.