Mining Issues in Traditional Indian Web Documents

Recent developments in information technology are mostly in areas where information, content creation and knowledge integration are the driving forces. Beginning with adjusting to complexities in internet and mobile communications, these developments are becoming significant sources of knowledge and expertise creators and this is where countries like India and China play a major role. Indian tradition is considered more than 5000 years old and proofs of some of this are available even now on written, oral and real forms like Mahabharata on text or Mohenjo-Daro-Harappa as structures. This study presents issues at extracting information from traditional Indian documents and a method of evaluating content as language, script and form of the web documents are significantly varied. The development is based on pixel level to make the approach generic and presents results for some basic issue at text level and how this can be extended to word and document level.


Introduction
The last decade witnessed a complete shift in communications with emphasis on cellular and web technology dominating. Availability of cost-effective cell-phones to suit different segments of society has created a new shift in many human activities at societal and organizational level with SMS, Internet and web becoming common communication language. With the emergence of Internet and easier access to it any time globally, large amount of data availability, interaction on the web and dependence on data for day-to-day issues are becoming necessary. It is said that Internet and web are probably the largest source of information availability and access at any time anywhere 1 and more than 1.5 billion people use the Internet on a daily basis as described in several news items 2 . People have different reasons and purposes for using Internet. Some use search engines, such as Google, Yahoo, etc. 3,4 to get to know and access information for a topic of interest. Some advanced users will be extracting information from pages for later processing. This can be achieved by the so-called information extraction systems 5,6 . Pre-processing, data mining and post-processing are normally terms that are associated with information extraction 7,8 . All these aspects relate predominantly to English-based web development and even here terms like web noise, static or dynamic or temporal nature of data and methods of web generation have come into focus in web usage. Web noise is defined as information present on web pages and not relevant to the main content of the page. For example, in an article with a specific title, all the copyright, banners, advertisements and other redundant segments within that website are considered noise; because, they do not add any value to the article 9,10 . But in Indian context, web and Internet communication need to cater to different languages, dialects and traditions as diversions and variations in customs, languages and dialects are quite significant geographically. Since communication can be in any language, web also should be able to cater to this and the present study is on issues related to extraction of information in typical Indian web pages.

Brief Review on Related Literature
Extracting information and content from conventional web pages, done in English, have received considerable attention in the research community, and selected references for survey like Gupta et al. 12 and books like Han et al. 13 are also available. Works related to structured and unstructured web pages deal with using software 14 ; block importance 15,16 and cleaning 9 . Recently, Akbar et al. 19 proposed an algorithm to extract the main blocks in blog posts. Another study using predictive approach by Weiss et al. 20 and a different approach based on pattern matching by Wu et al. 21 significantly deals with text documents predominantly in English. Most of the work relates to extracting content from structured and welldeveloped web pages. But web pages in Indian context and in particular using traditional data are quite complex.

Web Documents in Indian Context
Indian web documents are quite varied and if one looks at the web pages developed in different regions on the same day, show considerable variation in content. Figure 1 gives two typical web pages taken on same day-25 th December, 2014, in two languages, Hindi and English. One can see very few common items and even images and text for the same content are different. Extracting content from these web pages calls for an approach which needs to be generic and at the same time different from earlier data mining and text mining techniques for handling language, script and dialect related issues. So, current Indian web pages are multi-lingual, unstructured and non-homogeneous; but, if one adds tradition to this through digitized data, complexities increase like the two documents shown in Figure 2. Here an image of a 'vimana' used in Ramayana is shown in a sketch form and definitions in the form of sutras-verses, whose translations are also shown therein. A varied form is the text written on stone or metal and palm leaf as shown in Figure 3.
With this background, Indian web pages -current or traditional-have specific features which can be grouped into unstructured, non-homogeneous, and multi-lingual and strongly dialect related. A higher level of information extraction more general than conventional data mining and which deals with web pages as they are is needed and towards this, media mining approaches have come into prominence.

Media Mining Approach for Indian Texts
Media mining is progressing very fast in the last decade and has become a powerful tool with number of software backups for translating documents in different languages and in different forms. These approaches which rely on the structural features of the documents may not be directly applicable to multi-lingual documents in the Indian context, mainly due to varied forms of regional language texts and dialects besides the syntax formats in some cases being different. A letter in Indian regional language is quite different from English as the text may be a combination of two or three different characters like ' ' in Hindi meaning hundred. This is one aspect of Indian regional language text making it difficult to apply directly data mining approaches. Further a letter in one language can be a part of a word or a word by itself like the letter 'a' or 'I' in English and ' ' in Hindi. Figure 4 gives text complexities at letter level in Indian context. Figure 4(a) gives a letter in English 'a' with its equivalent words in translated form in four different Indian regional languages. This shows the special feature of a single character required in English to form a word; whereas it requires more than one character to form a word in other languages. Figure 4 meaning. Figure 5(a) gives a word in Hindi ' ' which is a single character in Hindi, but it requires more than one character to form its equivalent in other languages like 'mother' in English and ' ' in Tamil. Figure  5(b) gives another variation of Hindi 'maa' used as part of other words to give a meaningful word but of different content.  Text complexities at letter level -Regional language 'Hindi' as basis.
So, content extraction in multi-lingual web pages, calls for a quick and efficient overall assessment of the content of a web page irrespective of the language used for its development and it is proposed to cover this aspect by giving examples at letter and character level and later at word level. Figure 6 gives a typical bilingual web page with multi-tasking features like text in two different languages, images embedded with text, advertisements, and add-ons all on the same webpage.
Indian languages are very much different from English. English being the link language both in communication and forms the basis in higher education, some complexities in mi grating from English to regional language or vice-versa exist like the one shown in Figure  7. Here a word 'magnet' in English, translated into Hindi, Telugu, Tamil as shown in Figure 7, shows clearly the variations in script and structure of text in different forms. This is true in education where text books written by authors in regional languages are used in web pages. Typically in Figure 8 a page of computer science text book is shown in two languages English and Tamil. Here, one can see computer terms in English are used as they are -in Tamil, scripts with letters in English and image does not change. Translating everything into Tamil may not work and is not possible and even if done may not reach the student easily in terms of content.  A general method is proposed which operates at pixel level so that text and language variations can be taken care of. This is becoming very important in education where mixing letters and words mutually with and without translation is becoming quite common while teaching. This is detailed in the following sections.

Methodology Proposed and Typical Results
Here an overall organization of the proposed model is presented as flowchart in Figure 9; where input preparation followed by attribute generation; algorithm usage and content extraction are the major segments.

Input Preparation
For any data processing or numerical modeling, data and its form play an important role. Data in the Internet and web context could be of different kinds; like text or script or image or even video and as such for computer modeling and processing these needs to be presented in the basic computer-recognizable form which is a pixel map. Pixel maps are matrix representations of data-whether text or image-giving x and y locations and the intensity or color

Attribute Generation
Feature extraction of pixel maps in different ways could be: • One number giving the sum of all non-zero voxels-volume pixels. Voxel stands for volume pixel which opens another dimension in pixel map manipulation, since presence of pixel gives zero and non-zero values as attributes, the other third dimension is given by voxel. Figure 10 gives the histograms for attribute variations of 'a' treated as word in comparison with Hindi and Tamil equivalents for that word 'a' . In comparing these attribute variations; the first chart gives values less than 1 as they are normalized, while others are actual values. It is clear that for voxel values 'a' in Hindi comes more closer to 'a' in Tamil when compared to 'a' in English. In matching the pattern with the values given, the content is easily predicted. The next case is when 'a' is used as part of a word like 'apple' or 'mango 'or 'orange' and Figure 11 gives similar bar charts for comparison.
In comparing Figures 10 and 11 using pattern matching, it shows clearly that both of them are of different content. This proves our assumption of letter 'a' which is having unique meaning when used as a single letter, and when used as part of other words like 'apple' has totally different meaning, bringing the context of totally different content.
Now the observations made for English need translations to other Indian languages like Hindi where letter ' ' , means mother and equivalents in other two languages are taken and Figure 12 gives a comparison.
In comparing all the attribute variations, it is found that for non-zero attribute values Hindi and Tamil have closure similarities when compared to 'maa' in English. For 3-value attribute, English and Tamil patterns match much closure when compared to Hindi 'maa' . Similarly, for 2 x 2 value attributes, Hindi and Tamil match closely in pattern. This is shown in Figure 12 and Figure 13 respectively. Another distinct observation came out of this study is ' ' if compared carefully with 'mother' and ' ' , a dot is seen on the top half in Hindi, where the other characters presence is not seen. This is proved by presence of zeroes in values observed in Figure 14.
Here again the first chart gives values less than 1 as ratio with total pixels is taken whereas others are actual values.
But values given in these figures need generality and normalization is done with number of pixel maps so that all values will be less than 1. These normalized values for all four types of attributes are shown in Figure 15 and one can see clearly that as the size of attribute increases from scalar to vector and then onto matrix, proximity to the actual matrix increases. With these five types of attributes, algorithm can be started to assess the content.  In comparing all the attribute variations, it is found that for non-zero attribute values Hindi and Tamil have closure similarities when compared to 'a' in English. For 3-value attribute, English and Tamil patterns match much closure when compared to Hindi 'a' . For 3 x 3 value attributes, Hindi and Tamil match closely when compared to 'a' in English. At last when we compared all the 5 value attributes, it is observed that 'a' in Hindi is close to 'a' in Tamil. As the base pixel map is that of English for 'a' and for Hindi ' ' , it better to normalize all with respect to base values. This is shown in Figure 15 for 'a' and Figure  16 for ' ' and this normalization helps one to assess for any new letter, whether it is a word in both languages.
For 3 x 3 value attributes, Hindi and Tamil match closely when compared to 'maa' in English. At last when we compared all the 5 value attributes, it is observed that 'maa' in Hindi is close to 'maa' in Tamil. This is shown clearly in Figure 12 and Figure 13 respectively. Figure 16 gives a comparison of all the attributes normalized with ' ' in Hindi. Since all the features are normalized, this comparison gives better observations, like if a median line is drawn many values are in the same range for each attribute. This observation we found for the first time which helps in matching the pattern superbly. While repeating this approach for words containing letters, it shows clearly that both of them are of different content. This proves our assumption of letter ' ' which is having unique meaning when used as a single word, and when used as part of other words like has totally different meaning, bringing the context of totally different content.
Our next interesting observation is letter 'x' which is unique to English and doesn't have any meaning when used as single character. Even very rarely any proper meaningful word starts with 'x' . For example "xeonlamp" or "x-mas" is not unique English word. So, we don't have an exact translation for 'x' in other Indian regional languages. We can write 'x' in other languages like Hindi and Tamil as a combination of two or more characters, as shown in Figure 17. Extraction of feature attributes can also be done for this using a similar approach and normalized one can also be displayed. Figure 18 gives a comparison of all the attributes normalized with 'x' in English. Since all the features are normalized, this comparison gives better observations, like if a line is drawn at the center, many values are in the same range for each attribute indicating content 'x' .  In examining closely in all the above cases it is observed that, a minimum of 2 x 2 attributes is required to get an idea of content. Good conclusions came from 2 x 2, 3 x 3 and 5 attributes. Such as 'a' when treated as single character as a word has a different meaning, compared to 'a' when used as part of another words. The same results were observed with a regional language word. So, to understand more better about these attributes, it is better to train a neural network with any of these three different attributes.

Algorithm for Content Assessment
Since it is possible to generate attributes for any letter or word or image using the approach mentioned earlier, algorithms like statistical or pattern matching or neural network model can be used to assess the content. Already some results for statistical methods were presented by the authors 22 Figure 19 gives the neural network plot for comparing 3 x 3 attribute variations. It is observed that for the first iteration the best validation performance is attained. Figure 20 gives attribute variations for 3 x 3 compared with 2 x 2 using neural network fitting plot, the function fit for output element 1 shows the input and output targets along with errors observed. The results are satisfactory and indicate that the net has been trained to assess the content for any similar or dissimilar words. More work is under way for creating a depository for content assessment.

Conclusion
In this paper we presented a detailed study of attribute variations at character and word level. The complexities and structure variations in Indian regional languages are identified. Pattern matching techniques at attribute level yielded good results to identify the content. It is observed that, a minimum of 2 x 2 attributes is required to get an idea of content. Good conclusions came from 2 x 2, 3 x 3 and 5 attributes. Such as 'a' when treated as single character as a word has a different meaning, compared to 'a' when used as part of another words. The same results were observed with a regional language word. Neural network fitting helped to judge best valid performance of the attribute values. The work will be extended further with in-depth study of neural network processing at more technical word and sentence level.