Authorship Identification for Tamil Classical Poem (Mukkoodar Pallu) using Bayes Net Algorithm

Objective: To classify the authors of unknown Tamil dataset based on the work of known authors. Methods/Analysis: Text processing is the method of deriving high quality information from text that includes statistical patterns from the text. This paper proposes text processing method to extract features and perform classification on the same. Findings: The accuracy of the classifier turns out to be 94.1%. Classifier accuracy is improved from 88.23% to 94.1% by varying the classification algorithm (Bayes Net). Novelty/Improvement: This method can be further extended to all regional languages. By doing this, authors of various other poems in Tamil language can be identified which will be helpful to the society.


Introduction
Authors of many regional language poems are not yet identified. For instance, in Tamil language many poems are still anonymous. Identifying them would be of more use. Based on various researches, it turns out that most of the authorless poems can be associated with one of the authors, whose name and work is already known. So by using a suitable algorithm, authors for the unknown work can be identified. Thomas Bayes (1871) was the first to use statistical theory for solving authorship issues in the federalist papers. Auguste de Morgan as early as in 1851 has suggested the mean length of words as a measure to resolve authorship problem.
Identifying the writer of an article on the basis of stylistic character is the author attribution problem in lin-guistic research. Feature extraction can contribute more to this authorship problem, which consists of extraction of frequently used words, length of sentence, special characters used etc.
In 1 , explains what features can be extracted from the dataset and find the accuracy of the classifier model. Enron dataset for e-mail was used with 6 different algorithms and it turns out that the accuracy is 90.08%. Adaptive metropolis algorithm gives an accuracy of 68.19%, N Bayes algorithm gives an accuracy of 79.07%, 79.86% accuracy by using Bayes Net algorithm, 88.47% accuracy by using CMAR algorithm, CBA algorithm gives an accuracy of 84.18% and 90.08% accuracy by using CMARAA algorithm.
In 2 proposed the usage of sonnets to perform authorship identification extracted from the text. Identification was performed by normalizing the sonnets that were Keywords: Authorship, Classification, Feature Selection, Tamil Articles collected from various federalist papers. An easy way to extract features from regional languages is depicted elaborately 3 , which uses Slovakian dataset and compaction algorithm.
In 4 explained how to use AT and SAT models to perform authorship attribution on the given dataset. The authors use random kitchen algorithm to perform the classification on the dataset to find the author of unknown texts 5 .
In 6 developed a method to identify Tamil words from online datasets, extraction of features from them and performing classification on them is explained. In 7 explained the way to perform authorship attribution on Enron email dataset by performing classification on it by using random forest algorithm.
In 8 explained how to identify and extract features both local and global to identify numerals and Tamil words from online articles. In 9 provided a way to identify and extract features that are pertinent to Kannada language to identify the authors of unknown articles.
In 10 , the author attribution is done on Tamil datasets using techniques like reevaluation and pattern analysis to find the authors of unknown texts while reference 11 shows how to recognize Tamil texts from online datasets and perform classification on them based on random kitchen algorithm.

Materials and Method
The present authorship identification methods are limited to English only. Other regional languages are not supported. Authorship identification can be performed in two ways: Natural Language Processing and Text Processing. In this paper, text processing method is proposed to perform authorship identification process. Text processing method uses features like character count, sentence count; functional words count and perform classification to find the authors of unknown texts. A list of features that can be used to perform classification is mentioned in 1 Table 1. These features are extracted from the dataset and used for performing classification. These features define the stylometry of the author. Stylometry is the application of study of written styles from handwritten articles that can be used in authorship identification. Stylometry includes extraction of lexical, syntactic and semantic features pertinent to the language considered. Above table shows the lexical and syntactic features that are extracted from the dataset. By using the Bayes Net algorithm, an accuracy of 94.1% was attained. Bayes Net algorithm uses Simple Estimator A0.5 for estimating the instances and K2 for searching through the instances. Decision tree algorithm gives an accuracy of 88.23% by varying the confidence factor and number of objects, while Bayes Net algorithm gives an accuracy of 94.1%. The confusion matrix, Figure 1 shows the correctly classified instances which includes the work of three authors. Vol 9 (47) | December 2016 | www.indjst.org

Feature Extraction
As the classification process cannot be done directly on the dataset, features that are useful for building classifier have to extract from the dataset. These features cannot be extracted manually and so macro is written to perform feature extraction iteratively. The lists of features that are shown in Table 1 are considered and are used to perform classification. These features have to be extracted from the dataset first. In order to extract features, Tamil dataset is converted into Unicode first. By using Microsoft excel, the listed features are extracted by using macro. A macro is a small piece of code that is used to perform certain operations recursively.

Feature Selection
Feature selection is done by building a decision tree based on C4.5 algorithm. Decision tree is built based on two factors, namely Information Gain and Entropy. Information gain and Entropy are inversely proportional. That is if entropy is more, information gain is less. Decision tree is built from root node based on information gain. From the root node, further nodes are built based on information gain. Node which has got highest information gain tends to be the next node and forms a subset of nodes based on the homogeneity of information gain.
Decision tree is used to prune the number of features to a minimum number without affecting the classifier accuracy. Pruning is done based on the information gain and entropy difference. Feature selection is done in order to reduce the number of unwanted features that are initially considered. Problem with lot of features is that it could either affect the classifier accuracy or increase the number of unwanted features. Selection of relevant features lead to increased classifier accuracy. The lists of selected features are listed in the table-1.

Classification Algorithm
The classification algorithm used is Bayes Net. Bayes Net or Bayesian Network algorithm works is a probabilistic graphical model which represents random variables and their corresponding conditional dependencies in a directed acyclic graph. Bayes Net works on the basis of probabilities of events. It simply creates a network based on the probabilities of all the events specified. These probabilities are especially useful to predict if some other similar event could occur.
The classifier is built based on the architecture shown in figure-1. All the steps are followed one by one to build the classifier.

Results and Discussion
The Bayes Net algorithm has produced an accuracy of 94.1% on Mukkoodar Pallu dataset using Weka tool. Initially the classifier accuracy obtained was 88.23% using C4.5 algorithm. The decision tree algorithm C4.5 produced an accuracy of 76.4% on the dataset. To improve the classifier accuracy, two parameters: confidence factor and minimum number of objects were varied. Suppose if you choose the confidence factor as 0.2 shown in Figure-   Bayes Net algorithm has performed better than C4.5 algorithm by using simple estimator and K2 algorithm for estimation and has improved the classifier accuracy. Bayes Net uses two parameters: simple estimator and K2 algorithm for searching. Simple estimator works by increasing the gap by 0.5. The K2 algorithm is used to uncover the underlying structure from the pre-determined nodes by using greedy algorithm. Table 2 shows the confusion matrix obtained after classification. The poems of author X and Z are all correctly classified. The work of author Y is only 50% accurate as only one poem is classified correctly and the other is wrongly classified. This gives an overall classifier accuracy of 94.1%.

Conclusion
The decision tree algorithm C4.5 produced an accuracy of 76.4% on the dataset. To improve the classifier accuracy, two parameters: confidence factor and minimum number of objects were varied. By choosing the confidence factor as 0.2 and minimum number of objects as 4, the classifier accuracy was increased to 88.23%. The Bayes Net algorithm produces an accuracy of 94.1% by using simple estimator and K2 for searching. The authorship identification leads to an accuracy of 94.1% by choosing these two constraints. Thus by extracting general features that are common for all regional languages, an overall authorship identification system can be developed for all regional languages.