Mining Children Ever Born Data; Classification Tree Approach

data into two or more categories based on input covariates, these covariates can be categorical, continuous or any mix of them (e.g., gender, educational level, race, age). The results are often showed in a graph like tree 4,5 . Nowadays classification trees are used in many diverse ﬁelds such Abstract Classification is an important problem in many fields of science. In this problem the goal is classification of categorical variable as a response based on any input variables as covariates. For years Logistic regression and discriminant analysis were used as statistical methods for classification. Nowadays by developing data base storage and computer programming some new methods such as classification trees are applied to classify data. The aim of this article is introducing CART algorithm as a classification tree rule to analyze demographic data. CARTs generate binary decision tree and have simple interpretations. We use CART to classify children ever born as an important phenomenon in demographic research.


Introduction
The new generation of advanced computers has capability to save huge data sets. This is a good situation that we can keep any data we want; on the other hand if we can't use information of these data, we won't gain from that. Data mining is a relatively new data analysis method that can encounter with big data sets. Data mining is a collection of various sciences such as computer, artificial intelligence, machine learning and statistics and got its name from the fact that researchers mine among huge data to extract worth information 1 . Data mining is a useful exploratory data analysis process in which the perception and interpretation of pre-determined results of data at hand does not exist. In fact, data mining searches new, valuable and nontrivial information through large volumes of data. Data mining is used both to describe and predict. Explanation the new and nontrivial information based on data is a descriptive aim and providing a model for data is a predictive goal of data mining. In other words, the goal of data mining is prediction, creating a model to classify, estimation or other similar purposes and the main purpose of data mining is description, identification patterns and relationships in data. As mentioned by many researchers, traditional statistical methods cannot replace by data mining, but they can extend by some advanced methods 2 .
One of the important issues in data mining, such as statistical analysis, is predictive models. These models based on types of response variable which can be continues or categorical, are divided into regression and classification models. In regression, response mean is predicted by equation that relates response with covariates. When the aim is classification of target variables, probabilities of class membership are estimated in training data set. These probabilities are used to allocate new data that include target variable to different categories 2,3 .
Classification tree is a hierarchical and flexible classification method; these characteristics make classification tree as a very attractive and applicable tool for classification. Classification tree splits data into two or more categories based on input covariates, these covariates can be categorical, continuous or any mix of them (e.g., gender, educational level, race, age). The results are often showed in a graph like tree 4,5 . Nowadays classification trees are used in many diverse fields such as social sciences, demography, medicine, business, and biology [6][7][8][9][10] . Although constructing classification tree can sometimes be quite complex and extracted tree may not be so simple, but, the graphical form of it, have simple interpretation even for complex trees. As any predictive model obtaining the most accurate classification model is the goal of a classification tree analysis 11 . This paper outlines an attempt to introduce Classification and Regression Trees (CART) algorithm as a successful classification algorithm in classification tree and apply it to classify Children Ever Born (CEB) data set. It is organized as follows. The following section introduces classification tree induction by CART algorithm, Gini index and chi-square test as measures for determining the best way to classify the data. Moreover, it shows how to prune the tree. Section 3 is about CEB and the application of some statistical methods to analyze it. The results of CART application to analyze CEB are presented in this Section too. Finally, some concluding remarks are presented in Section 4.

Classification Tree Induction
Data is splited into smaller divisions called nods by classification tree. Some criteria that is called impurity measure used to divide data, and splitting continue until no more divisions are created. Since classification tree splits data based on one or more than one predictor variables classification tree has multivariate splits on predictor variables 12 . Linear combination splits, can be computed for classification trees when continuous predictors are measured 13 . Classification tree algorithms need some requirements which must be met before applying them; Classification tree algorithms represent supervised learning, and so require preclassified target variables. Data sets divided in two parts; training and learning data. Training data set which is used to extract decision tree should be rich and varied, to provide the algorithm with a healthy cross section of the types of records for which classification may be needed in the future. In classification tree, response (target) variable must be discrete. If target variable isn't discrete, we won't apply classification tree and instead regression tree should be used 14 .

Classification and Regression Trees (CART) Algorithm
Algorithm such as Automatic Interaction Detection (AID), THAID, CHi-squared Automatic Interaction Detection (CHAID) and Classification and Regression Trees (CART) can be used for extracting classification trees. AID and THAID are applied for nominal responses, CHAID is used for nominal and ordinal responses and creates multiple splits trees. CART was suggested by 15 and can used for both discrete and continues response variables. CART unlike AID, THAID and CHAID is distribution-free algorithm and more applicable. So in this article we use CART. CART are strictly binary, containing exactly two branches for each classification node. CART recursively divides cases in the training data set into subsets of cases with similar values for the target variable 16 . CART algorithm searches all available variables, all possible splitting values and selects the optimal split for growing the tree; these splitting are done by following criteria 17 :

Gini Index
Gini index is used in tree by binary splitting and for node t and target variable by k categories is defined as: Where p is the probability that a node t belongs to class C j and is estimated by |C j,D |/|D| (|D| is the size of subset D). The sum is computed over k categories. The Gini index considers a binary split for each variable 14,16 .

Chi-Square Test
The chi-square (χ 2 ) test for node t and target variable by k categories is defined as: where X is predictor variable, T is subset of nodes and C is number of classification in target variable. When we use (χ 2 ) as a purity measure its higher value shows that the variability is not occur by chance 14,16 .

Tree Pruning
When a classification tree was constructed, we should consider that the bigger tree is not essentially a better one and this tree may over-fit data; so the solutions must be considered in order to determine the right size for a classification tree. The methods used for this purpose is called pruning. In other words, the optimal tree is optimal in terms of size and level of classification error. First, estimate the classification error, then using conventional techniques tree pruning and finally, through the pruned tree using the appropriate rule, Tree with the lowest cost and size is selected. Pruning consists of two stages is a classification tree. The first stage is used in ways that would stop the development of excessive classification. These methods are known as pre-pruning. In the second stage useless classification tree branches are removed which is post pruning. Methods used in this phase is to create smaller trees with more accuracy 18,19 .

Children Ever Born Classification Tree
In demographic research, fertility is one of the important phenomenon. The number of children ever born per woman has important implications for public health, economic, climate, and population structure. It can influence infant, child and maternal mortality, obstetric and child health services, economic growth (or decline), independency burden, labor force participation, and age structure of populations 20 . There is a rich literature about fertility and factors which are had significant effect on it. In 21 had identified factors which had played major contribution in the enhancement and inhibition of fertility using data of married adolescents in Bangladesh. It was found that early marriage is one of major concern of Bangladesh but it was not sole factor causing adolescent pregnancy. The significant factor effecting adolescent motherhood was education 21 . In 22 had analyzed fertility patterns and its correlates in North East India. Education, religion, occupation, economic status, child mortality, age of women, age at marriage, duration of breastfeeding and use of contraceptives by either spouse were taken as factors which could influence fertility. Status of women had strong influence on fertility 22 .
In 23 had analyzed children ever born using binary logistic model by dichotomizing collected from 250 households of slum area of Rajshahi, Bangladesh. Respondents were women of age 15-49. Factors which had contributed significantly to large family were education level of husband and wife, average monthly income and expenditure, ideal number of children, age at marriage and reproductive life span.
In 24 investigated the cause and effect relationship in fertility for Kanyakumari district of India. Initially factor analysis technique was used for grouping of variable into new factors and then multiple regression models were fitted using variables of each group separately. Information related to age of women, age at marriage, religion, type of family, education of husband and wife, work status, income of husband and wife was used. In 25 had used multiple regression and multilevel analysis to investigate the factors that forced women to high parity in Uttar Pradesh, India. Religion (Islam), women's work status, ever use of contraception and number of child loss had caused increase in fertility while place of residence, women's education, partner's education, type of house, source of lighting had resulted decline in fertility. In Uttar Pradesh it was to be positive estimate the effect of predictors on the parity due to hierarchical structure of data at different levels of survey. In multilevel regression analysis primary sampling unit variable was taken as a regressor along with socio-economic and demographic factors. Both multiple regression and multilevel models produced almost similar results about significance of factors and covariates. The difference was found only between the standard errors. Coefficients of multiple regression analysis had undersized standard errors and consequently resulted in more significant factors than multilevel regression 26 .
There is a rich literature about fertility transition in Iran and how expansion in education, reduction in child mortality, urbanization, wide access to family planning services and importance of quality vs. quantity of children have contributed to the recent fertility decline in this country [27][28][29][30][31][32] .
The empirical results of the analysis of 28 represented three groups of determinants influence fertility behavior of Iranian households. The first group consists of economic factors either at micro or macro levels. Second, distribution of intra-household bargaining power has a strong influence on fertility in Iran. Finally, although there was no difference between the number of children in urban and rural areas, the findings yield a support for the role of other demographic determinants such as literacy, social norms of household size, and religion on fertility behavior of Iranian families 27 .
Since there isn't any study which is used classification tree to analyze CEB in Iran, also the benefit of this model for classification in this article we use CART to classify CEB.

In this study, Children Ever Born (CEB) in survey of "Study Marriage and Fertility Attitudes of Married 15-49
Year-Old Women in Semnan, Iran; 2012" 33 , are classified by classification tree approach. The data in this survey is achieved by a cross-sectional survey in Semnan province which has been collected by a structural questionnaire. Semnan is a province that is taking efficient steps to development and modernization. Nowadays, it is considered as one of the developed province in Iran. In this province, changes in fertility attitudes and beliefs expected to be affected by the modernization, industrialization and urbanization 33 . 405 samples from 2 cities and 6 villages of Semnan province, among 8 cities and 589 villages, were selected. This sample includes of 15-49 year-old women who has married in private settled household. CEB, age at first marriage, marriage type, education levels, job status, birth Place and birth cohort were collected in this research. Table 1 shows frequency and percentage of categorical variables. CEB of 45 percentages of women is 2 children. Almost equal percentage of women exists in all three birth cohorts. 80, 67.1 and 77.3 percentages of women were unemployed, diploma and above and born in urban area respectively. Nearly 60 percentages of women had non-familial marriage type.

CART Classification Tree for
Classification of CEB by Gini and chisquare Indices with Estimated Prior Probabilities Figure 1, 2 presents classification tree of CEB according to predicted variables of age at first marriage, marriage type, educational levels, job status, birth place, and birth cohort by Gini index and Chi-square test with estimated prior probabilities respectively. All of the predicted variables are entered in this classification tree as nodes. Birth cohort has been placed in the root of the classification tree as the most influenced variable on classifying CEB. Table 2 presents the misclassification matrix of Model 1 and 2 which indicates the accuracy of two classification models. The shaded cells in Table 2 signify correct classification or accuracy of the classification trees on Figure 1 and 2. The accuracy of the classification tree can be calculated as Equation (3) and (4) mod el (2) 18 49 115 76 Accuracy 0.64 405 0.65 of classification accuracy Model 1 means that CEB of 65 percentages of women has been classified correctly. This value indicates that misclassification is equal to 35 percent. As Equation (3) and (4) present, the accuracy of two models are approximately equal. But Model 1 is more complex than model 2; Model 1 contains 11 nodes and 12 leaves compare with Model 2 with 8 nodes and 9 leaves. So we recommend using Model 2 instead of Model 1.
The following rules can be extracted from the classification tree in Figure 2: • CEB of women in the first birth cohort (1960 decade) was 3 and more children without affects of any other predictor. • CEB of women in the second birth cohort (1970 decade) whose age at first marriage was low (≤15.5) was 3 and more children. Birth place for women in this cohort whose age at first marriage was high (>15.5) with both educational levels didn't play any specific rules in classifying CEB. Their CEB were 2 children either they were living in urban or rural areas.
• CEB of unemployed women in the third birth cohort (1980 decade) whose with non-familial marriage and familial marriage were 1 and 0 child, respectively. • CEB of employed women in the third birth cohort (1980 decade) whose age at first marriage were low (≤24.5) and high (>24.5) were 2 and 1 children, respectively. Risk and standard error of classification tree for training and learning data and estimated prior probability have been shown in Table 3 and 4, respectively. According to the results of Table 3, these values are almost equal which indicates the validity of classification model proposed by classification tree in Figure 2.

Conclusion
Classification trees recursively partitioning predictor variables to separate areas and allocate data to classes. The recursive partitioning lead to a fixed piece model on the predictive variable space. For partition each node, all partitions validate for each predictors. Variable and corresponding partitioning points select in such a way that the best separation between two nodes is obtained. This process continues recursively until each node contains a limited number of cases. After making a big tree, the rules for pruning and adjusting the size of the tree is used. Classification and Regression Trees (CARTs) are useful in generating binary classification trees by splitting the subsets of the dataset using all predictor variables to create two child nodes 3 . CART algorithm is an important method that uses classification and regression tree analysis of large data sets. CART has so many advantages such as: • CART doesn't have any distributional assumption on covariates and response variables. • Covariates can be mixed of both categorical and continuous variables. • CART can deal with missing data by some methods, so no case is deleted from the analysis because of missed information. • CART doesn't affected by outliers.
Classification tree have been used for classification in many fields, such as demography, medicine, manufacturing and production, financial analysis, astronomy, and molecular biology 11 .
Classification trees are different from discriminant analysis and logistic regression which are traditional statistical methods and used to classify data. Both discriminant analysis and logistic regression need some assumptions that without them the validity of results is not met. For example in discriminant analysis all covariates must be continuous and have normal distributions that are not confirmed in many applied studies. Validation of logistic regression depends on enough cases in each target variable categories and when the number of covariates increases, the full models that contains interactions gets more complex. In classification tree any assumptions about distributions of response and covariate variables does not need. Also we can consider interactions between variables without complexity 13,14 .
In this study, Children Ever Born (CEB) in survey of "Study Marriage and Fertility Attitudes of  Year-Old Women in Semnan, Iran; 2012" are classified by classification tree approach. The following results have extracted from CBR classification tree: • CEB of women in the first birth cohort (1960 decade) was 3 and more children without affects of any other predictors. Although CEB of women in the second birth cohort (1970 decade), whose married in low and high age, was 3 and more and 2 children respectively. • Marriage type has not affected on CEB in first and second birth cohort which are 2 or 3 children and more which only depend on age of their marriage. • CEB of women in the third birth cohort (1980 decade) was affected by type of marriage while women whose type of marriage is non-familial, have 1 child, women with familial marriage were childless. • 1 or 2 were CEB of employed women in third cohort whose age of marriage were low and high respectively. Employed women by low age of marriage comparing to unemployed women had higher CEB.

Acknowledgment
This article is extracted from a survey under the title of "Mining Demographic Data by Decision Tree" which is supported by National Population Studies and Comprehensive Management Institute in 2014 by the registered number of 20/15283.