Early treatment of cancer increases the possibility of curing and reduces the fatality rate and cancer recurrence ^{1}. An efficient tool is required to diagnose the patients whether they are affected by cancer or not and distinguish different types of cancer. Data mining provides data analysis to uncover interesting knowledge for understanding and diagnosing diseases from microarray gene expression data. However, due to a large number of genes and small samples size of microarray data, the conventional statistical and classification techniques may not be able to deal with it efficiently ^{2}. Due to huge number of features or genes in the microarray, data may produce redundant results and affect the classification accuracy. To overcome this problem, feature selection, dimension reduction, informative genes identification ^{3}, discretization and rules generation are important in building an associative classification model for gene expression data. Identifying candidate genes for the specific cancers using an informative genes and class associative rules are used to analyze the gene expression with biological information from the gene ontology ^{4}. The associative classification methods help the healthcare professionals in identifying the cancer risk factors^{ }and diagnose the genes which are cause for diseases. Veroneze et al.^{ 5} have used association rule mining to identify molecular profiles patterns from gene expression data on chronic inflammatory diseases. Luna et al.
Sen et al.
The National Center for Biotechnology and Information (NCBI) Gene Expression Omnibus (GEO) database is a public functional genomics database with highthroughput gene expression data, chips, and microarrays. GSE15781 ^{12}, GSE87211 ^{13}, GSE25070 ^{14} and GSE43580^{ }^{15} was downloaded from GEO. The GSE15781 dataset consists of colorectal cancer patients includes 13 cancer tissue and 10 normal tissues. The dataset GSE87211 contains 203 colorectal cancer samples and 160 control samples. The GSE25070 dataset contains the 26 tumor samples and 26 normal samples. In this research, the proposed method tested with the multiclass dataset. The GSE43580 gene expression profiles datasets contain 77 lung adeno carcinoma and 73 lung squamous cell carcinoma samples from Gene Expression Omnibus (GEO) under accession number GSE43580. NCI60 is a data set of gene expression profiles of 60 National Cancer Institute (NCI) cell lines. The data set ^{16}^{ }consists of 7129 genes and 60 samples. The 60 human tumor cell lines are divided into eight different cancer classes such as, eight is breast cancer, six CNS cancer, seven colon cancer, six leukemia cancer, eight melanoma cancer, nine nonsmallcell lung carcinoma cancer, six ovarian cancers, eight renal cancer tumors and two prostate cancers.
The objective of this work is to present an associative classification method for gene expression data to enhance the performance, time and space in the high dimensional data set. In this section, the new proposed method, namely Modified Associative Classification Model (MACM) for microarray gene expression data classification. The proposed model has four phases namely data preprocessing, data transformation, associative classification, and biomarker prediction
The gene expression data are standardized using the zscore standardization. The standardization brings all genes within a range. After that, informative genes are identified using the LIMMA package which is used to reduce the dimensionality of the dataset.
Normalization and standardization methods are applied to remove certain systematic biases that are inherent on the data. Before analyzing the data, the gene expression data must be normalized to avoid large variation in the gene expressions and to avoid errors during data processing ^{17}. The Zscore
where, D is the data to be normalized, μ is the arithmetic mean and σ is the standard deviation of that data and Z is the standardized variable with mean 0 and variance 1. This method is used to normalize the gene expression data.
Gene selection is a main task for microarray data classification to find the differentially expressed genes and to reduce dimensionality by removing irrelevant and noisy data ^{18}. The Linear Models for Microarray Data (LIMMA) package is Rbased opensource software in statistical genomics ^{19}. LIMMA package uses linear models to preprocess and analyze the microarray experiments ^{19}. The LIMMA model requires design matrix and contrast matrix. The first step is to fit a linear model using
The moderated tstatistic for the k^{th} contrast for the gene j is calculated using
where
Data transformation ^{20} is used to integrate various types of data and to apply association rule mining successfully in the rule generation phase. Data discretization is a data transformation method, where the gene expression data are transformed from continuous data into nominal data. There are several discretization techniques such as equal width binning, clusterbased discretization, equal depth discretization, class attribute contingency discretization and entropybased discretization etc. The entropybased discretization is a supervised approach that discretizes attributes using the class information.
The discretization process^{ }^{21} follows four steps, such as sorting the continuous values, calculating cut points for splitting intervals or merging intervals, based on some condition or criterion, and finally stopping at some point based on the splitting or merging intervals. Gene expression data sets are continuous variables and measured by the interval. The process of partitioning continuous variables into categories is known as discretization. The discretization techniques are located along two dimensions such as supervised versus unsupervised and local versus global. In this work, the entropybased discretization is used.
Entropybased discretization is a supervised technique, which uses the class information to transform the data into nominal data. The entropybased discretization process is explained in the following algorithm 1.
Begin
Step1: Read the statistically and significantly expressed genes
Step2: Sort the gene expression values
Step3: Calculate the entropy
Step4: Search a suitable cut point with the lowest entropy
Step5: Split the range of continuous gene expression values according to cut point is calculated using
Step6: Repeat steps 45 until satisfy the stopping criteria and discretize all the continuous values
End.
The filtered and discretized microarray data set is transformed into a transaction set. A microarray transaction table is built based on the class labels before rule mining is applied. The number of transactions in a gene expression dataset corresponds to its number of samples. The number of transactions of biological information is the total number of discrete data acquired. These sets of transactions are passed to the maximal class rule generation phase for model building. Association rule^{ }^{22} and classification are combined into associative classification. Generating a frequent itemset is the one of the key processes in the association analysis to identify the interesting set of genes. Mining frequent itemsets is essential for discovering class association rules. Many of the frequent itemset generation algorithms follow Apriori ^{23}, which uses a bottomup and breadthfirst search approach. Generating long frequent patterns in dense data is computationally infeasible. A solution to this problem is to mine only the maximal frequent itemsets
Given a set of items I= {i1, i2, i3 … in} and a set of transaction T = {t1, t2, t3 … tm}, a subset of I is called a frequent, if support(S) ≥ minimum support, where minimum support is a user defined threshold. The maximal frequent itemset
The method to generate maximal frequent itemsets follows a depth first search approach. The frequent itemset is maximal if it is frequent but none of its proper supersets is frequent.
The maximal classifier model is built from the maximal frequent itemset. A class association rule is the form of
The confidence of rule
Begin
Step 1: Read discretized dataset for each class
Step 2: Transform the dataset into transaction set
Step 3: For each class compute the maximal frequent itemsets
Step 4: Generate a set of rules that have confidence above the minimum confidence threshold from maximal frequent items
Step5: Make a classifier model from these Class Association Rules
Step6: Repeat steps 46 until form the maximal class association rules for all classes
End.
Prediction in associative classification is one of the important steps to determine the accuracy for the developed model. During prediction a sample is predicted to be a particular class when it satisfies more number of eligible rules of the concerned class otherwise it is declared as the default class, which is the majority class in the dataset. Assigning default classes to a sample can affect classifier accuracy. The challenge is to make use of the generated rules in the model to produce a good accuracy. In this paper, a probabilitybased prediction method in associative classification is proposed. Poisson probability distribution predicts the probability of occurrence of certain events when how often the event has occurred is known. It gives us the number of occurrences of the event in a fixed interval. The Poisson probability distribution is calculated using the
Where,
Begin
End.
The data of the microarray are presented in the gene expression matrix. Experiments for the proposed method are carried out by the R statistical programming language.
First, in the data preprocessing phase, the raw microarray data were normalized using Zscore normalization and candidate gene features were selected from the normalized data using the LIMMA test. Selected candidate gene features can achieve the highest classification accuracy with the fewest number of genes.





Colon Cancer 
GSE15781 
23 
13 
10 
Colon Cancer 
GSE87211 
363 
203 
106 
Colon Cancer 
GSE25070 
52 
26 
26 
Lung Cancer 
GSE43580 
150 
73 
77 












GSM396309 
1 
0.97 
1.08 
0.55 
…. 
0.84 
0.86 
0.89 
1 
0.93 
0.42 
GSM396310 
1 
0.63 
0.73 
0.96 
…. 
0.47 
1.11 
0.91 
1.15 
0.96 
0.89 
GSM396311 
1 
1.1 
0.8 
0.92 
…. 
0.95 
0.72 
0.91 
1.18 
0.9 
0.74 
GSM396312 
1 
0.18 
0.64 
0.05 
…. 
0.8 
0.76 
0.06 
0.24 
0.49 
0.12 
GSM396313 
1 
1.23 
0.97 
1.01 
…. 
0.92 
1.22 
0.91 
0.96 
0.93 
1.15 
GSM396314 
1 
0.39 
0.16 
0.33 
…. 
0.45 
0.07 
0.39 
0.32 
0.04 
0.49 
GSM396315 
1 
0.38 
0.83 
0.96 
…. 
0.91 
0.97 
0.85 
0.86 
0.75 
1.06 












GSM396309 
1 
0.97 
1.08 
0.55 
1.36 
0.84 
0.86 
0.89 
1 
0.93 
0.42 
GSM396310 
1 
0.63 
0.73 
0.96 
1.39 
0.47 
1.11 
0.91 
1.15 
0.96 
0.89 
GSM396311 
1 
1.1 
0.8 
0.92 
1.39 
0.95 
0.72 
0.91 
1.18 
0.9 
0.74 
GSM396312 
1 
0.18 
0.64 
0.05 
0.16 
0.8 
0.76 
0.06 
0.24 
0.49 
0.12 
GSM396313 
1 
1.23 
0.97 
1.01 
0.67 
0.92 
1.22 
0.91 
0.96 
0.93 
1.15 
GSM396314 
1 
0.39 
0.16 
0.33 
0.03 
0.45 
0.07 
0.39 
0.32 
0.04 
0.49 
GSM396315 
1 
0.38 
0.83 
0.96 
0.54 
0.91 
0.97 
0.85 
0.86 
0.75 
1.06 
GSM396316 
1 
0.77 
0.36 
1.14 
0.01 
0.8 
0.82 
0.84 
0.83 
0.86 
0.91 
GSM396317 
1 
1.14 
0.7 
0.93 
1.25 
1.04 
1.15 
0.91 
0.8 
0.89 
1.12 
GSM396318 
1 
1.3 
0.41 
1.3 
1.36 
0.97 
1.22 
0.92 
0.96 
0.94 
1.13 
GSM396319 
1 
0.95 
0.98 
0.79 
0.05 
0.83 
0.07 
0.9 
0.63 
0.84 
0.78 
GSM396320 
1 
0.44 
0.97 
0.6 
0.5 
0.57 
0.07 
0.7 
0.51 
0.81 
0.65 
GSM396321 
1 
0.84 
0.68 
0.52 
0.95 
0.61 
0.35 
0.78 
0.86 
0.74 
0.85 
GSM396322 
0 
0.87 
0.82 
1.43 
0.74 
1.31 
1.81 
1.32 
1.62 
2.15 
0.93 
GSM396323 
0 
1.47 
1.72 
0.07 
0.32 
1.85 
0.27 
1.57 
0.15 
0.4 
1.03 
GSM396324 
0 
0.5 
0.18 
0.97 
1.38 
1.54 
1.11 
0.11 
1.76 
0.96 
0.98 
GSM396325 
0 
0.59 
0.09 
0.62 
0.4 
0.53 
0.5 
1.65 
0.64 
0.23 
2.52 
GSM396326 
0 
0.77 
2.55 
1.82 
1.3 
0.14 
2.04 
2.12 
1.88 
1.72 
1.4 
GSM396327 
0 
0.37 
1.16 
0.61 
1.51 
0.04 
0.49 
0.39 
0.49 
0.51 
0.12 
GSM396328 
0 
0.58 
0.87 
1.4 
1.04 
0.46 
0.32 
0.46 
0.09 
0.9 
0.32 
GSM396329 
0 
2.09 
1 
0.58 
1.47 
0.95 
1.11 
1.23 
1.43 
1.06 
0.34 
GSM396330 
0 
0.58 
1.08 
0.1 
0.07 
0.41 
1.09 
0.06 
0.2 
0.37 
0.47 
GSM396331 
0 
1.71 
0.02 
1.85 
1.04 
2.12 
1.06 
0.48 
0.91 
1.53 
0.98 












GSM396309 
1 
0, 0.095 
Inf,0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396310 
1 
0, 0.095 
Inf,0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396311 
1 
0, 0.095 
Inf,0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396312 
1 
0, 0.095 
Inf 0.125 
0.010 0.075 
0.115 0.240 
Inf 0.255 
Inf 0.31 
0.085 0.000 
0.220 0.405 
0.445 0.500 
0.15 0.22 
GSM396313 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396314 
1 
0.380, 0.445 
Inf 0.125 
0.215 0.455 
Inf 0.05 
0.430 0.455 
0.170 0.195 
0.225 0.425 
0.220 0.405 
0.205 0.095 
0.48 0.71 
GSM396315 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396316 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396317 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396318 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396319 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
0.170 0.195 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396320 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
0.170 0.195 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396321 
1 
0, 0.095 
Inf 0.125 
Inf 0.295 
Inf 0.05 
Inf 0.255 
Inf 0.31 
Inf 0.405 
Inf 0.21 
Inf 0.555 
Inf 0.15 
GSM396322 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.455 Inf 
0.195 Inf 
0.425 Inf 
0.405 Inf 
0.5 Inf 
0.71 Inf 
GSM396323 
0 
0.445 , Inf 
0.125 Inf 
0.305 
0.24 Inf 
0.455 Inf 
0.48 
0.425 Inf 
0.21 0.22 
0.095 0.445 
0.71 Inf 
GSM396324 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.455 Inf 
0.195 Inf 
0.49 
0.405 Inf 
0.5 Inf 
0.71 Inf 
GSM396325 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.455 Inf 
0.195 Inf 
0.425 Inf 
0.405 Inf 
0.095 0.445 
0.71 Inf 
GSM396326 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.255 0.430 
0.195 Inf 
0.425 Inf 
0.405 Inf 
0.5 Inf 
0.71,Inf 
GSM396327 
0 
0.095, 0.380 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.255 0.430 
0.195 Inf 
0.225 0.425 
0.405 Inf 
0.5 Inf 
0.15 0.22 
GSM396328 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.455 Inf 
0.195 Inf 
0.425 Inf 
0.21 0.22 
0.5 Inf 
0.22 0.48 
GSM396329 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.455 Inf 
0.195 Inf 
0.425 Inf 
0.405 Inf 
0.5 Inf 
0.22 0.48 
GSM396330 
0 
0.445 , Inf 
0.125 Inf 
0.075 0.215 
0.050 0.115 
0.255 0.430 
0.195 Inf 
0.000 0.225 
0.21 0.22 
0.76 
0.22 0.48 
GSM396331 
0 
0.445 , Inf 
0.125 Inf 
0.455 Inf 
0.24 Inf 
0.455 Inf 
0.195 Inf 
0.425 Inf 
0.405 Inf 
0.5 Inf 
0.71 Inf 












GSM396309 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396310 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396311 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396312 
1 
1 
1 
3 
3 
1 
1 
3 
3 
5 
2 
GSM396313 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396314 
1 
3 
1 
5 
1 
3 
3 
5 
3 
3 
4 
GSM396315 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396316 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396317 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396318 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396319 
1 
1 
1 
1 
1 
1 
3 
1 
1 
1 
1 
GSM396320 
1 
1 
1 
1 
1 
1 
3 
1 
1 
1 
1 
GSM396321 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
1 
GSM396322 
0 
4 
2 
6 
4 
4 
4 
6 
4 
6 
5 
GSM396323 
0 
4 
2 
2 
4 
4 
2 
6 
2 
4 
5 
GSM396324 
0 
4 
2 
6 
4 
4 
4 
2 
4 
6 
5 
GSM396325 
0 
4 
2 
6 
4 
4 
4 
6 
4 
4 
5 
GSM396326 
0 
4 
2 
6 
4 
2 
4 
6 
4 
6 
5 
GSM396327 
0 
2 
2 
6 
4 
2 
4 
5 
4 
6 
2 
GSM396328 
0 
4 
2 
6 
4 
4 
4 
6 
2 
6 
3 
GSM396329 
0 
4 
2 
6 
4 
4 
4 
6 
4 
6 
3 
GSM396330 
0 
4 
2 
4 
2 
2 
4 
4 
2 
2 
3 
GSM396331 
0 
4 
2 
6 
4 
4 
4 
6 
4 
6 
5 




PCSK1N[ lower = 0.445, upper = Inf], C9orf100[ lower = 0.125, upper = Inf], 
Class=Control 
SPIB[ lower = 0.255, upper = 0.43], AKR1B10[ lower = 0.195, upper = Inf] 


PCSK1N[ lower = 0.445, upper = Inf], C9orf100[ lower = 0.125, upper = Inf], 
Class=Control 
AKR1B10[ lower = 0.195, upper = Inf], HSD17B2[ lower = 0.21, upper = 0.22], 

KRT20[ lower = 0.22, upper = 0.48] 


PCSK1N[ lower = 0.445, upper = Inf], C9orf100[ lower = 0.125, upper = Inf], 
Class=Control 
PDGFD[ lower = 0.24, upper = Inf], SPIB[ lower = 0.455, upper = Inf], 

INSL5[ lower = 0.425, upper = Inf], HSD17B2[ lower = 0.21, upper = 0.22] 


PCSK1N[ lower = 0.445, upper = Inf], C9orf100[ lower = 0.125, upper = Inf], 
Class=Control 
PDGFD[ lower = 0.24, upper = Inf], SPIB[ lower = 0.455, upper = Inf], 

INSL5[ lower = 0.425, upper = Inf], SLC4A4[ lower = 0.095, upper = 0.445], 

KRT20[ lower = 0.71, upper = Inf] 


C9orf100[ lower = 0.125, upper = Inf], LGALS4[ lower = 0.455, upper = Inf], 
Class=Control 
PDGFD[ lower = 0.24, upper = Inf], SPIB[ lower = 0.255, upper = 0.43], 

AKR1B10[ lower = 0.195, upper = Inf], HSD17B2[ lower = 0.405, upper = Inf], 

SLC4A4[ lower = 0.5, upper = Inf] 


PCSK1N[ lower = 0.445, upper = Inf], C9orf100[ lower = 0.125, upper = Inf], 
Class=Control 
LGALS4[ lower = 0.455, upper = Inf], PDGFD[ lower = 0.24, upper = Inf], 

SPIB[ lower = 0.455, upper = Inf], AKR1B10[ lower = 0.195, upper = Inf], 

INSL5[ lower = 0.425, upper = Inf], SLC4A4[ lower = 0.5, upper = Inf], 

KRT20[ lower = 0.22, upper = 0.48] 


PCSK1N[ lower = 0.445, upper = Inf], C9orf100[ lower = 0.125, upper = Inf], 
Class=Control 
LGALS4[ lower = 0.455, upper = Inf], PDGFD[ lower = 0.24, upper = Inf], 

SPIB[ lower = 0.455, upper = Inf], AKR1B10[ lower = 0.195, upper = Inf], 

INSL5[ lower = 0.425, upper = Inf], HSD17B2[ lower = 0.405, upper = Inf], 

SLC4A4[ lower = 0.5, upper = Inf], KRT20[ lower = 0.71, upper = Inf] 


C9orf100[ lower = Inf, upper = 0.125],PDGFD[ lower = Inf, upper = 0.05], 
Class=Cancer 
AKR1B10[ lower = 0.17, upper = 0.195] 


PCSK1N[ lower = Inf, upper = 0.095],C9orf100[ lower = Inf, upper = 0.125], 
Class=Cancer 
LGALS4[ lower = Inf, upper = 0.295],PDGFD[ lower = Inf, upper = 0.05], 

SPIB[ lower = Inf, upper = 0.255],AKR1B10[ lower = Inf, upper = 0.31], 

INSL5[ lower = Inf, upper = 0.405],HSD17B2[ lower = Inf, upper = 0.21], 

SLC4A4[ lower = Inf, upper = 0.555], SLC4A4 [ lower = Inf, upper = 0.15] 
It is observed from the










MACM 
MinMax 
GSE15781 
Frequent 
20 
80 
2256 
100 
0 
3.36 
MACM 
MinMax 
GSE15781 
Closed 
20 
80 
102 
100 
0 
0.27 










MACM 
MinMax 
GSE15781 
Frequent 
40 
80 
1292 
100 
0 
3.4 
MACM 
MinMax 
GSE15781 
Closed 
40 
80 
66 
100 
0 
0.21 










MACM 
MinMax 
GSE15781 
Frequent 
80 
80 
266 
100 
0 
16.78 
MACM 
MinMax 
GSE15781 
Closed 
80 
80 
9 
100 
0 
0.15 










MACM 
ZScore 
GSE15781 
Frequent 
20 
80 
2482 
100 
0 
70.8 
MACM 
ZScore 
GSE15781 
Closed 
20 
80 
57 
100 
0 
67.8 










MACM 
ZScore 
GSE15781 
Frequent 
40 
80 
1766 
100 
0 
15.33 
MACM 
ZScore 
GSE15781 
Closed 
40 
80 
42 
100 
0 
0.18 










MACM 
ZScore 
GSE15781 
Frequent 
80 
80 
532 
100 
0 
4.66 
MACM 
ZScore 
GSE15781 
Closed 
80 
80 
11 
100 
0 
0.15 
MACM 
ZScore 
GSE15781 
Frequent 
80 
80 
4 
100 
0 
0.15 
CBA 
ZScore 
GSE15781 
Frequent 
20 
80 
770 
100 
0 
0.25 
CBA 
ZScore 
GSE15781 
Frequent 
40 
80 
640 
100 
0 
0.25 
CBA 
ZScore 
GSE15781 
Frequent 
50 
80 
19 
100 
0 
0.19 
CBA 
ZScore 
GSE15781 
Frequent 
20 
80 
770 
100 
0 
0.25 
CBA 
Min Max 
GSE15781 
Frequent 
40 
80 
547 
100 
0 
0.26 
CBA 
Min Max 
GSE15781 
Frequent 
50 
80 
30 
100 
0 
0.21 
MACM 
Min Max 
GSE87211 
Frequent 
20 
80 
528 
97.79 
2.21 
3.96 
MACM 
Min Max 
GSE87211 
Closed 
20 
80 
342 
97.79 
2.21 
2.57 










MACM 
Min Max 
GSE87211 
Frequent 
40 
80 
277 
96.69 
3.31 
1.47 
MACM 
Min Max 
GSE87211 
Closed 
40 
80 
241 
96.69 
3.31 
1.18 










MACM 
Min Max 
GSE87211 
Frequent 
80 
80 
4 
91.73 
8.27 
0.17 
MACM 
Min Max 
GSE87211 
Closed 
80 
80 
4 
91.73 
8.27 
0.14 
MACM 
Min Max 
GSE87211 
Maximal 
80 
80 
2 
91.73 
8.27 
0.14 
MACM 
ZScore 
GSE87211 
Frequent 
20 
80 
541 
99.17 
0.83 
3.82 
MACM 
ZScore 
GSE87211 
Closed 
20 
80 
526 
99.17 
0.83 
3.74 










MACM 
ZScore 
GSE87211 
Frequent 
40 
80 
77 
98.62 
1.38 
0.62 
MACM 
ZScore 
GSE87211 
Closed 
40 
80 
77 
98.62 
1.38 
0.64 










MACM 
ZScore 
GSE87211 
Frequent 
80 
80 
1 
55.92 
44.08 
0.14 
MACM 
ZScore 
GSE87211 
Closed 
80 
80 
1 
55.92 
44.08 
0.14 
MACM 
ZScore 
GSE87211 
Maximal 
80 
80 
1 
55.92 
44.08 
0.14 
CBA 
ZScore 
GSE87211 
Frequent 
20 
80 
770 
98.34 
1.66 
0.44 
CBA 
ZScore 
GSE87211 
Frequent 
40 
80 
615 
98.34 
1.66 
0.3 
CBA 
ZScore 
GSE87211 
Frequent 
80 
80 
115 
98.34 
1.66 
0.3 
CBA 
Min – Max 
GSE87211 
Frequent 
20 
80 
770 
98.89 
1.11 
0.37 
CBA 
Min – Max 
GSE87211 
Frequent 
40 
80 
320 
98.34 
1.66 
0.42 
CBA 
Min – Max 
GSE87211 
Frequent 
50 
80 
14 
98.07 
1.93 
0.27 
MACM 
Min – Max 
GSE25070 
Frequent 
20 
80 
2510 
100 
0 
6.02 
MACM 
Min – Max 
GSE25070 
Closed 
20 
80 
202 
100 
0 
0.58 










MACM 
Min – Max 
GSE25070 
Frequent 
40 
80 
1578 
100 
0 
3.85 
MACM 
Min – Max 
GSE25070 
Closed 
40 
80 
140 
100 
0 
0.78 










MACM 
Min – Max 
GSE25070 
Frequent 
80 
80 
1022 
100 
0 
2.68 
MACM 
Min – Max 
GSE25070 
Closed 
80 
80 
45 
100 
0 
0.25 










MACM 
ZScore 
GSE25070 
Frequent 
20 
80 
2510 
100 
0 
6.36 
MACM 
ZScore 
GSE25070 
Closed 
20 
80 
202 
100 
0 
0.72 










MACM 
ZScore 
GSE25070 
Frequent 
40 
80 
1578 
100 
0 
3.79 
MACM 
ZScore 
GSE25070 
Closed 
40 
80 
140 
100 
0 
0.42 










MACM 
ZScore 
GSE25070 
Frequent 
80 
80 
1022 
100 
0 
2.54 
MACM 
ZScore 
GSE25070 
Closed 
80 
80 
45 
100 
0 
0.23 










CBA 
ZScore 
GSE25070 
Frequent 
20 
80 
770 
100 
0 
0.21 
CBA 
ZScore 
GSE25070 
Frequent 
40 
80 
770 
100 
0 
0.23 
CBA 
ZScore 
GSE25070 
Frequent 
50 
80 
262 
100 
0 
0.23 
CBA 
Min – Max 
GSE25070 
Frequent 
20 
80 
770 
100 
0 
0.28 
CBA 
Min – Max 
GSE25070 
Frequent 
40 
80 
770 
100 
0 
0.29 
CBA 
Min – Max 
GSE25070 
Frequent 
50 
80 
262 
100 
0 
0.24 
MACM 
Min – Max 
GSE43580 
Frequent 
20 
80 
121 
94 
6 
0.62 
MACM 
Min – Max 
GSE43580 
Closed 
20 
80 
108 
94 
6 
0.48 










MACM 
Min – Max 
GSE43580 
Frequent 
40 
80 
42 
77.33 
22.67 
0.24 
MACM 
Min – Max 
GSE43580 
Closed 
40 
80 
42 
77.33 
22.67 
0.24 
MACM 
Min – Max 
GSE43580 
Maximal 
40 
80 
21 
77.33 
22.67 
0.19 
MACM 
Min – Max 
GSE43580 
Frequent 
80 
80 
13 
48.67 
51.33 
0.23 
MACM 
Min – Max 
GSE43580 
Closed 
80 
80 
13 
48.67 
51.33 
0.17 
MACM 
Min – Max 
GSE43580 
Maximal 
80 
80 
7 
48.67 
51.33 
0.15 
MACM 
ZScore 
GSE43580 
Frequent 
20 
80 
282 
84 
16 
1.23 
MACM 
ZScore 
GSE43580 
Closed 
20 
80 
238 
84 
16 
1.13 
MACM 
ZScore 
GSE43580 
Maximal 
20 
80 
40 
84 
16 
0.26 
MACM 
ZScore 
GSE43580 
Frequent 
40 
80 
86 
78.67 
21.33 
0.59 
MACM 
ZScore 
GSE43580 
Closed 
40 
80 
82 
78.67 
21.33 
0.4 
MACM 
ZScore 
GSE43580 
Maximal 
40 
80 
25 
78.67 
21.33 
0.21 
MACM 
ZScore 
GSE43580 
Frequent 
80 
80 
18 
48.67 
51.33 
0.35 
MACM 
ZScore 
GSE43580 
Closed 
80 
80 
18 
48.67 
51.33 
0.2 
MACM 
ZScore 
GSE43580 
Maximal 
80 
80 
12 
48.67 
51.33 
0.15 
CBA 
ZScore 
GSE43580 
Frequent 
20 
80 
769 
91.33 
8.67 
0.49 
CBA 
ZScore 
GSE43580 
Frequent 
40 
80 
343 
89.33 
10.67 
0.31 
CBA 
ZScore 
GSE43580 
Frequent 
50 
80 
1 
 
 
0.14 
CBA 
Min – Max 
GSE43580 
Frequent 
20 
80 
672 
92.67 
7.33 
0.44 
CBA 
Min – Max 
GSE43580 
Frequent 
40 
80 
246 
90.67 
9.33 
0.32 
CBA 
Min – Max 
GSE43580 
Frequent 
50 
80 
 
 
 
0.17 










MACM 
ZScore 
Frequent 
10 
80 
5105 
88.89 
11.11 
105.6 
227.4 
MACM 
ZScore 
Frequent 
20 
80 
5105 
33.33 
66.67 
13.19 
27.3 
MACM 
ZScore 
Frequent 
40 
80 
2046 
33.33 
66.67 
3.66 
7.43 
MACM 
ZScore 
Closed 
10 
80 
94 
88.89 
11.11 
0.65 
0.51 
MACM 
ZScore 
Closed 
20 
80 
94 
33.33 
66.67 
0.49 
0.18 
MACM 
ZScore 
Closed 
40 
80 
17 
33.33 
66.67 
0.37 
0.06 
MACM 
ZScore 
Maximal 
10 
80 
60 
88.89 
11.11 
0.81 
0.41 
MACM 
ZScore 
Maximal 
40 
80 
8 
33.33 
66.67 
0.46 
0.051 
The proposed MACM model was analyzed using the AUC curve and accuracy.



Number of affected tissues that are correctly diagnosed 

Number of healthy tissues that are wrongly identified as a tissue 

Number of healthy tissues that are correctly diagnosed 

Number of affected tissues that are wrongly identified as a healthy 
The proposed method uses the supervised discretization to generate rules with gene expression intervals. The supervised discretization method uses class information to split data into set of discrete intervals and without loss of information. Compared with frequent itemset and closed frequent itemset types, the class association rule mining algorithm using maximal frequent itemsets. The proposed method does not generate numerous frequent itemsets so that the generation of redundant rules decreases and mining rules becomes faster. The proposed algorithm improves the space and time utilization for the gene expression data and helps to identify the relationship among the gene expression profiles. The proposed method helps to find the functions of the genes and enrichment analysis for group of genes. The



Ong et al ^{14} 
gene1, gene2, gene3 ,…..genen→ class1gene2, gene3…..genen→class2 
Method extracts the association rules with gene name 
Yuan, et al ^{15} 
If (TSC2 ≤ 124.073) and (GLTP ≥ 2042.765) Then →Lung SCC 
Extracts the association rules using IF–THEN relationship (e.g., IF gene1 ≥ 6.4 AND gene2 ≥ 4.8 THEN lung AC ) 
Proposed MACMMethod 
g1[interval], g2[interval], … gn[interval]→ class1g5[interval], g6[interval],… gn[interval] → class2g1[interval], g2[interval],… gn[interval] → classn 
Proposed method is applied to extract the association rules with gene expression intervalsIt is used to find correlation among genes expression data profiling and to identify the positive and negative regulators of the genes 
The proposed method diagnoses diseases from microarray gene expression data using maximal frequent itemsets and probabilitybased distribution prediction method. Experimental results show that the maximal frequent itemsets quickly generate the rules and consume less memory space for storing the maximal frequent itemsets. Existing methods use only frequent itemsets but the proposed method uses frequent itemsets, closed frequent itemsets and maximal frequent itemsets. The experiments are carried out for the binary class datasets colon cancer and lung cancer; and for the multi class data set National Cancer Institute60 (NCI60) cancer cell line gene expression data. The proposed MACM model provides 100% accuracy for the binary class datasets but provides 88.89 % accuracy for the multiclass dataset. In the existing methods, when the generated rules are not matched with the test pattern then they assign the most class frequent in the training data. But as the proposed method uses the probability distribution, it assigns the predicted class for the test pattern based on the probability of the rules covered for a class. Also, the proposed method uses only maximal frequent itemsets which leads to avoid rule pruning. Proposed method works the best only for binary class dataset is its limitation. Future work can be concentrated on multi class data sets and an ensemble soft weighted gene selectionbased approach and cancer classification using modified metaheuristic learning can be proposed for enhancing the current work.