Indian Journal of Science and Technology
Year: 2018, Volume: 11, Issue: 40, Pages: 1-14
Che Li1 *, Abeer Alsadoon1 , P.W.C. Prasad1 and A. Elchouemi2
1 Charles Sturt University, Business, Justice and Behavioural Sciences Faculty, Sydney Campus, Australia; [email protected], [email protected], [email protected]
2 Information Technology Faculty, Ashford University, San Diego, California, USA; [email protected]
*Author for correspondence
Charles Sturt University, Business, Justice and Behavioural Sciences Faculty, Sydney Campus, Australia; [email protected]
Objective: Organizations always use data to help their knowledge discovery by using data mining techniques nowadays. Discretization algorithms are the main techniques to discover knowledge in the data cleansing stage. This study is to develop an enhanced discretization algorithm to investigate the impact of data cleansing on knowledge discovery. Methodology: The ELFD algorithm is based on the Low Frequency Discretizer (LFD) which includes four phases: copying dataset, calculating correlation ratio, identifying cut points and discretizing datasets. Using a part of the categorical attributes is to increase the correlation ratio between a numerical attribute and each categorical attribute. We evaluate the new discretization algorithm by using health datasets compared with LFD. The classification accuracy of the discretized dataset is the major criteria for evaluating the ELFD. Finding: The classification accuracy of the ELFD is greater than the classification accuracy of the LFD. Accuracy is enhanced by approximately 9% with the use of the ELFD. Considering manual recording errors, the time processing of the ELFD is similar to the LFD algorithm. Conclusion: The ELFD adds an additional step by choosing the top 75% categorical attributes for which the correlation ratio values are largest and then calculates the correlation ratio between the numerical attribute and these categorical attributes. Using a part of the categorical attributes increases correlation ratio values so that the ELFD improves knowledge discovery from personal information contained in health records during the stage of data cleansing.
Keywords: Corrupt Data Detection, Data Discretization, Data Cleansing, Data Mining, Data Pre-processing, Missing Value Imputation
Subscribe now for latest articles and news.