Total views : 56

High Dimensional Classification - An Overview

Affiliations

  • VIT University, Vellore – 632014, Tamil Nadu, India

Abstract


Objective: A comprehensive overview of high dimensional data classification techniques is presented for the benefit of researchers, scientists and data engineers in both government and private sectors working on large dimensional data. Methods/Statistical Analysis: A systematic approach was followed by studying and reporting the literature review for the years 1969-2016. Findings: The high dimensional data classification is found to be a challenging task as the data will not fit into main memory as required by conventional classification methods. Many of the features would be irrelevant and as the dimensionality increases with limited number of samples any conventional supervised learning algorithm may over fit to noise. The present study reveals the methods to generate artificial samples to increase the size of training data for better classification performance. It is also noted that reducing dimensionality not only reduces the storage space and computational time but increases the understandability. Applications: Text Classification, Email Classification, Pattern Classification, Information Retrieval, Gene Expression Analysis, Health Care Analysis, Predictive Modelling.

Keywords

Dimensionality Reduction, Feature Selection, GA, LSA, PCA, Synthetic Pattern Generation.

Full Text:

 |  (PDF views: 50)

References


  • Suganthi J, Malathi V. Fuzzy based Feature Selection Scheme through Transductive SVM Technique for Cancer Pattern Classification and Prediction. Indian Journal of Science and Technology. 2016 Apr; 9(16):1–7. Crossref
  • Venkateswaran K, Shree TS, Kousika N, Kasthuri N. Performance Analysis of GA and PSO based Feature Selection Techniques for Improving Classification Accuracy in Remote Sensing Images. Indian Journal of Science and Technology. 2016 Apr; 9(16):1–7. Crossref
  • Sang HV, Nam NH, Nhan ND. A Novel Credit Scoring Prediction Model based on Feature Selection Approach and Parallel Random Forest. Indian Journal of Science and Technology. 2016 May; 9(20):1–6.
  • Jeevanandam, Jotheeswaran, Koteeswaran S. Feature Selection using Random Forest Method for Sentiment Analysis. Indian Journal of Science and Technology. 2016 Jan; 9(3):1–7.
  • Samet H. Foundations of multidimensional and metric data structures. Morgan Kaufmann; 2006 Aug.
  • Hamamoto Y, Uchimura S, Tomita S. A bootstrap technique for nearest neighbor classifier design. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997 Jan; 19(1):73–9. Crossref
  • Jain AK, Chandrasekaran B. Dimensionality and Sample Size Considerations in Pattern Recognition Practice. Handbook of Statistics. PR Krishnaiah and LN Kanal editions; 1991, pp.1-13.
  • Viswanath P, Murty N, Bhatnagar S. Overlap pattern synthesis with an efficient nearest neighbor classifier. Pattern Recognition. 2005 Aug 31; 38(8):1187–95. Crossref
  • Saradhi W, Murty MN. Bootstrapping for efficient handwritten digit recognition. Pattern recognition. 2001 May 31; 34(5):1047–56. Crossref
  • Seetha H, Saravanan R, Murty MN. Pattern Synthesis Using Multiple Kernel Learning for Efficient SVM Classification. Cybernetics and Information Technologies. 2012 Dec 1; 12(4):77–94. Crossref
  • Seetha H, Murty MN, Saravanan R. A Note on the Effect of Bootstrapping and Clustering on the Generalization Performance. International Journal of Information Processing. 2011; 5(4):19–34.
  • Ha TM, Bunke H. Off-line, handwritten numeral recognition by perturbation method. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1997 May; 19(5):535–9. Crossref
  • Leung CH, Cheung YS, Chan KP. Distortion Model for Chinese Character Generation. In Proc IEEE International Conference on Systems Man and Cybernetics; 1985. p. 38–41.
  • Leung KC, Leung CH. Recognition of handwritten Chinese characters by combining regularization Fisher’s discriminant and distorted sample generation. 10th International Conference on Document Analysis and Recognition; 2009 Jul 26. p. 1026–30.
  • Mori M, Suzuki A, Shio A, Ohtsuka S, Schomaker LR, Vuurpijl LG. Generating new samples from handwritten numerals based on point correspondence. In Proc 7th International Workshop on Frontiers in Handwriting Recognition; 2000 Sep. p. 281–90.
  • Velek O, Nakagawa M, Liu CL. Vector-to-Image Transformation of Character patterns for On-line and Off-line Recognition. International Journal of Computer Processing of Oriental Languages. 2002 Jun; 15(02):187– 209. Crossref
  • Varga T, Bunke H. Effects of training set expansion in handwriting recognition using synthetic data. In Proc 11th Conference of the International Graphonomics Society; 2003 Nov. p. 200–3.
  • Varga T, Bunke H. Generation of Synthetic Training Data for an HMM-based Handwriting Recognition System. ICDAR. 2003 Aug; 3(3):618–22. Crossref
  • Chen B, Zhu B, Nakagawa M. Training of an on-line handwritten Japanese character recognizer by artificial patterns. Pattern Recognition Letters. 2014 Jan 1; 35:178–85. Crossref
  • Schuller B, Burkhardt F. Learning with synthesized speech for automatic emotion recognition. In 2010 IEEE International Conference on Acoustics Speech and Signal Processing; 2010 Mar 14. p. 5150–3. Crossref
  • Barngrover C, Kastner R, Belongie S. Semisynthetic versus real-world sonar training data for the classification of mine-like objects. IEEE Journal of Oceanic Engineering. 2015 Jan; 40(1):48–56. Crossref
  • Sezer EA, Nefeslioglu HA, Gokceoglu C. An assessment on producing synthetic samples by fuzzy C-means for limited number of data in prediction models. Applied Soft Computing. 2014 Nov 30; 24:126–34. Crossref
  • Rozantsev A, Lepetit V, Fua P. On rendering synthetic images for training an object detector. Computer Vision and Image Understanding. 2015 Aug 31; 137:24–37. Crossref
  • Wonders M, Ghassemlooy Z, Hossain MA. Training with synthesised data for disaggregated event classification at the water meter. Expert Systems with Applications. 2016 Jan 31; 43:15–22. Crossref
  • Fan J, Fan Y, Wu Y. High-dimensional classification. High-dimensional Data Analysis; 2011. p. 3–7.
  • PMCid:PMC3025826
  • Guyon I, Elisseeff A. An introduction to feature extraction. Feature extraction. Springer; Berlin Heidelberg. 2006. p. 1–25. Crossref Crossref
  • Maldonado S, Weber R. A wrapper method for feature selection using support vector machines. Information Sciences. 2009 Jun 13; 179(13):2208–17. Crossref
  • Amarnath B, Appavu S, Balamurugan. Metaheuristic Approach for Efficient Feature Selection: A Data Classification Perspective. Indian Journal of Science and Technology. 2016 Jan; 9(4):1–6. Crossref
  • Bikku T, Rao NS, Akepogu AR. Hadoop based Feature Selection and Decision Making Models on Big Data. Indian Journal of Science and Technology. 2016 Mar; 9(10):1–6. Crossref
  • Fodor, Imola K. A survey of dimension reduction techniques; 2002.
  • Martin, Levine D. Feature extraction: A survey. In Proceedings of the IEEE. 1969; 57(8):13–91.
  • Chandrashekar, Girish, Ferat S. A survey on feature selection methods. Computers & Electrical Engineering. 2014; 40(1):16–28. Crossref
  • Kumar, Vipin, Minz S. Feature Selection. SmartCR. 2014; 4(3):211–29.
  • Achlioptas D. Database-friendly random projections. In Proceedings of the 20th ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems; 2001 May 1. p. 274–81. Crossref
  • Bingham E, Mannila H. Random projection in dimensionality reduction: Applications to image and text data. In Proceedings of the 7th ACM SIGKDD International conference on Knowledge discovery and data mining; 2001 Aug 26. p. 245–50. Crossref
  • Dasgupta S, Gupta A. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms. 2003 Jan 1; 22(1):60–5. Crossref
  • Paul B, Athithan G, Murty MN. Speeding up AdaBoost classifier with random projection. In Advances in Pattern Recognition. IEEE Seventh International Conference (ICAPR); 2009 Feb 4. p. 251–4. Crossref
  • 38. Duda RO, Hart PE, Stork DG. Pattern classification. John Wiley & Sons; 2012 Nov 9.
  • Lee DD, Seung HS. Learning the parts of objects by non-negative matrix factorization. Nature. 1999 Oct 21; 401(6755):788–91. Crossref PMid:10548103
  • Guillamet D, Vitria JSB. Introducing a weighted nonnegative matrix factorization for image classification.
  • Pattern Recognition Letters. 2003; 24(14):2447–54. Crossref
  • Shahnaz F, Berry MW, Pauca VP, Plemmons RJ. Document clustering using nonnegative matrix factorization. Information Processing & Management. 2006 Mar 31; 42(2):373–86. Crossref
  • Liu W, Yuan K, Ye D. Reducing microarray data via nonnegative matrix factorization for visualization and clustering analysis. Journal of biomedical informatics. 2008 Aug 31; 41(4):602–6. Crossref PMid:18234564
  • Zhu Z, Guo YF, Zhu X, Xue X. Normalized dimensionality reduction using nonnegative matrix factorization. Neurocomputing. 2010 Jun 30; 73(10):1783–93. Crossref
  • Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research. 2003 Mar; 3:1289–305.
  • Moh’d AMA. Chi square feature extraction based SVMs Arabic language text categorization system. Journal of Computer Science. 2007; 3(6):430–5. Crossref
  • Chi-Squared Distribution [Internet]. 2016 [cited 2016 Sep 2]. Available from: Crossref.
  • Chi-Square Goodness of Fit Test [Internet]. 2016 [cited 2016 Sep 2]. Available from: Crossref.
  • Forman G. An extensive empirical study of feature selection metrics for text classification. Journal of machine learning research; 2003 Mar. p.1289–305.
  • Aggarwal CC, Zhai C. Mining text data. Springer Science & Business Media; 2012 Feb 3.
  • Zheng Z, Wu X, Srihari R. Feature selection for text categorization on imbalanced data. ACM Sigkdd Explorations Newsletter. 2004 Jun 1; 6(1):80–9. Crossref
  • Mitchell TM. Machine Learning. WBA; 1997.
  • Gini Coefficient [Internet]. 2016 [cited 2016 Sep 2]. Available from: Crossref
  • Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. ICML. 1997 Jul 8; 97:412– 20.
  • Singh SR, Murthy HA, Gonsalves TA. Feature Selection for Text Classification Based on Gini Coefficient of Inequality. FSDM. 2010; 10:76–85.
  • Park H, Kwon S, Kwon HC. Complete gini-index text (git) feature-selection algorithm for text classification. IEEE 2010 2nd International Conference on Software Engineering and Data Mining (SEDM); 2010 Jun 23. p. 366–71.
  • Glossary of Important Assessment and Measurement.
  • Billsus D, Pazzani MJ. Learning Collaborative Information Filters. ICML. 1998 Jul 24; 98:46–54.
  • Guyon I, Elisseeff A. An introduction to feature extraction. Springer Berlin Heidelberg. Feature extraction; 2006. p. 1–25.
  • Hsu HH, Hsieh CW. Feature selection via correlation coefficient clustering. Journal of Software. 2010 Jan 12; 5(12):1371–7. Crossref
  • Latent Semantic Analysis.
  • Landauer TK, Dumais ST. A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological review. 1997 Apr; 104(2):2–11. Crossref
  • Landauer TK, Foltz PW, Laham D. An introduction to latent semantic analysis. Discourse processes. 1998 Jan 1; 25(2-3):259–84. Crossref
  • Dumais ST. Latent Semantic Indexing (LSI): TREC-3 Report. Nist Special Publication SP; 1995.
  • Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. Journal of the American society for information science. 1990 Sep 1; 41(6):1–34. Crossref
  • Shima K, Todoriki M, Suzuki A. SVM-based feature selection of latent semantic features. Pattern Recognition Letters. 2004 Jul 2; 25(9):1051–7. Crossref
  • Hofmann T. Probabilistic latent semantic indexing. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval; 1999 Aug 1. p. 50–7. Crossref
  • Mutual Information [Internet]. 2016 [2016 Sep 2]. Available from: Crossref
  • Basu T, Murthy CA. Effective text classification by a supervised feature selection approach. 2012 IEEE 12th International Conference on Data Mining Workshops; 2012 Dec 10. p. 918–25. Crossref
  • Doquire G, Verleysen M. Mutual information-based feature selection for multilabel classification. Neurocomputing. 2013 Dec 25; 122:148–55. Crossref
  • Lee J, Kim DW. Feature selection for multi-label classification using multivariate mutual information. Pattern Recognition Letters. 2013 Feb 1; 34(3):349–57. Crossref
  • Lee J, Kim DW. Mutual information-based multi-label feature selection using interaction information. Expert Systems with Applications. 2015 Mar 31; 42(4):2013–25. mCrossref
  • Mosteller F. Association and estimation in contingency tables. Journal of the American Statistical Association. 1968 Mar 1; 63(321):1–28. Crossref Crossref
  • Edwards AWF. The Measure of Association in a 2 × 2 Table. Journal of the Royal Statistical Society. Series A (General). 1963; 126(1):109–14. Crossref
  • Chen J, Huang H, Tian S, Qu Y. Feature selection for text classification with Naïve Bayes. Expert Systems with Applications. 2009 Apr 30; 36(3):5432–5. Crossref
  • Joachims T. A statistical learning learning model of text classification for support vector machines. In Proceedings of the 24th annual international ACM SIGIR Conference on Research and development in information retrieval; 2001 Sep 1. p. 128–36. Crossref
  • Abbas A, Wu Z, Siddiqui IF, Lee S. An Approach for Optimized Feature Selection in Software Product Lines using Union-Find and Genetic Algorithms. Indian Journal of Science and Technology. 2016 May; 9(17):1–8. Crossref
  • Soufan O, Kleftogiannis D, Kalnis P, Bajic VB. DWFS: A wrapper feature selection tool based on a parallel genetic algorithm. PloS one. 2015 Feb 26; 10(2). Crossref
  • Ghamisi P, Benediktsson JA. Feature selection based on hybridization of genetic algorithm and particle swarm optimization. IEEE Geoscience and Remote Sensing Letters. 2015 Feb; 12(2):309–13. Crossref
  • Jung M, Zscheischler J. A guided hybrid genetic algorithm for feature selection with expensive cost functions. Procedia Computer Science. 2013 Dec 31; 18:2337–46. Crossref
  • Huang CF. A hybrid stock selection model using genetic algorithms and support vector regression. Applied Soft Computing. 2012 Feb 29; 12(2):807–18. Crossref
  • Oreski S, Oreski G. Genetic algorithm-based heuristic for feature selection in credit risk assessment. Expert systems with applications. 2014 Mar 31; 41(4):2052–64. Crossref
  • Tsai CF, Eberle W, Chu CY. Genetic algorithms in feature and instance selection. Knowledge-based Systems. 2013 Feb 28; 39:240–7. Crossref
  • Welikala RA, Fraz MM, Dehmeshki J, Hoppe A, Tah V, Mann S, Williamson TH, Barman SA. Genetic algorithm based feature selection combined with dual classification for the automated detection of proliferative diabetic retinopathy. Computerized Medical Imaging and Graphics. 2015 Jul 31; 43:64–77. Crossref PMid:25841182
  • Ghaemi M, Feizi-Derakhshi MR. Feature selection using Forest Optimization Algorithm. Pattern Recognition. 2016 Dec 31; 60:121–9. Crossref

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.