An Effective Initialization Method Based on Quartiles for the K-means Algorithm

Trushali Jambudi; Savita Gandhi

doi:10.17485/IJST/v15i35.714

Article

An Effective Initialization Method Based on Quartiles for the K-means Algorithm

VIEWS 1101
PDF 1323

Indian Journal of Science and Technology

DOI: 10.17485/IJST/v15i35.714

Year: 2022, Volume: 15, Issue: 35, Pages: 1712-1721

Original Article

An Effective Initialization Method Based on Quartiles for the K-means Algorithm

Trushali Jambudi^1*, Savita Gandhi²

¹Research Scholar, Department of Computer Science, Gujarat University, India
²Former Professor and Head, Department of Computer Science, Gujarat University, India

*Corresponding Author
Email: [email protected]

Received Date:08 April 2022, Accepted Date:30 July 2022, Published Date:07 September 2022

This work is licensed under a Creative Commons Attribution 4.0 International License.

Abstract

Objectives: This study aims to speed up the K-means algorithm by offering a deterministic quartile-based seeding strategy for initializing preliminary cluster centers for the K-means algorithm, enabling it to efficiently build high-quality clusters. Methods: We have investigated various cluster center initialization approaches in literature and presented our findings. For the Kmeans algorithm, we here propose a novel deterministic technique based on quartiles for finding initial cluster centers. To obtain the preliminary cluster centers, we have applied our suggested approach to the data set. The initial cluster centers determined by our suggested method are then entered into the K-means algorithm. The proposed seeding method is evaluated on sixteen benchmark clustering data sets: five synthetic and eleven real data sets. Python is used to run the simulation. Findings: Based on empirical results from experiments, it is evident that our proposed cluster center initialization method allows the K-means algorithm to form clusters with SSE values comparable to the minimum SSE values produced by repeated Random or Kmeans++ initializations. Furthermore, our deterministic initialization strategy assures that the K-means algorithm converges faster than the Random and K-means++ initialization techniques. Novelty: In this study, we explore the potential of quartile-based seeding as a technique of accelerating the Kmeans algorithm. Needless to add, as our seeding method is deterministic, the requirement to run K-means repeatedly with different stochastic initializations is completely eliminated. Also, our initialization strategy assures that there is remarkable saving in execution time as compared to the Random and Kmeans++ initialization techniques. Moreover, it is found that after initializing with our offered method, the solution obtained with just a single run of K-means produces optimal clusters. Applications: Our proposed seeding technique will be helpful for initializing the K-means algorithm in time-sensitive applications, applications managing large amounts of data, and applications looking for deterministic cluster solutions.

Keywords: Kmeans Algorithm; Initialization Method; Speeding Kmeans; Quartiles; Clustering; Deterministic Initialization Method

References

Jambudi T, Gandhi S. A New K-means - Based Algorithm for Automatic Clustering and Outlier Discovery. In Information and Communication Technology for Intelligent Systems. (pp. 457-467) Singapore. Springer. 2019.
Han J, Pei J, HT. Data Mining Concepts and Techniques. In: The Morgan Kaufmann Series in Data Management Systems (4th). Elsevier. 2022.
Gouho JB, Karim S, Aka B, Babri M. Automatic Modulation Classification Based on In-Phase Quadrature Diagram Constellation Combined with a Deep Learning Model. Indian Journal of Science and Technology. 2020;13(2):200–212. Available from: https://doi.org/10.17485/ijst/2020/v13i02/148648
Fränti P, Sieranoja S. K-means properties on six clustering benchmark datasets. Applied Intelligence. 2018;48(12):4743–4759. Available from: https://doi.org/10.1007/s10489-018-1238-7
Seman A, Sapawi AM. An Optimal and Stable Algorithm for Clustering Numerical Data. Algorithms. 0197;14(7):197. Available from: https://doi.org/10.3390/a14070197
Cortez P, Cerdeira A, Almeida F, Matos T, Reis J. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems. 2009;47(4):547–553. Available from: https://doi.org/10.1016/j.dss.2009.05.016
Vouros A, Langdell S, Croucher M, Vasilaki E. An empirical comparison between stochastic and deterministic centroid initialisation for K-means variations. Machine Learning. 2021;110(8):1975–2003. Available from: https://doi.org/10.1007/s10994-021-06021-7
Fahim A. K and starting means for k-means algorithm. Journal of Computational Science. 2021;55:101445. Available from: https://doi.org/10.1016/j.jocs.2021.101445
Fahim A. Finding the Number of Clusters in Data and Better Initial Centers for K-means Algorithm. International Journal of Intelligent Systems and Applications. 2020;12(6):1–20. Available from: https://doi.org/10.5815/ijisa.2020.06.01
Rehman AU. Samir Brahim Belhaouari. Divide well to merge better: A novel clustering algorithm. Pattern Recognition. 2022;122. Available from: https://doi.org/10.1016/j.patcog.2021.108305
Jie Y, Yu-Kai W, Xin Y. Lin Chin-Teng. Adaptive Initialization Method for K-Means Algorithm. Frontiers in Artificial Intelligence. 4. Available from: https://www.frontiersin.org/article/10.3389/frai.2021.740817
Ciuparu A, Mureșan RC. Gradient-k: Improving the Performance of K-Means Using the Density Gradient. bioRxiv. 2022;03(30):486343. Available from: https://doi.org/10.1101/2022.03.30.486343
Celebi ME, Kingravi HA, Vela PA. A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Systems with Applications. 2013;40(1):200–210. Available from: https://arxiv.org/pdf/1209.1960.pdf?ref=https://githubhelp.com
Jambudi T, Gandhi S. Analytical review of K-means based algorithms and evaluation methods. In: 12th International Conference on Advances in Computing, Control, and Telecommunication Technologies. (pp. 479-486) 2021.
Jambudi T, Gandhi S. Analysing the effect of different Distance Measures in K-means Clustering Algorithm. GLS KALP-Journal of Multidisciplinary Studies. 2021;1(3):49–56. Available from: http://glskalp.in/index.php/glskalp/article/download/15/11
Dua D, Graff C, Uci. Machine Learning Repository. Irvine, CA: University of California. School of Information and Computer Science. 2019. Available from: http://archive. ics. uci. edu/ml
11p, Cortez A, Cerdeira F, Almeida T, Matos J, Reis. 11P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 2009. Available from: http://www3.dsi.uminho.pt/pcortez/wine/

Copyright

© 2022 Jambudi & Gandhi. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Published By Indian Society for Education and Environment (iSee)