In classical Graph theory, the construction of a Minimal Spanning Tree (MST) is a very important aspect and attracts the attention of many researchers. It has been studied widely in various fields of computer science and engineering such as computer networks, routing, social networking, etc. In a plane for 'n' collected data points, the MST can be constructed with the following steps
Nowadays, the advent of big data and internet connectivity of all applications has thrown a lot of challenges in handling data efficiently as far as their requirements and customizations are concerned. For efficient handling of data, it needs to be clustered or partitioned into manageable units.
To find the MST for largescale data points various divideandconquer approaches were used in the existing literature. The advantage of this approach is that a subproblem can be handled independently without the intervention of other data sets. It indicates that the MST problem can be parallelizable and be solved efficiently by exploiting the power of multicore systems.
A bimeans clustering algorithm was used by Jothi et al.
Sandhu et al.
The disjoint set data structures were used to determine the minimum spanning tree of a graph by Khan et al.
Mishra et al. divided the dataset into subsets based on the dot product of the dispersion property and position vector
The works in
• A novel heuristic is proposed to find a suitable value of
• Delaunay Triangulation is used to form a graph for each cluster as opposed to a complete graph.
• The AMST algorithm is proposed with complexity
• A new heuristic is proposed to find the second MST capturing boundary points in order to optimize the initial MST formed from earlier clusters.
The paper is organized into five Sections. Section 2 discusses some preliminary requirements for the development of AMST and mentions an overview of the proposed algorithms followed by proofs of the supporting theorems. Section 3 shows the experimental results and their comparison with existing works. Section 4 describes the concluding remark. All references are mentioned in the Section 5.
This Section discusses Delaunay triangulation and the kmeans clustering algorithms that are necessary for the development of proposed Accelerated MST as preliminaries. The proposed method is then discussed, followed by the mathematical justifications behind it.
A Delaunay Triangulation is made up of triangles for a set of points
Let
Funke et al.
Computationally, the Delaunay triangulation can be obtained more easily from a set of points in logarithmic time. In the proposed approach Delaunay triangulation is used to get the reduced set of edges from a set of connected points, as a result, the graph formed here has a smaller number of edges compared to a complete graph.
A popular optimization problem is solved using the kmeans algorithm
The observed time complexity of the
The AMST is created in three steps for generating an approximate MST for a big dataset. In the first stage, the data is divided into different clusters, and an initial MST is constructed, comprising all MSTs of these clusters. The loss of shortdistance edges during the connection phase of MSTs necessitates a refinement stage. Finally, AMST is created by combining the results of both the stages. The following subsections briefly provide more information about the intermediate stages.
a) Divide Step. Partition of the
b) Conquer Step. Construct a Delaunay Triangulation graph for
c) Combine Step. Connecting all MSTs of
a) RePartition the dataset over the border of the cluster produced in the previous stage using a heuristics called MidPointHeuristics.
b) Generate the second approximate MST denoted as MST_{2} by using the Conquer and combine step of the DivideandConquer Step.
a) Create a graph
b) Get the AMST by applying Kruskals algorithm to
At this stage, the given dataset of size
In this step for each cluster, a graph is formed using Delaunay Triangulation. An MST is constructed using Kruskal’s algorithm for each graph of a cluster. The motivation behind the use of Kruskal's algorithm is to minimize the runtime of MST as the graph by Delaunay triangulation has a






1. 
Taking 
2. 
Construct a graph of each cluster using the Delaunay Triangulation 
3. 
Apply Kruskal’s Algorithm to the graph of each cluster to get 
4. 
return 
A Delaunay Triangulation graph is formed using the centroids of each cluster and then an MST of the graph is obtained to find the neighboring relationship between the clusters. A pair of adjacent MSTs are connected by an edge joining two vertices belonging to respective MSTs sitting nearest to their neighbor cluster's centroids. The MST generated by this step is called the first approximate MST denoted by MST_{1}. The detailed discussion of the generation of MST_{1} is described in Algorithms 1, 2 and 3 (
Algorithm 2: Joining MSTs Input: k MSTs: MST , MST , ...MST Output: Approximate MST of obtained from MST(C_{i}) for 1 ≤ i ≤ k 1. Find the centroid of each cluster Ci, 1 ≤ i ≤ k. 2. Construct a graph of all centroids μ = , , ..., µk using the Delaunay Triangulation. 3. Find an MSTµ of the graph obtained from step 2 using Kruskal’s Algorithm to determine the neighboring relationship between clusters. 4. For each pair of clusters , that their center and are connected by e ∈ MS Tµ, discovered the edge by Algorithm 3 (Get Connecter Edges), that connects MST and MS T. 5. Join the set of edges found in step 4 to k MSTs: MST , MST , ..., MST to get the First approximate MST: MST1. return {MST1}
The MST_{1} obtained using the divideandconquer approach is suboptimal. Because some edges between the two adjacent MSTs lead to optimality being missed while combining the MSTs. So a refinement of the dataset over the boundary is required to capture those edges. Hence, a repartitioning of the dataset over the boundary is required. A midpoint heuristics is proposed for repartitioning in the following subsection that obtains the second approximate MST called MST_{2}.
Algorithm 3: Get Connecter Edges 
Input: A neighbor pair MSTs, 
Output: Connecter edge between 
1. Find the nearest vertex a ∈ from the centroid of cluster . 
2. Find the nearest vertex b ∈ MST from the centroid of cluster . 
3. return } // connecter edge. 
To get a refined MST, the dataset is partitioned again by targeting all points that lie across the boundaries between a pair of clusters. The initial centroid of this clustering process is the midpoints between two neighbour centroid
A heuristic is used to calculate the midpoints. This is called MidPoint heuristics and is defined in Algorithm 5. The midpoints are calculated for some specific edges only. If an edge cost is less than or equal to the average cost of all edges in the graph of all centroids, then the
The conquer and combine steps are similar to the Conquer Step and Combine Step of the divideandconquer Stage. The MST generated in this stage is called the second approximate MST, termed MST_{2}. The stepbystep procedures are described in Algorithm 4 (
The first and second approximate MSTs are merged to produce a graph having
Algorithm 4: Get MST2 Input: Centroid Set of Clusters and Dataset having n points Output: Second Approximate MST of : MST2 1. Construct a graph of all Centroids using Delaunay Triangulation 2. Get the midpoint using Algorithm 5 (MidPointHeuristics) 3. Apply kmeans clustering over midpoint as seeds of clusters and run it for a single iteration 4. Construct a graph of each cluster using the Delaunay Triangulation 5. Apply Kruskal’s Algorithm to get MSTs for each cluster. 6. Join all MSTs by Algorithm 2 (Joining MSTs) to get the Second approximate MST: MST2.
Algorithm 5: MidPoint Heuristic Input: A graph of Centroids µ of Clusters C Output: A set of Midpoints 1. Find the Average Cost of all Edges in the given graph 2. Select edges having Cost ≤ Average Cost of all edges in the graph 3. Compute the midpoint of Selected edges 4. return {A set of Midpoints}
Before the complexity calculation of the AMST algorithm, it has been shown that the cost of Euclidian MST obtained from a complete graph of
(ii) DMST can be computed in
(a) In the Delaunay Triangulation there exists
(b) If
As a Delaunay Graph is a subgraph of the complete graph of
The Delaunay Triangulation is a subgraph of a Complete Graph. So running time of the Kruskals algorithm on Delaunay Triangulation will take time
Proof: Let
F(k) =
Where
is the time complexity for
T(Delaunay) is the time to compute the Voronoi Diagram for
The time complexity of Kruskal’s algorithm is
Then
To get the optimal value of
Taking
On solving the above Equation 6 it is found that
Since
where
The choice could be
Next combining the results of Equations 2, 3 and 4 it is found that If
Substituting
This proves the theorem 2.
Proof: Let the total points in each cluster form an arithmetic series
That is
Since all graphs are constructed using Delaunay Triangulation for each cluster
By Wallis formula
Putting
Further taking
This ensures the proof of Theorem 3.
It is interesting to note that the time complexity of the proposed AMST is
To test the efficacy of the proposed AMST against existing algorithms, the experiments are conducted on clustering datasets namely Compound, Pathbased, S1, Joensuu, A3, T4.8k, Birch1 and Europe









Data size 
300 
399 
5000 
6014 
7500 
8000 
100000 
169308 
Dimension 
2 
2 
2 
2 
2 
2 
2 
2 
A Weight Error Rate (ERweight) metric is used to test AMST and
A large dataset requires a partition approach that depends on the number of clusters
In the current work, the proposed algorithm chosen
The weight error rate versus datasets for various algorithms
The execution time of the proposed AMST is compared with earlier works
Kruskal’s algorithms and














T4.8k 
FMST 
1.61 
31.01 
41.28 
0.31 
74.21 
AMST 
1.19 


0.31 


Birch1 
FMST 
33.17 
89.91 
95.96 
0.93 
219.97 
AMST 
30.69 


0.85 


Europe 
FMST 
136.28 
614.1 
545.63 
1.42 
1297.43 
AMST 
11.93 


1.33 

In
In the following the accuracy of MST as regards clustering for the proposed AMST are discussed against clusterbased MST including FMST of Zhong et al using three indices such as Rand Index






Pathbased 
0.84 
0.93 
0.82 
0.75 
0.94 
Compound 
0.98 
0.98 
0.98 
0.98 
0.98 






Pathbased 
0.643 
0.86 
0.614 
0.509 
0.86 
Compound 
0.962 
0.962 
0.96 
0.973 
0.96 






Pathbased 
0.764 
0.907 
0.745 
0.715 
0.907 
Compound 
0.971 
0.971 
0.97 
0.979 
0.971 
The experimental results for the accuracy of MST based on clustering using the Rand Index, Adjusted Rand Index and Fowlkes Mallows Index are presented in
The Adjusted Rand Index in
A runtime advantage of AMST for various small and large datasets is found in
The proposed AMST algorithm obtains an optimal MST for scalable datasets considering a Delaunay Triangulation approach. The proposed algorithm has a promising speedup of 95% as compared to the existing algorithms and the quality of the MST is also close to the exact MST. The salient feature of the proposed AMST is time complexity is of the order