In recent decades, microprocessors that are commercially available are predominantly employed in the world of embedded systems that are RISC based processors, for instance ARM, AtmelAVR, and much more are specifically prominent. Processor from the aforementioned families has been used in several applications including IoT devices and energy constrained embedded systems ^{1}. Various sorts of RISCV instruction sets are developed by numerous research groups across the globe resulting in free accessibility for embedded systems. This is supported by the RISCV International as successful attempts are made to bring different industries companies throughout the globe, and it is an open for collaboration. The business approach is flexible with open license, with the intention that any person can contribute for the modifications or use the contributions anywhere ^{2}. This has helped the research groups to implement novel methods to modify the RISCV architecture and result as a hardware accelerator.
RISCV architecture has various blocks that include CPU, decoder, General Purpose Registers (GPR), Control and Status Register (CSR), ALU, Multiplier, and many other important segments. One of the significant sections of CPU is the prefetch, it is programmed to decipher the commands from memory and exhibits the received instructions to the rest of the processor blocks. In prefetch, sorting is one of the basic blocks to perform the desired operation. Nevertheless, sorting plays a noteworthy role as it assists in processing queries and various kinds of data sorting, the implementation of it can be witnessed in the domain, such as data centres, data mining, IOT edge nodes, and cloud servers ^{3}. Sorting has the primary upper hand in accessing the data effortlessly from the database, as mostly the data required will be in alphanumerical manner. Hence, an optimized sorting algorithm is essential that can accelerate the performance. Due to this, FPGAs have witnessed its application in intensified domains with complex computations. The basic building blocks of FPGAs are LookUp Tables (LUTs), memory units, and processing units of small scale to avail the advantage of ease of programming
In recent decades, several sorting algorithms have been developed and tested for diverse applications. The scope of the modified sorter is very wide and it has immense potential to be used for various applications. It has been inspected that verifying the drafted algorithm in a software platform is simpler. However, it is also possible to demonstrate on a hardware platform in which the results will be compared in terms of relative performance assessment, but configuring and executing it is cumbersome. Specifications that are assessed for the comparability with the optimized algorithm are size of data, number of data elements, speed, and time.
Since FPGAs are meant to handle abundant data, on which aggregation, sorting, merging, and joining operations will be performed, resulting in meaningful information. The importance of sorting algorithm is discussed in the earlier section. There are many sorting techniques like Bitonic sort, Oddeven sort, bubble sort, merge sort and insertion sort, which have certain advantages, disadvantages that are discussed below in detail
Bubble sort is one among the straightforward and well known methods
Usui.et.al. ^{9} proposed a treebased merge sorter for largescale sorting and successfully implemented. It was observed that the performance was degraded for the data that are distorted. To handle massive data sets, a dependable approach was needed. It was achieved by implementing improved sortmerge unit that enhanced the overall throughput but hard to scale for a large Bubble Sort number of data sets. Casper et.al. ^{10} Provides a detailed comparison of various sorting techniques. The parameter that is emphasized sorter that is free from comparator approach. Multiple attempts were made in the recent past to develop comparatorfree sorting, illustrating attention towards speed and resource utilization. Similar to this, in the proposed technique along with not using any comparator or complex data path, optimization is also performed ^{10}.
Alternative sorter algorithm is OddEven method; it is very similar to the bubble sorting with few modifications. The computation is executed until the array elements are sorted and in each iteration two phases occurs, Odd and Even Phases. In the odd phase, we perform a bubble sort on odd, and we perform a bubble sort on even indexed elements ^{11}. This process continues for the alternating index (oddeven pairs, evenodd pairs) until no swapping operation is performed and finally the array is in sorted form. There are two modules involved in this sorting technique: compare and swap. Therefore, the timing complexity can be deduced as O(N^{2}). By adding parallel computing to this sorting technique, the time complexity of O(N) is possible to be achieved ^{12}. Recently, it is found that this is the preeminent sorting method on a hardware device since it presents stable throughput and constant performance over skewed data distributions.
Bitonic sort is based on the Bitonic sequence as well as it is a standard parallel algorithm for sorting. A Bitonic sequence is composed of nearly half of the array in ascending order and another one in descending order ^{13}. This sorting technique is widely used since the implementation is easy due to its structure compared to oddeven sort. The complexity of this sorting technique is O (N log_{2}(N)).
Insertion sort is based on finding the right node for inserting a new element by comparing with other elements. Depending on the ascending or descending order, the right node reflects the minimum or maximum element. The data were scanned from right to left in sequential order. The sorting operation is overlapping with the input data activities in this sorting. The major drawback of this sorting method is not suitable for massive data sets.
There are various techniques to sort elements without the comparison unit. An elaborated comparisonfree sorting method, technology, and implementation methodology are presented in Table I. From the table, its clearly exhibits the advantages of comparisonfree sorting techniques and as well power analysis and reducing delay is not reported in previous work. As already mentioned, oddeven sorting is efficient technique on hardware, a comparisonfree unit inside the oddeven sorting reduces delay and provides efficient resource utilization.
The proposed method deals with the design of comparator free sorter and the obtained results are compared with the existing sorters such as Bitonic, Bubble sort and many more. The rest of the paper is organized as follows. Section 2 elaborates the proposed work of comparatorfree sorting technique with an example. Section 3 evaluates the performance of the proposed architecture by elaborating the simulation results. Finally, Section 4 concludes the paper.
The oddeven sorting technique is dedicated for parallel processors. This sorting is based on oddeven and evenodd pairs. The worstcase complexity is O(N^{2}) and the bestcase complexity is O(N). This sorting can be efficiently implemented on FPGA, since FPGAs mitigate the constraint on CPU. This sorting can also achieve higher throughput by implementing on GPU, but FPGAs immerse low power compared to GPUs. The oddeven sorting design follows a hierarchical model i.e. resulting from the instantiation of submodules like move through and swap_if_required. Due to modular parameters, it’s easy to identify the design of sorting algorithm.





A novel Hardware–based sorting technique which uses SRAMbased memory to store data elements. The stored elements follow the Hamming maximum order representation. 
90 nm TSMC technology. A simple matrix multiplication using AND operation sorts the data elements. The Hardware structure mitigates the repetitive flow of data elements between the memory and sorting units. Complexity results in O(N) order. 
Power Analysis is not described for the proposed sorting technique. The proposed sorting unit is suitable for single core architecture. 

Sorts input data onthefly without any comparators between the data. The proposed technique uses simple registers to hold the binary elements. 
A matrixmapping operation are performed to sort elements. The 
The proposed algorithm reported a power consumption of 1.6 mW for N = 1024 elements. Not suitable for a large set of elements. 


The NCL sorting unit and comparison free unit compared with respect to transistor, LUTs and power. 
By implementing onehot matrix comparison free sorting, the overall power, area, and time compared to null convention sorting is drastically reduced. 

Proposed a softwarebased comparisonfree sorting method that divides the computations into three phases: count, sum, and sort. 
Threads are bundled into singleinstruction execution units in each phase, which is a criterion for optimizing hardware use. Based on the mathematical algorithm for CPU implementation (single thread, multithread) 
Does not concentrate on power 

The proposed architecture, in the first iteration, the largest element (LE) is found. Following that, it identifies the next LE from the remaining data elements in each iteration. The sorting engine has cascaded blocks selected one after another. Each of the blocks in cascaded consists of N number of basic cells Each cell consists of a 2input AND gate and a tiny switch. 
65nm technology standard cell library. Linear sorting delay of O(N) clock cycles 
Test beyond 1024 elements of 32 bits and 2048 elements of 11 bits. Power consumption not discussed 
^{19} 
DAS (Demultiplexer Array Sorting engine)  based on radix2 sort algorithm combined with counting sort algorithm. It uses bitserial approaches for bitunit operations so that their hardware burden is much lighter than that of comparisonbased implementations which perform wordunit operations. 
DAS uses NXN deMUX array to sort N of kbit data and completes the sorting operations in k cycles regardless of N. Time complexity of DAS is O(Nk). DAS has O(N^{2}) hardware complexity due to the array of deMUX, 
DAS uses minimal memory because it does not move any data during the sorting operation, but regenerates the input data to memory in sorted order. 

Proposed an FSM that functions as comparisonfree sorting algorithm. It’s based on counting number of 1’s and uses one hot decoder and upcounter for sorting elements. 
90 nm CMOS technology In order to reduce hardware complexity, the FSM module is presented with a comparisonfree sorting algorithm. 
Designed for signed numbers 
A novel architecture based on oddeven comparatorfree sorting is introduced in this paper. This architecture sorts the N elements in N iterations. The proposed sorting is based on oddeven sorting technique that does not require any comparator or complicated circuitry or intricate algorithm.
The evenodd, indexed data elements are passed to a simple circuitry which consists of basic gates (and, or, not). In the proposed architecture, only basic gates are used, the delay can be ignored and finally results in the improvement of the overall performance.
Boolean_Expression: A Boolean expression produces a Boolean value by applying logical functions to the inputs. It plays a major part in building circuits, generating algorithms and as well in programming an application. Hence, minimizing Boolean expression is a prime requisite. There are different methods to minimize the expression like applying algebraic rules, Kmap, tabular method, and Quine–McCluskey algorithm (QM). Among these methods, QM algorithm is the most efficient and powerful way of minimizing Boolean expression. In the proposed architecture, the swap decision is made, based on the results obtained by QM method. The prime implicants obtained by the function f are as follows.
If(A>B)
then f=1;
else f=0;
where A, B are inputs to Swap_If_Required function.
The hardware engine comprises of enumerated blocks connected in a cascaded form as shown in
The internal structure of each alternate block is shown in
The Boolean expression can be easily solved on FPGAs by using Look Up Table (LUT) and Multiplexor. Hardware LUTs are specially made truth tables that can be customized as per the logic functions required. The values are loaded to FPGA as per requirement. LUTs can be customized for any combinational circuit. In the proposed structure, LUT6 is used. It is a 6input and 1output LUT. As shown in
Let us consider an example for sorting five data elements, say 4(=0100), 7(=0111), 3(=0011), 5(=0101), and 2(=0010). In the first stage, the first element is moved through the next stage without any changes. The even indexed and odd indexed elements are swapped depending on the value returned by Boolean expression. Here the elements are 7 and 3. The Boolean expression returns 1, hence swapping occurs. The next elements are 5 and 2 need to be swapped. In continuation, this process repeats for the next stage in pipeline. Finally, the last stage has a sorted array.
The designed sorter algorithmic program was simulated using Xilinx ISE 14.7 on Virtex7, it is optimized and integrated for 28nm for the best performance
The number of elements considered for sorting was eight. Hence, N=8 in all conditions. Therefore, for 4bit number the total length is 32bits, for 8bit numbers the total length is 64bits, for 16bit numbers total length is 128bits, for 32bit numbers the total length is 256bits.
The proposed architecture takes 2.147 ns, 3.104 ns, 3.791 ns, and 4.445 ns delay was observed for sorting 4bit, 8bit, 16bit and 32bit numbers (respectively) with 8 elements in each case. As well, it is evident that the numbers of LUTs employed by the proposed method/ sorter is also less compared to the other sorting methods. Hence, it can be claimed that along with the reduction of time, the area required will be less. As a result, the authors believe that the proposed method has an upper hand over the compared methods.








4 bits 
Bitonic 
72 
219 
221 
31 
68 
2.874 








Insertion 
70 
168 
168 
41 
68 
2.497 

selection 
79 
214 
216 
33 
68 
3.338 

bubble 
76 
235 
236 
31 
68 
3.433 

8 bits 
Bitonic 
140 
430 
436 
33 
134 
3.431 








Insertion 
138 
637 
640 
32 
132 
3.409 

selection 
145 
378 
382 
38 
132 
3.743 

bubble 
145 
395 
400 
37 
132 
3.475 

16 Bit 
Bitonic 
264 
650 
652 
41 
260 
4.697 








Insertion 
274 
1087 
1097 
45 
260 
5.607 

selection 
270 
645 
646 
42 
260 
4.865 

bubble 
273 
678 
679 
41 
260 
5.942 

32 bits 
Bitonic 
518 
1536 
1536 
44 
516 
5.786 








Insertion 
620 
2777 
2867 
46 
516 
5.873 

selection 
530 
1150 
1155 
46 
56 
4.899 

bubble 
533 
1331 
1338 
48 
516 
6.125 
The





1 
Slice 
64 
49 
46 
2 
LUT 
112 
85 
79 
3 
Delay 
7.646 
6.991 
1.164 






Slices Occupied 
11142 
1665 
6832 
504 
336 
LUTs Utilized 
71310 
3750 
43719 
2016 
1682 
FlipFlops 
90169 
3330 
133584 
2016 
672 
Delay (ns) 
400 
1500 
1600 
1300 
72 
Device 
Virtex7 
Virtex5 
Virtex7 
Virtex5 
Virtex7 





Slice Registers 
8bits 
256 
4288 
0.832 
LUTs 
9088 
3.53 

Flip Flops 
4224 
13.51 

Slice Registers 
512 
8555 
1.66 

LUTs 
18173 
7.05 

Flip Flops 
8442 
27.5 

Slice Registers 
1024 
17145 
3.33 

LUTs 
36347 
14.11 

Flip Flops 
16891 
54.03 

Slice Registers 
16bits 
256 
8384 
1.63 
LUTs 
15712 
6.1 

Flip Flops 
8320 
26.62 

Slice Registers 
512 
16762 
3.25 

LUTs 
31420 
12.2 

Flip Flops 
13532 
43.29 

Slice Registers 
1024 
33531 
6.51 

LUTs 
62844 
24.4 

Flip Flops 
21273 
68.05 

Slice Registers 
32bits 
256 
16672 
3.24 
LUTs 
33664 
13.07 

Flip Flops 
6576 
21.04 

Slice Registers 
512 
33334 
6.47 

LUTs 
67321 
26.13 

Flip Flops 
20510 
65.61 

Slice Registers 
1024 
66682 
12.94 

LUTs 
134655 
52.27 

Flip Flops 
24295 
77.72 
A modified Oddeven approach to sort N elements is proposed. The complexity of the novel sorting is O(N) that is used to denote the sorting speed. The expanded evenodd sorter is tested and verified by implementing on FPGAs that had the configuration of Virtex7, capable to work at 224 MHz. From the comparison table (