This post demonstrates the implementation of a systolic array based sorting engine, which consists of 8 processors containing 2 elements each, for a total of 16 elements to be sorted. This project was part of an IIT Madras course titled “Mapping DSP Algorithms to Architectures”, co-taught by Prof. Nitin Chandrachoodan of IIT Madras and Prof. CP Ravikumar of Texas Instruments. I worked on this project with my classmate Rajat Rao.

Standard sorting algorithms include bubble sort, insertion sort, selection sort, merge sort, heap sort, and quick sort. All of these are designed for single processor implementation, and cater to different time/space complexity requirements.

In parallel computer architectures, a systolic array is a homogeneous network of tightly coupled data processing units called cells or nodes. Each node independently computes a partial result as a function of data received from its neighbours, stores the result within itself, and passes the relevant data to its neighbours. Systolic arrays were invented by Richard Brent and H.T. Kung, who developed them for computing the greatest common divisors of integers and polynomials.

In this project, we need to sort a set of numbers that are already considered to be present within the processors. As an aside, if these were to be input linearly (one after the other) into the processors (either from the right or from the left), it is possible to sort them as they are being transferred across the processor array, by using the same algorithm on each processor (input from left, sorts in ascending order) – compare the number present internally (initialized with INT_MAX), and if the number saved internally is smaller, forward the new input, but if it is larger, replace it with the new input and forward the previous internal number.

For the above reason, we assumed that the numbers to be sorted are already present in the processors (parallel loaded for tests). The standard sorting technique (odd-even transposition sort) is illustrated below.

In our problem statement, there are 16 elements, 2 per processor, and hence multiple ways of performing the sorting. Two techniques that are immediately evident are described as follows.

The first method transfers one element from one processor (A) to the other (B) (1 clock cycle), performs 2 compare operations to find the two smallest elements (2 clock cycles) and returns the largest one to the other processor (1 clock cycle). This set of operations is then repeated for the other number in processor A (4 clock cycles). Finally, the two processors swap their internal values in-place, if required. A total of 10 clock cycles would be required to implement the above algorithm. The main advantage is that the bus width is limited to the width of a single number (16 bits in our implementation). The disadvantage is that the bus is being used four times, and will be the primary bottleneck on systems with high-latency buses.

The second method transfers two elements at a time using a 32-bit bus (1 clock cycle), performs a 4-number sort within each processor (6 clock cycles) and returns the corresponding numbers to processor B (1 clock cycle). This reduces the number of bus transactions by 2 (since the bus width is doubled), but the number of computations remains the same (6). The total number of clock cycles becomes 8.

The method that we have adopted is a combination of the above two techniques, described below. We have used a 16-bit width bus to connect the processors, and used 2 clock cycles to transfer both numbers from processor A to processor B, following which a 4-number sort is performed within processor A (6 clock cycles) and the results are sent to processor B.

# Implementation

Our Verilog based implementation consists of a top-level module **sorting-engine**, which initializes 8 processors implemented in the **processor **module, and a bus controller, implemented in the **bus_controller **module.

The processors are implemented in as generic a manner as possible, so as to make them compatible for a wide variety of applications. They are connected by bidirectional (inout) 16-bit width buses to the left and right, and are controlled by two 2-bit control signal inputs from the bus-controller. In addition to the above ports, they also have reset and clock signals provided by the bus-controller. Finally, there is an initialization signal that allows the bus-controller to load to all the processors in parallel. This would provide the correct initial state for the implementation of the desired algorithm, as mentioned earlier.

The bus controller module splits up its work into **even **processors and **odd **processors. The inputs to the bus controller are the reset and clock signals, and the outputs are the left/right signals for the even/odd processors. The sort states are organized such that there is one state for each possible operation, as follows:

(here RR = Receive from Right, RL = Receive from Left, SR = Send Right, SL = Send Left, CMP = Compare)

(1) odd RR, even SL

(2) odd CMP

(3) odd SR, even RL

(4) odd SL, even RR

(5) even CMP

(6) odd RL, even SR

In this implementation, each processor sends two values, waits for 6 clock cycles for the 4-number sorting to be completed, and then sends back the resulting two sorted outputs. For this, we have built separate states for the bus controller to know that the data has finished processing, and for each systole of the algorithm. The data state 0 means the system is idle, data states 1-4 mean steps 0-1, 2-3, 4-5, 6-7 of the algorithm, and data state 5 means that the sorting is complete and the system can be reset.

In both systolic steps of the sorting algorithm, Send to Right (SR) and Receive from Left (RL) operations are performed, resulting in the comparison step taking place in the left processor of that pair. The algorithm we have used for sorting the four numbers inside the “*left*” processor of the pair is as follows.

- Step 1: Compare x
_{1 }and x_{2}. Swap them if x_{1}> x_{2}. - Step 2: Compare y
_{1}and y_{2}. Swap them if y_{1}> y_{2}. - Step 3: Compare x
_{2}and y_{1}. Swap them if x_{2 }> y_{1}. - Step 4: Compare y
_{1}and y_{2}. Swap them if y_{1}> y_{2}. - Step 5: Compare x
_{1 }and x_{2}. Swap them if x_{1}> x_{2}. - Step 6: Compare x
_{2 }and y_{1}. Swap them if x_{2}> y_{1}.

The above algorithm was verified to work by implementation in Python and running it over all possible combinations of 4-tuples using the itertools.permutations iterator.

The bidirectional inter-processor buses were implemented using Verilog’s inout port construct. As a result, the value on the bus is only valid when the processor is in a “write to other processor state”. In any other state, the buses are tri-stated and will be displayed as high-impedance signals (ZZZZ).

The processor contains 9 general purpose registers (r0 to r8), out of which 5 are used by the current application. The first four registers (r0 to r3) are used for the above algorithm: r0 and r1 store the processor’s own values, whereas r2 and r3 store values obtained from the adjoining processor. The register r8 is used to keep track of the stage of the above 4-value sorting algorithm.

The above modules were implemented in Verilog and tested using a self-checking test bench, with the input array and expected output (sorted) array generated using Python. The output is as shown below.

# Results

The above implementation is able to successfully sort 16 values and requires a total of 81 clock cycles (80 clock cycles for actually sorting, and one extra clock cycle since the data_state of the bus controller is updated from idle to sorting mode and the sort_state updates only happen when data_state is 1, 2, 3 and 4.

The simulation output shows that the data_state is valid (non-idle: 1, 2, 3, 4) between times 8.9 us and 17.0 us. This gives a total time of 8.1 us, which corresponds to 8.1 us/100 ns = 81 clock cycles. The final values stored internally in the processors is the ordered array, as expected. It can also be seen that the interconnection buses are being used for short periods of time for data transfer (serially – 2 values back to back), following which they are tri-stated for 6 clock cycles (during the 4-value sorting algorithm), and then used again to send data back.

The above outputs have been generated by using a Virtual I/O (VIO) IP in Vivado for providing the inputs to the sorting_engine module, and probing the outputs. The code was ported to the Zybo FPGA board and verified to sort the input array correctly.

The above output obtained from the self-checking test bench also confirms that the implementation can sort any arbitrary array of 16 integers.

The code is freely available on GitHub. The GitHub repository includes the **full Verilog code** with functionality as described above, Python code to **validate the 4-value sorting** algorithm, Python code for **generating random input/output arrays** for testing, and finally, Verilog code for a **self-checking test-bench**.