This is a follow-up to the previous post on building a JPEG Compression Engine. This post assumes some background with the Xilinx Zynq All Programmable System on Chip (APSoC) series. In this post, the code was directly converted to C from Python by adding data types and libraries wherever necessary. The image was read from a text file which was populated from the Python program created earlier. This is in keeping with the intent of focusing on the computational operations.
The outputs of the Python and C programs differ slightly, which may have resulted from the abstraction in terms of the data types in Python, which had been made explicit in C. In any case, the signal processing operations have been performed as per the JPEG standard, and so the computation required is identical to that of actual JPEG compression.
The C program thus created is available on GitHub. It can be run using GCC on any Linux based operating system, or an equivalent compiler for Windows. Details of running and sample test cases are provided with the code.
Now, in order to port the algorithms to hardware, I am going to use Xilinx’s Vivado HLS software. HLS stands for High Level Synthesis. It converts C++ code into synthesizable RTL code that can be directly mapped to programmable logic on an FPGA.
In order to make the code synthesizable while requiring less hardware, a considerable number of improvements were taken into account, including the following:
1. Conversion of all floating point variables to Q16.16 fixed point so as to simplify arithmetic operations. The IEEE 754 floating point standard requires significantly more hardware for performing certain operations as opposed to fixed point.
2. Implementation of the cosine matrix as an 8×8 lookup table, so as to remove the requirement for the expensive CORDIC engine. The CORDIC engine explicitly calculates each trigonometric value using an intelligent algorithm requiring several table lookups, arithmetic operations, and approximately 20 steps (for Q16.16 accuracy).
3. Changed the variable size of image being read from file to a fixed size image making it possible to synthesize. A variable number of loops makes it difficult for Vivado HLS to create fixed RTL logic for FPGA synthesis.
4. Created a top-level function that has an AXILITE interface for transferring the image to the FPGA from the processor and back. This is required for the master-processor (Zynq ARM Cortex A9) to transfer the data to the FPGA fabric for computation, and then retrieve the post-processing output.
5. Pipelined at the individual function level, unrolled loops and in-lined functions to minimize latency and maximize throughput. I applied optimization directives (short statements directing the compiler to apply additional hardware for improving throughput).
The level of pipelining is such that each byte pushed in results in a byte pushed out (approximately a 1 byte/cycle throughput). This is verified by the fact that the latency of the IP is 49441 clock cycles for a 128×128 image consisting of 128*128*3 bytes, or 49152 bytes. This is on par with state of the art implementations (which include encoding and decoding), and so it can be brought down further, with more intelligent techniques.
The use of the AXI interface for transferring the entire image caused the utilization of Block RAMs to shoot up significantly, as a result of the buffering of the image on the FPGA side. For extending this to larger images, I could use a single 128×128 compression block in the FPGA fabric, and send chunks of the image sequentially for processing using the ARM processor.
The SDK code was designed to create a random array of integers (the 128x128x3 image), send it to the FPGA PL and time how long it takes for it to do the compression, and then do the same for identical code running on the ARM processor. I was able to achieve nearly 800x speedup on hardware vs software. The image is attached below.
Speedup = 38056371/49527 = 768.39 ≈ 770.
This shows that JPEG compression can indeed be implemented effectively on an FPGA. All the code used thus far for this project is available on GitHub. Suggestions and comments are appreciated.
This project was part of an IIT Madras course titled “Mapping DSP Algorithms to Architectures”, co-taught by Prof. Nitin Chandrachoodan of IIT Madras and Prof. CP Ravikumar of Texas Instruments.