DSP Functions on FPGAs

By Dr. Jürg M. Stettbacher, Stettbacher Signal Processing

In terms of their size and processing speeds, modern FPGAs (Field Programmable Gate Arrays) have attained a level that makes it possible not only to perform individual mathematical operations but also to accommodate entire Signal Processing algorithms. At the same time, leading manufacturers have released tools that specifically support the development of digital algorithms for FPGAs. As a result, a new, interesting platform is becoming established in the world of digital signal processing.

Modern FPGAs

Today's FPGAs consist of up to 10 million logical gates. Although this number sounds impressive, it does not actually tell us very much. This is because these gates are located in different functional units. Consequently, it is not individual gates but logic cells that are available to users. Such cells generally contain at least one flip-flop together with configurable logical units. In addition, FPGAs can be equipped with RAM blocks, multipliers or entire processor cores.

One example of the above class is Xilink's Virtex II Pro XC2VP125, which contains, among other things, four integrated PowerPC cores and 556 individual 18x18-bit multipliers. Further chips provide over 1000 configurable I/O pins. Naturally, this state-of-the-art technology has a corresponding price tag.

An extensive range of small and medium-sized components are now available for average users. This has made the task of choosing the right one much more difficult.

At the start of a project it is often unclear just what will be required of an FPGA. However, this is no longer a problem. The design can largely be implemented in a hardware-independent way. The chip-specific data is not added until the final stage. This determines the occupancy of the individual components. An FPGA can be easily replaced if necessary.

FPGAs for signal processing

Because of their size and the components they contain, FPGAs now offer a wide variety of interesting possibilities in the field of digital signal processing. The difference between the classical solution - using a Digital Signal Processor (DSP) - and implementation on an FPGA lies in the fact that the DSP has to be programmed in Assembler or C whereas FPGA algorithms are described in VHDL. While a DSP works through its program more or less sequentially, an FPGA maps the entire algorithm at the hardware level.

Because, unlike in DSPs, only application-specific and correspondingly optimized arithmetic units are implemented in an FPGA, the solutions are particularly cost-effective and efficient. In the high-end sector, enormous arithmetic power can be housed in a tiny area by integrating four DSP units on the same FPGA.

Example application

The brief example below is intended to clarify the FPGA-based DSP design cycle. Within the framework of a demonstration project, a three-band audio equalizer has been implemented on an FPGA. The audio signal is supplied via codec to an FPGA where it passes through the digital equalizer. The signal is then returned to the codec and converted for analog use.

For reasons of clarity, we use half-band filters in the equalizer algorithm. In this, a digital high-pass (HP) and a digital low-pass (LP) each split the discrete-time input signal into two sub-bands. In turn, the sum of the two sub-bands yields the input signal.

In MATLAB, it is possible to calculate this type of filter using just a few instructions:

> % Buttworth low-pass filter with cut-off frequency w1:
> [G_LP_num, G_LP_den] = butter(2, w1);
>
> % Complement-res high-pass filter:
> G_HP_num = G_LP_den - G_LP_num;
> G_HP_den = G_LP_den;

Figure 1 shows the entire equalizer with two half-band filter stages.

Each of the three bands is multiplied by a coefficient (K_Low, K_Mid and K_High). The output signal y[.] is given by the sum of the three weighted sub-bands. If all three coefficients have the value one then the output signal y[.] is equal to the input x[.]. If one of the coefficients is greater than one then the corresponding band is amplified. If it is less than one, then the band is attenuated.

Tool chain

The equalizer was developed, simulated and checked using MATLAB and Simulink. To do this, audio files (for example in WAV format) are read into the simulation environment, passed through the algorithm and then played. This is not performed in real time but nevertheless fast enough to permit the easy optimization of the algorithm.

Next - and this represents the start of the actual FPGA design phase - it is necessary to verify that the algorithm meets requirements even when calculations to bit accuracy are needed. To this end, Xilinx offers a block set for Simulink. The number representations, word width, overflow and rounding behavior etc. of the relevant blocks are configurable and perform calculations to bit accuracy.

It should be noted that an additional bit may arise when adding two fixed-point numbers. Consequently, multiplication can lead to a result of almost double the length. This means that the word width tends to increase as the algorithm progresses. Since, at the end of the calculation, the result should be present in, for example, 16-bit form, the numbers must be truncated or rounded in an appropriate way during processing. This demands design sensitivity and intuition on the part of the engineer. Thanks to its flexibility, the Xilinx block set is perfectly suited for this task.

It should also be mentioned that, in principle, Simulink with the Xilinx block set could also be used for the development of the algorithm. However, in practice it has been found that simulation with Xilinx blocks takes longer than with MATLAB because of the bit accuracy of the calculations.
At the end of this operation, the digital equalizer is present in the form of a Simulink model. Because of the bit accuracy of the description, it behaves in exactly the same way as it is to subsequently run on the FPGA.

Generation of VHDL code

The second important feature of the Xilinx blocks now comes into play: this is the fact that these blocks can be directly converted into VHDL. This task is performed by the Xilinx System Generator. This does not just convert the individual blocks but also the entire Simulink model from which it generates an FPGA project folder. This can then be opened and further processed using the FPGA development software. The entire tool chain is illustrated in Figure 2.

It should be noted that the System Generator is only able to convert blocks from the Xilinx block set.

Fig 2: FPGA development tool chain. Click on image to see enlarged view. English Translation: Top left: Development, Simulation, Verification, Optimization of the Algorithm Bottom left: Development, Simulation, Verification, Optimization of the FPGA

Figure 3 depicts a half-band filter created using Xilinx blocks. Here, we see the delay networks that are typical of IIR filters (at the extreme left and right) together with five multipliers and four adders. The System Generator will subsequently recognize the multipliers and assign them to the hardware multipliers present on the selected chip. Clicking on a block opens the configuration menu.

Figure 3: IIR filter in Simulink. Click on image to see enlarged view.

Figure 4 presents the menu for a multiplier. The top four boxes relate to the numerical presentation and word width. In signal processing applications it is normal to represent numbers as two's complements and in fraction format. This means that the decimal point is located in second position on the left. The next two boxes control the way the result value is delimited on the left and right sides. For the left-hand side, it is possible to choose between wrapping and saturation. A number of rounding approaches are possible for the right-hand side. All the other parameters refer to the VHDL generation.

Figure 4: Multiplier configuration menu. Click on image to see enlarged view.

In principle, all that is now missing is the assignment of the physical pins and the clock to the algorithm. In the current example, the serial interface to the codec was implemented in VHDL. The relevant files are therefore added to the design. The FPGA development software then generates finished FPGA code. This can then, for example, be loaded onto an evaluation board via JTAG and be executed immediately.

Summary

Table 1 indicates the equalizer's occupancy of a Xilinx XC2V1000-4FG456C FPGA. This makes it clear that the selected FPGA is more than generously sized for a single equalizer. Flip-flops were primarily used for the serial interface and for shift registers. Look-up tables are used when adding the filters. The design presented here is completely parallel, with the result that an output value is calculated in every FPGA clock cycle. This means that the solution is optimized for the processing speed. However, to achieve this, the equalizer occupies 20 of the 40 available 18x18-bit multipliers on the chip. The filters could have been implemented more economically in sequential form. However, the required development time would have been a little greater and the maximum achievable data rate would have been reduced.

Elements	Total	Used	Occupancy
Flip-flops	10240	327	3%
Look-up tables	10240	1141	11%
I/O pins	324	11	3%
Multipliers	40	20	50%
Clock lines	16	2	12%
Data rate	10 MHz	18 MHz	0.2%

Table1: The equalizer's occupancy of the XC2V1000

The data rate, in particular, indicates the phenomenal arithmetic performance offered by today's FPGAs. The equalizer presented here processes 18 k samples per second. The FPGA's limit is approximately 80 internal logical layers at slightly more than 10 MHz. If adapted correctly, our equalizer would therefore be able to cope with more than 500 audio channels.

Published 2004