HDL Serial Architectures for FIR Filters

Open Live Script

This example illustrates how to generate HDL code for a symmetrical FIR filter with fully parallel, fully serial, partly serial, and cascade-serial architectures for a lowpass filter for an audio filtering application.

Design the Filter

Use an audio sampling rate of 44.1 kHz and a passband edge frequency of 8.0 kHz. Set the allowable peak-to-peak passband ripple to 1 dB and the stopband attenuation to -90 dB. Then, design the filter using fdesign.lowpass, and create the FIR filter System object™ using the 'equiripple' method with the 'Direct form symmetric' structure.

Fs           = 44.1e3;         % Sampling Frequency in Hz
Fpass        = 8e3;            % Passband Frequency in Hz
Fstop        = 8.8e3;          % Stopband Frequency in Hz
Apass        = 1;              % Passband Ripple in dB
Astop        = 90;             % Stopband Attenuation in dB

fdes = fdesign.lowpass('Fp,Fst,Ap,Ast',...
    Fpass, Fstop, Apass, Astop, Fs);
lpFilter = design(fdes,'equiripple', 'FilterStructure', 'dfsymfir', ...
    'SystemObject', true);

Quantize the Filter

Assume that the input for the audio filter comes from a 12 bit ADC and output is a 12 bit DAC.

nt_in = numerictype(1,12,11);
nt_out = nt_in;
lpFilter.FullPrecisionOverride = false;
lpFilter.CoefficientsDataType = 'Custom';
lpFilter.CustomCoefficientsDataType = numerictype(1,16,16);
lpFilter.OutputDataType = 'Custom';
lpFilter.CustomOutputDataType = nt_out;

% Check the response with fvtool.
fvtool(lpFilter,'Fs',Fs, 'Arithmetic', 'fixed');

Generate Fully Parallel HDL Code from the Quantized Filter

Starting with the correctly quantized filter, generate VHDL® or Verilog® code. Create a temporary work directory. After generating the HDL code (selecting VHDL in this case), open the generated VHDL file in the editor by clicking on the link displayed in the command line display messages.

The default settings generate a fully parallel architecture. There is a dedicated multiplier for each filter tap in direct form FIR filter structure and one for every two symmetric taps in symmetric FIR structure. This results in a lot of chip area (78 multipliers, in this example). You can implement the filter in a variety of serial architectures to obtain the desired speed/area trade-off. These architecture options are shown in further sections of this example.

workingdir = tempname;
% fully parallel (default)
generatehdl(lpFilter, 'Name', 'fullyparallel', ...
            'TargetLanguage', 'VHDL', ...
            'TargetDirectory', workingdir, ...
            'InputDataType', nt_in);

### Starting VHDL code generation process for filter: fullyparallel
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/fullyparallel.vhd
### Starting generation of fullyparallel VHDL entity
### Starting generation of fullyparallel VHDL architecture
### Successful completion of VHDL code generation process for filter: fullyparallel
### HDL latency is 2 samples

Generate a Test Bench from the Quantized Filter

Generate a VHDL test bench to make sure that the result matches the response you see in MATLAB® exactly. The generated VHDL code and VHDL test bench can be compiled and simulated using a simulator.

Generate DTMF tones to be used as test stimulus for the filter. A DTMF signal consists of the sum of two sinusoids - or tones - with frequencies taken from two mutually exclusive groups. Each pair of tones contains one frequency of the low group (697 Hz, 770 Hz, 852 Hz, 941 Hz) and one frequency of the high group (1209 Hz, 1336 Hz, 1477Hz) and represents a unique symbol. This code generates all the DTMF signals and uses only one of them (digit 1 here) for test stimulus. This choice keeps the length of test stimulus to reasonable limit.

symbol = {'1','2','3','4','5','6','7','8','9','*','0','#'};

lfg = [697 770 852 941]; % Low frequency group
hfg = [1209 1336 1477];  % High frequency group

% Generate a matrix containing all possible combinations of high and low 
% frequencies, where each column represents one combination.
f  = zeros(2,12);
for c=1:4
    for r=1:3
        f(:,3*(c-1)+r) = [lfg(c); hfg(r)];
    end
end

Next, let's generate the DTMF tones

Fs  = 8000;       % Sampling frequency 8 kHz
N = 800;          % Tones of 100 ms
t   = (0:N-1)/Fs; % 800 samples at Fs
pit = 2*pi*t;

tones = zeros(N,size(f,2));
for toneChoice=1:12
    % Generate tone
    tones(:,toneChoice) = sum(sin(f(:,toneChoice)*pit))';
end

% Taking the tone for digit '1' for test stimulus.
userstim = tones(:,1);

generatehdl(lpFilter, 'Name', 'fullyparallel',...
            'GenerateHDLTestbench','on', ...
            'TestBenchUserStimulus', userstim,...
            'TargetLanguage', 'VHDL',...
            'TargetDirectory', workingdir, ...
            'InputDataType', nt_in);

### Starting VHDL code generation process for filter: fullyparallel
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/fullyparallel.vhd
### Starting generation of fullyparallel VHDL entity
### Starting generation of fullyparallel VHDL architecture
### Successful completion of VHDL code generation process for filter: fullyparallel
### HDL latency is 2 samples
### Starting generation of VHDL Test Bench.
### Generating input stimulus
### Done generating input stimulus; length 800 samples.

Warning: Wrap on overflow detected. This originated from 'DiscreteFir'
Suggested Actions:
    • Suppress future instances of this diagnostic from this source. - Suppress

### Generating Test bench: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/fullyparallel_tb.vhd
### Creating stimulus vectors ...
### Done generating VHDL Test Bench.

Information Regarding Serial Architectures

Serial architectures present a variety of ways to share the hardware resources at the expense of increasing the clock rate with respect to the sample rate. In FIR filters, we will share the multipliers between the inputs of each serial partition. This will have an effect of increasing the clock rate by a factor known as folding factor.

You can use the hdlfilterserialinfo function to get information regarding various filter lengths based on the value of coefficients. This function also displays an exhaustive table of possible options to specify SerialPartition property with corresponding values of folding factor and number of multipliers.

hdlfilterserialinfo(lpFilter, 'InputDataType', nt_in);

   | Total Coefficients | Zeros | A/Symm | Effective |
   ---------------------------------------------------
   |         156        |   0   |   78   |     78    |

Effective filter length for SerialPartition value is 78.

  Table of 'SerialPartition' values with corresponding values of 
  folding factor and number of multipliers for the given filter.

   | Folding Factor | Multipliers |   SerialPartition   |
   ------------------------------------------------------
   |        1       |      78     |ones(1,78)           |
   |        2       |      39     |ones(1,39)*2         |
   |        3       |      26     |ones(1,26)*3         |
   |        4       |      20     |[ones(1,19)*4, 2]    |
   |        5       |      16     |[ones(1,15)*5, 3]    |
   |        6       |      13     |ones(1,13)*6         |
   |        7       |      12     |[ones(1,11)*7, 1]    |
   |        8       |      10     |[ones(1,9)*8, 6]     |
   |        9       |      9      |[9 9 9 9 9 9 9 9 6]  |
   |       10       |      8      |[ones(1,7)*10, 8]    |
   |       11       |      8      |[ones(1,7)*11, 1]    |
   |       12       |      7      |[ones(1,6)*12, 6]    |
   |       13       |      6      |[13 13 13 13 13 13]  |
   |       14       |      6      |[14 14 14 14 14 8]   |
   |       15       |      6      |[15 15 15 15 15 3]   |
   |       16       |      5      |[16 16 16 16 14]     |
   |       17       |      5      |[17 17 17 17 10]     |
   |       18       |      5      |[18 18 18 18 6]      |
   |       19       |      5      |[19 19 19 19 2]      |
   |       20       |      4      |[20 20 20 18]        |
   |       21       |      4      |[21 21 21 15]        |
   |       22       |      4      |[22 22 22 12]        |
   |       23       |      4      |[23 23 23 9]         |
   |       24       |      4      |[24 24 24 6]         |
   |       25       |      4      |[25 25 25 3]         |
   |       26       |      3      |[26 26 26]           |
   |       27       |      3      |[27 27 24]           |
   |       28       |      3      |[28 28 22]           |
   |       29       |      3      |[29 29 20]           |
   |       30       |      3      |[30 30 18]           |
   |       31       |      3      |[31 31 16]           |
   |       32       |      3      |[32 32 14]           |
   |       33       |      3      |[33 33 12]           |
   |       34       |      3      |[34 34 10]           |
   |       35       |      3      |[35 35 8]            |
   |       36       |      3      |[36 36 6]            |
   |       37       |      3      |[37 37 4]            |
   |       38       |      3      |[38 38 2]            |
   |       39       |      2      |[39 39]              |
   |       40       |      2      |[40 38]              |
   |       41       |      2      |[41 37]              |
   |       42       |      2      |[42 36]              |
   |       43       |      2      |[43 35]              |
   |       44       |      2      |[44 34]              |
   |       45       |      2      |[45 33]              |
   |       46       |      2      |[46 32]              |
   |       47       |      2      |[47 31]              |
   |       48       |      2      |[48 30]              |
   |       49       |      2      |[49 29]              |
   |       50       |      2      |[50 28]              |
   |       51       |      2      |[51 27]              |
   |       52       |      2      |[52 26]              |
   |       53       |      2      |[53 25]              |
   |       54       |      2      |[54 24]              |
   |       55       |      2      |[55 23]              |
   |       56       |      2      |[56 22]              |
   |       57       |      2      |[57 21]              |
   |       58       |      2      |[58 20]              |
   |       59       |      2      |[59 19]              |
   |       60       |      2      |[60 18]              |
   |       61       |      2      |[61 17]              |
   |       62       |      2      |[62 16]              |
   |       63       |      2      |[63 15]              |
   |       64       |      2      |[64 14]              |
   |       65       |      2      |[65 13]              |
   |       66       |      2      |[66 12]              |
   |       67       |      2      |[67 11]              |
   |       68       |      2      |[68 10]              |
   |       69       |      2      |[69 9]               |
   |       70       |      2      |[70 8]               |
   |       71       |      2      |[71 7]               |
   |       72       |      2      |[72 6]               |
   |       73       |      2      |[73 5]               |
   |       74       |      2      |[74 4]               |
   |       75       |      2      |[75 3]               |
   |       76       |      2      |[76 2]               |
   |       77       |      2      |[77 1]               |
   |       78       |      1      |[78]                 |

You can use the optional properties 'Multipliers' and 'FoldingFactor' to display the specific information.

hdlfilterserialinfo(lpFilter, 'Multipliers', 4, ...
                    'InputDataType', nt_in);

Serial Partition: [20 20 20 18], Folding Factor:   20, Multipliers:    4

hdlfilterserialinfo(lpFilter, 'Foldingfactor', 6, ...
                    'InputDataType', nt_in);

Serial Partition: ones(1,13)*6, Folding Factor:    6, Multipliers:   13

Fully Serial Architecture

In fully serial architecture, instead of having a dedicated multiplier for each tap, the input sample for each tap is selected serially and is multiplied with the corresponding coefficient. For symmetric (and antisymmetrical) structures the input samples corresponding to each set of symmetric taps are preadded (for symmetric) or pre-subtracted (for anti-symmetric) before multiplication with the corresponding coefficients. The product is accumulated sequentially using a register and the final result is stored in a register before the next set of input samples arrive. This implementation needs a clock rate that is as many times faster than input sample rate as the number of products to be computed. This results in reducing the required chip area as the implementation involves just one multiplier with a few additional logic elements like multiplexers and registers. The clock rate will be 78 times the input sample rate (foldingfactor of 78) equal to 3.4398 MHz for this example.

To implement fully serial architecture, use the hdlfilterserialinfo function and set the 'Multipliers' property to 1. You can also set the 'SerialPartition' property equal to the effective filter length, which in this case is 78. The function also returns the folding factor and number of multipliers used for that serial partition setting.

[spart, foldingfact, nMults] = hdlfilterserialinfo(lpFilter, 'Multipliers', 1, ...
                                    'InputDataType', nt_in); %#ok<ASGLU>
                      
generatehdl(lpFilter,'Name', 'fullyserial', ...
           'SerialPartition', spart, ...
           'TargetLanguage', 'VHDL', ...
           'TargetDirectory', workingdir, ...
           'InputDataType', nt_in);

### Starting VHDL code generation process for filter: fullyserial
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/fullyserial.vhd
### Starting generation of fullyserial VHDL entity
### Starting generation of fullyserial VHDL architecture
### Clock rate is 78 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: fullyserial
### HDL latency is 3 samples

Generate the test bench the same way, as in the fully parallel case. It is important to generate a test bench again for each architecture implementation.

Partly Serial Architecture

Fully parallel and fully serial represent two extremes of implementations. While Fully serial is very low area, it inherently needs a faster clock rate to operate. Fully parallel takes a lot of chip area but has very good performance. Partly serial architecture covers all the cases that lie between these two extremes.

The input taps are divided into sets. Each set is processed in parallel by a serial partition consisting of multiply accumulate and a multiplexer. Here, a set of serial partitions process a given set of taps. These serial partitions operate in parallel with respect to each other but process each tap sequentially to accumulate the result corresponding to the taps served. Finally, the result of each serial partition is added together using adders.

Partly Serial Architecture for Resource Constraint

Let us assume that you want to implement this filter on an FPGA which has only 4 multipliers available for the filter. You can implement the filter using 4 serial partitions, each using one multiply accumulate circuit.

hdlfilterserialinfo(lpFilter, 'Multipliers', 4, ...
                    'InputDataType', nt_in);

Serial Partition: [20 20 20 18], Folding Factor:   20, Multipliers:    4

The input taps that are processed by these serial partitions will be [20 20 20 18]. You will specify SerialPartition with this vector indicating the decomposition of taps for serial partitions. The clock rate is determined by the largest element of this vector. In this case the clock rate will be 20 times the input sample rate, 0.882 MHz.

[spart, foldingfact, nMults] = hdlfilterserialinfo(lpFilter, 'Multipliers', 4, ...
                                    'InputDataType', nt_in);

generatehdl(lpFilter,'Name', 'partlyserial1',...
            'SerialPartition', spart,...
            'TargetLanguage', 'VHDL', ...
            'TargetDirectory', workingdir, ...
            'InputDataType', nt_in);

### Starting VHDL code generation process for filter: partlyserial1
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/partlyserial1.vhd
### Starting generation of partlyserial1 VHDL entity
### Starting generation of partlyserial1 VHDL architecture
### Clock rate is 20 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: partlyserial1
### HDL latency is 3 samples

Partly Serial Architecture for Speed Constraint

Assume that you have a constraint on the clock rate for filter implementation and the maximum clock frequency is 2 MHz. This means that the clock rate can't be more than 45 times the input sample rate. For such a design constraint, specify the 'SerialPartition' as [45 33]. Note that this results in an additional serial partition hardware, implying additional circuitry to multiply-accumulate 33 taps. You can specify the 'SerialPartition' property using hdlfilterserialinfo and its property 'Foldingfactor' as follows.

spart = hdlfilterserialinfo(lpFilter, 'Foldingfactor', 45, ...
                            'InputDataType', nt_in);

generatehdl(lpFilter,'Name', 'partlyserial2', ...
            'SerialPartition', spart,...
            'TargetLanguage', 'VHDL',...
            'TargetDirectory', workingdir, ...
            'InputDataType', nt_in);

### Starting VHDL code generation process for filter: partlyserial2
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/partlyserial2.vhd
### Starting generation of partlyserial2 VHDL entity
### Starting generation of partlyserial2 VHDL architecture
### Clock rate is 45 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: partlyserial2
### HDL latency is 3 samples

In general, you can specify any arbitrary decomposition of taps for serial partitions depending on other constraints. The only requirement is that the sum of elements of the vector should be equal the effective filter length.

Cascade-Serial Architecture

The accumulators in serial partitions can be re-used to add the result of the next serial partition. This is possible if the number of taps being processed by one serial partition must be more than that by serial partition next to it by at least 1. The advantage of this technique is that the set of adders required to add the result of all serial partitions are removed. However, this increases the clock rate by 1, as an additional clock cycle is required to complete the additional accumulation step.

Cascade-Serial architecture can be specified using the property 'ReuseAccum'. This can be done in two ways.

Add 'ReuseAccum' to generatehdl method and specify it as 'on'. Note that the value specified for 'SerialPartition' property has to be such that the accumulator reuse is feasible. The elements of the vector must be in descending order except for the last two which can be same.

If the property 'SerialPartition' is not specified and 'ReuseAccum' is specified as 'on', the decomposition of taps for serial partitions is determined internally. This decomposition minimizes the clock rate and reuses the accumulator logic. For this audio filter, the serial partitions are [12 11 10 9 8 7 6 5 4 3 3]. Note that it uses 11 serial partitions, implying 11 multiply accumulate circuits. The clock rate will be 13 times the input sample rate, 573.3 kHz.

generatehdl(lpFilter,'Name', 'cascadeserial1',...
            'SerialPartition', [45 33],...
            'ReuseAccum', 'on', ...
            'TargetLanguage', 'VHDL', ...
            'TargetDirectory', workingdir, ...
            'InputDataType', nt_in);

### Starting VHDL code generation process for filter: cascadeserial1
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/cascadeserial1.vhd
### Starting generation of cascadeserial1 VHDL entity
### Starting generation of cascadeserial1 VHDL architecture
### Clock rate is 46 times the input sample rate for this architecture.
### Successful completion of VHDL code generation process for filter: cascadeserial1
### HDL latency is 3 samples

Optimal decomposition into as many serial partitions required for minimum clock rate possible for reusing accumulator.

generatehdl(lpFilter,'Name', 'cascadeserial2', ...
            'ReuseAccum', 'on',...
            'TargetLanguage', 'VHDL',...
            'TargetDirectory', workingdir, ...
            'InputDataType', nt_in);

### Starting VHDL code generation process for filter: cascadeserial2
### Generating: /tmp/Bdoc24a_2528353_289974/tp89139833_2d76_45c7_a246_77199ce362b4/cascadeserial2.vhd
### Starting generation of cascadeserial2 VHDL entity
### Starting generation of cascadeserial2 VHDL architecture
### Clock rate is 13 times the input sample rate for this architecture.
### Serial partition # 1 has 12 inputs.
### Serial partition # 2 has 11 inputs.
### Serial partition # 3 has 10 inputs.
### Serial partition # 4 has 9 inputs.
### Serial partition # 5 has 8 inputs.
### Serial partition # 6 has 7 inputs.
### Serial partition # 7 has 6 inputs.
### Serial partition # 8 has 5 inputs.
### Serial partition # 9 has 4 inputs.
### Serial partition # 10 has 3 inputs.
### Serial partition # 11 has 3 inputs.
### Successful completion of VHDL code generation process for filter: cascadeserial2
### HDL latency is 3 samples

Conclusion

You designed a lowpass direct form symmetric FIR filter to meet the given specification. You then quantized and checked your design. You generated VHDL code for fully parallel, fully serial, partly serial and cascade-serial architectures. You generated a VHDL test bench using a DTMF tone for one of the architectures.

You can use an HDL simulator to verify the generated HDL code for different serial architectures. You can use a synthesis tool to compare the area and speed of these architectures. You can also experiment with and generating Verilog code and test benches.