Lattice Semiconductor Parallel FIR Filter User’s Guide
5
Coefficient Registers
The Coefficient Registers module stores the FIR filter coefficients. The coefficients can either be loaded at run time
or can be fixed during core generation. If the user chooses to fix the coefficients, then the
coeff
bus and
loadc
ports are not used in this module. For fixed coefficients, the values are hardcoded. If the coefficients are configured
to be loaded, they are loaded into the
coeff
registers sequentially at every clock edge. The
coeff
loading starts
at the first clock edge after
loadc
goes high and continues as long as
loadc
is active.
Data Scheduler
Data scheduling is necessary to schedule the tap and coefficient data to the multiplier bank for multi-cycle compu-
tations. This module has the necessary multiplexers to supply the tap and coefficient data to the multiplier bank in
batches. For a multi-cycle implementation with
C
cycles, the number of multipliers,
M
is equal to (N/C) rounded to
the next higher integer. For a fully parallel implementation (C = 1), the data scheduler reduces to a direct connec-
tion. The data scheduler is also used to multiplex data for optimizing decimation and interpolation filters.
Multiplier Bank
The Multiplier Bank has
M
number of
W
bit wide multipliers, where
M
is determined as the number of taps
N
divided by the number of computational cycles
C
rounded to the next higher integer
(M = ceil (N/C))
. The number of
multipliers is equal to the number of taps for a fully parallel implementation. The input to the bank comes from the
data scheduler and the output goes to the adder tree. The maximum delay through the multiplier bank is equal to
the delay of a singe multiplier.
Adder Tree and Output Control Unit
The Adder Tree has parallel adders instantiated in a binary tree fashion. The Output Control Unit has the scaling
and rounding logic to achieve output scalability and selectable rounding. There are also data registers to provide
synchronous registered output from the filter core. For a multi-cycle or decimation filtering, an adder is present in
the block, which when combined with the output registers, makes an accumulator.
Core Operation
There are four distinct implementations of parallel FIR filter: single-cycle, multi-cycle, decimation and interpolation.
This section describes these implementation types in detail. A note on rounding and truncation is also given in this
section. Complex data type is supported in all the filter implementations. For a complex data type, the complex
input data can be either supplied all at once (complex-parallel) or in two stages, real data followed by imaginary
data (complex-serial). The following notations are used:
N Number of taps
W Width of input data and coefficients
C Number of cycles for a multi-cycle operation
D Decimation ratio
U Interpolation ratio
M Number of multipliers, determined as M = Next higher integer to (N/C)
OW Output width
OFW Output full width
Single Cycle
This is the simplest of all implementations, in that it assumes availability of sufficient resources for parallel imple-
mentation. For an N-tap filter, it uses
N
multipliers and
N - 1
adders. The output is available on every cycle. The tim-
ing diagrams for the single-cycle implementations are given in Figures 2 and 3. As seen in the timing diagram, real
and imaginary parts of the input are supplied in successive clock cycles in complex serial mode. The data rate is
equal to half the clock rate. The input
irdy
should be asserted high to coincide with every valid real data at the
din port. Similarly, the core asserts the output real_out whenever the real part of the output data is placed on
the output bus.