The CDSP-k5 is the fifth member in the CDSP family of high
performance customizable fixed-point DSP cores, featuring high
execution speeds for
both signal-processing algorithms and standard microprocessor
applications. It is meant to be
used
as an embedded cell in ASICs developed on most of the 0.25u and below
technologies.
It is highly customizable and can be targeted at a large number of
technologies
thanks to its parameterized, HDL-only based design.
The CDSP-k5 processes two words in parallel (up to 32 bits each),
interpreted
as either a complex number, or as a pair of real numbers. The CDSP-k5
hosts a 4-unit complex multiplier allowing speeds
of one ComplexMAC/Clock cycle, or two parallel RealMACs/Clock cycle; in
both cases saturation logic can be enabled to assist these operations.
The CDSP-k5 includes a high performance, customizable hardware
acceleration
unit that has been optimized for most of the common DSP algorithms
(MAC/FIR/Correlation,
LMS, IIR, FFT/iFFT, Search/Min/Max, Matrix/Vector operations). The
modular
design of the core allows stripped-down versions to be easily obtained,
while a number of list-box/check-box customizable features enable
on-the-fly
tuning of the design to match the user's specifications. The
user-guided
customization process can thus achieve a highly efficient, low power
and
small area implementation, making the CDSP-k5 well suited for
high-volume,
low-cost applications, while also delivering world-class performance.
Some of the CDSP-k5 general registers can be used to interface
application-specific
hardware accelerators; this offers the user a convenient and effective
way to tightly interact with the internal CDSP structure. Also, both
the
ALU and the MAC can be completely replaced and/or complemented with
user-defined
hardware structures.
Architectural features:
- Single-cycle execution for most instructions.
- Operates directly on complex numbers, or on pairs of real numbers
- Highly orthogonal, two-operand instruction set, with one operand
residing
in a register, and the other in a register or memory location
- Unified data memory addressing replaces the traditional DSPs' X
and Y
data
memories
- Configurable MAC unit optimized for most of the common DSP
algorithms,
enabling execution speeds comparable with the cutting edge parallel DSP
processors on the market.
- Saturation logic built in both the ALU and the MAC units
- Up to four index registers fully featured with modulo and
bit-reversed
post-increment addressing capability
- Zero-cycle Block-repeat capability plus a standard looping
instruction
- Dynamic shift instruction, plus a choice of static shifts (both
arithmetic
and logic).
- Compact code and large addressing space
- Low power dissipation achieved by blocking the logic modules that
are
inactive
in every clock cycle
- Less than one cycle response when in wait mode allowing fast
synchronization
with predictable asynchronous events
- Six internal 64-bit data busses enabling up to six internal
complex-data
transfers per cycle, or twelve internal real-data transfers per cycle
- Special bank-based memory architecture enabling efficient usage
of data
types that are smaller than a processor word
- Synchronous program memory implementable as a RAM/ROM
combination,
enabling
the DSP with run-time programmability
- Interface registers to allow application-specific hardware
acceleration
modules to be tightly integrated with the core
Customizable features include:
- The size of the processor word (up to 32 bits)
- The RAM and ROM sizes
- The number of general registers
- The number of index registers and the features of the address
generators,
including modulo and bit-reversed addressing modes
- The performance of the MAC unit, ranging from a simple, one
result-bit
per cycle multiplier, up to state-of-the-art, fully pipelined,
single-cycle
complex hardware accelerator
- The saturation and rounding options built in the ALU and the MAC
- The amount of shifting for the static shift instructions
- The addressing space (up to 2 GW)
- The number, size and operation mode of the communication ports
- And more...
Performance for a typical 0.25u/3V technology implementation:
- The CDSP-k5 is implemented in two versions: a 4-stage pipeline
version
CDSP-k5-4, and a 6-stage pipeline version CDSP-k5-6. The 4-stage
version
consolidates (chains) memory accesses with internal DSP processing in
the
same clock cycle, while the 6-stage version pipelines the memory
accesses
and internal DSP processing; this leads to a double clock speed for the
6-stage version as compared to the 4-stage version. The critical path
inside
the CDSP-k5 is less than 5ns, leading to 200MHz operation for the
6-stage
version (CDSP-k5-6), and 100MHz operation for the 4-stage version
(CDSP-k5-4).
- The CDSP-k5 operates at a sustained rate of 100MIPS at 100MHz, or
200MIPS
at 200MHz. A very
high performance is achieved for typical DSP algorithms by having
up
to eight internal arithmetic units, plus two address generator units,
working
in parallel every clock cycle. The CDSP's ALU and MAC units have
been desigend with special emphasis on efficient usage of the hardware
resources during typical DSP algorithms, leading to effective 2GOPS
(Giga Operations
Per Second) speeds. Examples of algorithms that fully utilize
this computing power are Complex FIR, Complex Correlation, Complex
Matrix
Multiplication, Complex Energy calculation. For other algorithms such
as
Real FIR, Real Correlation, Real Energy calculation, FFT, iFFT,
LMS-based
Complex FIR update and Echo Cancellation, speeds between1GOPS and
1.5GOPS
are obtained.