CDSP-k5 Overview

The CDSP-k5 Processor

Technology overview

The CDSP-k5 is the fifth member in the CDSP family of high performance customizable fixed-point DSP cores, featuring high execution speeds for both signal-processing algorithms and standard microprocessor applications. It is meant to be used as an embedded cell in ASICs developed on most of the 0.25u and below technologies. It is highly customizable and can be targeted at a large number of technologies thanks to its parameterized, HDL-only based design.

The CDSP-k5 processes two words in parallel (up to 32 bits each), interpreted as either a complex number, or as a pair of real numbers. The CDSP-k5 hosts a 4-unit complex multiplier allowing speeds of one ComplexMAC/Clock cycle, or two parallel RealMACs/Clock cycle; in both cases saturation logic can be enabled to assist these operations.

The CDSP-k5 includes a high performance, customizable hardware acceleration unit that has been optimized for most of the common DSP algorithms (MAC/FIR/Correlation, LMS, IIR, FFT/iFFT, Search/Min/Max, Matrix/Vector operations). The modular design of the core allows stripped-down versions to be easily obtained, while a number of list-box/check-box customizable features enable on-the-fly tuning of the design to match the user's specifications. The user-guided customization process can thus achieve a highly efficient, low power and small area implementation, making the CDSP-k5 well suited for high-volume, low-cost applications, while also delivering world-class performance.

Some of the CDSP-k5 general registers can be used to interface application-specific hardware accelerators; this offers the user a convenient and effective way to tightly interact with the internal CDSP structure. Also, both the ALU and the MAC can be completely replaced and/or complemented with user-defined hardware structures.

Architectural features:

Single-cycle execution for most instructions.
Operates directly on complex numbers, or on pairs of real numbers
Highly orthogonal, two-operand instruction set, with one operand residing in a register, and the other in a register or memory location
Unified data memory addressing replaces the traditional DSPs' X and Y data memories
Configurable MAC unit optimized for most of the common DSP algorithms, enabling execution speeds comparable with the cutting edge parallel DSP processors on the market.
Saturation logic built in both the ALU and the MAC units
Up to four index registers fully featured with modulo and bit-reversed post-increment addressing capability
Zero-cycle Block-repeat capability plus a standard looping instruction
Dynamic shift instruction, plus a choice of static shifts (both arithmetic and logic).
Compact code and large addressing space
Low power dissipation achieved by blocking the logic modules that are inactive in every clock cycle
Less than one cycle response when in wait mode allowing fast synchronization with predictable asynchronous events
Six internal 64-bit data busses enabling up to six internal complex-data transfers per cycle, or twelve internal real-data transfers per cycle
Special bank-based memory architecture enabling efficient usage of data types that are smaller than a processor word
Synchronous program memory implementable as a RAM/ROM combination, enabling the DSP with run-time programmability
Interface registers to allow application-specific hardware acceleration modules to be tightly integrated with the core

Customizable features include:

The size of the processor word (up to 32 bits)
The RAM and ROM sizes
The number of general registers
The number of index registers and the features of the address generators, including modulo and bit-reversed addressing modes
The performance of the MAC unit, ranging from a simple, one result-bit per cycle multiplier, up to state-of-the-art, fully pipelined, single-cycle complex hardware accelerator
The saturation and rounding options built in the ALU and the MAC
The amount of shifting for the static shift instructions
The addressing space (up to 2 GW)
The number, size and operation mode of the communication ports
And more...

Performance for a typical 0.25u/3V technology implementation:

The CDSP-k5 is implemented in two versions: a 4-stage pipeline version CDSP-k5-4, and a 6-stage pipeline version CDSP-k5-6. The 4-stage version consolidates (chains) memory accesses with internal DSP processing in the same clock cycle, while the 6-stage version pipelines the memory accesses and internal DSP processing; this leads to a double clock speed for the 6-stage version as compared to the 4-stage version. The critical path inside the CDSP-k5 is less than 5ns, leading to 200MHz operation for the 6-stage version (CDSP-k5-6), and 100MHz operation for the 4-stage version (CDSP-k5-4).
The CDSP-k5 operates at a sustained rate of 100MIPS at 100MHz, or 200MIPS at 200MHz. A very high performance is achieved for typical DSP algorithms by having up to eight internal arithmetic units, plus two address generator units, working in parallel every clock cycle. The CDSP's ALU and MAC units have been desigend with special emphasis on efficient usage of the hardware resources during typical DSP algorithms, leading to effective 2GOPS (Giga Operations Per Second) speeds. Examples of algorithms that fully utilize this computing power are Complex FIR, Complex Correlation, Complex Matrix Multiplication, Complex Energy calculation. For other algorithms such as Real FIR, Real Correlation, Real Energy calculation, FFT, iFFT, LMS-based Complex FIR update and Echo Cancellation, speeds between1GOPS and 1.5GOPS are obtained.