The CDSP-k2 Processor

Technology overview

The CDSP-k2 is the second member in the CDSP family of high performance customizable fixed-point DSP cores, featuring high execution speeds for both signal-processing algorithms and standard microprocessor applications. It is meant to be used as an embedded cell in ASICs developed on most of the 0.6u and below technologies. It is highly customizable and can be targeted at a large number of technologies thanks to its parameterized, HDL-only based design.

The CDSP-k2 includes an integrated customizable hardware acceleration unit that has been optimized for MAC/FIR/Correlation, FFT/iFFT, and Matrix/Vector operations, thus allowig the processor to be optimized for many common DSP algorithms. The modular design of the core allows stripped-down versions to be easily obtained and enables an easy tuning of the design to match the user's specifications. The user-guided customization process can thus achieve a highly efficient, low power and small area implementation, making the CDSP-k2 well suited for high-volume, low-cost applications, while also delivering world-class performance.

Architectural features:

Single-cycle execution for most instructions.
Two-operand instruction set with one operand residing in memory and the other in a register
A dual-operation instruction-word option enabling sustained rates of two operations per cycle in memory-access intensive algorithms such as buffered image processing and adaptive filtering
Four internal data busses enabling up to four internal data transfers per cycle
Zero-cycle Block-repeat capability plus a standard looping instruction
Special bank-based memory architecture enabling efficient usage of data types that are smaller than a processor word
Very compact code and large addressing space
Eight logical shifts and four arithmetic shifts
Configurable hardware multiplier
Provision for full accommodation of MAC(s) by mapping one MAC input and the output on the register file and providing the full range of indexed addressing modes for the second input
Configurable butterfly unit enabling execution speeds comparable with the cutting edge parallel DSP processors on the market.
Up to three index registers fully featured with modulo and bit-reversed post-increment addressing capability
A constant-memory table-lookup pointer featured with post-increment/post-decrement options
Synchronous program memory implementable as a RAM/ROM combination, enabling the DSP with run-time programmability feature via the comm ports
Less than one cycle response when in wait mode allowing fast synchronization with predictable asynchronous events
Option for shadow registers allowing zero-cycle context saving for one or more levels of interrupts

Customizable features include:

The size of the processor word (up to 64 bits) and the point-position within the fixed-point registers
The RAM and ROM sizes
The number of integer and fixed-point registers
The number and choice of shadow registers
The number of index registers and the features of the address generators, including modulo and bit-reversed addressing modes
The amount of shifting for the shift instructions
The addressing space (up to 2 GW)
The number, size and operation mode of the communication ports
The performance of the hardware multiplier, ranging from one result-bit per cycle up to pipelined single-cycle
Two instruction set implementation options, targeting lower power consumption and higher maximum execution speed respectively
And more...

Performance for a typical 0.6u/5V technology implementation:

75MHz or 133MHz operation depending on instruction-set implementation option
Sustained 100 MIPS performance, leading to various execution speeds depending on the architecture variant and the specific algorithm being implemented. Peak performance for typical DSP algorithms is higher than:

100 MOPS for the basic architecture

200 MOPS by using the dual-operation instruction-set option

300 MOPS by using the dual-operation instruction-set option and a register-mapped MAC

As an example, an 8-channel ITU-G726 ADPCM algorithm can be implemented on a 60 MHz basic architecture
A 10,000 256-point FFTs/second rate can be obtained at 75 MHz operation, by using the dedicated butterfly unit.
The basic architecture, without the MAC(s) and the butterfly unit, has a 130 MIPS/Watt ratio, leading to less than 300mW power dissipation for a 40-bit processor running at 50MHz