Digital Signal Processors (DSPs) are CPUs that have been designed especially for the high computational requirements of digital signal processing applications. DSPs are commonly found on consumer electronics like cameras, printers, modems and phones. They are well suited for applications that involve processing images, video, audio signals, and in digital communications. They include a number of architectural features that, when fully exploited, make it possible to execute DSP applications several times faster than if they were running on general purpose processors.
DSP Architectural Features
While different DSP processors have varying instruction sets and other features, most DSPs tend to share certain common characteristics:
Custom Instructions
DSP processors include special instructions to handle common signal-processing tasks. For example, a Multiply-Accumulate (MAC) instruction is popular for programming signal processing filters. Multiplications with saturation arithmetic can save on the burden of checking for overflow conditions.
Most DSPs will also have SIMD (Single Instruction – Multiple Data) instructions that can operate on multiple data elements in parallel. The way this works is that they divide the registers into smaller segments suitable for the application. For example, video data can be represented using 8 bits per color, so a single 32-bit register can hold 4 pixel colors. There are SIMD instructions to perform parallel addition, subtraction, min, max, and various types of multiplications on all four elements at once. In this example, the parallel operations would provide up to a 4-fold speedup.
The actual instructions available on each processor vary from processor to processor. Some instructions can be as complex as one that performs the sum of absolute differences of two vectors of numbers, which can speed-up video encoding applications tremendously.
Long Registers
The concept of parallel processing using SIMD instructions is made even more powerful by using long registers. Some DSPs provide 64-bit and even 128-bit register sets. These registers can be used with SIMD instructions to provide up to 16-fold speedup of certain applications. The actual speedup obtained depends on the size of the data elements, and on the parallelism available in the algorithm itself.
DMAs
Signal processing applications need to move large amounts of data through memory. DSP processors will often include advanced Direct Memory Access (DMA) hardware blocks that can offload the movement of data from the CPU.
On-chip Memory
In order to keep the CPU operating at peak speed, it should be possible to read data from memory without any wait states. External memory (DRAM) tends to be much slower than the CPU frequencies, so DSPs will always include some amount of fast on-chip memory. The memory can be arranged either as a data cache, or as software-controllable on-chip memory.
When using a data cache, the first time a piece of data is accessed, the processor will stall until it can be read from main memory. But subsequent accesses will be read from cache without any stalls. Data gets copied out of cache into main memory automatically when the cache gets reused, or at the request of the program.
To utilize software-controllable on-chip memory, the program will first need to copy the data into on-chip memory (possibly with the DMA engine) before it can be utilized. Similarly, the results need to be copied back into main memory using the DMA, or word-by-word copes.
Multiple Instructions per Cycle
In addition to all of the complex instructions available, some of the most powerful DSPs can issue multiple instructions per cycle. This can be done in one of two ways: superscalar processors utilize special hardware to dispatch instructions to one of several execution units, and coordinates the access to a common set of registers. The other way to issue multiple instructions per cycle is to utilize Very Long Instruction Words (VLIW). In this case, the compiler is left in charge of deciding which instructions can be issued at every cycle and coordinate when the results of each one are available. The instructions for each execution unit are combined into one very large instruction per cycle, thus the name VLIW.
There are superscalar DSPs in the market with up to four instructions issued per cycle, and VLIW processors with up to eight issued instructions per cycle.
Programming Languages for DSPs
Most DSPs are programmed in special versions of C that have been extended with “intrinsics,” which permit the programmer to request specific custom instructions available in the processor. In addition, the C language needs to be extended to support longer registers for SIMD operations. DSP vendors will almost always provide support for C++ programming, but it is not very popular in the DSP software industry.
To fully exploit the architectural features of DSPs and realize the speed potential, the software needs to be optimized by hand. Compilers are not capable of generating the complex instructions available in DSPs from regular unoptimized code.
Some DSP software programmers will resort to assembly programming for DSPs. While it cannot be denied that a well-written assembly program cannot be beat by any C program, assembly programs are extremely difficult to debug and maintain, especially on VLIW architectures. At Inband Software we do not favor assembly-programming of DSPs; instead we recommend utilizing optimization techniques in C to approach the performance that would be possible in assembly.
Optimizations
Each software application is different, and therefore benefits differently from various optimization techniques. But the following techniques are generally useful when programming for DSPs:
Use Custom Instructions
Using the complex instructions available in the processor can replace many general instructions. One needs to pay attention to the latency of each instruction to make sure that it is still faster to execute a single complex instruction than multiple simple ones.
Loop Unrolling
This technique is very helpful on VLIW processors. It consists in reducing the number of iterations by a factor of 2 or higher, and duplicating the code inside a loop. The result is that there are more instructions inside the loop for the compiler to work with, which tends to produce better instruction schedules.
Software Pipelining
In this technique, the execution inside a loop is broken in half. Half of the first iteration is done outside the loop, half of the last iteration is done after the loop. The loop then performs half of the processing for one iteration, and half of the processing of the next iteration. This technique breaks the data dependencies inside the loop, and permits a more compact schedule of instructions.
Fixed-Point Programming
Some DSPs include a floating-point unit. But many don’t, so programmers need to utilize fixed-point representations to program arithmetic algorithms with fractional data. The basic idea is to assign a fixed number of bits to the fractional portion of the quantity, and the rest to the integer portion. For example, the number 1.0 can be represented with 4 fractional bits as binary 10000, and 0.5 would be 1000. As long as they share the same number of fractional bits, two numbers can be added and subtracted using integer instructions. Multiplications work very similarly, but they produce twice the number of integer and fractional bits, so the results may need to be shifted in order to return to the desired number of fractional bits.
Optimizing software for DSPs is a laborious process, but well worth the effort. Properly optimized, a DSP application will fit in a smaller and cheaper DSP processor than it would require on a general purpose processor. DSP programming is a specialized skill that takes years to master.
Inband Software can help optimize your applications on DSP processors. Our expert DSP programmers are intimately familiar with most DSP architectures and can help you achieve the maximum potential of your DSP processors.