Massively Parallel Processor Arrays

MPPA Technology Backgrounder
The Am2045 architecture is a member of an emerging class of massively-parallel chips called Massively Parallel Processor Arrays or MPPAs. MPPA is a generic name for a category of chip, much like FPGAs, ASICs, or DSPs (so Nethra does not use MPPA as a brand name.)

MPPAs are distinguished from 'multi-core' general-purpose processors. These conventional multi-core devices typically have only a few processors and a shared-memory, shared-bus architecture. Most multi-core devices have been aimed at general-purpose computing, for running huge existing applications and complex operating systems. By contrast, MPPAs employ massive parallelism of at least hundreds of peer-to-peer processing elements such as complete processors, ALUs, finite state machines and distributed memories, and a rich word-wide flexible interconnect fabric which can be statically, or sometimes dynamically, reconfigured. MPPAs are usually aimed at embedded computing, where they run high-performance dedicated applications such as media or network processing, which have strict cost, power and real-time requirements.

Starting with some standard semiconductor market categorization from Gartner Group, here is a diagram suggesting where MPPAs fit.



How do MPPAs differ? To start with, there are two classes of MPPA chips, SIMD (single instruction, multiple data) and MIMD (multiple instructions, multiple data).

SIMD MPPAs only have a single instruction stream (or very few) running the processors the same way on similar data. Some SIMD chips can mask off a few different parts of the array at runtime but it's still basically SIMD.

SIMD is very efficient for simple DSP filters and other regular vector processing. It used to be good for media and network processing, but modern applications like H.264 video compression and adaptive network intrusion detection get more powerful by getting more complex and less regular. H.264 video codecs for example have many components with extremely complex looping, forward and backward data dependencies and highly variable, data-dependent compute-times, which are ill-suited to SIMD MPPAs.

MIMD MPPAs are much more capable. Every processor can have its own parallel instruction stream running on its own data. (A MIMD MPPA running the same code everywhere becomes SIMD). MIMD's diversity of parallel control is most effective for today's applications. It's good for many data structures, not just vectors – its processors can stay busy on varied types of data and data sizes. However, previous MIMD MPPA chips haven't been very practical to program, requiring the developer to deal explicitly with synchronization, and use hardware-type, timing-sensitive languages and tools, or they depend on 'system compilers' that are often regarded as open-ended research projects – as open-ended as artificial intelligence.

The Ambric chip enables a practical MIMD-parallel programming model, with innovative asynchronous channels that remove the global synchronization problem. Explicit task-scheduling is simply not necessary. Complex global state-machines don't have to be coded and debugged. Block diagram structures of objects which are written in standard high-level language single-threaded, sequential code are reasonable and practical to develop and handle even the most complex applications.

For more information, refer to the Massively parallel processor array entry at Wikipedia.

Embedded Computing Problem

Complex high-bandwidth, real-time, streaming-media, sensor, and network applications for embedded systems can be achieved by programming a massively parallel processor array (MPPA) with software rather than using the hardware design required for ASICs and FPGAs. But a fundamental problem is how to productively program and validate a complex, irregular application on hundreds of processors. With conventional multiprocessing techniques, keeping processors busy, communicating, and synchronized is difficult and prone to failure, and won’t scale up to hundreds of CPUs.

Traditional architectures, such as CPUs and DSPs, are reaching limits in ease of development, performance, power and scalability. The most visible example is the industry's transition from frequency-scaling and its exponentially escalating problems - to scaling by adding multiple cores per die [5].

Single CPUs and DSPs can no longer keep track with the performance imperative of Moore's Law by trying ever-higher clock speeds or more elaborate implementation architectures, because they all suffer from diminishing returns.

Conventional multi-core processors won't scale for long, especially for embedded systems. Non-determinism, thread complexity, shared memory bottlenecks, explicit developer-managed synchronization all place fundamental limits on this approach [4].

ASICs and high-end FPGAs are getting harder and more expensive to develop. Global timing convergence, million-dollar plus ASIC NREs, FPGA static power issues and the RTL hardware design productivity gap between available and delivered ASIC gates [6] are among the growing difficulties.

It is better to spend transistors on supporting a more practical parallel programming model. It has been common for chip vendors to build parallel hardware architectures without adequate consideration of the programming model.

Structural Object Programming Model

MPAA’s structural object programming model [1,2,3] is the foundation of how it all works:
img1
A MPPA chip has an array of hundreds of 32-bit RISC processors that are each programmed with ordinary software, and hundreds of distributed memories. These are called objects because they obey strict encapsulation rules enforced in hardware. Objects simply run independently on their own parallel hardware, not shared or multi-threaded or virtualized. Structural objects are strictly encapsulated, execute with no side effects on one other, and have no implicitly shared memory.

We object communicate through a simple parallel structure of hardware channels. Each channel is word-wide, unidirectional, point-to-point from one object to another, and acts like a FIFO-buffer. Channels carry both data and control tokens, in simple or structured messages. Channel hardware synchronizes its objects at each end, dynamically as needed at run time, not scheduled at compile time.

Inter-processor communication and synchronization is straightforward. Sending or receiving a word on a channel is as simple as reading or writing a processor register. Sending a word through a channel is both a communication and a synchronization event. This keeps objects in step with each other in a common-sense way. Since we channel synchronize transparently and locally, the application achieves high performance without complex global synchronization.

Article: Accelerating High-Performance Embedded Applications Using a Massively Parallel Processing Array, Parts I & II
— by Laurent Bonetto, Technical Marketing Engineer, Ambric, Inc., April 2008

The result is much easier and cheaper development, the performance and power advantages of massive parallelism, and long-term scalability.

Registers and Channels

How these channels are built is the key to putting this programming model on a chip. Instead of ordinary synchronous registers, Ambric registers are used throughout the chip. A chain of Ambric registers is called an Ambric channel. Ambric channels are fully encapsulated, fully scalable technology for passing control and data between objects. Channel stages are entirely local. There are no wires longer than one stage. Ambric channels dynamically accommodate varying delays and workloads on their own. Changing the length of a channel has no effect on functionality, only its latency changes.

Ambric processors are interconnected in hardware through Ambric channels. Each processor runs independently, responding only to its own local channels. This is why local changes have only local effects, because it's a globally asynchronous system. This makes it possible to change the clock speed of each processor independently at runtime, to minimize power.

Self-synchronizing channels with local buffering in an asynchronous system, which is practical to program, fast and scalable, all become possible thanks to Ambric channels.

Processor Architecture

This programming model motivates a reassessment of the assumptions that today's processors have always been based on.
Traditional von Neumann architecture reads and writes variables in a memory space, bottlenecked through a register-memory hierarchy. Communication between processors is an afterthought. Efficient handling of streaming data such as media processing or packet processing is not inherent in conventional machines.

The objective of Ambric processor architecture is processing data and control from channels. Memories are encapsulated objects like any others, read and written using channels. Instruction streams arrive through channels. So channel communication is a first-class feature of Ambric's architecture.

img2

This makes for very lightweight 32-bit RISC CPUs, which treat channels just like general registers. Every datapath is a self-synchronizing Ambric channel. Data and control tokens keep moving, so RAMs get used more for buffering, rather than as a static global workspace.

All Ambric processors can: execute an operation, do a loop iteration, input from channels, and output to a channel every cycle. That means a processor running single-cycle loop is equivalent performance to a configurable hardware block, like in an FPGA. But unlike FPGAs, the full range of performance versus complexity between pure hardware and ordinary software is easily available, and with software programming, not complex globally-synchronous RTL-based hardware design.

Chip Architecture: Compute Unit, RAM Unit

In Ambric's chip architecture, a cluster of four processors is a Compute Unit (or CU). It has two types of processors:

The SRD processor is a 32-bit streaming RISC processor with DSP extensions for math-intensive processing. It has local memory for its 32-bit instructions, and can directly execute more code from the RAM Unit next door.

The SR processor is a simpler 32-bit streaming RISC CPU, used mainly for managing channel traffic, generating complex address streams, and other utility tasks which enable sustained high throughput for the SRDs.

The RAM Units (or RUs) are the main on-chip memories that can stream addresses and data over channels.

img3

Chip Architecture: Brics and Interconnect

CUs and RUs are combined into the top-level physical building-block called a bric. The core of the chip is assembled just by stacking up brics. The number of brics serves as a measure of compute-capacity.

Each bric has two CU-RU pairs, totaling 8 CPUs and 21 KBytes of SRAM.

Brics connect by abutment through channels that cross bric-to-bric. The CUs and RUs are arranged so that, in the array of brics, there are contiguous CUs and contiguous RUs.

To optimize the cost and performance of each channel, the chip's configurable interconnect is hierarchical with several levels of hierarchy. These channels are word-wide and run at up to 10 Gigabits per second.

These bric-long channels are the longest signals in the core, except for a low-power clock tree and the reset. So this physical architecture is very scalable going forward.

Chip Implementation

The core of Ambric's has 45 brics in an array, containing a total of 336 32-bit SR and SRD processors running at 350 MHz and 7.1 Mbits of distributed SRAM. At full speed, all processors together are capable of 1.2 trillion operations per second, over one teraOPS. This performance is supported by the interconnect's 792 gigabit per second bisection bandwidth, 26 Gbps of off-chip DDR2 memory, PCI Express at an effective 8 Gbps each way, and up to 13 Gbps of parallel general-purpose I/O.

Programming Model and Tools

According to the Ambric programming model, developing applications for this chip is straightforward, no "magic compilers," no scheduling, no synthesis.

The aDesigner integrated development environment (IDE) based on Eclipse — an open development platform from the Eclipse Foundation (http://www.eclipse.org) — lets objects be written in a subset of standard Java or assembly code or be loaded from libraries; the structure is defined with graphical block diagrams or a text-based structural language. An application simulator is available in the IDE. Objects and structure are automatically compiled, placed, and routed onto the Am2045. This generates a configuration file, which is loaded into the chip at runtime. Symbolic source-level parallel debugging and performance monitoring, in real time on the chip, is available through the IDE.

img4

The developer starts by describing the application as a parallel structure of objects, and the data and control messages they send and receive. The process is to divide and conquer the application hierarchically, defining composite objects according to higher-level functional blocks, which is a very good match to the way developers think about an application. All the parallelism is encoded in the structure, leaving the leaf objects sequential.

Since Ambric objects, even large composite objects, are strictly encapsulated with simple common channel interfaces, they are easily reused. Once validated, encapsulation protects them, so they maintain correct behavior, with no need for re-validation.
Application-specific leaf objects are written in ordinary sequential code in the subset of a standard high-level language and/or assembly code, and compiled normally.

A functional simulator is available to run and do initial debugging in a software testbench environment on the desktop.
Finally the realization tool chain auto-maps all objects onto CUs and RUs, auto-routes the channels, and creates a configuration file for the chip. At runtime the chip is configured by a host or by itself from flash, similar to FPGAs.

Runtime debugging and performance or power tuning in the real system on real data at full speed is vital. Unused processors, memories and channels are straightforward to use for real-time debug - just by forking and copying channel traffic. The chip also includes a separate dedicated debug network. The developer can halt, step, restart processors with the usual debug features, plus observe or trap on channel events.

Performance Metrics

Ambric defined this programming model, architecture and tools to get very high performance from a massively parallel chip. Here are the results.

These execution examples compare the Am2045 with TI's 90nm high-end fixed-point VLIW DSP, and a large 90nm Xilinx FPGA.

 

Am2045 45-bric Chip

TI C641x DSP

Xilinx Virtex-4 LX100 - LX200

Process

130nm

90nm

90nm

MHz

350 MHz

1,000 MHz

500 MHz Nominal

Published DSP
Benchmarks

10-25X throughput,
1/3 the code

1X

n/a

Multiply-Accum./Sec.
(16x16 to 32-bit)

60 GMACS

4 GMACS

48 GMACS

Ambric's DSP benchmark was created by implementing the same functionality as five published TI benchmark kernels. The Am2045 delivers a range of 10 to 25 times greater throughput when extrapolated to the chip-level. Its code is 1/3 the size and much less complex, without all the VLIW-style setup and teardown. Developing the benchmarks for this chip took one field application engineer a day and a half.

Am2045’s multiply-accumulate throughput is superior to 1 GHz 90nm DSPs and general-purpose logic-oriented high-end 90nm FPGAs. (Some DSP-centric FPGAs have larger nominal MAC ratings but offer only limited capacity for application logic.)

Big numbers on raw throughput and small benchmarks are great, but what about real applications?

Application Example: Motion Estimation

Consider a video encoding application, created for a customer benchmark: real-time Motion Estimation across two reference frames of broadcast-quality 720p high-definition video. Motion Estimation is the most compute-intensive part of video compression. This benchmark does full, exhaustive search, with sums of absolute differences (SAD) between pixels as the best-match criteria. (Commercial motion estimators in silicon or software typically use far less rigorous and computationally cheaper search methods.)

The first step in the algorithm is to take in the frames to compare along with candidate motion vectors. The frames are buffered in off-chip DRAM, from which the individual 16-by-16-pixel macroblocks are read and processed.

The search is done in parallel by a set of identical Motion Estimation objects, that each handle a different region. Data and results are streamed down a pair of channels, one for pixels and the other for results. This is the top level of a hierarchical design.

Each of those ME units is a composite object, consisting of 4 Calculation objects which each process one macroblock at a time, and a block to collect and choose the best results. Each of those Calculation objects is another composite object assembled from leaf objects that run on the processors and memories.

img5

The motion estimator for a single reference frame is shown here; the full implementation has two of these which both fit in the Am2045 chip, using 89% of the brics. They only need to run at 300 MHz. Actual sustained performance is 0.46 trillion operations per second, which is over half of the peak theoretical performance available from these brics at this clock rate.

This shows how the architecture and programming model succeeds in delivering very high performance on real applications in a programmable chip.

Scalability

Ambric set out to overcome fundamental barriers to scalability in development cost and difficulty, timing, and power.

Our programming model's object-based modularity makes much greater design reuse possible. Its simple combination of objects written in normal software code, combined in a hierarchy of block-diagram structures, makes high-performance design development much easier and cheaper.

Our asynchronous system of processors, memories and self-synchronizing channels, with local synchronous clocking and no long wires, is very scalable into future process generations.

Conventional techniques increase power far out of proportion to performance increase. With parallel processors, performance per watt stays relatively constant. The Ambric chip and its programming model make that massive parallelism practical.

img6

Starting from today's 1 teraOPS 130nm Am2045, long term scalability in more than one direction is available.

Increased performance at the same area and cost will come from process scaling and more custom implementation, leading to 65nm parts with over 1,000 processors and over 4 teraOPS.

Constant performance and smaller area for applications that need lower cost and energy also opens up, analogous to the low-cost FPGA and DSP families.

Massively Parallel Processing Arrays

The Ambric architecture is a member of an emerging class of parallel chips called Massively Parallel Processor Arrays. MPPAs are distinguished from "multi-core" conventional processors, which have only a few processors and a shared-memory architecture, by having massive parallelism of at least hundreds of processing elements and distributed memories, and a rich word-wide flexible interconnect fabric.

Conclusions

Ambric has found practical solutions to the architectural and programming challenges that have stood in the way of achieving massively parallel embedded computing that delivers extremely high performance, which is silicon- and software-scalable long-term, and with reasonable software-only application development effort.

It's not enough to put lots of processors on a chip without thinking about how they're programmed. We started with our Structural Object Programming Model, which scales without limit, and designed our silicon to enable it.

Its practical access to massive parallelism realizes an order-of-magnitude increase in the throughput available from a programmable chip in a given silicon process and area.

Its modularity, parallel low-power, and local timing, mean Ambric's object-based technology can continue to track Moore's Law for many process generations to come.

References

[1] Michael Butts, A. M. Jones, Paul Wasson. A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing. Proc. IEEE Symposium on Field-Configurable Custom Computing Machines (FCCM) 2007, pp. 55-64.

[2] Michael Butts. Synchronization through Communication in a Massively Parallel Processor Array. IEEE Micro, vol. 27 no. 5, pp. 32-40, Sept./Oct. 2007.

[3] A. M. Jones, Michael Butts. TeraOPS Hardware: A New Massively-Parallel MIMD Computing Fabric IC. IEEE Hot Chips Symposium, August 2006.

[4] Edward A. Lee, "The Problem with Threads," in IEEE Computer, 39(5):33-42, May 2006.

[5] Jan M. Rabaey, "
Design at the End of the Silicon Roadmap," Keynote Presentation, ASPDAC 2005, Shanghai, January 2005.

[6] Gary Smith, "The Crisis of Complexity," Dataquest briefing, 40th Design Automation Conference, 2003.