Am2045 MPPA Technology Backgrounder
The Am2045 architecture is a member of a class of massively-parallel chips called Massively Parallel Processor Arrays or MPPAs which includes GPUs - also considered an MPPA. MPPA is a generic name for a category of chip, much like FPGAs, ASICs, or DSPs (so Nethra does not use MPPA as a brand name.)
MPPAs are distinguished from 'multi-core' general-purpose processors. These conventional multi-core devices typically have only a few processors and a shared-memory, shared-bus architecture. Most multi-core devices have been aimed at general-purpose computing, for running large existing applications and complex operating systems. By contrast, MPPAs employ massive parallelism of at least hundreds of peer-to-peer processing elements such as complete processors, ALUs, finite state machines and distributed memories, and a rich word-wide flexible interconnect fabric which can be statically, or sometimes dynamically, reconfigured. MPPAs are usually aimed at embedded computing, where they run high-performance dedicated applications such as signal-, media- or network processing, which have strict cost, power and real-time requirements.
Starting with some standard semiconductor market categorization from Gartner Group, here is a diagram showing where MPPAs fit, though they also are used instead of, and along with, FPGAs in most cases.
How do MPPAs differ from each other? To start with, there are two classes of MPPA chips, SIMD (single instruction, multiple data) and MIMD (multiple instructions, multiple data).
SIMD MPPAs only have a single instruction stream (or very few) running the processors the same way on similar data. Some SIMD chips can mask off a few different parts of the array at runtime but it's still basically SIMD. GPU's are SIMT where the T is "thread". GPU's are great for primarily feed-forward algorithms but may have great difficulty expressing complex programs with feedback loops such as the CABAC bit-compression algorithm of the H.264 standard – even the best GPU programmers cannot express this complex algorithm in a SIMT GPU, let alone even more limited SIMD devices.
SIMD is very efficient for simple DSP filters and other very regular, feed-forward dataflow vector processing. It used to be good for media and network processing, but modern applications like H.264 video compression and adaptive network intrusion detection get more powerful by getting more complex and less regular. H.264 video codecs for example have many components with extremely complex looping, forward and backward data dependencies and highly variable, data-dependent compute-times, which are ill-suited to SIMD/SIMT MPPAs.
MIMD MPPAs are much more capable. Every processor can have its own parallel instruction stream running on its own data. (A MIMD MPPA running the same code everywhere becomes SIMD). MIMD's diversity of parallel control is most effective for today's applications. It's good for many data structures, not just vectors – its processors can stay busy on varied types of data and data sizes. However, previous MIMD MPPA chips haven't been very practical to program, requiring the developer to deal explicitly with synchronization, and use hardware-type, timing-sensitive languages and tools, or they depend on 'system compilers' that are often regarded as open-ended research projects – as open-ended as artificial intelligence.
The Ambric chip enables a practical MIMD-parallel programming model, with innovative asynchronous channels that remove the global synchronization problem. Explicit task-scheduling is simply not necessary. Complex global state-machines don't have to be coded and debugged. Block diagram structures of objects which are written in standard high-level language single-threaded, sequential code are reasonable and practical to develop and handle even the most complex applications.
For more information, refer to the Massively Parallel processor Array (MPPA) entry at Wikipedia.
Embedded Computing Problem
Complex high-bandwidth, real-time, streaming-media, sensor, and network applications for embedded systems can be achieved by programming a massively parallel processor array (MPPA) with software rather than using the hardware design required for ASICs and FPGAs. But a fundamental problem is how to productively program and validate a complex, irregular application on hundreds of processors. With conventional multiprocessing techniques, keeping processors busy, communicating, and synchronized is difficult and prone to failure, and cannot scale up to hundreds of CPUs.
Traditional architectures, such as CPUs and DSPs, are reaching limits in ease of development, performance, power and scalability. The most visible example is the industry's transition from frequency-scaling and its exponentially escalating problems - to scaling by adding multiple cores per die [5].
Single CPUs and DSPs can no longer keep track with the performance imperative of Moore's Law by trying ever-higher clock speeds or more elaborate implementation architectures, because they all suffer from diminishing returns.
Conventional multi-core processors won't scale for long, especially for embedded systems. Non-determinism, thread complexity, shared memory bottlenecks, explicit developer-managed synchronization all place fundamental limits on this approach [4].
ASICs and high-end FPGAs are getting harder and more expensive to develop. Global timing convergence, million-dollar plus ASIC NREs, FPGA static power issues and the RTL hardware design productivity gap between available and delivered ASIC gates [6] are among the growing difficulties.
It is better to spend transistors on supporting a more practical parallel programming model. It has been a common pitfall for chip vendors to build parallel hardware architectures without adequate consideration of the programming model. Nethra avoided this pitfall by focusing on the programming model first, then inventing the circuits and silicon to support it.
An MPPA application is developed by expressing it as a hierarchical block diagram or workflow, whose basic, decomposed tasks run in parallel, each on their own processor. Likewise, large data objects may be broken up and distributed into local memories with parallel access. Objects communicate over a massively parallel mesh structure of circuit-switched channels. The objective is to maximize aggregate throughput while minimizing local latency, optimizing performance and efficiency. The Nethra MPPA's model of computation is similar to a Kahn process network with bounded buffers, or to Communicating Sequential Processes (CSP).
Structural Object Programming Model
MPAA's structural component programming model [1,2,3] is the foundation of how it all works:

A Nethra MPPA chip has an array of hundreds of 32-bit RISC processors that are each programmed with ordinary software, and hundreds of distributed memories. The tasks running on the cores are called components, or objects, because they obey strict encapsulation rules enforced in the processor-interconnect hardware. Objects simply run independently on their own parallel hardware, not shared or multi-threaded or virtualized. Structural objects are strictly encapsulated, execute with no side effects on one other, and have no implicitly shared memory (with the exception of external SDRAMs for bulk storage such as video frames).
The components communicate through a simple parallel structure of circuit-switched hardware channels. Each channel is word-wide, unidirectional, point-to-point from one object to another, and acts like a FIFO-buffer, though many buffer types can be supported. Channels carry both data and control tokens, in simple or structured messages. Channel hardware synchronizes its objects at each end, dynamically as needed at run time, not scheduled at compile time.
Inter-processor communication and synchronization is straightforward. Sending or receiving a word on a channel is as simple as reading or writing a processor register. Sending a word through a channel is both a communication and a synchronization event. This keeps objects in step with each other in a common-sense way. Since we channel synchronize transparently and locally, the application achieves high performance without complex global synchronization.
The result is much easier and cheaper development, the performance and power advantages of massive parallelism, and long-term scalability.
Registers and Channels
How these channels are built is the key to putting this programming model on a chip. Instead of ordinary synchronous registers, special Ambric-architecture registers are used throughout the chip. A chain of Ambric-architecture registers is called an Ambric-channel. Ambric-channels are fully encapsulated, fully scalable technology for passing both control and data between objects. Channel stages are entirely local. There are no wires longer than one stage. Ambric channels dynamically accommodate varying delays and workloads on their own. Changing the length of a channel has no effect on functionality, only its latency changes.
Ambric-architecture processors are interconnected in hardware through Ambric-channels. Each processor runs independently, responding only to its own local channels. This is why local software changes have only local effects, because it's a globally-asynchronous system. This makes it possible to change the clock speed of each processor independently at runtime, to minimize power. It enables enhancements and bug-fixes to be isolated in their side-effects.
Self-synchronizing channels with local buffering in an asynchronous system which is practical to program, fast and scalable, all become possible thanks to Ambric-channels.
Processor Architecture
This programming model motivates a reassessment of the assumptions that today's processors have always been based on.
Traditional von Neumann architecture reads and writes variables in a memory space, bottlenecked through a register-memory hierarchy. Communication between processors is just an afterthought. Efficient handling of streaming data such as media processing or packet processing is not inherent in conventional machines.
The objective of Ambric processor architecture is processing data and control from channels. Memories are encapsulated objects like any others, read and written using channels. Instruction streams arrive through channels. So channel communication is a first-class feature of Nethra's architecture.

This makes for very lightweight 32-bit RISC CPUs, which treat channels just like general registers. Every data-path is a self-synchronizing Ambric channel. Data and control tokens keep moving, so RAMs get used more for buffering, rather than as a static global workspace.
All Ambric processors can: execute an operation, do a loop iteration, input from channels, and output to a channel - every cycle. That means a processor running single-cycle loop is equivalent in performance to a configurable hardware block, like in an FPGA. But unlike FPGAs, the full range of performance versus complexity between pure hardware and ordinary software is easily available, and with software programming, instead of complex, globally-synchronous RTL-based hardware design.
Chip Architecture: Compute Unit, RAM Unit
In Nethra's Ambric-architecture, a cluster of four processors is a Compute Unit (or CU). It has two types of processors:
The SRD processor is a 32-bit streaming RISC processor with DSP extensions for math-intensive processing. It has local memory for its 32-bit instructions, and can directly execute more code from the RAM Units next to it.
The SR processor is a simpler 32-bit streaming RISC CPU, used mainly for managing channel traffic, generating complex address streams, managing complex buffers, and other utility tasks which enable sustained high throughput for the SRDs.
The RAM Units (or RUs) are the distributed on-chip memories that can stream addresses and data over channels.

Chip Architecture: Brics and Interconnect
CUs and RUs are combined into the top-level physical building-block called a bric. The core of the chip is assembled simply by stacking up brics with interconnect by abutment. The number of compute brics serves as a measure of compute-capacity.
Each bric has two CU-RU pairs, totaling 8 CPUs and 21 KBytes of SRAM.
Brics connect by abutment through channels that cross bric-to-bric. The CUs and RUs are arranged so that, in the array of brics, there are contiguous CUs and contiguous RUs.
To optimize the cost and performance of each channel, the chip's configurable interconnect is hierarchical with several levels of hierarchy. These channels are word-wide and run at up to 9.6 Gigabits per second.
These bric-long channels are the longest signals in the core array, except for a low-power clock tree, low-power debug network and the reset. So this physical architecture is very scalable going forward with a credible path to thousands of cores/chip
Chip Implementation
The core array has 42 compute brics in an array, containing a total of 336 32-bit SR and SRD processors running at 300 MHz and 7.1 Mbits of distributed SRAM. (Other brics are allocated to DDR controllers for example.) At full speed, all processors together are capable of 1.03 trillion operations per second, over one teraOPS. This performance is supported by the chip's 792 Gbps bi-section bandwidth, 26 Gbps of off-chip DDR2 memory, PCI Express on-chip 4-lane bi-directional, and up to 13 Gbps of parallel general-purpose I/O (GPIO) commonly used to integrate with a system FPGA and peripheral IO devices such as HD-SDI video transceivers or A2D devices.
Programming Model and Tools
Using the Ambric-architecture structured component programming model, developing applications for this chip is straightforward, no "magic compilers," no scheduling, no synthesis, no waiting hours or days for a build.
The aDesigner integrated development environment (IDE) based on Eclipse — an open development platform from the Eclipse Foundation — lets software components to be written in a subset of standard Java or assembly code or be loaded from libraries; the structure is defined with graphical block diagrams or a text-based structural language. An application simulator is available in the IDE as well as on-chip execution. Software components and your structure are automatically compiled, placed, and routed onto the Am2045. This generates a configuration file, which is loaded into the chip at runtime. Symbolic source-level parallel debugging and performance monitoring, in real time on the chip, is available through the IDE.

The developer starts by describing the application as a parallel structure of software components and the data and control messages they send and receive. The process is to divide and conquer the application hierarchically, defining composite objects/components according to higher-level functional blocks, which is a very good match to the way developers think about an application. All the parallelism is encoded in the structure, leaving the leaf-objects sequential. Components/objects are decomposed until they fit on one core. (Even though not strictly the same by computer-science definition, we'll use 'component' and 'object' interchangeably, with the key point that the interconnect hardware enables 'encapsulation'.
Since Ambric-architecture components, even large composite components, are strictly encapsulated with simple common channel interfaces, they are easily reused. Once validated, encapsulation protects them, so they maintain correct behavior, with no need for re-validation.
Application-specific leaf objects are written in ordinary sequential code in a strict subset of a standard high-level language and/or assembly code, and compiled normally. Assembly can be in-lined.
A functional simulator is available to run and do initial debugging in a software test-bench environment on the desktop.
Finally the realization tool chain auto-maps all objects onto CUs and RUs, auto-routes the channels, and creates a configuration file for the chip. At runtime the chip is configured by a host FPGA/processor or by itself from flash memory, similar to FPGAs.
Runtime debugging and performance or power tuning in the real system on real data at full speed is vital. Unused processors, memories and channels are straightforward to use for real-time debug - just by forking and copying channel traffic. The chip also includes a separate dedicated debug network. The developer can halt, step, restart processors with the usual debug features, plus observe or trap on channel events. Performance 'heat-map' analysis tools help focus in on tasks which might need more pipelining or fan-out parallelism. Focus on functionally first usually, then performance-tune by adding parallelism concepts that are simple and familiar to FPGA developers.
For more information on the aDesigner tool suite, click here.
Performance Metrics
Nethra defined this programming model, architecture and tools to get very high performance from a massively parallel chip. Here are the results.
| Am2045 | |
| Operations/sec | 1.03 TeraOPS |
| Multiply-Accum./Sec. (16x16 to 32-bit) |
50 GMACS |
Scalability
Nethra set out to overcome fundamental barriers to scalability in development cost and difficulty, timing, and power.
Our programming model's modularity makes much greater design reuse possible. Its simple combination of objects written in normal software code, combined in a hierarchy of block-diagram structures, makes high-performance design development much easier and cheaper.
Our globally-asynchronous system of processors, memories and self-synchronizing channels, with local synchronous clocking and no long wires, is highly scalable into future process generations with a clear path to over a thousand cores per device.
Conventional techniques increase power far out of proportion to performance increase. With parallel processors, performance per watt stays relatively constant. The Ambric-architecture MPPA and its programming model make that massive parallelism practical.
Starting from today's 1 teraOPS Am2045, long term scalability in more than one direction is available.
Increased performance at the same area and cost will come from process scaling and more custom bric implementation, leading to parts with over 1,000 processors and over 8 teraOPS.
Or, the same performance and smaller area for applications that need lower cost and energy also opens up, analogous to the low-cost FPGA and DSP families.
Massively Parallel Processing Arrays
The Ambric-architecture is a member of an emerging class of parallel chips called Massively Parallel Processor Arrays. MPPAs are distinguished from "multi-core" conventional processors, which have only a few processors and a shared-memory architecture, by having massive parallelism of at least hundreds of processing elements and distributed memories, and a rich word-wide flexible interconnect fabric.
Conclusions
Nethra has found practical solutions to the architectural and programming challenges that have stood in the way of achieving massively parallel embedded computing that delivers extremely high performance, which is silicon-scalable and software-scalable long-term, and with reasonable software-only application development effort.
It's not enough to put lots of processors on a chip without thinking about how they're programmed. We started with our Structural Component Programming Model, which scales without limit, and designed our silicon to enable it.
Its practical access to massive parallelism realizes an order-of-magnitude increase in the throughput available from a programmable chip in a given silicon process and area.
Its modularity, parallel low-power, and self-synchronizing interconnect, means that Nethra's MPPA technology can continue to track Moore's Law for many process generations to come.
References
[1] Michael Butts, A. M. Jones, Paul Wasson. A Structural Object Programming Model, Architecture, Chip and Tools for Reconfigurable Computing. Proc. IEEE Symposium on Field-Configurable Custom Computing Machines (FCCM) 2007, pp. 55-64.
[2] Michael Butts. Synchronization through Communication in a Massively Parallel Processor Array. IEEE Micro, vol. 27 no. 5, pp. 32-40, Sept./Oct. 2007.
[3] A. M. Jones, Michael Butts. TeraOPS Hardware: A New Massively-Parallel MIMD Computing Fabric IC. IEEE Hot Chips Symposium, August 2006.
[4] Edward A. Lee, "The Problem with Threads," in IEEE Computer, 39(5):33-42, May 2006.
[5] Jan M. Rabaey, "Design at the End of the Silicon Roadmap," Keynote Presentation, ASPDAC 2005, Shanghai, January 2005.
[6] Gary Smith, "The Crisis of Complexity," Dataquest briefing, 40th Design Automation Conference, 2003.