Wednesday, April 23, 2008

EDP Conference in Monterey

I didn't attend Electronic Design Process 2008 (EDP) in Monterey recently. The keynote was entitled Parallel Computing: Can We PLEASE Do It Right This Time? by Timothy G. Mattson of Intel. The focus of the talk was about software languages for multi-core hardware platforms.

I think there is a key point made in his presentation that people in the hardware design and EDA communities will find interesting. And, this is coming from someone talking about software development -- a community particularly interested in using traditional, sequential C/C++ software. Timothy's conclusions are not a surprise to anyone looking at how automatic parallelization tools have historically performed -- in a long history starting in the parallel computing market (Cray/Thinking Machines/KSR/...). He concludes:

* Automatic parallelization will not solve the software community's problems (that is, be a solution for developing parallel software for multi-cores)

* And, so: "Our only hope is to get programmers to write parallel
software 'by hand'."

I completely agree with this viewpoint -- and I think it equally applies to hardware design (actually, MUCH more so). We've seen a lot of "high-level" hardware design tools focused around automatically parallelizing C/C++/SystemC. And, there's no doubt that a certain class of smaller IP blocks (loop-and-array style code) can be effectively and efficiently synthesized automatically from these languages. But, these tools require a lot of tailoring to make efficient hardware -- and can't effectively address the bigger problems:

* Software development -- these tools do nothing for software -- isn't that the biggest chip design challenge? Don't we need models and implementations earlier and that run faster for software? It seems to me that behavioral synthesis tools are like trees in the forest for addressing this fundamental problem. Aside from there being many vendors hyping algorithms, since when did they become the long pole in the tent?
* Algorithmic subsystem performance -- Just staying specific to the "algorithm" subsystem, these tools can't easily comprehend, express or tailor memory and switch subsystem performance, which can often be the dominating architectural consideration for power, performance, area, ... And, impedance matching these blocks to memory and switch subsystems or impedance matching multiple of these smaller blocks can be a bear.
* Loop-and-array algorithms are a small part of designs. Looking at a system, these tools do nothing for the rest of the system: DMA controllers, memory controllers, processors, controllers, communications IP, etc.

Hardware designers need to develop the parallel aspects of their design by hand. That doesn't mean it needs to be at the RTL level. Simplifying concurrency is the only way to improve this process -- automatic parallelization just isn't a general solution (so C/C++/SystemC is not a path). This is where atomic transactions come in. (I know a lot less about how atomic transactions will and can work in the software space. That said, I have some thoughts about what Timothy says about the value of transactional memory in this keynote. I am in CA and my day is starting so I have to run).

Monday, April 14, 2008

Macbeth and RTL

I was speaking with Nikhil, our CTO, about the issue people often run into with RTL where one gets committed inexorably to a particular direction. When it becomes clear that there might have been a better approach, you don't have the flexibility to switch directions. You are too mired in your current approach -- and no longer have time to approach it freshly. Nikhil immediately thought of the following quote from Shakespeare's Macbeth, in Act III, Scene IV, which provides a terrific description of this situation:

"I am in blood
Stepp'd in so far, that, should I wade no more,
Returning were as tedious as go o'er."

Macbeth is in too deep -- he's already committed murder. To go back is just as tedious as just to trudge forward.

When you get to this point with RTL, the only choice is typically just to continue. You get one shot -- it better be aimed in the right direction.

Saturday, April 12, 2008

Algorithmic Myopia

I was reading the summary presentation from last year's (2007's) MemoCODE codesign contest winners and thought it highlighted an important point about algorithmic design. (Last year, the winning team beat second place by 5X. This year, which was recently announced, the winning team beat second place by 11X).

The problem last year was a Blocked Matrix-Matrix Multiplication. It was designed to have both a software and a hardware portion. The core of the hardware piece is a classic "algorithmic" design. Tightly nested for-loops -- the type of solution that one *might* target with traditional behavioral synthesis (algorithmic synthesis) technology:

void mmmBlocked(Number* A, Number* B,
Number *C, int N, int NB) f
int j, i, k;
for (j = 0; j < N; j += NB)
for (i = 0; i < N; i += NB)
for (k = 0; k < N; k += NB)
mmmKernel(&(A[i*N+k]),
&(B[k*N+j]),
&(C[i*N+j]), N, NB);

As is not atypical for these types of problems, however, the real magic lies in the system considerations -- not the core loop-and-array architecture. If you just focused on the portion that algorithm synthesis solutions can address, you would have missed the key architectural consideration -- which was memory bandwidth.

The challenge was not only to recognize that this was the core architectural consideration, but focus the innovation and exploration around the memory subsystem -- AND ensure that the algorithmic piece was tightly coupled and scheduled to work with this subsystem. The first place team delivered 5X the performance of the second place team because they could rapidly explore tradeoffs in this area -- and their environment encompassed not just the "algorithm" but the system as well. (I put quotes around "algorithm" because this term is often mis-used -- and I'm mis-using it a bit here. An algorithm is not just the functional description of the problem (as it might be expressed in C), but also the cost-model for how that function is solved. A C function is not "the algorithm", but one "algorithm" for solving a problem -- it never expresses a particular hardware algorithm for solving the problem.)

Algorithmic synthesis solutions may be able to automatically produce different pipeline micro-architectures for loop-and-array hardware pieces, but the total algorithm isn't just this piece -- it's the entire system. This is a problem that has been repeatedly learned in history -- a good example is IBM's computer systems, which didn't always have the fastest processor, but didn't have to. Their architects focused on the entire system -- including the memory and disk subsystem -- which delivered mainframes that outperformed their competition.

Algorithmic design that focuses solely on the loop-and-array areas fails to take into account inter-loop-and-array-block interactions -- and interactions with the rest of the system. You need both the ying and the yang -- and they should be tightly intertwined so that you can easily optimize and match both. System, memory and inter-block interactions are often more important than pipeline choices in sub-blocks. They are not separate -- to treat them so is to be short-sighted.

Wednesday, April 09, 2008

Unfair advantage

The results of the 2008 MemoCODE Codesign contest were just released (yesterday, I believe). There were 8 teams with a total of 9 entries -- only one team and entry used Bluespec. No doubt a very strong team -- but the results are stunning. This is the difference between thinking and fluidly expressing architecture and doing RTL. In the hands of a very strong team, you get an especially powerful, mutually-reinforcing combination.

Team ID Normalized
Speedup
Platform Design Languages
team kermin 1102.4 XUP Bluespec
team brian 100.2 XUP C
team marco 85.4 XUP C + HDL
team uljana 49.8 XUP C + HDL
team sunita (1) 41.1 XUP C + HDL
team vijay 33.0 XUP C + HDL
team rob 23.5 XUP C + HDL
team eric 12.8 XC2VP100 Amirix C + HDL
team sunita (2) 11.0 XUP C + Impulse C

Tuesday, April 01, 2008

Bluespec Acknowledges DOE and IAEA Probes into its Atomic Transactions

Apparently when you innovate, people take notice. Unfortunately, this time, it's both the Feds and U.N. showing a bit too much interest in our technology. We're expecting to clear up their misunderstandings quickly, but until then, Bluespec felt compelled to explain the situation and our perspective in a press release today. Please read the complete press release here.

Tuesday, March 11, 2008

Atomic Transactions in Processors

Sun's announced their new processor called the Rock. It's the first multi-core processor that uses atomic transactions (the first of many I anticipate) -- specifically, it implements support for transactional memory that is used to support the implementation of atomic transactions.

Atomic transactions are the highest way to specify complex concurrent behavior. Atomic transactions attack the root issue behind what makes hardware and concurrent software so:

  • Error-prone
  • Brittle
  • Complex
  • Costly to develop and verify

That core issue is managing concurrent accesses to shared resources. This is THE issue in programming multi-core processors -- how do you manage shared memory in a coordinated way? Atomic transactions make it tractable.

Wednesday, February 13, 2008

MIT Open Source Hardware Designs (OSHD) and Slashdot

MIT recently launched their website providing some pretty sophisticated, free, open source hardware designs. At this point, they've got three items, though one is a superset of one of the others:

1. HD-quality H.264 video decoder - there are a couple interesting things about this design aside from meeting the performance, quality of HD-quality H.264:
a) The design is about 10,000 lines of code. The C/C++ UNSYNTHESIZABLE reference code for this function is 20K+ lines of code -- this shows how succinct and elegant Bluespec is for datapath intensive designs
b) This design illustrates how you can use Bluespec to parameterize on structure. A single design can generate lots of different micro-architectures -- and all the control logic adapts to the new micro-architecture

2. OFDM transceiver - this one's very cool because it supports both WiFi and WiMax from a single design (and, MIT's working on support WUSB from the same source as well) -- here, it highlights the power of Bluespec to parameterize on differences in functionality

This Sunday, someone posted a link to this site on Slashdot. Fortunately (unfortunately? :>) ), this was indirect to Bluespec so our website didn't get Slashdotted (brought down with the traffic load).

Have fun!

Friday, January 11, 2008

IIT Bombay course taught this week by Arvind

Prof Arvind is teaching a short course on Bluespec at IIT Bombay this week. Hopefully when he gets back, MIT will start rolling out their open source hardware designs...

Tuesday, December 18, 2007

HW State of the Art.... Bluespec!

Mikko Terho is VP and Nokia Fellow at Nokia. According to his bio:

Mikko Terho heads Nokia's Intelligent Connectivity Group which focuses on the development of the innovations and prototypes for pervasive communication devices with novel internet services. He also advices as Nokia fellow other Nokia R&D Groups in the area of system design, software architecture and component selection.

He recently presented at edaForum 07 in Munich. Unfortunately, I don't have access to his slides, which go in detail about results using Bluespec for hardware/software architecture tradeoffs. The title of this presentation was "Are EDA tools for Systems Architecture, ASIC Design or Agile Software Development".

But, I did find a shorter, but recent, slide (mostly sub-) set here. He first shows the current virtual HW/SW world in slide fifteen. In this slide, he shows that RTL/SystemC generation from algorithms and C is "still flaky". Then he goes to the slide I'm quite fond of, slide twenty. This slide is called "HW state of the art". Using Bluespec, the "concept/algorithm" is fully synthesizable into SystemC models for software development and HW for chip implementation.

The funny thing is we just discovered these presentations. Nice surprise!

(On a side, but related, note, MIT/Nokia Research will soon be open sourcing some of the beautiful designs they've done with Bluespec. Currently it looks like H.264 and OFDM for WiFi/WiMax/... will be included in the mix... Stay tuned!)

Monday, December 17, 2007

Faster Chips Are Leaving Programmers in Their Dust

The NYTimes has this article today about the industry, and Microsoft in particular, needing to develop new languages for parallel architectures. It notes that one of the issues is that there are tasks that cannot be split across processors. This problem is analogous to the one we see in the hardware domain where many hardware designs cannot be automatically parallelized -- this disconnect has made traditional behavioral synthesis a disappointment for many that have tried it. Okay for simple, nested for-loops -- it quickly breaks down for more complex designs, especially ones with control logic intertwined.

In the software domain, they're moving to more explicitly parallel languages -- we advocate the same in the hardware domain. The challenge is picking something that significantly raises the level of abstraction and is still synthesizable with high quality results -- it is just this need that makes SystemC pretty good for modeling, but not great for hardware design.

Another interesting observation by Microsoft's Craig Mundie is that hardware designs are less likely to be homogenous matrices of identical processing elements than hetergeneous, optimized-per-task hardware processing elements.