Single page Print

Block architecture
Modern block design starts with an architect. The architect is responsible for the what, the why, and part of the how of the block. But they're usually not responsible for the rest of the how or for the where. I'll explain what the hell I'm talking about, I promise.

The what is reasonably obvious. Let's take the GPU, because I'm incredibly fond of things that make pixels. What does a GPU do? It processes graphics workloads. The why doesn't mean, "why does it process graphics workloads?" That much should be obvious. The why means, "why does it process graphics workloads in this or that particular way?" The fundamentals of computing machinery mean there are infinite ways to skin any complex computational cat.

A functional block diagram of the PowerVR Series7XT GPU. Source: Imagination Technologies

Sticking with GPUs, which have to process the pixels on your screen, you could architect something that processes all the pixels individually, one after the other. You could architect a GPU that processes pixels individually, one after the other, in a random order. Or you could architect a GPU that processes a bunch of them together in parallel, in a tile-based fashion, because the pixels have some level of complete independence yet also some inherent level of connected properties, and exploiting spatial locality in the memory hierarchy leads to great things. Or you could architect a GPU that does nothing but render pictures of cats and hope that's what the user wants. It's the architect's job to figure out why his Widget should process inputs and yield outputs in a particular way.

In terms of modern consumer oriented semiconductors, these blocks all have a certain heft—a boundary, complexity, and physical size. They're not trivial, they're almost always programmable in some way, and they tend to be busy with memory a lot of the time. So the why is never simple, and architects today usually can't operate alone as a result.

So the block architect is operating in many similar ways to the full chip designer we talked about earlier. Interestingly, because it's so expensive to develop a chip, and because you want to be able to reuse blocks from one design to the next if you can, the blocks are basically black boxes. A block needs ways to get data in, do some work, and get resulting data out. But it often does that work in complete isolation from the rest of the system, without sharing data or resources, and it usually goes about things completely differently from other blocks. Computation is not computation, if you catch my drift.

A CPU goes about its business in a completely different fashion—from memory in, through computation, to memory out—than a GPU, never mind DSPs, modems, video encoders, video decoders, display pipelines, and everything else on a modern complex chip.

The architect therefore needs to understand how their block is used by the software and rest of the system and how it connects to that system. They also have to be broadly aware of some of the more physical properties that will affect the chip. But because the block is a black box, almost everything can be an implementation detail. That's the how part.

You often find that block architects, including those at companies doing all-in-house development, will design those blocks with a common interface to the outside world that's shared with other similar blocks in the same family. This lets the full chip architect make changes to their bigger design—and swap out certain blocks for others—without making material changes to anything but the eventual layout. Using a common interface also makes it possible for the block designer to create multiple variants of their block, each one specialized in certain ways to address certain markets, without making those variants any more complicated to integrate than one another.

Think of it like swapping out one CPU in your PC for another. They share the same interface with the outside world, the pins in the CPU's case and what travels across them, but the implementation could be completely different.

Software helps here, presenting a uniform layer to the rest of the system for the hardware underneath. Drivers for certain blocks let you keep a common interface at the software level while changing the implementation underneath whenever it's needed. It's the same principle.

Software helps drive the underlying block architecture, too, but not completely. Whether running a software instruction set architecture (ISA) or implementing drivers for a client API executed by the block, the architecture of the block can be completely different underneath while retaining compatibility, allowing differentiation on performance, power efficiency, area, and features. Look at GPUs as a block, where there's a lot of differentiation in underlying architecture in modern SoCs, yet they all implement support for a common set of APIs.

Die shots like this one of Nvidia's GK110 hint at the underlying block structure

Lastly, we come to the where part of the block, which tends not to matter as much to the architect. Especially in modern system-on-chip designs, if you could take a look at the physical layout of them and identify the blocks, you'd see them all over the place. One vendor will always place the video encoder and decoder next to each other, but another will place them separately. Sometimes discrete blocks will share power delivery or a power island, but sometimes they won't. The chip's physical topology doesn't really matter to the block architect. That's the domain of the hardware person who implements it. More about the hardware person later.

First, we need to talk about simulation.

When the architect has completed their design, at least for modern processors in consumer electronics, it's handed off to two separate teams. The simulation team's job is to take the design and create a functionally accurate software implementation, somewhat obviously called the simulation. The functionally accurate part is important; because the simulation is of what the design is supposed to do, it must produce the same outputs as the real hardware for all the accepted inputs. But that doesn't mean it has to implement the exact same how as the hardware. The simulation also doesn't have to be cycle accurate, which is really important. (Since it's in software, the simulation very probably can't be cycle accurate).

So the simulated software design acts for all intents and purposes like the hardware design, it just doesn't perform the same. In reality, it's many orders of magnitude slower for complex designs. For example, the simulated models of the GPUs that I'm used to working on can take minutes or even hours to generate single frames of output, depending on what's being rendered. Hours. While the simulation is a great way to verify the design is correct, it's still nowhere near the hardware in terms of performance.

For the curious, simulators tend to be written in C++ or a dialect of C++ called SystemC. SystemC has features and idioms that allow the simulator writer to more easily model the functionality of the parts of the design that will be operational at the exact same point in time.

The benefit of the simulation is that it's much cheaper and faster to produce than the hardware implementation. It's much easier to test, verify, and debug, too, since it's just software. (If you're a simulator engineer, before you hunt me down and kill me in my sleep, rest assured I completely understand that it's not just software, as if there's no inherent complexity compared to other types of software.)

When an architecture is finished and handed off to the hardware and simulator teams, the simulator team is always going to finish first. It really has to, because the hardware team is going to use the simulator model to help verify that their hardware implementation works correctly for the same inputs and outputs!

So we now understand that there's an architecture team responsible for coming up with the way the block should work, a simulator team responsible for a functionally accurate software model, and also a hardware team responsible for expressing the architecture in terms of the final physical design. There's a lot of overlap between architecture and hardware in many cases (and in my experience, great hardware people tend to be really good architects in many ways, and vice versa). For example, there's no point in the architecture team designing something that has to be able to absorb a certain number of clocks of memory latency, but the hardware team implementing the main computational pipeline depth so it can't keep enough work in flight to hide that latency. Architecture and hardware teams work very closely with each other to make sure the design is respected and implemented correctly.

There are block designs in chips that were successful in the market but shipped with hardware-level defects caused by the implementation not really gelling with what the architecture team had in mind for the design.

That said, there's still a lot of the hardware implementation that can be done entirely with black boxes, where the architect might not actually know how the pieces work underneath. That's the great thing about designing something in modular fashion; as long as your part of the design accepts the right inputs and produces the right outputs, the implementation can sometimes, but not always, be just details for the hardware person to worry about.

That keeps things simpler for the architecture team and lets the hardware team focus on what it's good at in terms of the physical implementation. A really good level of trust and communication is required between the architecture and hardware teams. Misinterpretation can cause bottlenecks to appear in the design where the architect didn't expect them, or in the worst case, they can result in completely broken physical implementations. There are block designs in chips that were successful in the market but shipped with hardware-level defects caused by the implementation not really gelling with what the architecture team had in mind for the design. Get it really wrong, and you can't ship.

When it comes to the building blocks of a modern chip, the real meat of the where  is in power, area, and physical layout. Most blocks are rectilinear to make them easier to lay out, but that tends to waste space compared to more complex layouts—those where maybe the tooling wasn't capable and a human got involved to reduce wasted area on the chip that doesn't contain some working logic. Sometimes that's desirable for routing or power delivery reasons, and sometimes you will always have extra space on a chip because you're pad limited. But sometimes you really need to pack blocks together as closely as possible for the smallest possible area, and it's the hardware team's job to design a block that can potentially be laid out in a flexible way.

Let's talk about that in more detail.

The output of the hardware team is a hardware description language (HDL) variant called RTL, or register transfer language. There are a couple of dominant RTL variants called Verilog and VHDL, but there's no clear winner between them on the market, at least not that I've been able to discern. Some companies implement their hardware with one and some with the other, and the two languages can be used together in different blocks in a single chip, integrated by the physical design teams.

RTL is a human-written and both human- and machine-readable expression of the movement of data and processing logic for the blocks on a chip. It's fed into the electronic design automation (EDA) tools operated by the physical design team. More on that later.

RTL encodes what needs to happen at the logical level of each part of the design, accepting inputs and working on them in some way before moving the data out at the back end. It's very much like normal computer code. The hardware programmers build libraries of common processing parts in RTL, which allow you to build more complex structures.

Going back to my favorite example, the GPU, you tend to want to multiply two numbers at almost every stage of the architecture. There's no point in the guy coding the blender, say, and the girl coding one of the ALU datapaths, to write their own interpretation of a floating point multiplier. Instead they can collaborate, create something together that jointly fits their individual needs because they tend to want to operate on the same kinds of numbers, and build the multiplier once.

Then in the RTL for their blender or shader core, they can import that shared multiplier and instantiate it inside that larger encompassing block, saving time to implement their part of the design and reducing the cost of validation later.

One of the key things to understand about RTL is that it can be debugged, tested, and validated before a chip is created. There are various software and hardware platforms that can take RTL and execute it pretty much directly, without the need to take the RTL all the way to a physical chip to see if it works.