Single page Print

Emulation and FPGAs
Remember, the complexity of a modern chip design is measured in billions of transistors these days. The physical area of a modern system-on-chip in today's smartphones and tablets starts at around 50 square millimeters and triples from there at the top end. The chips in more powerful laptops and desktop computers, especially the GPUs, run to several billions of transistors and upwards of 500 square millimeters on modern process technology.

You can imagine those chips cost a lot of money to manufacture, and I'll talk a bit more about that later. Because of the inherent costs, if you weren't able to test your design functionally before taking it into physical chip form, it'd be impossible to design anything complex. Being able to prototype your design before the complex "back-end" physical processes get started is therefore completely fundamental to modern chip development.

We talked about simulation earlier, where there exists a functional software implementation that completely mimics what the hardware is supposed to do, just not exactly how it does things at the cycle level. But what about ways for executing the RTL ahead of having to turn it into a physical chip, in that desirable, cycle-accurate manner?

There are a number of options for doing so these days, ranging from the "cheap" to the incredibly expensive. Depending on the size of the block you want to implement and how it connects to the outside world, field-programmable gate arrays (FPGAs) might be an option. As the initialism suggests, FPGAs are reconfigurable arrays of programmable gates. The gates in question are the fundamental logic gates that processors are made out of, like the "and" gate, which takes two inputs and outputs a logical 0 if both inputs are 0, otherwise it outputs a logical 1.


Nvidia's G-Sync module uses an FPGA

The FPGA programming process uses the RTL from the hardware team as an input. Remember that RTL is a logical representation of the function of the hardware, so it maps particularly well to FPGAs. Processing speeds for today's fastest and largest FPGAs, which can implement the biggest designs, are in the low-single-MHz range for large blocks. That's a far cry from the GHz-class designs you might expect for something like a modern CPU, but it's only a couple of orders of magnitude away from the frequency of things like GPUs or DSPs.

Regardless of the low clock speed, it's still an incredible improvement in speed compared to simulation. Crucially, it's also cycle accurate! That property of cycle accuracy exposes the design, no matter how it's implemented in final silicon, to performance analysis. If you're able to connect the FPGA implementation to other FPGAs or prototypical silicon that helps you implement the wider system architecture that might house the design, you can start to figure out how fast it's going to be in real systems.

There's usually some kind of disconnect between the performance in FPGA form and the final shipping silicon. Still, FPGAs are usually enough to give you an idea, so you can start work on tuning the design for performance in both absolute terms and relative to what the design communicates with on the outside world.

Then you have a class of full-block emulation systems that, as long as you have enough of them, can be configured together to emulate very large designs in full. The "in full" property is important. Back to GPUs again (sorry). To keep it simple, say your GPU design consists of a front-end, a shader core, and a back-end. Imagine the design is such that even the largest FPGAs you can get your hands on are only big enough to hold the design for the shader core, but not the front- and back-end architecture as well. You'd have to split the design across multiple FPGAs or not implement it in FPGAs at all, depending on the inter-block complexity and communication you need between those parts in order to make the design work.

Emulators, which after years of consolidation in the EDA industry are now usually also produced by tools vendors like Cadence and Synopsys, are enormous. I really mean that: a large installation of a modern emulator that's big enough for a large chip design can easily fill a big room, and that room tends to need to be specially constructed due to the power and cooling requirements.

In addition to being able to implement even large designs in full if you have enough of them connected together, the emulator can also be set to appear to a connected host system as the real device. Many months ahead of ever seeing the design in silicon, you can boot your operating system, load the driver for the GPU, and run the full software stack as if it were a real device.

Just like with FPGAs, the advantages of that ease-of-use and ability to use the full software stack are hard to understate. Emulation is great for the driver writer, the performance analyst like me, the hardware team implementing the RTL, or maybe architecture team that might want changes to be made based on the full run-time data you're now able to collect using the full software stack. All that's possible because the emulator lets the system believe that the design is real.

Full block-level emulation is slow, even slower than FPGAs, at less than 1 MHz, and also many times more expensive than any other solution (millions of dollars to buy something that will be useful to the designer of a modern consumer system-on-chip, for example, which is why they actually tend to be leased). But it's still much cheaper and much faster in terms of turnaround time compared to full silicon production.


An LTE physical layer simulation. Source: Synopsys

So, let's imagine that every step in the process we've talked about has been followed, up to and including full emulation of the design in a giant emulation platform that fills a huge, custom-designed building with fully plumbed-in liquid cooling. The design now provably works. It's been tested with real software. The driver writer is happy with the driver running against the simulated and emulated models. There are no last minute hardware bug fixes to be made (hopefully!). The design passes the full regression suite. The team responsible for delivery to the customer—be that an internal customer if you're someone like Intel or Qualcomm, or an external customer if you're shipping to a company like Mediatek or Rockchip that only tends to work with outsourced designs—has signed off on the design.

The RTL is then shipped to the customer. This is an important milestone in the journey to full silicon, and for an IP-only supplier, it's pretty much the last point in the journey. But there's still so much to be done in order to get the design into a working chip in a device. So what's next? One of the coolest and most mysterious processes, at least from my vantage point.

Synthesis
The RTL that describes the hardware then needs to be turned into logic in the form of transistors. That's where EDA tools and the physical design team come in. The transformation of RTL into actual transistors on the chip is a process called synthesis.

If you understand how computer programming works, you'll be familiar with the concept of compilation. You take code written in a high-level language and transform it through a series of steps, generating and consuming various intermediate representations of the code, until the final hardware code generation happens, targeted at the instruction set architecture of the processor that'll consume it at runtime. Interestingly, modern processors also tend to take that binary representation of their instruction set and transform it internally into other representations they can understand, all hidden from the programmer.

The key point to take away is that, while there are multiple complex transformations of the original high-level code, the computational meaning isn't changed. If the programmer wants to add two numbers together, you better not change the addition to a divide. You can argue that it is changed at certain steps, but I'd argue, at least for simplicity's sake, that it's actually just optimized.

That same "compilation" happens during RTL synthesis, where the functional hardware design encoded in RTL isn't changed, but the final representation, in this case usually a really cool binary interchange format called GDS2, is now something that can help generate the actual transistors on the chip. Think of it as the map between the described logic in the RTL to the transistor structures on the silicon.


Spot the SRAM cells in a 22-nm Intel test chip. Source: Intel

The output of the synthesis EDA tools is built from a library of standard cells. The cells are collections of transistors that implement a certain structure in the silicon. The easiest ones to point out are the SRAM cells, which tend to be readily visible in photographs of the physical chip's floorplan. SRAM cells are usually consolidated in large, highly-regular rectilinear structures built from cellular building blocks.

The cells are tied to the foundry process, of course, and they tend to be provided by either the foundries themselves (someone like TSMC or Samsung LSI) or the EDA tools vendor (like Synopsys). It's common for IP vendors to partner up with the cell library vendor to create a tooling flow for an implementation of a particular block that's optimized for a certain foundry process and set of cell libraries—and their best operating conditions—to guarantee a set of performance characteristics.

So synthesizing the RTL is the act of turning the human-readable HDL into cellular blocks of transistors, in effect. And because transistors have physical dimensions, they need to be laid out in relation to one another.

Layout
Because the GDS2 is a full physical representation of the structures on the chip, it has inherent size, and it might surprise some to realize that modern chip manufacturing is a 3D process (not that I'm not talking about three-dimensional transistors). Not only does the bottom silicon layer spread out in width and height in a single polysilicon layer, but the design also spreads upwards in terms of metal.

Every modern microprocessor has a metal stack. The reasons why will hopefully be obvious or least become obvious shortly. Think of the full chip now, with the individual blocks that we talked about designing and implementing, all connected to each other in the final large chip design. But how are they connected? Tiny wires! The silicon layer is a single planar structure, but the connections between blocks aren't implemented in the silicon. They're implemented in metal wires, so the wires have to go upward.

In today's large designs, there's no way to do the wiring in just a single layer, so there tends to be a full stack of metal layers snaking around and through each other like 3D spaghetti with insulating material in between. The more metal layers, the more costly the design is to manufacture. And you can guess that, with a large silicon floorplan and many layers in the metal stack, laying everything out is not a task for a human being—at least not entirely.


Cross-sections of different metal stack implementations. Source: AMD

Starting again at the silicon layer, the blocks occupying a shared planar surface need to be placed beside each other. For manufacturing reasons, the chips on a wafer need to be square or rectilinear, but the blocks inside don't, although it helps if they are for layout simplicity. Imagine a variant of Tetris, where you have to not only get differently shaped blocks to fit together in an optimal space, but where you're also faced with an extra constraint that requires related blocks to be near each other to keep the interconnection between them (the metal layer stack) as simple as possible.

You can probably quite easily understand that certain block layouts require fewer long wires to cross the chip connecting blocks to each other, to bus fabrics, or what have you, resulting in a simpler metal stack.

You can also probably imagine that the number of possible permutations of the block layout on the bottom silicon layer and the wire layouts in the metal stack is mind boggling. Tiny changes in block placement at the silicon layer can lead to exponential growth in the wiring, for example. Searching through the possible combinations of block placement and wiring complexity therefore tends to be done by computer.

The bounding conditions for that search tend to be block frequency (because the clock for the chip has to propagate through all of the blocks), power (more transistors equals more power, relative to a fixed input voltage and frequency), and area (because the area has a physical cost in dollars for the silicon and the metal stack that wires it all up). There are many more in reality, and the EDA tools vendors tend to sell the software that figures it all out. The important thing to note is that finding an optimal layout for one input factor can cause huge changes in all of the others.

Now it is possible for a human being to lay out parts of the design, and there are reasons why that might be desirable. The layout software is guided by an engineer or set of engineers, but it might only have a certain number of possible search strategies baked into it, and they'll all be bound by the characteristics we talked about before: power, area, frequency, and so on.

You can also probably imagine that the number of possible permutations of the block layout on the bottom silicon layer and the wire layouts in the metal stack is mind boggling.

However there's one important characteristic to the boundary conditions of the layout software: time. Because finding the final layout solution is an exponential problem in terms of the number of the number of input conditions, the software has to somehow limit its run time. If the software isn't sophisticated enough to find an acceptable layout for the usual input parameters, it's possible for skilled folks to step in and either partially or fully guide the software to the solution the chip designer is looking for.

Finding fully or partially laid out digital logic in complex, large microprocessors is rare, but it does happen. You can imagine that the reasons why it does happen are to really optimize the last few percent of a design in a certain way, to do the best possible job on performance, power, and area. I've looked at dozens of large chip designs in the last few years, in detail, and I've seen hand-optimized layout just once. I think. I say I think because it's very hard to tell, as you can imagine.

Clocking
It's worth talking about clocking of large chips here, just briefly, since it's such a big part of certain designs. The way a clock is applied to digital logic means that the clock propagates through a block's cells, driving them pretty much in a wave carried by the wires that route through the metal stack. The clocks applied to a modern large design are complex, because they're designed to cover a large range and move between levels and lock quickly. They're also varied, because there's no one clock to govern a single system-on-chip, so you find anywhere from a few to dozens of clock sources in today's designs.

The length of the wires is the biggest factor in figuring out the delay of the clock travelling along them. That delay is the main limit for the peak operating frequency of a given block. Factor it into whatever else is sharing the same clock source, and it's possible for a single block in a modern design to limit the peak frequency of all other blocks that share the same clock.

The clock is effectively a tree. It's generated at the root source and moves along the branches, which are the main feeding wires, to the leaves, which are the wires farthest from the source.

The clock tree is sometimes able to account for a non-trivial portion of a design's area, since it needs to effectively feed the numerous complex blocks within a modern design. I point out the clocking just to give you an idea that not all of the area on a modern chip is dedicated to computational logic, and that clocking setups and clock variation strategies to keep power under control are an increasingly large part of modern chip designs.