Rys Sommefeldt works for Imagination Technologies and runs Beyond3D. He took us inside Nvidia’s Fermi architecture a couple of years ago, and now he’s back with a breakdown of how modern semiconductors like CPUs, GPUs, and SoCs are made.
Disclaimer: what you’re about to read is not an exact description of how my employer, Imagination Technologies, and its customers take semiconductor IP from idea to end user product. It draws on how they do it, but that’s it.
This essay is designed to be a guide to understanding how any semiconductor device is made, regardless of whether it’s purely an in-house design, licensed IP, or something in between. I’ll touch on chips for consumer devices, since that’s what I work on most, but the process applies almost universally to any chip in any device.
I’ve never read a really great top-to-bottom description of the process, and it’s something I’d have loved to have read years before I joined a semiconductor IP company. I hope this helps others in the same position. If you’re at all interested in chip manufacturing and how chips are made and selected for consumer products, this should hopefully be a great read.
It all starts with an idea you see. Not quite at the level of “I want to build a smartphone,” although understanding that the smartphone might be a target application for the idea will be great to help the idea take shape. No, we’re going to talk about things a little bit further down, at the level of the silicon (but not for long!) chips that do all the computing in modern devices, be they smartphones or otherwise.
All of the chips I can think of, even the tiniest and most specialized chips that perform just a few functions, are made up of much smaller building blocks underneath. If you want to perform any non-trivial amount of computation, even just on a single input, you’re going to need a design that builds on top of foundational blocks.
So whether the idea is “let’s build the high-performance GPU that’ll go in the chips that go into smartphones,” or something that’s much simpler, the idea (almost) never gets built in its entirety as one monolithic piece of technology. It usually must be built from smaller building blocks. The primary reason, especially these days, is that it’s incredibly rare that one single person can hold the entire design for a chip in her or his head, in order to build it from start to finish and make sure it works. Modern chips are complex, usually consisting of at least a couple hundred million transistors in most consumer products and often much much more. Most main processors in modern a desktop or laptop are well over a billion transistors. There’s maybe over a billion transistors in your pocket, in the main chip in your phone.
So you overwhelmingly can’t build the idea as a monolithic thing, because humans just don’t work that way. Instead, the idea must be broken down into blocks. Maybe a single person can design, build, assemble, and test all of the blocks themself, but blocks are must. I’ll talk a lot about blocks, so apologies if the word offends somehow, or if it means “I hate your cat” in your native language. I definitely love your cat.
For simplicity’s sake, I’m going to talk about most common processors these days, which all take at least a year to make. Nothing in the semiconductor business happens really quickly. It really does normally take years to go from an idea about a chip all the way through the design, build, validation, integration, testing, sampling, possible rework, and mass production. All that happens before the product can be sold and you hold it in your hands, put it under your TV, drive it, fly it, use it to read books, or whatever else the chip finds itself in these days.
|There’s maybe over a billion transistors in your pocket, in the main chip in your phone.|
The lifetime of a new chip is therefore never short. There are some macro views of the semiconductor industry where you might think that’s the case. For example, a modern smartphone system-on-chip (SoC) vendor might be able to go from project start to chip mass production in a short matter of months, but that’s because all they’re doing is integrating the already designed, built, validated, and tested building blocks that other people have made. Tens of thousands of man years went into all of the constituent building blocks before the chip vendor got hold of them and turned them into the full SoC.
It takes years—not months, weeks, days, or anything silly like that, at least for the main chips performing complex computation in modern consumer electronics and related industries.
Knowing what you need
Chip development taking years means there’s a certain amount of hopefully accurate prediction to be done. Smart chip designers are data-driven folks who don’t trust instinct or read tea leaves. They don’t make decisions based on whether the headline in today’s paper started with the third letter of the name of the second dog they had in their first house as a kid. Knowing what to design is almost pure data analysis.
Data inputs come in to the chip designer from everywhere: marketing teams, sales people, existing customers, potential new customers, product planners, project managers, and competitive and performance analysis folks like me. Then there’s the data they get from experience, because they built something similar last time and they know how well it worked (or not).
The chip designer’s first job is to filter all of that data and use it as the foundation of the model of what they’re going to build. They need to know as much as possible about the contextual life of the chip when it finally comes into existence. What kind of products is it going to go into eventually? What does the customer expect as a jump over the last thing someone sold them? Is there a minimal bar for new performance or a requirement for some new features? Are trends in battery life, materials science, or the manufacturing of chips by the foundry changing?
What about costs? Costs play an enormous role in things. There’s no point designing something that costs $20 if your competitor can sell their closely functional and performing equivalent for $10. Knowing your cost structure for any chip is probably the thing that shapes a chip designer’s top-level bounds the most. Every choice you make has a cost, direct or indirect.
Say your chip needs Widget A, which is 20 square millimeters in area on the process technology of your foundry. Your total chip cost lets you design something that’s 80 mm² square, because every square millimeter costs you 20 cents and your customer won’t pay more than $20 for the full chip, and because you really need that 25% gross margin on the manufacturing to pay for the next chip. Widgets B through Z only have 60 mm² left, and really a bit less than 60 because it’s incredibly hard to lay out everything on the chip so there are no gaps. Sometimes you even want gaps, for power or heat reasons. I’ll come back to that theme later.
There’s both a direct (your chip can’t cost more than $16 to fab) and an indirect (choosing Widget A affects your further choices of Widgets B through Z) set of costs to model.
So the chip designer takes all of those inputs and feeds them into her or his models (there’s usually a lot of spreadsheet gymnastics here, more than you might think). The designer decides what Widgets they need for their chip, intercepting all of the top-level context about the chip, when it will be made, and when it will come into the world to make they take advantage of everything known about its design and manufacturing.
We now know that the designer needs some building blocks for their chip, and that they’ve made the hard decisions about what they believe they need. Where do those blocks come from these days?
Buy it in or build it yourself
If you’re a semiconductor behemoth like Intel, where you literally have the ability not just to design the chip yourself, but also to manufacture it because you also own the chip fabrication machinery, you invariably build the blocks yourself. Say you’re the lead designer for the next-generation Core i8-6789K xPro Extreme Edition Hyper Fighting. These days a product like that is not just the CPU like it used to be, where everything else in the system lies on the other end of a connected bus. Chips like the Core i7-4790K are a CPU, memory controller and internal fabric, GPU, big last level cache, video encoder, display controller, PCI Express root complex, and more. So let’s assume the i8-6789K is probably at least all of those things.
As lead designer of something like the i8-6789K, there’s probably almost nothing on the i7-4790K chip that its designer bought from outside Intel, or that you’ll now buy from a third party as a building block. I’d like to think there’s at least one block that Intel didn’t design, but I wouldn’t be surprised if someone told me there were zero third-party pieces.
Intel do make chips where they get the blocks from outside of the company, but the vast majority of their revenue comes from sales of chips that are almost completely their own.
So where are you going to get building blocks from? Intel obviously has design teams for each and every block of the chip. It’s incredibly expensive, but the competitive advantages are enormous. Knowing that all of your block designs are coming from your own company, on timescales you (hopefully) control, where your competitors have no idea what you’re building, and where you have full design-level control over every part that results in a flip-flop to be flip-flopped, is really compelling. That vertical integration is overwhelmingly an excellent idea if you can afford it, because it lets you put economies of scale to work amortizing the incredibly expensive capital expenditure required.
You can see that build-it-yourself mentality elsewhere in the chip industry. Qualcomm do as much as they can. Nvidia are trying their very best. Apple are beating the rest of the consumer device world to death with their ability to vertically integrate as much as they can. Lots of that is built on Apple doing the work themselves, at the chip’s block level.
At the other end of the scale in consumer devices like phones and tablets, you have vendors that are master integrators but design none of the blocks themselves. They go shopping, get the blueprints for the blocks from other suppliers, connect them up, and ship the result, often very quickly. It’s comparatively cheap and easy for them take this approach. And, primarily because it’s also cheap and easy for someone else to follow suit, they’re in a horrible, slow, squeezing, cost-down race to the bottom that only a few will survive unless they can differentiate.
Choosing between buying it in or building it yourself is largely a matter of capital expenditure, expertise, and supporting shipping volume. Those are the big factors, but there’s still incredible extra nuance depending on the company making the chip. Some vendors will take a block design in-house where they previously bought it, not because doing so will make them any more money directly, but simply because it’ll increase the size of the smile on the customer’s face when they use the final product.
Now we know where the blocks tend to come from. If you’re rich and your customers love your stuff so much that your competition matters less, if anyone can even compete with you at all, and if you ship loads of whatever it is you make, you can go ahead and try to do as much of block design as you can yourself. If your cost structure and competitive environment means things are tighter, you need to go shopping. I’ve also written about how you should go shopping, if you want to nip off and read about that too.
Regardless, someone needs to design the blocks.
Modern block design starts with an architect. The architect is responsible for the what, the why, and part of the how of the block. But they’re usually not responsible for the rest of the how or for the where. I’ll explain what the hell I’m talking about, I promise.
The what is reasonably obvious. Let’s take the GPU, because I’m incredibly fond of things that make pixels. What does a GPU do? It processes graphics workloads. The why doesn’t mean, “why does it process graphics workloads?” That much should be obvious. The why means, “why does it process graphics workloads in this or that particular way?” The fundamentals of computing machinery mean there are infinite ways to skin any complex computational cat.
Sticking with GPUs, which have to process the pixels on your screen, you could architect something that processes all the pixels individually, one after the other. You could architect a GPU that processes pixels individually, one after the other, in a random order. Or you could architect a GPU that processes a bunch of them together in parallel, in a tile-based fashion, because the pixels have some level of complete independence yet also some inherent level of connected properties, and exploiting spatial locality in the memory hierarchy leads to great things. Or you could architect a GPU that does nothing but render pictures of cats and hope that’s what the user wants. It’s the architect’s job to figure out why his Widget should process inputs and yield outputs in a particular way.
In terms of modern consumer oriented semiconductors, these blocks all have a certain heft—a boundary, complexity, and physical size. They’re not trivial, they’re almost always programmable in some way, and they tend to be busy with memory a lot of the time. So the why is never simple, and architects today usually can’t operate alone as a result.
So the block architect is operating in many similar ways to the full chip designer we talked about earlier. Interestingly, because it’s so expensive to develop a chip, and because you want to be able to reuse blocks from one design to the next if you can, the blocks are basically black boxes. A block needs ways to get data in, do some work, and get resulting data out. But it often does that work in complete isolation from the rest of the system, without sharing data or resources, and it usually goes about things completely differently from other blocks. Computation is not computation, if you catch my drift.
A CPU goes about its business in a completely different fashion—from memory in, through computation, to memory out—than a GPU, never mind DSPs, modems, video encoders, video decoders, display pipelines, and everything else on a modern complex chip.
The architect therefore needs to understand how their block is used by the software and rest of the system and how it connects to that system. They also have to be broadly aware of some of the more physical properties that will affect the chip. But because the block is a black box, almost everything can be an implementation detail. That’s the how part.
You often find that block architects, including those at companies doing all-in-house development, will design those blocks with a common interface to the outside world that’s shared with other similar blocks in the same family. This lets the full chip architect make changes to their bigger design—and swap out certain blocks for others—without making material changes to anything but the eventual layout. Using a common interface also makes it possible for the block designer to create multiple variants of their block, each one specialized in certain ways to address certain markets, without making those variants any more complicated to integrate than one another.
Think of it like swapping out one CPU in your PC for another. They share the same interface with the outside world, the pins in the CPU’s case and what travels across them, but the implementation could be completely different.
Software helps here, presenting a uniform layer to the rest of the system for the hardware underneath. Drivers for certain blocks let you keep a common interface at the software level while changing the implementation underneath whenever it’s needed. It’s the same principle.
Software helps drive the underlying block architecture, too, but not completely. Whether running a software instruction set architecture (ISA) or implementing drivers for a client API executed by the block, the architecture of the block can be completely different underneath while retaining compatibility, allowing differentiation on performance, power efficiency, area, and features. Look at GPUs as a block, where there’s a lot of differentiation in underlying architecture in modern SoCs, yet they all implement support for a common set of APIs.
Lastly, we come to the where part of the block, which tends not to matter as much to the architect. Especially in modern system-on-chip designs, if you could take a look at the physical layout of them and identify the blocks, you’d see them all over the place. One vendor will always place the video encoder and decoder next to each other, but another will place them separately. Sometimes discrete blocks will share power delivery or a power island, but sometimes they won’t. The chip’s physical topology doesn’t really matter to the block architect. That’s the domain of the hardware person who implements it. More about the hardware person later.
First, we need to talk about simulation.
When the architect has completed their design, at least for modern processors in consumer electronics, it’s handed off to two separate teams. The simulation team’s job is to take the design and create a functionally accurate software implementation, somewhat obviously called the simulation. The functionally accurate part is important; because the simulation is of what the design is supposed to do, it must produce the same outputs as the real hardware for all the accepted inputs. But that doesn’t mean it has to implement the exact same how as the hardware. The simulation also doesn’t have to be cycle accurate, which is really important. (Since it’s in software, the simulation very probably can’t be cycle accurate).
So the simulated software design acts for all intents and purposes like the hardware design, it just doesn’t perform the same. In reality, it’s many orders of magnitude slower for complex designs. For example, the simulated models of the GPUs that I’m used to working on can take minutes or even hours to generate single frames of output, depending on what’s being rendered. Hours. While the simulation is a great way to verify the design is correct, it’s still nowhere near the hardware in terms of performance.
For the curious, simulators tend to be written in C++ or a dialect of C++ called SystemC. SystemC has features and idioms that allow the simulator writer to more easily model the functionality of the parts of the design that will be operational at the exact same point in time.
The benefit of the simulation is that it’s much cheaper and faster to produce than the hardware implementation. It’s much easier to test, verify, and debug, too, since it’s just software. (If you’re a simulator engineer, before you hunt me down and kill me in my sleep, rest assured I completely understand that it’s not just software, as if there’s no inherent complexity compared to other types of software.)
When an architecture is finished and handed off to the hardware and simulator teams, the simulator team is always going to finish first. It really has to, because the hardware team is going to use the simulator model to help verify that their hardware implementation works correctly for the same inputs and outputs!
So we now understand that there’s an architecture team responsible for coming up with the way the block should work, a simulator team responsible for a functionally accurate software model, and also a hardware team responsible for expressing the architecture in terms of the final physical design. There’s a lot of overlap between architecture and hardware in many cases (and in my experience, great hardware people tend to be really good architects in many ways, and vice versa). For example, there’s no point in the architecture team designing something that has to be able to absorb a certain number of clocks of memory latency, but the hardware team implementing the main computational pipeline depth so it can’t keep enough work in flight to hide that latency. Architecture and hardware teams work very closely with each other to make sure the design is respected and implemented correctly.
|There are block designs in chips that were successful in the market but shipped with hardware-level defects caused by the implementation not really gelling with what the architecture team had in mind for the design.|
That said, there’s still a lot of the hardware implementation that can be done entirely with black boxes, where the architect might not actually know how the pieces work underneath. That’s the great thing about designing something in modular fashion; as long as your part of the design accepts the right inputs and produces the right outputs, the implementation can sometimes, but not always, be just details for the hardware person to worry about.
That keeps things simpler for the architecture team and lets the hardware team focus on what it’s good at in terms of the physical implementation. A really good level of trust and communication is required between the architecture and hardware teams. Misinterpretation can cause bottlenecks to appear in the design where the architect didn’t expect them, or in the worst case, they can result in completely broken physical implementations. There are block designs in chips that were successful in the market but shipped with hardware-level defects caused by the implementation not really gelling with what the architecture team had in mind for the design. Get it really wrong, and you can’t ship.
When it comes to the building blocks of a modern chip, the real meat of the where is in power, area, and physical layout. Most blocks are rectilinear to make them easier to lay out, but that tends to waste space compared to more complex layouts—those where maybe the tooling wasn’t capable and a human got involved to reduce wasted area on the chip that doesn’t contain some working logic. Sometimes that’s desirable for routing or power delivery reasons, and sometimes you will always have extra space on a chip because you’re pad limited. But sometimes you really need to pack blocks together as closely as possible for the smallest possible area, and it’s the hardware team’s job to design a block that can potentially be laid out in a flexible way.
Let’s talk about that in more detail.
The output of the hardware team is a hardware description language (HDL) variant called RTL, or register transfer language. There are a couple of dominant RTL variants called Verilog and VHDL, but there’s no clear winner between them on the market, at least not that I’ve been able to discern. Some companies implement their hardware with one and some with the other, and the two languages can be used together in different blocks in a single chip, integrated by the physical design teams.
RTL is a human-written and both human- and machine-readable expression of the movement of data and processing logic for the blocks on a chip. It’s fed into the electronic design automation (EDA) tools operated by the physical design team. More on that later.
RTL encodes what needs to happen at the logical level of each part of the design, accepting inputs and working on them in some way before moving the data out at the back end. It’s very much like normal computer code. The hardware programmers build libraries of common processing parts in RTL, which allow you to build more complex structures.
Going back to my favorite example, the GPU, you tend to want to multiply two numbers at almost every stage of the architecture. There’s no point in the guy coding the blender, say, and the girl coding one of the ALU datapaths, to write their own interpretation of a floating point multiplier. Instead they can collaborate, create something together that jointly fits their individual needs because they tend to want to operate on the same kinds of numbers, and build the multiplier once.
Then in the RTL for their blender or shader core, they can import that shared multiplier and instantiate it inside that larger encompassing block, saving time to implement their part of the design and reducing the cost of validation later.
One of the key things to understand about RTL is that it can be debugged, tested, and validated before a chip is created. There are various software and hardware platforms that can take RTL and execute it pretty much directly, without the need to take the RTL all the way to a physical chip to see if it works.
Emulation and FPGAs
Remember, the complexity of a modern chip design is measured in billions of transistors these days. The physical area of a modern system-on-chip in today’s smartphones and tablets starts at around 50 square millimeters and triples from there at the top end. The chips in more powerful laptops and desktop computers, especially the GPUs, run to several billions of transistors and upwards of 500 square millimeters on modern process technology.
You can imagine those chips cost a lot of money to manufacture, and I’ll talk a bit more about that later. Because of the inherent costs, if you weren’t able to test your design functionally before taking it into physical chip form, it’d be impossible to design anything complex. Being able to prototype your design before the complex “back-end” physical processes get started is therefore completely fundamental to modern chip development.
We talked about simulation earlier, where there exists a functional software implementation that completely mimics what the hardware is supposed to do, just not exactly how it does things at the cycle level. But what about ways for executing the RTL ahead of having to turn it into a physical chip, in that desirable, cycle-accurate manner?
There are a number of options for doing so these days, ranging from the “cheap” to the incredibly expensive. Depending on the size of the block you want to implement and how it connects to the outside world, field-programmable gate arrays (FPGAs) might be an option. As the initialism suggests, FPGAs are reconfigurable arrays of programmable gates. The gates in question are the fundamental logic gates that processors are made out of, like the “and” gate, which takes two inputs and outputs a logical 0 if both inputs are 0, otherwise it outputs a logical 1.
The FPGA programming process uses the RTL from the hardware team as an input. Remember that RTL is a logical representation of the function of the hardware, so it maps particularly well to FPGAs. Processing speeds for today’s fastest and largest FPGAs, which can implement the biggest designs, are in the low-single-MHz range for large blocks. That’s a far cry from the GHz-class designs you might expect for something like a modern CPU, but it’s only a couple of orders of magnitude away from the frequency of things like GPUs or DSPs.
Regardless of the low clock speed, it’s still an incredible improvement in speed compared to simulation. Crucially, it’s also cycle accurate! That property of cycle accuracy exposes the design, no matter how it’s implemented in final silicon, to performance analysis. If you’re able to connect the FPGA implementation to other FPGAs or prototypical silicon that helps you implement the wider system architecture that might house the design, you can start to figure out how fast it’s going to be in real systems.
There’s usually some kind of disconnect between the performance in FPGA form and the final shipping silicon. Still, FPGAs are usually enough to give you an idea, so you can start work on tuning the design for performance in both absolute terms and relative to what the design communicates with on the outside world.
Then you have a class of full-block emulation systems that, as long as you have enough of them, can be configured together to emulate very large designs in full. The “in full” property is important. Back to GPUs again (sorry). To keep it simple, say your GPU design consists of a front-end, a shader core, and a back-end. Imagine the design is such that even the largest FPGAs you can get your hands on are only big enough to hold the design for the shader core, but not the front- and back-end architecture as well. You’d have to split the design across multiple FPGAs or not implement it in FPGAs at all, depending on the inter-block complexity and communication you need between those parts in order to make the design work.
Emulators, which after years of consolidation in the EDA industry are now usually also produced by tools vendors like Cadence and Synopsys, are enormous. I really mean that: a large installation of a modern emulator that’s big enough for a large chip design can easily fill a big room, and that room tends to need to be specially constructed due to the power and cooling requirements.
In addition to being able to implement even large designs in full if you have enough of them connected together, the emulator can also be set to appear to a connected host system as the real device. Many months ahead of ever seeing the design in silicon, you can boot your operating system, load the driver for the GPU, and run the full software stack as if it were a real device.
Just like with FPGAs, the advantages of that ease-of-use and ability to use the full software stack are hard to understate. Emulation is great for the driver writer, the performance analyst like me, the hardware team implementing the RTL, or maybe architecture team that might want changes to be made based on the full run-time data you’re now able to collect using the full software stack. All that’s possible because the emulator lets the system believe that the design is real.
Full block-level emulation is slow, even slower than FPGAs, at less than 1 MHz, and also many times more expensive than any other solution (millions of dollars to buy something that will be useful to the designer of a modern consumer system-on-chip, for example, which is why they actually tend to be leased). But it’s still much cheaper and much faster in terms of turnaround time compared to full silicon production.
So, let’s imagine that every step in the process we’ve talked about has been followed, up to and including full emulation of the design in a giant emulation platform that fills a huge, custom-designed building with fully plumbed-in liquid cooling. The design now provably works. It’s been tested with real software. The driver writer is happy with the driver running against the simulated and emulated models. There are no last minute hardware bug fixes to be made (hopefully!). The design passes the full regression suite. The team responsible for delivery to the customer—be that an internal customer if you’re someone like Intel or Qualcomm, or an external customer if you’re shipping to a company like Mediatek or Rockchip that only tends to work with outsourced designs—has signed off on the design.
The RTL is then shipped to the customer. This is an important milestone in the journey to full silicon, and for an IP-only supplier, it’s pretty much the last point in the journey. But there’s still so much to be done in order to get the design into a working chip in a device. So what’s next? One of the coolest and most mysterious processes, at least from my vantage point.
The RTL that describes the hardware then needs to be turned into logic in the form of transistors. That’s where EDA tools and the physical design team come in. The transformation of RTL into actual transistors on the chip is a process called synthesis.
If you understand how computer programming works, you’ll be familiar with the concept of compilation. You take code written in a high-level language and transform it through a series of steps, generating and consuming various intermediate representations of the code, until the final hardware code generation happens, targeted at the instruction set architecture of the processor that’ll consume it at runtime. Interestingly, modern processors also tend to take that binary representation of their instruction set and transform it internally into other representations they can understand, all hidden from the programmer.
The key point to take away is that, while there are multiple complex transformations of the original high-level code, the computational meaning isn’t changed. If the programmer wants to add two numbers together, you better not change the addition to a divide. You can argue that it is changed at certain steps, but I’d argue, at least for simplicity’s sake, that it’s actually just optimized.
That same “compilation” happens during RTL synthesis, where the functional hardware design encoded in RTL isn’t changed, but the final representation, in this case usually a really cool binary interchange format called GDS2, is now something that can help generate the actual transistors on the chip. Think of it as the map between the described logic in the RTL to the transistor structures on the silicon.
The output of the synthesis EDA tools is built from a library of standard cells. The cells are collections of transistors that implement a certain structure in the silicon. The easiest ones to point out are the SRAM cells, which tend to be readily visible in photographs of the physical chip’s floorplan. SRAM cells are usually consolidated in large, highly-regular rectilinear structures built from cellular building blocks.
The cells are tied to the foundry process, of course, and they tend to be provided by either the foundries themselves (someone like TSMC or Samsung LSI) or the EDA tools vendor (like Synopsys). It’s common for IP vendors to partner up with the cell library vendor to create a tooling flow for an implementation of a particular block that’s optimized for a certain foundry process and set of cell libraries—and their best operating conditions—to guarantee a set of performance characteristics.
So synthesizing the RTL is the act of turning the human-readable HDL into cellular blocks of transistors, in effect. And because transistors have physical dimensions, they need to be laid out in relation to one another.
Because the GDS2 is a full physical representation of the structures on the chip, it has inherent size, and it might surprise some to realize that modern chip manufacturing is a 3D process (not that I’m not talking about three-dimensional transistors). Not only does the bottom silicon layer spread out in width and height in a single polysilicon layer, but the design also spreads upwards in terms of metal.
Every modern microprocessor has a metal stack. The reasons why will hopefully be obvious or least become obvious shortly. Think of the full chip now, with the individual blocks that we talked about designing and implementing, all connected to each other in the final large chip design. But how are they connected? Tiny wires! The silicon layer is a single planar structure, but the connections between blocks aren’t implemented in the silicon. They’re implemented in metal wires, so the wires have to go upward.
In today’s large designs, there’s no way to do the wiring in just a single layer, so there tends to be a full stack of metal layers snaking around and through each other like 3D spaghetti with insulating material in between. The more metal layers, the more costly the design is to manufacture. And you can guess that, with a large silicon floorplan and many layers in the metal stack, laying everything out is not a task for a human being—at least not entirely.
Starting again at the silicon layer, the blocks occupying a shared planar surface need to be placed beside each other. For manufacturing reasons, the chips on a wafer need to be square or rectilinear, but the blocks inside don’t, although it helps if they are for layout simplicity. Imagine a variant of Tetris, where you have to not only get differently shaped blocks to fit together in an optimal space, but where you’re also faced with an extra constraint that requires related blocks to be near each other to keep the interconnection between them (the metal layer stack) as simple as possible.
You can probably quite easily understand that certain block layouts require fewer long wires to cross the chip connecting blocks to each other, to bus fabrics, or what have you, resulting in a simpler metal stack.
You can also probably imagine that the number of possible permutations of the block layout on the bottom silicon layer and the wire layouts in the metal stack is mind boggling. Tiny changes in block placement at the silicon layer can lead to exponential growth in the wiring, for example. Searching through the possible combinations of block placement and wiring complexity therefore tends to be done by computer.
The bounding conditions for that search tend to be block frequency (because the clock for the chip has to propagate through all of the blocks), power (more transistors equals more power, relative to a fixed input voltage and frequency), and area (because the area has a physical cost in dollars for the silicon and the metal stack that wires it all up). There are many more in reality, and the EDA tools vendors tend to sell the software that figures it all out. The important thing to note is that finding an optimal layout for one input factor can cause huge changes in all of the others.
Now it is possible for a human being to lay out parts of the design, and there are reasons why that might be desirable. The layout software is guided by an engineer or set of engineers, but it might only have a certain number of possible search strategies baked into it, and they’ll all be bound by the characteristics we talked about before: power, area, frequency, and so on.
|You can also probably imagine that the number of possible permutations of the block layout on the bottom silicon layer and the wire layouts in the metal stack is mind boggling.|
However there’s one important characteristic to the boundary conditions of the layout software: time. Because finding the final layout solution is an exponential problem in terms of the number of the number of input conditions, the software has to somehow limit its run time. If the software isn’t sophisticated enough to find an acceptable layout for the usual input parameters, it’s possible for skilled folks to step in and either partially or fully guide the software to the solution the chip designer is looking for.
Finding fully or partially laid out digital logic in complex, large microprocessors is rare, but it does happen. You can imagine that the reasons why it does happen are to really optimize the last few percent of a design in a certain way, to do the best possible job on performance, power, and area. I’ve looked at dozens of large chip designs in the last few years, in detail, and I’ve seen hand-optimized layout just once. I think. I say I think because it’s very hard to tell, as you can imagine.
It’s worth talking about clocking of large chips here, just briefly, since it’s such a big part of certain designs. The way a clock is applied to digital logic means that the clock propagates through a block’s cells, driving them pretty much in a wave carried by the wires that route through the metal stack. The clocks applied to a modern large design are complex, because they’re designed to cover a large range and move between levels and lock quickly. They’re also varied, because there’s no one clock to govern a single system-on-chip, so you find anywhere from a few to dozens of clock sources in today’s designs.
The length of the wires is the biggest factor in figuring out the delay of the clock travelling along them. That delay is the main limit for the peak operating frequency of a given block. Factor it into whatever else is sharing the same clock source, and it’s possible for a single block in a modern design to limit the peak frequency of all other blocks that share the same clock.
The clock is effectively a tree. It’s generated at the root source and moves along the branches, which are the main feeding wires, to the leaves, which are the wires farthest from the source.
The clock tree is sometimes able to account for a non-trivial portion of a design’s area, since it needs to effectively feed the numerous complex blocks within a modern design. I point out the clocking just to give you an idea that not all of the area on a modern chip is dedicated to computational logic, and that clocking setups and clock variation strategies to keep power under control are an increasingly large part of modern chip designs.
When I first was introduced to semiconductor manufacturing, I can’t even remember when now, there was always talk of this mythical process called the tapeout. I always wondered why it was called that, because I couldn’t fathom where a tape was ever involved in semiconductor design or manufacturing. I fathomed right, because it’s a legacy term from back in the day, when the final data required for manufacturing was actually delivered on a tape or tapes. These days, the data is transferred electronically.
So what are the constituent parts of a tapeout as far as the chip design goes? The big part is the GDS2, produced by the synthesis and layout steps. It’s sent to a mask house to be turned into a literal photon-blocking mask. More about why in the next part.
The mask, or normally a set of masks for today’s designs, is the most critical part of the manufacturing process. The mask set can be modified by the foundry after its creation, but it’s generally set in stone and can’t be altered, so it’s critical that the mask house gets it right.
The delivery of GDS2 to the mask company is digital, but the mask is obviously a physical object. It looks really cool in person, if you’re ever lucky to see one; it’s big enough that you can see the constituent blocks of the design in good detail without a microscope.
The mask set and some associated metadata is then sent to the foundry for manufacturing.
This part is a series of books on its own, and I’m no expert, so I’m going to be brief. Silicon dice, and I presume it’ll be the same with the replacement materials for silicon when they eventually arrive, have circuits etched into them by lithography. And it’s not really just silicon either; the actual material is a mixture of silicon and dopants to give the resulting transistors certain electrical and switching properties.
Laser light, these days specific short wavelengths of ultraviolet and I believe usually 193-nm UV, is shone through the mask created in the previous step. It then passes through complex optics that focus and steady the mask beam, allowing the light to etch out a circuit on the silicon. Some foundry processes even pass the beam through water, using the water as a lens! Since the transistor feature size is smaller than the light wavelength, the process is called sub-wavelength lithography.
The wafer is moved underneath the laser light to manufacture each individual die on the wafer. Apparently, there’s an incredibly complex set of computations happening in real-time to correct the optical assembly and the laser emission in order to ensure the dice are etched without defects. Some process nodes require multiple exposures through the mask set per chip, increasing time and cost.
Then the metal stack is laid on top and the full die assembly comes together (a gross oversimplification, but I don’t really understand the metal stack assembly and how it works). The wafers are then cut so that each individual die can be taken out. I believe modern wafers, even for tiny dice, are cut with what’s effectively a saw rather than something like a laser. The margin for error is incredibly small given how tightly the dice are packed together on the wafer.
Testing and packaging
After the dice are cut from the wafer, they need to be packaged into something that can be placed into the final device. Packaging at this point depends entirely on the chip and the target device market. For big PC chips like CPUs and GPUs, that usually means placement onto an organic substrate that connects the metal pins on the die to larger balls or pins underneath of the package.
If you don’t own your own foundry like Intel or Samsung, packaging tends to happen externally, via third-party companies. The supply chain for semiconductors is really quite long. Most people assume that to create the final packaged chip, the foundry does everything after receiving the design from the designer, but that’s not true. External packaging adds some latency to the production process on top of the time taken for manufacturing by the foundry. The cut dice are sent to the packaging house for that step, then the packaging house sends them to another place for testing. There’s been some recent consolidation of this part of the production process, with packaging and testing houses becoming the same entity by merger or acquisition. Geographically, almost all of those for hire are in Taiwan.
Testing is the point in the process where you figure out if the chip is going to work or not. Certain tests can be performed on the full die, ahead of packaging. But there are some that obviously can’t, where you need the chip to be completely functional, powered on fully, and running certain external or self tests to determine operational functionality.
There are obviously certain other steps in testing, usually longer completely functional tests with full software stacks. Here, the packages are placed into form-factor devices and run through long run-time tests in varied operating conditions to ensure the chip can run in all of the environments it will ever find itself in. Those kinds of tests tend to be done by the chip vendor, with the chip in situ in a device form factor that’s representative of what you’ll finally buy.
If the chip vendor is happy at this point and testing completes properly, the design is signed off for limited production.
Yields and binning
Yields are computable at this point. You know how many wafer starts you had, you know how many chips came back working, and you can start to bin those chips at various grades. Because of the inherent nature of the physical manufacturing process, and despite the high degree of control over the whole process from the individual wafers upwards, not every chip is the same.
You want it to be that way, but there are inherent things stopping that from being the case. Sometimes you have functional defects, where blocks of logic on the chip just don’t work, but where you can suffer the loss and sell the chip as a slightly different SKU with the defective blocks turned off—and with different performance.
Sometimes you have process variation, where everything works the same as another copy of the same chip, but it won’t clock as high at the same voltage. So you have to test the chip functionally for defects and then test it operationally to find out where on the voltage/frequency/power curve it sits. Binning is inherently time consuming and therefore adds significant costs to things, but it’s the only way to guarantee you can sell as many dice as possible from a given production run.
Otherwise, if your products demand uniform performance so your customers know exactly what they’re buying, and you only have a couple of performance levels to sell at, say a tablet and a phone, then you’ll have to discard some of your production run unless your foundry is fantastic. It’s all a big trade-off between the complexity of the chip, the complexity of manufacturing, and the target devices the chip is supposed to go into in the end.
There are various stages of production in a chip’s lifetime. The first part already happened, if you’ve been following along. There’s been enough wafer runs to produce enough chips to sign-off basic functional and operational tests. This is hundreds of chips but usually not thousands.
|Binning is inherently time consuming and therefore adds significant costs to things, but it’s the only way to guarantee you can sell as many dice as possible from a given production run.|
Then modern chips usually go into a wider—but still not full—production run to create enough chips for device vendors, which then make sure the chip behaves properly in final devices. More on that soon.
Then there’s full mass production. At this point, there’s a lot of money at stake for the chip vendor. They place an order with the foundry that can’t usually be altered; because of how the foundries work, which is related in part to the manufacturing section earlier, the chip being produced in a foundry can’t be changed quickly. To amortize the cost of swapping one set of masks and wafers out—remember this whole process takes place in completely clean environments with no contaminants that could find themselves landing on the chip and spoiling the lithography, and every swap of the mask set and wafers is a chance to introduce something that could compromise yields—you either have to place a big order, or you have to wait for the foundry to be doing something special with the production run for some reason.
That does happen from time to time. For example, when a new fab building comes online, the foundry will dedicate time and energy to swapping out the wafer types, optics, and mask sets more often than normal in order to produce a bunch of different designs, test out the production pipeline, and make sure the fab is operational. They’ll often produce different designs on the same wafer at this point! For certain kinds of wafer starts, Vendor A’s chips might be right next to Vendor B’s chips on the wafer, without either one ever knowing.
Mass production is usually on the order of at least hundreds of thousands of chips for consumer device designs, if not tens or even the low hundreds of millions over the production lifetime of some devices. Economies of scale kick in big time here, especially for the bigger chip designers. Some companies are able to keep an entire fab building consumed with a single chip design, for a single target market, for extended periods of weeks or longer at a time.
The economics of production mean that it’s not financially viable for a fab to run cold with no wafer starts, or for it to be constantly swapping and changing design starts. So chip vendors that can guarantee volumes and longer runs of production get priority over the smaller vendors that don’t need as much or that need many more designs to go into production.
Early prototype sampling
We talked a little bit about sampling earlier in the testing and packaging section, but it’s worth quickly revisiting. Sampling is part of the production process that happens ahead of mass production. A limited number of wafers are started to make sure the chip works before mass production orders are placed.
It’s at this point that chip vendors sample the silicon to potential customers, as well as to themselves for internal testing. The potential customers take delivery of a small number of chips, usually in the tens or hundreds, in order to test them out on prototype boards, well ahead of device form factor creation. Sometimes, this happens before anyone even understands what the form factor is even going to be like for new markets or device types.
I find this part fascinating because of the varied sizes, shapes, colors, and functional variations these devices come in. Some are expensive, single-PCB designs with a nice socket for the test chips to go into and everything integrated pretty close to the full design. But some are crazy, multi-level, tiered PCB assemblies of various sizes, shapes, and colors joined by completely non-standard connectors you’ve never seen before and might never see again. Some don’t even have a socket, so you have to swap out an entire PCB assembly. I sometimes wonder how these frankenmachines can be transported safely without breaking, since they have an inherent fragility.
Lastly, I wanted to talk about the semiconductor intellectual property (IP) business model. I haven’t really touched on it specifically, although it’s been implied throughout, in the sections on block architecture, HDL, hardware, and the back-end physical design processes.
The semiconductor IP model is one where a company, the most famous example being ARM, goes through the process I’ve outlined from idea up to the RTL. The RTL is what they nominally sell, and they usually stop there. They only sell the source code to a block or blocks—not a full chip design, the chip itself, or anything like an end-user device.
But increasingly, semiconductor IP businesses are crossing the boundary from the RTL delivery into the later stages. They have to do that to stay competitive and help their customers with complicated tapeouts and chip productions.
It’s not uncommon now for IP vendors to supply, often in conjunction with an EDA vendor, not just the RTL, but also the RTL along with tools, scripts, and a set of cell library choices for the buyer to run at synthesis time to generate the block with a very specific, tightly-controlled set of physical properties. This isn’t quite an already fully synthesized “macro” instantiation of the IP, ready to integrate with other chosen blocks in the chip design at that point. It’s more a set of hard and fast rules that effectively say, “if you synthesize exactly like this, we guarantee this area, power and frequency for this IP”. Semiconductor IP firms do that because it’s increasingly hard for a back-end physical design team to get the most out of a complicated block like a CPU, GPU, ISP, modem, or video unit, especially for power and area.
So the block vendor gets involved now to help that process go as smoothly as possible. It’s still somewhat common to see fully synthesized macro sales, but that’s falling away in favor of a tightly integrated tools flow, especially one that’s aimed at integration into certain chip types for certain markets, where the extra costs can be justified by having a better chip in the end.
The transition from full chip vendor, where you do everything yourself, to semiconductor IP, where you’re either buying or selling blocks in RTL form, has come about almost solely because of the costs involved in meeting the increased complexity of designs, in order to then meet rising customer demand and expectation as chip technology, features, and performance march forward at a rapid pace.
That said, the semiconductor IP market is changing, too. I can’t talk about how, but it’s increasingly exciting to work in a space where there’s a high rate of change in how things are done, lots of new players coming and going, and an increasing reliance on the IP business model in order to make the complex chips that make their way into consumer electronics and related markets.
The final product
Hopefully, you now have a much better idea of the complexity, timescales, costs, number of steps, risks, number of suppliers, design, and architecture aspects of today’s modern processors. Most of what I’ve written applies just as much to companies like Intel, who are enormous and do almost everything themselves, as to much smaller yet incredibly agile vendors like Actions Semiconductor (who I bet you’ve never even heard of), who buy everything from semiconductor IP vendors and handle the steps between when the RTL is bought and delivered and when the chip is handed off to a device OEM.
If you want to find a nice average point for your understanding of consumer chips, especially SoCs, where the vendor is not quite an Intel but not quite the cheapest, tiniest, fastest—and thus least tested, integrated, and proven—provider at the other end of the market, things tend to happen on this kind of scale: at least a million units shipped of something that takes at least a year from idea to mass production and costs at least $10M to produce from architecture through IP selection and purchase, integration, bring-up, testing, early production, packaging, sampling, and then finally mass production.
Just a few short years ago, the timescales and costs were doubled for the average vendor. Just goes to show how quickly things are moving and yet still how complex everything is getting, in order to go from the initial idea for a chip to the device in your hand, the computer under your desk, or increasingly places like your car.