TR Forums

WhatMeWorry · Tue Jan 25, 2011 8:58 pm

I came across this layman article about cache, but I still don't understand some aspects of it:

Level 1 cache actually resides in the processor core and runs at the processor speed, very fast compared to the other RAM. Due to physical space constraints the size of this cache is small; on the Intel Yonah dual core processor the L1 cache is 32KB while others can be up to 128 KB. Level 2 cache rests outside the CPU core and before the DRAM. This cache will typically run at speeds below the processor speed, but it still faster then the DRAM and is far larger then L1 cache...One might ask, “if cache is so much faster then any other type of memory why not built a system that only uses cache?” The answer is money, the cost of SRAM (cache) ranges from $4,000 to $10,000 per gigabyte.

It mentions physical space constraints? What causes the constraint? Does L1 caches require more complex circuitry? And why is it so expensive? They don't use gold instead of silicon for cache :wink:

Does it have more defects per area than say non cache?

I keep thinking how we are coming up with 22nm processes and transistor budgets in the billions. To my naive understanding, shouldn't that make L1 cache cheap and plentiful.

SNM · Tue Jan 25, 2011 9:36 pm

WhatMeWorry wrote:
It mentions physical space constraints? What causes the constraint? Does L1 caches require more complex circuitry? And why is it so expensive? They don't use gold instead of silicon for cache Does it have more defects per area than say non cache?

I keep thinking how we are coming up with 22nm processes and transistor budgets in the billions. To my naive understanding, shouldn't that make L1 cache cheap and plentiful.

The L1 cache requires a deep, deep integration with the CPU to maintain its clock speeds while staying clock-synchronized with the CPU. Basically it has all the constraints of every other part of the CPU in terms of physical layout on the silicon. L2 is far enough separated that you can just shove transistors at it and make it bigger without worrying about such things so much -- which you might notice CPU mkers have in fact been doing.

tfp · Tue Jan 25, 2011 10:16 pm

There have been chips with large L1 but the only ones I can think of are the older HP PA-RISC chips. They were a bit faster than Sun chips of the same day, if I am remember right, which had small L1 and large half speed L2.

AMD has normally had a large L1 than intel and their caches are/were exclusive vs inclusive.

Though like you I am surprised the L1 sizes haven't increased more over the years, must be easier/cheaper to get performance elsewhere.

Tue Jan 25, 2011 10:19 pm

At the clock speeds a modern CPU core runs at, the propagation time of electrical impulses -- even at near the speed of light -- starts to become an issue. The L1 cache needs to be physically small so that it can be located as close as possible to the execution units in the CPU core.

Furthermore, the logic to manage a cache is not trivial. Any given physical memory address can be mapped to any one of a number of potential cache locations; there is also additional logic to track which locations in the cache are valid and which ones are "dirty" (need to be flushed back out to L2 because they have been modified). The larger the cache, the slower the cache management logic gets, since it needs to deal with more bookkeeping data.

This is why we have L2 (and now L3 as well). As the caches physically move farther away from the cores, they get progressively larger and slower. You could even view system RAM as your L4 cache, sitting between the CPU and your disk drives...

bdwilcox · Tue Jan 25, 2011 10:33 pm

Parsimonious...good word.

Tue Jan 25, 2011 10:35 pm

Yeah, who says the Internet is turning us all into a bunch of illiterate dummies? :lol:

bdwilcox · Tue Jan 25, 2011 10:53 pm

just brew it! wrote:
Yeah, who says the Internet is turning us all into a bunch of illiterate dummies?

Good thing he didn't use the synonym niggardly. You can get fired for that, you know.

Wajo · Tue Jan 25, 2011 11:18 pm

It all comes down to minimizing average memory access times. It depends on cache hit rate and latency.

A smaller cache will have a lower hit rate (percentage of times a particular item is found on the cache as opposed to being fetched from main memory) but will also tend to have a lower latency. The opposite is true for a larger cache.

Thus (when you do the math), the best performance is usually obtained with several levels of cache, starting with smaller, faster caches (L1) and growing progressively bigger and slower (L2 and L3)

Other factors are involved, but you get the general idea.

Wed Jan 26, 2011 12:37 am

Wajo wrote:
Thus (when you do the math), the best performance is usually obtained with several levels of cache, starting with smaller, faster caches (L1) and growing progressively bigger and slower (L2 and L3)

I believe his question was more along the lines of, "Why don't we just make an L1 that is as big as the L1+L2+L3 all put together?"

If you don't understand that larger caches are necessarily slower, it is a reasonable question to ask.

mutarasector · Wed Jan 26, 2011 8:58 am

Good thing he didn't use the synonym niggardly. You can get fired for that, you know.

Naw, these days one simply can't use terms like "target", "Fire", or "Crosshairs". I guess that makes all discussion of AMD's "Crossfire" off limits.

kvndoom · Wed Jan 26, 2011 10:28 am

bdwilcox wrote:
just brew it! wrote:
Yeah, who says the Internet is turning us all into a bunch of illiterate dummies?

Good thing he didn't use the synonym niggardly. You can get fired for that, you know.

Dang I had forgotten about that. Crazy stuff... :roll:

Wed Jan 26, 2011 12:52 pm

And with that, let's please end this excursion towards the border of R&P.

Thanks for listening.

bcronce · Fri Jan 28, 2011 6:48 pm

The larger the cache, the higher the latency. If they made 256k of L1 cache, not only would it eat up more transistors, but it could over double the latency, which would hurt performance.

It's a delicate balance between locality and latency. Prefetching and hyper threading can help mask a lot of stalls caused by cache misses.

Also, with cache, latency is cumulative.

example. a program requests data from a memory address, it
checks L1 cache - 2 cycles, not there
check L2 cache - 12 cycles, not there
check L3 cache - 25 cycles, no there
goes out to main memory, has to wait 2 command cycles, 9 cas cycles, 9 cas-ras cycles, 9 ras cycles @ 1600 mhz, finally read.(memory latencies are at memory speeds, so a CPU at 3.2ghz would see latencies of 4-18-18-18 instead of 2-9-9-9)

you spend 39 cycles just figuring out you need to get the data from main memory. The larger the cache, the less likely the data will have to be read from main memory, but with diminishing returns and increased latency added for each step, you optimize your sizes.

UberGerbil · Fri Jan 28, 2011 7:48 pm

And there are secondary considerations as well. For example, the bigger the cache the larger a target it is for soft error-causing cosmic rays (process node reductions of course shrink the physical size, but make the bits easier to flip). The larger L2+ caches get around this by including a lot of error correction circuitry, but that incrases latency and power usage (yet another reason why those caches are slower than L1). From the intro of a paper (PDF) from Carnegie Mellon

Rising soft-error rates are a major concern for modern microprocessor designers. The reduction in charge stored in memory cells, a result of continued technology scaling, leaves on-chip SRAMs (e.g., caches, TLBs, register files) highly susceptible to soft errors. Coding techniques, such as SECDED ECC (single-error correct, double-error detect), are widely utilized for protecting on-chip SRAMs. For L1 data caches, however, where low access latencies are critical, the additional delay to correct ECC errors prohibits inline correction on a read. In the event an error is detected on a read, recent designs such as the AMD Opteron throw a machine check exception asynchronously, potentially halting the machine to prevent silent data corruption.

Further compounding problems, recent work suggests that spatial multi-bit errors, where a single cosmic particle strike upsets multiple neighboring memory cells, are increasingly likely at future technology nodes. Bit interleaving, also called column multiplexing, is the conventional approach used to protect memory arrays from spatial multi-bit errors. In bit interleaving, bits belonging to multiple ECC check words are physically interleaved so that a spatial multi-bit error does not affect adjacent bits from a single check word. For SRAMs in a high-performance processor, however, our results indicate that interleaving beyond two-way is prohibitively expensive from a power perspective as a result of the additional precharging of bitlines from the interleaved data.

The bigger the L1 cache, the more they have to worry about soft errors (and what to do about them). This may not be as important a factor in keeping L1 caches small as some of the other things already mentioned, but when you're designing the critical path parts of a processor everything has an effect that has to be considered.

Sat Jan 29, 2011 4:32 pm

AMD apparently uses ECC at all levels for data cache: http://support.amd.com/us/Processor_TechDocs/46878.pdf

(Instruction cache only needs parity checking instead of full-blown ECC, since you can just reload the bad location from RAM if corruption is detected.)

TR Forums

Why are CPUs so parsimonious with L1 cache

Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Re: Why are CPUs so parsimonious with L1 cache

Who is online