news nvidia claims haswell class performance for denver cpu core

Nvidia claims Haswell-class performance for Denver CPU core

Some of Nvidia's CPU architects gave a talk at the Hot Chips symposium today, and they revealed some long-awaited details about  Nvidia's first custom CPU design. We weren't able to attend the talk, but the firm evidently pre-briefed some analysts about what it planned to say. There's a free-to-download whitepaper at Tirias Research on the Denver CPU core, and I've been scanning it eagerly to see what we can learn.

We already know Denver is a beefier CPU than ARM's Cortex-A15, since two Denver cores replace four A15 cores in the Denver-based variant of the Tegra K1. We also know Denver is, following Apple's Cyclone, the second custom ARM core to support the 64-bit ARMv8 instruction set architecture. We've long suspected other details, but Nvidia hasn't officially confirmed much—until now.

Here are some highlights of the Denver information revealed in the whitepaper and presumably also in the Hot Chips presentation:

  • Binary translation is for real. Yes, the Denver CPU runs its own native instruction set internally and converts ARMv8 instructions into its own internal ISA on the fly. The rationale behind doing so is the opportunity for dynamic code optimization. Denver can analyze ARM code just before execution and look for places where it can bundle together multiple instructions (that don't depend on one another) for execution in parallel. Binary translation has been used by some interesting CPU architectures in the past, including, famously, Transmeta's x86-compatible effort. It's also used for emulation of non-native code in a number of applications.

    Denver's binary translation layer runs in software, at a lower level than the operating system, and stores commonly accessed, already optimized code sequences in a 128MB cache stored in main memory. Optimized code sequences can then be recalled and replayed when they are used again.

  • Execution is wide but in-order. Denver attempts to save power and reap the benefits of dynamic code optimization by eschewing power-hungry out-of-order execution hardware in favor of a simpler in-order engine. That execution engine is very wide: seven-way superscalar and thus capable of processing as many as seven operations per clock cycle. Denver's peak instruction throughput should be very high. The tougher question is what its typical throughput will be in end-user workloads, which can be variable enough and contain enough dependencies to challenge dynamic optimization routines. In other words, Denver's high peak throughput could be accompanied by some fragility when it encounters difficult instruction sequences.

  • Impressively, Nvidia is claiming instruction throughput rates comparable to Intel's Haswell-based Core processors. That's probably an optimistic claim based on the sort of situations Denver's dynamic optimization handles well. Nonetheless, Nvidia has provided a quick set of results from a handful of common synthetic benchmarks. These numbers are normalized against the performance of the 32-bit version of the Tegra K1 based on quad Cortex-A15 cores. They show Denver challenging a Haswell U-series processor in many cases and clearly outperforming a Bay Trail-based Celeron. Another word of warning, though: we don't know the clock speeds or thermal conditions of the Tegra K1 64 SoC that produced these results.

  • Nvidia has built the expected power-saving measures into the Denver core, with "low latency power-state transitions, in addition to extensive power-gating and dynamic voltage and clock scaling based on workloads," according to a blog entry Nvidia has just posted on the SoC. As a result, they claim, "Denver's performance will rival some mainstream PC-class CPUs at significantly reduced power consumption." That sounds like a bold claim, but one wonders if they're comparing to something like Kaveri rather than Broadwell.

We should know more soon. Nvidia says Tegra K1 64 devices should be available "later this year" and alludes to its new SoC as an Android L development platform. I can't wait to put one of these things through its paces.

0 responses to “Nvidia claims Haswell-class performance for Denver CPU core

  1. I’m not a compiler expert by a long shot, so take this question for what it is:

    Isn’t the whole point that this thing doesn’t “just” have access to the entire program for static analysis, but that it in fact also has dynamic/run-time analysis capability courtesy of its JIT-like underpinnings?

    And couldn’t this thing potentially present a JVM-like native instruction set (without the AArch64/ARM v8 intermediary step) to e.g. Android?

    But yeah. I’m waaay out of my depth here.

  2. [quote<]"(...) and vehicles.[/quote<] Isn't PowerPC used quite a bit in cars? And isn't MIPS used quite a bit in networking equipment? If this Denver thing actually works as advertised, all NVidia needs to do is license those instruction sets and they might be able to compete with everyone by simply writing new JIT compiler code. In theory, that is.

  3. So this closely resembles the Transmeta Efficeon. I remember that the Efficeon (and Crusoe) was slower on the first run of an application because on the first run it had to do the ‘Code Morphing’. After that, the native code was copied to the RAM cache and subsequent runs were fast.
    I wonder if this will apply to Denver too. I don’t see how it can’t have the same problem.
    Considering that nVidia took a license to TM technology in 2008, I wonder how much of CMS ended up in Denver.

  4. And OpenVMS, which is still HP. I’ll give it that it runs OpenVMS pretty darn fast though. We have one Integrity rx2660 box at work running a pair of Itanium II 1.66GHz CPUs. Everything is compiled native for ia64 and the thing is a query running, data sorting beast…even 5 years later.

    Of course, HP is about to drop OpenVMS. They’re going to offload support/development of the OS to another company, aptly named VMS Software, Inc. The vendor for the application we run on it is still eager to get us onto one of their other products that’s SQL Server based, though.

  5. I understand that you are not well informed and that’s excusable but no need to be rude.

    in the year ended January 2014, Tegra generated 398 millions in revenue and an operating loss of 268 million (go check their 10Q forms)
    In Q1 this year (jna-april) Tegra revenue was 139 mil and an op loss of 61.44mil.

    So, if you finished the second grade and you take into account that Tegra margins are lower than their corporate average you can figure out that they need north of 1 billion annual Tegra revenue to break even.
    And next time maybe you try to be civil,especially when you are not even mildly familiar with the subject.

  6. If you’re referring to FX!32, it was a standard emulator that wasn’t baked into the firmware of the CPU. FX!32 did do caching of the converted code to disk so the conversion would only be done once.

    For a brief period of time, it was faster using FX!32 on a high end Alpha than using a real x86 chip. Alpha was damn fast for its time while x86 was comparatively weak. Remember this is way back in the Pentium 1 days ~20 years ago.

  7. That’s a non sequitur. Just because there are low power OoO processors does not mean that OoO is a small part of a CPU’s power budget.

    There’s a lot of research on CPU power budgets. For example:

    [url<][/url<] [quote<] This analysis demonstrates that one of the critical components for power consumption in modern superscalar processors is the part devoted to extract parallelism from applications at run time. In particular, the wake-up function, which analyzes data dependencies of the program through an associative search in the instruction queue, is the main power consuming part of the issue logic. This is reflected in a high complexity circuit and high logic activity. [b<]For a typical microarchitecture the instruction queue and its associated issue logic are responsible for about 25% of the total power consumption on average.[/b<][/quote<] This was in 2001. We've moved on to larger queues and wider architectures now so the power requirements for OoO have increased.

  8. That’s definitely one way to look at it!

    Reading the article and comments, I’m left convinced that Nvidia either has a card up their sleeve or that they’re crazy.

  9. Alpha did this as well. while marginally successful in some applications but overall was a wash and decremented performance. **but** it did allow people to bridge the gap.

    Likely NV will do the same here.

    “But here! and run all your arm and x86 code!” .. throw enough cores at it and it kinda works ok..

    But I still think it is very niche..

  10. I’m not aware of them working on LLVM but they did opensource their EKOPath suite and put it under GPL.

  11. [quote<]This CPU design sounds better for a server than a consumer device where users might pull up any number of things, though.[/quote<] Denver was blatantly designed to be a server chip. Apparently Nvidia has had a change of priorities, but Denver will most likely be used as a quasi-server in embedded applications like signage, kiosks, and vehicles.

  12. Seems like nVidia is banking on the simplicity of Android working in favor of oft-repeated code instead of multi-purpose, highly capable architecture. This CPU design sounds better for a server than a consumer device where users might pull up any number of things, though.

    That said, I remember Transmeta and I remember that all the promise in the world led to nothing but horrible performance and dismal overall compatibility.

    However, a setup like this would serve nVidia well should they choose to support more than just ARM with a chip like this. If it can translate into its own internal instruction set, perhaps it could be turned toward other platforms, too. I imagine that was the original reason for this design choice.

    Can’t help but think this is going to wind up looking subpar next to dedicated architectures. Did nVidia just make its own Bulldozer/Pentium IV, I wonder?

  13. Nvidia went with a VLIW architecture because it is closest to how GPUs work… kind of going with what you know even when that’s not necessarily the best way to put together a CPU.

  14. Just because an idea has never worked in the past we shouldn’t assume it can never work in the future…

    But I would think this isn’t the best way for VLIW to really achieve its maximum potential. I would think VLIW would work best as part of a tightly integrated stack — processor, compiler, language, OS, APIs, and software distribution.

    For example, I could imagine this scenario — you buy an app off of the iTunes store, and what’s downloaded to your device is a version of the binary that has been compiled specifically for the version of the processor in your iDevice. If you upgrade your device, a software update is automatically pushed out to the new device with a binary compiled to the new device. This could work in Apple-land because of the tight control over the whole stack — all apps written in the same language, compiled on the same compiler to a limited variety of processors, all of which were designed by the same people in charge of the compiler and the language, and of course all the APIs written in that same language and heavily optimized for this stack, etc etc.

    Interesting that (at least so far) Apple has not chosen to go the VLIW route, despite all their control, and despite boatloads of money to throw at the problem. So…. good luck with that, Nvidia.

  15. OMFG… Nvidia Itanium ?!? WTF are they smoking over there ?? Intel did this TWICE and FAILED TWICE …

    Jen-Hsun Huang has finally slipped his nut straight into a blender of crazy … wwooooww….

  16. Actually it isn’t going be AMD and nVidia, at least at this juncture. It could be nVidia and IBM though. It appears that nvLink and the POWER8’s CAPI interface are compatible. That would mean future nVidia GPU’s would have a coherent means of communicating with POWER8 cores. Moving to a common socket would be an interesting next step but that depends heavily on the design goals of future nVidia and IBM server CPU’s. Though IBM has produced different CPU packages for a POWER core depending on just how high end of a server the chip is going into (organic flip chip packaging for low end rack servers vs. massive ceramic MCM for mainframe-like full frame systems).

    Considering the market consolidation going on, AMD adopting CAPI is potentially possible down the road. A common socket between IBM and AMD was once rumored before but the design goals diverged before this could be realized. AMD using CAPI would enable HSA with an AMD CPU (ARM or x86) and using a nVidia GPU.

  17. Speaking of which, I’m really curious if Android GPU drivers are going to be ARM or native binaries on Denver SoC’s. By passing the binary translation would be a clever means of increasing performance.

  18. The concept behind the PowerPC 615 was that it would have two decoders that translate PowerPC or x86 instructions into a common set of micro-ops. This would enable a common execution back end. The problem is that the PowerPC 615 never shipped due to various technical reasons outside of the dual decoder paradigm.

    The difference between the PowerPC 615 and Project Denver is that by using binary translation, Denver could get a firmware upgrade to support another architecture. Right now Denver supports ARM but with a hypothetical firmware upgrade it could be made to support x86 code. With nVidia recent wins in the automotive industry, a possible PowerPC translation could be of use. IBM’s design philosophy behind the PowerPC 615 would require the chip to be redesigned to add another decoder to support the ARM ISA alongside PowerPC and x86.

  19. The concept of ‘translate once and run natively multiple times’ doesn’t change with the addition of 128 MB of memory. (Transmeta had a similar translation cache in memory, though I believe off hand that the size was 32 MB.) Energy still has to be spent to do that initial translation even if it is only performed once. The performance side of performance per watt has to be able to make up for the energy used in translation and for the translation look up. (Memory reads consume power and Denver will need to do that to see if the code it needs has been translated to decide what it needs to fetch.) That is going to be a challenge due to the difficulties of finding enough parallelism for a wide core.

    The other factor in performance per watt is going to be the ARM cores which Denver will be compared to. These consume very little overall power so the performance lead nVidia will have to claim will need to be substantial for translated code. The lead doesn’t need to be as impressive the less translation is needed.

    For an ultra mobile platform, using binary translation is kinda crazy due to the overhead that consumes power. Sure, it isn’t a big hardware instruction decoder that’ll need to be invoked frequently like in x86 but it is being added to support an architecture whose heritage has been about power optimization over performance and minimizing overhead.

    Last I checked nVidia is still moving forward with putting a Denver core into their GPU’s and there it’ll be a much better fit. What is a 5W CPU core compared to a >200W GPU? Ditto if nVidia plans on producing a dedicated SoC for datacenter usage using Denver.

  20. So it sounds like maybe it reduces the need for transistors dedicated to the task of reordering code, and instead reorders code using the more general purpose transistors on in the CPU. So either way, there is hardware being used to reorder instructions. It’s just a question of whether that hardware is dedicated solely to the purpose of reordering instructions or if it can also be used for other tasks. Am I getting this right?

  21. Itanium only ‘smoked’ on FP code which all VLIW arch. are inherently good at. For integer and/or scalar work loads even native software performance was at best ho hum which is a major reason why Itanium never took off.

    The other one is that it also never scaled in clockspeed/performance like Intel said it would either while also being fairly power and heat intensive.

    If K1 is VLIW then I’d expect similar performance issues.

  22. Wasn’t PathScale working on LLVM as well? I thought I read something about that on Phoronix about 6-9 months back.

  23. The reordering of the code isn’t running with dedicated hardware like as ROB buffers and other hardware of this stage of the execution in a tradicional out-of-order cpu.

    The cpu runs a software similar to a “compiler” (from a firmware) to analyze and rearrange the transcoded instruccions (from ARM to internal code, with the HW decoder), packed some instruccions in a VLIW instruccion, and store the final “recompiled” code in caches and/or the dedicated RAM.

    Then the cpu runs, finally, the “final code”, that it isn’t ARM code anymore, from the cache and the special RAM. This code runs strictly “in-order”, because the nature of VLIW cpus and code. But this code is the processed “out-of-order” and reordinated primeval ARM code. It’s transformed in a VLIW type of code.

    It’s a hybrid solution of HW and SW.

  24. OoO execution is most efficient when your CPU has hit a “memory wall” and requires L2/L3 cache, aka, it’s running faster than the RAM can keep up.

    OoO allows the CPU to keep running while waiting for data to come back from the RAM. Without it, it would have to constantly wait for the RAM to respond.

  25. How does it happen out of the cpu resources? Doesn’t software run on hardware? Did I miss a meeting?

  26. Stranger things have happened. But I just don’t see this happening. Incidentally was NVIDIA’s nForce a success or failure?

  27. So it’s just Haswell-class? Not Broadwell class? And all this time I thought Nvidia wanted to rain on Intel’s parade.

  28. Yes, it has some assisted hardware to the translation.

    And no, it isn’t like you say about power. Think, the Denver cpu uses the caches AND 128 MB of RAM (blocked to for this specific use and, out of control of the o.s.) to store the translated binary. Is about to “translate once, and run ‘natively’ multiple times”. The design is more power optimized than the Transmeta cpus (that store the translated code only in the cache), and the transmeta cpus were very frugal implementations for a x86 designs of similar computation power.

  29. No, the PowerPC cpus are all standard RISC designs. This one (Denver) is a VLIW design (nvidia bought a set of patents to Transmeta, so now you know what it did with this purchase).

    The Transmeta cpus as efficeon and others are the nearest designs to Denver cpu. With more distance, the Itanium cpus, Elbrus, etc.

  30. Nope, it’s a VLIW design, the reordering is “out of the cpu” more or less (but this process, happens, out of the cpu resources, but happens), with a software that translates from ARM code to internal VLIW code (fusioning of instructrions).

    Transmeta cpus, Itanium, Elbrus cpus, etc.

  31. I don’t think OoO is as power hungry as some make it out to be. Right now I’m using a Nexus 7 powered by 4 Krait cores. They’re OoO. Works great for 10 hours.

  32. [quote<]lol Tegra has always lost money[/quote<] lol Thanks for making it easy. Now I know I'm talking to a basement armchair analist. I hope that you can see how that might have been the case when Nvidia did <$50 million per quarter, and not true at all when they are doing >$150 million per quarter. Now. As for the rest, I like gems like "sub 9 inch segment is gone". lol good job

  33. OoO execution is incredibly power hungry and represents a major portion of modern CPU power budgets. Nvidia is trying the Itanium route, except using JIT optimization to avoid the binary incompatibility issues.

    Is it going to be slower than a OoO engine running native microcode, all else held constant? Yes.

    Can Nvidia use the freed power budget to overcome that disadvantage?

    Unknown factors:
    Can Nvidia write their JIT optimizer to run well on the 7-issue processor without OoO execution? I think so.
    Can Nvidia’s JIT optimizer consistently optimize programs, or is there a wide range of throughput? I don’t know.

    However it pans out, it will be an innovative entry into the market.

  34. Of course, but that’s not the point. With Apple not selling their chips to anyone else and them using their chips only for whatever purposes they wish, it means their chips are pretty much limited to iPhones and iPads, unless Apple suddenly decides to use them for other things, which is what these ARM chips from AMD, Broadcomm, etc. are for.

  35. In case you missed it, AMD is already well into the development of their custom ARM core. They may have sold their mobile SOC division to Qualcomm but that doesn’t mean they couldn’t design ARM cores anymore, ever.

  36. I think this design borrows heavily from the PowerPC 615, in that it can run x86 code as well as PowerPC code. You want x86 compatibility? Sure thing. Wanna go native PowerPC? It’s there too.

  37. Only AMD’s ARM chips will compete against Denver. Their upcoming x86 chips will not, at least not directly. There are no hybrid ARM/x86 chips announced by AMD thus far.

  38. I don’t think AMD will have much problems making a custom arm chip, they wont be competing with apple anyways. Nvidia is already competing with K1.

  39. This got me thinking a little bit. What if AMD and Nvidia kiss and make up and come up with a common desktop/server socket infrastructure and support chipsets so that they can jointly attack the x86 market? They may also invite other ARM licensees to join the fragfest if they think it wouldn’t muddy the waters too much. United we stand, divided we fall.

  40. I saw that block too. Maybe it has a slow(er) HW decoder for code that hasn’t been translated yet. This way, even untanslated code will run at a somewhat decent speed. It could be quite useful for code that only gets visited once.

  41. It’s kinda curious to see how Denver is an in-order execution core considering it’s a wide machine. OK, there’s the binary translation to consider but still, serial code can only be parallelized so much and given the machine’s width, OoO execution seems imperative to extract as much parallelism from the instruction stream. I don’t know. Seem like an odd design decision in an era of OoO cores.

  42. Adding to Damage.. Nvidia has made tall claims before, once they compared tegra 3 to core 2 and said it’s as fast. Its not easy to outpace Intel in gp microprocessors especially in a single generation.

  43. 4 cores at 2.5ghz vs 2core at 1.4ghz (non TDP optimized celeron part) ?
    So based on nvidia result, core for core, clock for clock, intel architecture is 3.6 time more efficient ?

    Benchmark also run very well controlled code.. So I wonder if nvidia tweaked some of those benchmarks with hand tuned binary translations.

    I wonder if x86 design could take the same approach (dump the x86 instruction decoder in software) so the cpu solely execute its native micro ops. For Soc this could save ~5%? of chip real estate,
    and for large code execution , maybe ~5%? power saving ?

  44. Yes, sounds like directly compiling to Denver is the natural thing to do, so NVIDIA is probably on that track.

  45. Note: target frequency is 2,5GHz for Denver, so it is 1,4GHz Haswell (without boost) versus 2,5GHz Denver. (See image in blog post announcing this)

    Frankly, there are so far too many missing pieces on their comparision. And they use Geekbench (regarded as too problematic) and no SPEC2k6…

    Suspect we already saw best results Denver can get.

  46. In hardware, reordering is done partly in response to unforeseeable conditions, in software its done only for those that are foreseeable.

  47. [quote<]Everybody who constantly whines about Intel and x86 should think first because Intel did every single possible thing "right" from the perspective of academia and the supposed experts on CPU design with Itanium... and it flopped badly in the real world.[/quote<] I doubt there was ever a monolithic group of CPU design experts that agreed on anything. Even now when it seems like there would be good agreement... here comes something crazy from nVidia!

  48. Denver is unique in that it should be looked from a binary translation perspective and as an ARM chip. So for benchmarking, it’d make sense to put it against other ARM SoC’s. It’d also be appropriate to bring up the [url=<]Transmeta Crusoe[/url<] chip that also did binary translation. How well Denver handles translated code compared to native ARM can be compared to how well Crusoe did vs. native x86. Essentially seeing how well the binary translators have improved over time.

  49. That block diagram is weird as it contains a unit labeled ‘ARM HW Decoder’ along with a nice arrow by-passing it. This arrangement has me scratching my head as binary translation should be done by software. Perhaps some hardware assist? The by-pass arrow makes sense for what nVidia has proposed so that instructions are only decoded once.

    The other thing is that the power consumption aspect of this design is likely both good and horrible simultaneously. The catch is if code has already passed through the binary translation layer. For native code, the design should be rather efficient. If the code has to be translated, things are going to be ugly in comparison to native ARM implementations. It takes power to run the binary translation software that could be put toward actually work. Even if the translation significantly increases performance, the performance/watt factor is handicapped due to the overhead. The more often translated code is run, the lesser overall impact the overhead becomes. I’m struggling to see how this could be a big win (it could be a moderate win for loop heavy code).

    The only other thing I wonder about is if there would be any true native applications. nVidia is betting on ARMv8 taking off but they could also supply an OS build that slips in native VLIW compiled code in key spots to improve performance and reduce power consumption. I’d surprise me if nVidia didn’t do this for at least the GPU drivers.

  50. Well in fairness to Intel, half of Itanium came out of HP research. They’re busy building ‘the machine’ now.

  51. While not for mobile, there is also Broadcom making custom ARM cores for servers. Their design is noteworthy due to being the first ARM implementation that I know off that uses SMT.

  52. Well there still is PathScale but ya there isn’t to many of them left. Most are now focusing on efforts like LLVM.

  53. He and everybody SHOULD have a grudge against Transmeta.

    Binary translation for dynamic efficiency? That’s like saying “we take a loss on every unit made, but we’ll make up for it in volume.” Vacant, vapid, lame, marketing BS.

    Transmeta CPUs sucked. I still have my old Sony Picturebook; so slow it almost goes backward.

  54. You’re on the right path. You should do that reading up, though. As has been mentioned in the last few days–if not in this article, at least in others–if you haven’t read H&P, you’re not even in the game.

    Basically, binary translation like this works best when the two architectures differ the least. The more they differ, the more it makes sense to write a compiler for the real target.

    Unless there is a great target for 64 bit ARM code optimization, there is little reason to try to leverage that. Better to offer an ARM v8 variant and put your effort into optimizing GCC to produce good code for your chip. I think Qualcom–as much as I don’t like them–have proven the truth of that.


    Wider execution paths always look enticing to the new designer because they promise so much possible performance. But, the lack of experience of the designer dooms them in the end. It’s the processor design equivalent of the fools gambit.

  55. Denver is going to be making its way into [url=<]nVidia's big GPGPU chips[/url<] (think GM210). The goal is to remove the immediate need for a host CPU. The main purpose of a big Xeon chip would be put lots of local memory on a node as nVidia's GPGPU boards will continue to have relatively small amounts of memory in comparison (12 GB on current Tesla boards vs 1.5 TB of memory on a Xeon E7 socket). nVidia did cancel their dedicated Denver server CPU.

  56. Portland? Isn’t that pretty much the last independent *serious* compiler vendor out there? Wow, I missed that. Good catch!

  57. lol Tegra has always lost money and they have next to no share.
    Custom core is not cheap and there is the baseband and all the other little bits.. Going forward there will be more specialized compute units and more integration with costs rising and the competition being able to afford those costs. There is the software team, sales and marketing where if you can’t afford a certain size you are out
    In phones Qualcomm has the bulk of the share and then Mediatek almost 30% ,everybody else little.
    In tablets the 2 in phones and there are a bunch of Asian companies that have significant share plus Intel dumping chips for free.
    Nvidia has practically no share in phones and very very little in tablets while addressing only the high end in tabs.
    Foldable screens will kill tablets very soon and the entire sub 9 inch segment is gone , with bigger ones remaining. Then there are wearables ,including glasses, where size matters a lot more and integration will be a lot more important.
    Nvidia used to say that they need 1 billion yearly Tegra revenue to break even but that was before buying Icera and costs will keep going up.
    At the same time, on a global level, phone prices are gonna get pushed down at least 50% and some of that will be offloaded to parts makers. There will be a huge pressure on margins in phones.
    The LTE margins opportunity is also closing, LTE had a premium but now Mediatek is starting to sell LTE and prices are going lower – watch Qualcomm take a hit on margins now, as they are forced to compete.

    Nvidia and everybody else absolutely need integrated baseband to survive in mobile, there is not question about it and Nvidia payed 367 millions on Icera because they knew there is no other way.
    The only question is when they’ll be ready to actually do it and you can be sure that they are doing all they can to get it done as soon as possible.
    The future is mobile ,they can exploit all other niches – cars, robots, PC, server- but not going after the main market just doesn’t make sense,especially after investing so much in it in the last decade.

  58. 7 way superscalar seem a bit excessive and optimistic. There is a reason AMD went from 5 to 4 way with the 6800/6900 series. Its hard to parallelize instructions like that, especially for cpu loads that often rely on being in order.

  59. That’s odd. Most stories I read about Denver say that it is the next incarnation of the Tegra K1 chip which is aimed at tablet form factors:

    [url<][/url<] [url<][/url<] The wiki entry lists the HTC Nexus 9 and Google Project Tango Tablet as possible devices that will utilize Tegra K1 SoC with a Denver chip.

  60. Hmmm.

    So, now that Android is shifting to ART, couldn’t Nvidia just have made sure it compiles to Denver as well, instead of using binary translation? And doesn’t ART use ahead-of-time compilation instead of just-in-time? Does binary translation still work, then? What about overhead?

    Will have to read up on in-order vs OoO execution again, but didn’t Apple and Intel both improve performance by going wider and OoO? As Scott points out, how does it work given a real-world workload?

  61. I don’t believe Apple’s slim lead in SoC tech will be enough to overcome the many reasons consumers currently choose their competitors products over Apple’s.

  62. Samsung never made a custom ARM design and Qualcom was caught with their pants down when Apple released A8 and it doesn’t look like they will recover any time soon.

    As for AMD, they sold their ARM designs a looong time ago and I doubt they have the $$ to build a capable team. Also, they’ll target servers not high end smartphones so their designs won’t end up in pple’s hands.

    One can hope though for more competition as we need Apple to be kept in check. I certenly don’t want them becoming the Microsoft of the 90s when it comes to high end smartphones.


  63. Believe me.. Denver is ONLY aimed at servers and higher power draw embedded systems like vehicles.

  64. Don’t count out AMD, Qualcomm and Samsung. They might yet come up with killer ARM SoC’s as well on the high end.

  65. And this is coming out when ? Cause in roughly a month, Apple’s Cyclone v2 is coming out and if we look at past performace, it will probably have a 50% performance improvement over its predecessor.

    Exciting times ahead in the ARM SoC world ! Too bad TI and Qualcom are abandoning the high end custom CPU wars. All that is left is Apple and nVidia. I honestly didn’t see this one coming.


  66. Ummm, you do realize that nvidia acquired one of the most respected HPC compiler companies last year right?

    [url<][/url<] 25 years of high performance optimized compiler experience right there in addition to their team that has been writing compilers for GPGPU for over a decade now.

  67. If you actually are interested in ARM server procs then why aren’t you wondering why Nvidia went out of its way to [b<]not[/b<] compare Denver to a single ARM server processor? Or even an Intel one?

  68. [quote<]Nothing else can generate sufficient volume to support the costs.[/quote<] Says who? No one knows the costs except Nvidia. Most other players have to pay for ARM license and GPU license. On top of that they have to pour R&D $ into memory controllers, video decoders and all the other smaller parts, or again, license it from others. Qualcomm has (had?) an advantage over those in that they only license (or licensed) the ARM ISA and create their own CPUs and they create their own GPU IP. So where does Nvidia stand in all of this: GPU: Nvidia GPU IP is split between mobile, desktop and HPC/Pro markets. Extreme success in mobile is not required to "break even" in this part. It's been self funded for ages, with great margins. CPU: On the CPU front Nvidia is at least on par with Qualcomm, creating their own CPU cores. Well they are ahead in this respect now, in that Qualcomm has gone with vanilla A57-A53 cores for the next generation. Denver and future CPUs, will be shared between mobile and the small-server market, and will probably make into the desktop territory too. And finally, the memory subsystem, video decode and all of those parts are also shared for all of Nvidia's different markets. Video decode is done in the GPU, for example.

  69. Strong the Krogoth in this one is.

    Edited to maximize Yodaness.

    Edited again for even more Yoda goodness

  70. What you seem to be missing is that a non translating CPU (to the extent that modern CPU are such) do not need to look at such large windows of instructions for reordering. They have compiler which can look at *the entire program* and reorder things to suit the hardware’s foibles.

    As someone who writes low level code for fun and profit, I hate the idea of a processor that does binary tranlation like this. There is no way a little chunk of code like this can find a better instruction sequency *in real time* than a skilled programmer can do with a whiteboard, a simulator, and a lot of skull sweat.

    Anyone who tells you differently is selling something. Hmm, wait, they are selling something….

  71. What is a GPU driver these days but a huge compiler? I would suggest that nVidia has a ton of experience here. Though, I think the idea of translating code on the fly is borderline insane.

  72. What I would find very interesting to know is approximately how much time a core spends doing out of order stuff vs the actual work assigned. That could tell a lot about the effectiveness in different situations.

    It could be a nice boost to single-threaded performance, since the second core would be doing a lot of the first’s work if there’s only one task. I still can’t see it being good for performance per watt from any angle, but performance per die area could be pretty epic in some cases. Of course, there’s no telling how common those cases are. If NVidia’s GPUs are anything to go by, I’d expect a lot of other design decisions to tilt it back towards performance per watt.

    There’s no way real-world performance testing of this can fail to be interesting.

  73. So much cool tech so little dependency on the Wintel world. Way to go nvidia! Between you and half a dozen other arm CPU licensees I hope you all give Intel a big run for its money.

  74. [quote<]Which architecture are you referring to?[/quote<] Every ARM chip Nvidia has released, maybe?

  75. With AMD also going ARM, I wonder how would an x86+ARM APU go up against Denver?

  76. They might as well quit the mobile space then, getting into phones and w/e replaces them is the only way to survive in mobile. Nothing else can generate sufficient volume to support the costs.
    They will try to integrate and if they fail they have no option but to try again or quit mobile.
    They can claim w/e they want now to justify their situation but the truth is that they have to do it.

  77. Those are expensive optimizations though. Especially for a company like NVIDIA who is really not big in the software space the way Intel is. Intel have their own compiler suite, proven products that over many years developed into high-end tools.
    NVIDIA is nowhere close to any of that and haven’t even started. They’d need the help of devs on the target platforms to optimize compilers.. and might just run into a few competitive barriers.

  78. [quote<]Next Tegra should be really interesting with Maxwell and hopefully 4 Denver cores. If they manage to integrate LTE too so we can get some phones with it,it would be even better.[/quote<] Nvidia will never again do an integrated LTE SOC as the time frame for validation is way too long and because of the cut-throat pricing. Nvidia's SOC's can use an external LTE modem.

  79. You need to recalibrate your BS filter. as it is broken.

    You hear “Binary translation” and you lose it.

  80. [quote<]I'm not trying to be negative, just to set the stage appropriately in light of the history of such architectures.[/quote<] Which architecture are you referring to? [quote<]but I'd like to spend some time with an end product before making any declarations of success.[/quote<] I agree.

  81. Next Tegra should be really interesting with Maxwell and hopefully 4 Denver cores.
    If they manage to integrate LTE too so we can get some phones with it,it would be even better.

  82. Tell ya what you little shill: When you can tell me who Hennessy & Patterson are, then maybe I’ll think that you’re out of junior high. I’ve been around more than long enough to see hype bubbles a mile wide and I my BS detector isn’t going wrong here.

    If Nvidia had been smart enough to say: Oh, we can outclass Avoton (without actually talking about power consumption), I would willing to say that their arguments at least seem reasonable. But pride goeth before the very big fall, and Nvidia’s slinging of buzzwords and “miraculous” gimmicks that failed over a decade ago gives me MORE than enough ammo to shoot them down.

  83. But the Denver core needs to translate literally everything, there is no ‘native code’. I suppose there could be some good software optimizations though.

  84. And you carry a grudge against Transmeta.

    As for you being beautiful: Doctor Who to Dalek: And here I thought that you couldn’t make me more sick, you think hatred is beautiful.

  85. The beautiful thing about me is that I basically trash on everyone and I call Charlie an idiot for the right reasons.

    See my post below: Your quotes of a puff-piece Nvidia “white paper” carry about as much water as Charlie claiming that Nvidia is going out of business next quarter. Or, to put it more succinctly, just because Charlie’s an idiot doesn’t mean that you’re not one.

  86. It isn’t, but I’d expect that re-ordering instructions in hardware is more efficient than doing it in software, as tends to be true of most workloads. Here NVIDIA is arguing that you don’t need to do it very often since you just cache the results. Maybe that works out.

    I hope reviews come out soon, this is the most intriguing CPU in a long time.

  87. Your points are very charlie like. Did you ask him for quotes?

    I thought Scott was a Negative Nancy but you take the cake in claiming that the data published in the white paper is fabricated by purposely neutering the Haswell.

    Usually you are more tongue-in-cheek but you seem to have a bias from the Transmeta days and seem to believe that that technology could never be improved upon.

  88. Itanium went from being incompetent to decent [but by no means miraculous] when the source code was tuned to a tee so that the compiler could actually emit 6-wide (and later 8-wide) EPIC instruction bundles.

    The rest of the time… even with native code.. it was a crapfest compared to even Pentium IVs much less anything modern.

    Itanium did prove something: Everybody who constantly whines about Intel and x86 should think first because Intel did every single possible thing “right” from the perspective of academia and the supposed experts on CPU design with Itanium… and it flopped badly in the real world.

  89. Yes. But it was and still is not easy to write native code for it. Or write a good compiler for said code.

  90. Since when has an Out-of-Order hardware one been more power efficient than an In-Order one?

  91. I’m not trying to be negative, just to set the stage appropriately in light of the history of such architectures. Would be happy to see Denver’s delivered, typical performance per watt turn out to be outstanding and without significant gaps, but I’d like to spend some time with an end product before making any declarations of success.

  92. 1. This article is the opposite of “negative nelly”.
    2. Your entire “proof” that Denver will take over the market is… Nvidia whitepaper.

    I’m sure you TOTALLY hold Intel to the same standard when “Intel whitepaper” sez that Intel’s next IGP will outdo Nvidia’s discrete GPUs.

    I’m sure you TOTALLY hold AMD to the same standard when “AMD whitepaper” sez that Mantle will permanently put Nvidia out of the gaming market….

  93. Scott why all the “Negative Nancy Speculation” when the data in the White Paper clearly spells out what you seem to speculate in the negative?

    [quote<]The tougher question is what its typical throughput will be in end-user workloads, which can be variable enough and contain enough dependencies to challenge dynamic optimization routines. Scott Negative Nancy Speculation: In other words, Denver's high peak throughput could be accompanied by some fragility when it encounters difficult instruction sequences. Impressively, Nvidia is claiming instruction throughput rates comparable to Intel's Haswell-based Core processors. Scott Negative Nancy Speculation: That's probably an optimistic claim based on the sort of situations Denver's dynamic optimization handles well. [/quote<] From Denver White Paper: [quote<]Because the run-time software performs optimization, the profiler can look over a much larger instruction window than is typically found in hardware out-of-order (OoO) designs. Denver could optimize over a 1,000 instruction window, while most OoO hardware is limited to a 192 instruction window or smaller. The dynamic code optimizer will continue to evaluate profile data and can perform additional optimizations on the fly. [/quote<] ... [quote<]They show Denver challenging a Haswell U-series processor in many cases and clearly outperforming a Bay Trail-based Celeron. Scott Negative Nancy Speculation: Another word of warning, though: we don't know the clock speeds or thermal conditions of the Tegra K1 64 SoC that produced these results.[/quote<] From Denver White Paper: [quote<]The shorter pipeline does not seem to impact Denver's clock speed scaling as it is expected to launch at 2.5GHz, faster than NVIDIA's Cortex-A15-based processor (Tegra K1-32). Power Optimized Cores With Denver, NVIDIA had flexibility in the core design when the company took an architecture license for ARMv8 ISA rather than license a Cortex-A57 core. The architecture license allowed it to build a custom microarchitecture, as long as the final processor maintains ARMv8 instruction set compatibility. With the custom design, the company created a new power state that allows the processor to enter a deeper sleep state where logic in turned off, but retains cache and CPU state information. NVIDIA calls this new power state CC4 (Figure 5). The advantage of this new power state is that it improves recovery time by holding the register and state information while still being in a very low power state. The new power state is almost as low as power rail gating the cores, but without the long recovery time. With power rail gating, there's the power and time overhead required to flush the cache contents and saving state before shutting off power. The cache flush overhead and the loss of architecture state adds significant time to the entry and exit of the power state. By maintaining the cache and state information, entering and exiting the CC4 state is faster than power gating, increasing responsiveness while still saving significant power. [/quote<]

  94. That may just have meant absolutely nothing in the client space… right, yup it didn’t. What does Itanium still run, HP-UX?

  95. It’s in-order and kind of out-of-order too, since it re-orders and optimizes instructions dynamically in software. I’m not convinced that this approach can be anywhere near as power-efficient as the conventional hardware one, but we’ll see when reviews show up.

  96. [quote<] You know, Itanium has wide and in-order execution too....[/quote<] Ya and it smoked when dealing with native code instead of translation.

  97. LMFAO… and here I was saying nice things about the K1. Ah well.

    [quote<]Binary translation is for real. Yes, the Denver CPU runs its own native instruction set internally and converts ARMv8 instructions into its own internal ISA on the fly. The rationale behind doing so is the opportunity for dynamic code optimization. Denver can analyze ARM code just before execution and look for places where it can bundle together multiple instructions (that don't depend on one another) for execution in parallel. [/quote<] Been there Transmeta'd that. [quote<]Execution is wide but in-order. [/quote<] Fascinting, so Intel and AMD already have wide [b<]and[/b<] out of order execution, but Nvidia can get away with just wide because magic... or not. You know, Itanium has wide and in-order execution too.... [quote<]Impressively, Nvidia is claiming instruction throughput rates comparable to Intel's Haswell-based Core processors.[/quote<] Don't use the "I" word in Krogoth-infested waters. Oh, and something tells me that what Nvidia is saying is: When we have Haswell run a software emulator of a 64-bit ARM core it's about as fast as Denver is in hardware.. FTW!

About Scott Wasson