Some of Nvidia's CPU architects gave a talk at the Hot Chips symposium today, and they revealed some long-awaited details about Nvidia's first custom CPU design. We weren't able to attend the talk, but the firm evidently pre-briefed some analysts about what it planned to say. There's a free-to-download whitepaper at Tirias Research on the Denver CPU core, and I've been scanning it eagerly to see what we can learn.
We already know Denver is a beefier CPU than ARM's Cortex-A15, since two Denver cores replace four A15 cores in the Denver-based variant of the Tegra K1. We also know Denver is, following Apple's Cyclone, the second custom ARM core to support the 64-bit ARMv8 instruction set architecture. We've long suspected other details, but Nvidia hasn't officially confirmed much—until now.
Here are some highlights of the Denver information revealed in the whitepaper and presumably also in the Hot Chips presentation:
- Binary translation is for real. Yes, the Denver CPU runs its own native instruction set internally and converts ARMv8 instructions into its own internal ISA on the fly. The rationale behind doing so is the opportunity for dynamic code optimization. Denver can analyze ARM code just before execution and look for places where it can bundle together multiple instructions (that don't depend on one another) for execution in parallel. Binary translation has been used by some interesting CPU architectures in the past, including, famously, Transmeta's x86-compatible effort. It's also used for emulation of non-native code in a number of applications.
Denver's binary translation layer runs in software, at a lower level than the operating system, and stores commonly accessed, already optimized code sequences in a 128MB cache stored in main memory. Optimized code sequences can then be recalled and replayed when they are used again.
- Execution is wide but in-order. Denver attempts to save power and reap the benefits of dynamic code optimization by eschewing power-hungry out-of-order execution hardware in favor of a simpler in-order engine. That execution engine is very wide: seven-way superscalar and thus capable of processing as many as seven operations per clock cycle. Denver's peak instruction throughput should be very high. The tougher question is what its typical throughput will be in end-user workloads, which can be variable enough and contain enough dependencies to challenge dynamic optimization routines. In other words, Denver's high peak throughput could be accompanied by some fragility when it encounters difficult instruction sequences.
- Impressively, Nvidia is claiming instruction throughput rates comparable to Intel's Haswell-based Core processors. That's probably an optimistic claim based on the sort of situations Denver's dynamic optimization handles well. Nonetheless, Nvidia has provided a quick set of results from a handful of common synthetic benchmarks. These numbers are normalized against the performance of the 32-bit version of the Tegra K1 based on quad Cortex-A15 cores. They show Denver challenging a Haswell U-series processor in many cases and clearly outperforming a Bay Trail-based Celeron. Another word of warning, though: we don't know the clock speeds or thermal conditions of the Tegra K1 64 SoC that produced these results.
- Nvidia has built the expected power-saving measures into the Denver core, with "low latency power-state transitions, in addition to extensive power-gating and dynamic voltage and clock scaling based on workloads," according to a blog entry Nvidia has just posted on the SoC. As a result, they claim, "Denver's performance will rival some mainstream PC-class CPUs at significantly reduced power consumption." That sounds like a bold claim, but one wonders if they're comparing to something like Kaveri rather than Broadwell.
We should know more soon. Nvidia says Tegra K1 64 devices should be available "later this year" and alludes to its new SoC as an Android L development platform. I can't wait to put one of these things through its paces.