Nvidia intros Tesla K20 series as Titan snags Top500 lead

Nvidia’s big Kepler chip is finally an official product. We first got a peek at the GK110’s architecture way back in May, but products based on it hadn’t been formally introduced until this morning, when Nvidia pulled the curtains back on a pair of cards in its Tesla K20 lineup. As part of the compute-focused Tesla lineup, these cards will be aimed at supercomputing installations and HPC clusters. Here’s a look at their basic specifications, which were the last real mystery remaining about them:

  Core

clock

(MHz)

Peak

SP tflops

Peak

DP tflops

Memory

capacity

(GDDR5)

Memory

interface

width

Memory

bandwidth

TDP
Tesla K20 706 3.52 1.17 5GB 320-bit 208 GB/s 225W
Tesla K20X 732 3.95 1.31 6GB 384-bit 250 GB/s 235W

Nvidia declined to reveal exact pricing, since these cards will be sold through OEM partners like Cray, but you can expect them to cost a fair bit more than any GeForce.

The big dawg, the Tesla K20X, does in theory exceed one teraflops of peak double-precision performance, as Nvidia told us to expect at GTC this past spring. In fact, its clock rate and peak flops numbers are very close to our best guess from back then. One surprise is the fact that, if our math is correct, even the K20X has one of the GK110’s SMX units disabled, leaving 14 active, for a total of 2688 shader ALUs.

Like prior high-end Teslas, the K20 series supports ECC for external memory via an encoding scheme that occupies a portion of memory bandwidth for checksum storage. Nvidia claims this overhead has been roughly cut in half for Kepler versus the prior-gen Fermi chips, so that between two and 15% of memory bandwidth is occupied by ECC traffic.

The K20X is strictly a server-focused product that will rely on its surroundings to generate airflow for cooling; it will not come with an onboard fan. The K20 will ship with an active fan, making it a candidate for use in workstations and non-OEM servers.

In very much related news, the latest version of supercomputing’s prestigious Top500 list of fastest machines was released this morning, and the Tesla K20-and-Opteron-based Titan supercomputer at Oak Ridge National Labs captured the top spot with 17.59 petaflops of sustained throughput in Linpack. Only the Sequoia system at Lawrence Livermore National Labs, based on IBM’s BlueGene/Q, comes close to matching Titan. The fourth-place entry on the list is under half of Titan’s peak speed, and fifth place is about a quarter.

Nvidia tells us Tesla K20-series cards will be shipping in volume this week via its OEM partners, with general availability to follow this month or next. In fact, the firm claims to have shipped a quantity of K20X cards capable of 30 petaflops in the past 30 days.

Comments closed
    • jdaven
    • 7 years ago

    For the first time since its release, there are no Itaniums in the top 500 supercomputers. Shame, really.

    /sarcasm

      • Airmantharp
      • 7 years ago

      Intel just released information on an update to Itanium…

      Not that I’m rooting for it, but I’m not rooting against it either. It’s an entirely different approach, and seems far better suited for certain types of computing than x86 ever could be.

      • chuckula
      • 7 years ago

      Considering that Itanium is really aimed at the mainframe market with over the top RAS features, it’s surprising that it was in the Top 500 at all. When was the last time an IBM Z-series system made the list?

    • ronch
    • 7 years ago

    AMD’s S10000 consumes a whopping 375w. The K20X is 235w. I see AMD really likes producing power-hungry products… Anyone here up for pairing an S10000 with an overclocked FX-8350? 🙂 It’s winter anyway. Sorry K20X, you’re not hot enough.

      • anotherengineer
      • 7 years ago

      The S10000 has 2 gpu’s on onboard though, so a bit more difficult to compare until more Nvidia specs are released.

        • OU812
        • 7 years ago

        Well the specs are right in front of you (in this article) and others published hours ago.

        K20x: 1.31 DP TFlops at 235 watts or 5.57 DP GFlops per watt

        vs

        S10000: 1.48 DP TFlops at 375 watts or 3.95 DP GFlops per watt

      • BestJinjo
      • 7 years ago

      For single precision it’s much faster though (5.91 Tflops S10000 vs. 3.95 of K20X). At the same time, theoretical and real world performance can’t always be used accurately for these GPUs. A lot depends on drivers and optimizations for a particular application. For example, if a modern program can take advantage of Hyper-Q and Dynamic Parallelism, it’s possible NV’s GPU will be faster even with a lower theoretical rate. The same can be said for OpenCL. It’s possible some programs would run faster under OpenCL than they would under CUDA. There is clearly a software dependency as well.

      For example K20X performance varies depending on the app despite constant theoretical specs:

      ● MATLAB (engineering) – 18.1 times faster
      ● Chroma (physics) – 17.9 times faster
      ● SPECFEM3D (earth science) – 10.5 times faster
      ● AMBER (molecular dynamics) – 8.2 times faster

      So you cannot just compare S10000 and K10/K20/K20X on paper alone. You have to look at real world testing. For instance, Quadro line’s massive performance advantage over AMD’s cards is a testament that software support is just as important as the underlying hardware.

      • dpaus
      • 7 years ago

      As with current CPUs, the power draw for the 99.9% of their time that they spend at idle is much more important than the draw under peak load.

        • chuckula
        • 7 years ago

        For something as expensive as an S10000, I hope it won’t be spending 99.9% of its time idle….

        • Meadows
        • 7 years ago

        Servers don’t idle.

          • dpaus
          • 7 years ago

          While I hardly claim to represent 100% of the server market, ours sit idle or at very low load well over 99% of the time, yet they’re candidates for this.

            • jihadjoe
            • 7 years ago

            Maybe stuff like web servers do, but compute farms really shouldn’t be idle for more than the time it takes to change jobs.

            • dpaus
            • 7 years ago

            We make disaster response systems – I’m OK that they’re idle more than 99.9% of the time.

            • UberGerbil
            • 7 years ago

            Sorry, how is your usage a candidate for this hardware? Do your apps do enormous amounts of fp calculations on large data sets (but not “Big Data”)? These are components aimed at classic HPC workloads — climate/meteorology, finance, fluid dynamics, geology, etc — they don’t seem to be a good match for something that is intended for a high-uptime, mostly-idle deployment. But there must be something about what your apps do that I don’t get (I would have guessed it’s mostly branchy integer code, not fp)

            • dpaus
            • 7 years ago

            [quote<]climate/meteorology, fluid dynamics, geology[/quote<] Bingo (but note that I deleted 'finance'). It'd be heavy-data visualization - the 3D dispersion plume of a hazardous material cloud in real-time, the 'spill-over' progress of rising flood waters through municipal wastewater systems, etc. Each of those, in turn, trigger the branchy integer code that infers the impact of the event on the community.

            • UberGerbil
            • 7 years ago

            Ah. I didn’t realize your stuff was actually doing modelling like that — I had assumed it was further downstream on the management side of things, not all the way upstream to the prediction side.

            Still that’s a relatively atypical use of high performance fp devices… at least up until now. Traditionally HPC servers are oversubscribed: whether it’s CGI rendering or derivatives calculations or petroleum resource predictions, they’re not making ROI if the’re not working; and on the scientific side of things there are usually more customers with models to run than there are slots in the schedule. But as with everything else in computing, as formerly high-end exotic hardware filters down to a wider array of users they find their own applications that don’t always fit the old stereotypes.

    • OU812
    • 7 years ago

    Nice to see Nvidia double DP throughput (1.31 TFlops DP) for the K20x vs (0.655 TFlops DP) from the M2090 while actually reducing power from 250 watts to 235 watts.

    Pretty impressive.

    [url<]http://www.anandtech.com/show/6446/nvidia-launches-tesla-k20-k20x-gk110-arrives-at-last[/url<]

    • ronch
    • 7 years ago

    Scott, do you think it’s possible to do a review of the recently-announced Firepro $10,000 (yes, that was deliberate) and Nvidia’s top dog GPGPU? Yes, it’s gonna cost a bundle not just to acquire these cards, but to decide which programs will be used to test them. What do you think? Is it possible to talk AMD and Nvidia into providing samples of their cards?

      • chuckula
      • 7 years ago

      –> Why should TR pay when they could (should?) be able to get review samples?

        • ronch
        • 7 years ago

        In the event of not being able to secure samples from AMD and Nvidia, TR will then just have to pony up the cash, if it’s even feasible in the first place. The majority of TR’s readers probably aren’t even into GPGPU computing so spending a lot just to produce the review is a bit awkward if AMD/Nvidia refuse to provide samples.

          • Metonymy
          • 7 years ago

          Perhaps you’d like to start taking up a collection.

          • Chrispy_
          • 7 years ago

          I tell you what, getting performance data for graphics cards (consumer or professional) in common productivity suites is an nightmare.

          I bet there are a hundred thousand people who would be interested to see Geforce/Radeon/Quadro/FirePro results for things like:

          Maya
          3DS Max
          Revit
          Microstation
          Rhino
          Photoshop
          Solidworks
          Vectorworks.

          Basically there’s a serious shortage of articles and I bet part of the reason for this is because engineering samples just don’t get sent out that often.

            • ronch
            • 7 years ago

            [quote<]engineering samples just don't get sent out that often.[/quote<] Exactly why I was asking Scott if they could do a review. Reviewing these things is no walk in the park like you review a CPU, not that CPU reviews are easy, mind you.

    • OU812
    • 7 years ago

    Looks like Intel Phi may be having power usage problems.

    [url<]http://www.top500.org/system/177931[/url<] [url<]http://www.top500.org/lists/2012/11/[/url<] Note that Power is not listed. All other entries in the Top 10 list Power used. Only the Intel Phi entry has the Power used field blank. Interesting. --- Edit: Now we know why the Power numbers were blank. [url<]https://techreport.com/news/23884/intel-joins-the-data-parallel-computing-fraternity-with-xeon-phi#0[/url<] The Stampede used special 300 watt Xeon Phi's (SE10P) that are not going to be produced. [quote<]For instance, the two "SE10" models above are special-edition cards that Intel supplied to OEM partners who needed early access to the Xeon Phi. They have higher power consumption than the final product[/quote<]

      • chuckula
      • 7 years ago

      Not as interesting as you are making it out to be. Stampede is actually still in the process of being built and these results are only of a ~25% completed system. It was probably meaningless to pull up power consumption estimates when a massive part of power consumption of any HPC system is in cooling, which is obviously specced out for the full Stampede system instead of a partially completed test run that they did for publicity. Not bad that a 25% completed system is already in the Top 10, and we’ll see the full results in the next Top 500 list.

      EDIT: For example, even though Stampede’s power isn’t listed, other Xeon Phis on the list are: [url<]http://www.top500.org/system/177993[/url<] 629 Teraflops / 215.6 Kilowatts ~= 2.9 Gigaflops/watt, which isn't massively out of line for these systems.

        • OU812
        • 7 years ago

        [quote<] Stampede is actually still in the process of being built and these results are only of a ~25% completed system.[/quote<] So at 25% the Stampede results are 204900 cores, Rmax TFlops/s 2660, RPeak 3959 TFlops/s with unknown power usage. [url<]http://www.top500.org/lists/2012/11/[/url<] At 100% these numbers grow to 819600 cores, Rmax TFlops/s 10640, RPeak 15836 TFlops/s with unknown power usage. Compared to the Titans 560640 cores, Rmax TFlops/s 17590, RPeak 27113 TFlops/s with 8209 kWH power usage the Stampede looks to come up way short. Titan uses 46% less cores yet produces 65% higher Rmax TFlops and 71% higher RPeak TFlops. If these numbers are right it sure looks like the Xeon Phi is having problems.

          • chuckula
          • 7 years ago

          The “core” count metric is completely arbitrary.. what is considered a “core” on either system?
          I also like how you completely ignored my factual rebuttal to your inuendo and jumped onto a new “Xeon Phi is having problems” conclusion with zero evidence. Considering Nvidia has been pushing compute cards for almost 10 years, I would certainly hope they would have some pretty impressive results. There’s nothing in any of the data that I have seen that indicates that Xeon Phi is “having problems” though considering it has just hit the market and is just beginning to ramp up into running systems.

          Try looking at Rmax / Rpeak efficiency, with the Titan having a ratio of about 65% while Stampede has a ratio of 67%. Doesn’t sound like Stampede is doing too bad by those metrics. It’s well known that Xeon Phi was not going to beat the K20 at peak performance, but if you knew anything about the HPC world beyond regurgitating some marketing slides from Nvidia, you’d know that people don’t sit around running Linpack all day on these systems.

          I haven’t seen you posting very much on TR’s boards, but it is, shall we say, interesting that you pop up out of nowhere with marketing speak that bashes anybody but Nvidia on cue like this….

            • OU812
            • 7 years ago

            I presented actual published evidence from the Top500 list for the Stampede and the Titan.

            [url<]http://www.top500.org/lists/2012/11/[/url<] Using your own admission that the Stampede was only 25% complete it was very easy to multiply by 4 the Stampede values to get the Stampede at 100%. And those numbers look bad for Stampede and the Intel Xeon Phi compared the the Titan and the K20. > Titan uses 46% less cores yet produces 65% higher Rmax TFlops and 71% higher RPeak TFlops. As for inuendo you seem to be a master as your reply did nothing to refute the above data in fact you seem to ignore it completely and branch off to such nonsense as "marketing slides" and "people don't sit around running Linpack all day".

            • chuckula
            • 7 years ago

            OK moronic shill: Looks like I was right: [url<]https://techreport.com/news/23884/intel-joins-the-data-parallel-computing-fray-with-xeon-phi[/url<] Wow... 225 watts for a commercially available Xeon-Phi. Oh, and while its peak DP numbers aren't quite as high as the K20, it has a metric crap-ton more memory bandwidth so you can bet that its efficiency at actually sustaining throughput near the maximum theoretical levels will be a lot better. Next time, try using more facts and less fanboyism. Seriously, you make the AMD and ARM fanboys look rational in comparison.

            • OU812
            • 7 years ago

            Moronic, thy name is you

            225 watts for the Xeon Phi 5110P that only produces 1.01 DP TFlops seems like a FAIL compared to the K20X’s 1.31 DP TFlops.

            Since you seem to be math illiterate I will spell it out to you. Xeon Phi 5110P is 23% slower than the K20X.

            —-

            Edit:

            Since the Phi is also a process node ahead (22nm vs 28nm) it looks like an even bigger failure.

            • MFergus
            • 7 years ago

            There’s other specs that are important besides flops ya know and im sure the real world power consumption is different for both.

            • OU812
            • 7 years ago

            Well the Phi is also not a low power device. At 225 watts, 23% slower than a K20X and on a 22nm process vs 28nm for the K10X it just looks like a failure after all these years.

            • MFergus
            • 7 years ago

            It still has good advantages and its performance isnt a whole lot worse.

    • ultima_trev
    • 7 years ago

    Even with 2,880 shaders, that MASSIVE 732 MHz core clock would enable this chip to utterly DESTROY Tahiti in game performance… Not that it will ever see a consumer version.

    NVIDIA=FAIL. AMD and Tahiti=WIN.

      • Beelzebubba9
      • 7 years ago

      …why would you assume there won’t be a consumer version? nVidia had no problem selling the GF110 based GTX 580, which was equally over-provisioned at the time of its launch as the GK110 is now.

      I suspect nVidia will launch a GK110 based card in time to spoil AMD’s Radeon 8xxx series party.

      • clone
      • 7 years ago

      who cares if this ever reaches the desktop?… I’m pretty sure Nvidia has no interest in it going there.

      a standard single GPU AMD HD 7970 or Nvidia 680 can run all games at max settings across 3 monitors……. honestly would their even be a market for it on desktop?

      if they ever released it to desktop you’d see it steal sales from server nothing more.

      • beck2448
      • 7 years ago

      There is a reason nvidia has 90% of the PRO market and has for years. Quality driver teams! AMD doesn’t have the money or resources to compete. Also nvidia is the choice for supercomputers as well, as their proprietary software runs medical imaging, oil and gas exploration,, and other GPGPU operations better than the competition. Their core clusters are much more powerful core to core than AMD.

    • ronch
    • 7 years ago

    The more I read about Nvidia and AMD, the more I realize that Nvidia has Vision. Funny how AMD’s marketing talks so much about Vision (Vision Technology from AMD… blah blah blah). Nvidia was the first to do all this GPU computing, the first of the big three companies (Intel, AMD, Nvidia) nowadays to seriously do ARM, etc. Jensen really has this Vision thing going for him. AMD just… follows. They’ve been doing it since IBM asked them to be the 2nd source for x86 CPUs and it’s still very much in their DNA. Perhaps I’m gonna be an Nvidia fanboy moving forward.

      • anotherengineer
      • 7 years ago

      Jensen is smart in going this route due to the high margins.

      Making/focusing on compute puts these into commercial/industrial markets with way better margins than consumer graphics cards.

      Ati would have been better on their own, since a lot of their managers were replaced with AMD personel and now have left the company.

        • moose17145
        • 7 years ago

        [quote<]Ati would have been better on their own, since a lot of their managers were replaced with AMD personel and now have left the company.[/quote<] That't exactly what I was just thinking. And actually I thought that ATI had pioneered into GPU computing before NVidia did? But then AMD bought ATI and kind of ruined it...

          • ronch
          • 7 years ago

          Well, the ATI division is still humming along pretty well, but I imagine their current products have been in the works for a while now, so the effects of the most recent layoffs won’t be felt for about a year or two. We’ll see.

          • MathMan
          • 7 years ago

          AMD was first to release a GPCPU SDK with an assembler, no compiler. Nvidia followed shortly after with a C compiler, elaborate libraries etc. The former was pretty much useless, the latter was not.

          Also: at the time, the AMD architecture was extremely inefficient for GPU computing, with a large discrepancy between theoretical and real performance. The Nvidia arch almost seemed to have been designed with GPU computing in mind.

          It’s fair to say that Nvidia has its eyes on that ball much earlier than AMD. GCN is a different story, but probably too late?

      • clone
      • 7 years ago

      why be a fanboy at all?

      how is that something to be proud of?…. being closed minded and ranting all things Nvidia would seem an embarrassing endeavor worthy of little more than scorn.

      • UberGerbil
      • 7 years ago

      Well, yes, but this wasn’t entirely by choice, either. When the chipset business got yanked out from under them by Intel, and the rising tide of “good enough” IGPs started eating into their volume-oriented products (particularly for OEM sales), they found themselves getting boxed into a ever-diminishing spiral of increasing R&D costs for high-end GPUs that were selling to an ever-smaller slice of the market (even if that slice was holding steady due to overall growth). Thus they were forced to make a big, bet-the-company play on GPGPU (which still has some long-term questions associated with it). They get full credit for seeing what was coming and taking the necessary steps (even to the point of emphasizing GPGPU — ie dp fp — over gaming in the first iteration of products), as well as for executing executing on it (all the strategy in the world doesn’t help if you can’t produce products that successfully embody it). But like the Tegra line — where they chose ARM because it was the best alternative to the x86 cores they didn’t have — it’s hard to say how much of this is grand strategic vision and how much is smart tactics and making the best of what is handed to them. Sometimes being lucky matters as much as being good.

      • sschaem
      • 7 years ago

      Look up Vision or Fusion, and I swear AMD is listed as Antonyms.

      And AMD spent 6 billion to acquire ATI so it could implement its Fusion directive.
      6 years later… the best AMD was able to do is glue a GPU next to a CPU.

      Intel actually was able to move forward,
      and its architecture now is able to shared a L3 cache between CPU and GPU.

      For AMD any transaction between the GPU and CPU need to go outside the chip to ram and do a round trip.
      2013 will show ZERO advance in AMD fusion effort, while Intel is getting closer.

      And how can a company claim “Vision” the company divest itself of its mobile asset right when the industry go through a paradym shift toward mobile computing.
      And also decide to scale back GP GPU computing while the industry embracing it.
      ATI was way ahead of nvidia until the AMD acquisition…

Pin It on Pinterest

Share This