Nvidia discusses Google TPU’s performance versus modern Teslas

Just a few days back, Google published some interesting benchmarks of its Tensor Processing Unit, a custom-made ASIC targeted at AI inference applications. The numbers were pretty impressive. Google claimed that its TPU was roughly 15 times faster than an unspecified GPU (which we took to mean a Tesla) when running inference tasks, and a whopping 30 to 80 times better when it came to performance-per-watt. As the CEO of a prominent GPU vendor, it was only natural that Nvidia's Jen-Hsun Huang would want to weigh in on the performance numbers that Google shared.

Nvidia agrees with Google on a lot of points, namely the fact that the existence of dedicated hardware is what makes many modern AI applications scalable at all. However, Huang notes that the Google engineers compared their TPU to a now-outdated Kepler-based Tesla K80, which was allegedly never really optimized for handling AI-related tasks. He proceeds to point out that the contemporary Tesla P40 (based on the Pascal architecture) "delivers 26x [the K80's] deep-learning inferencing performance."

With those facts in mind, Nvidia produced a chart quantifying the performance leap from the K80 to the P40 in both training and inference tasks, and estimating where Google's TPU fits in the picture.

Right off the start, it's easy to spot one caveat: Google's TPU isn't meant to run training tasks, explaining the "NA" in the appropriate table cell. However, despite the huge performance boost of the P40 versus the old K80, the fact remains that the TPU still comes out almost twice as fast as the modern Tesla in integer teraops (at 90 versus 48). Nvidia didn't note performance-per-watt figures in the charts, though. Those figures likely to wouldn't make even the modern Tesla look particularly good with the tested workload.

What this data amounts to is a little unclear. In our view, analyzing the performance of dedicated ASIC versus a general-purpose GPU is a flawed comparison to begin with—and one that doesn't even make the Tesla card look all that good in this specific instance. With that said, developing an ASIC and probably-specialized software to go with it isn't something that's even as remotely easy as just buying a Tesla card and hitting Nvidia's extensive HPC software library. The only moral in this story seems to be "pick the hardware that suits your needs."

Comments closed
    • ET3D
    • 3 years ago

    Tesla looks a little worse in performance/power. It’s twice as fast as the TPU doing inferences, and takes 3.33 times more power, so the TPU is less than 2x as efficient. Not bad, but not earth-shattering either.

    • chuckula
    • 3 years ago

    Of course an ASIC like Tensor can beat a much more general-purpose part like a GPU in a specific niche, but just remember how specific that niche really is.

    You literally can’t train a neural network using Tensor! Tensor is great for doing inference after you’ve already done the training (and the training is the truly computationally intensive part of the process) but that’s only one part of the equation.

    In other words, Nvidia ain’t going out of business because of Tensor.

      • jv008
      • 3 years ago

      Sorry, but you got that wrong, inferencing requires vastly more computation than training.
      Training you do once and you got a neural network that you then can deploy on millions of cloud nodes.

        • chuckula
        • 3 years ago

        No, the training process is vastly more computationally complex, and frankly it shows quite a bit of naivety to say that it is only ever run “once” given how complex systems update themselves.

        Furthermore, claiming that inference is more computationally complex than learning just because you can theoretically run inference using a whole bunch of processors is like saying that Microsoft Word is a more computationally complex than any workload run on any supercomputer because when you add up the operations executed by all the computers in the world running Word they are bigger than the operations of the supercomputer… it’s a theoretically valid statement that’s totally stupid.

          • jv008
          • 3 years ago

          Ok, let’s get a bit more clear here.
          A)Training a single neural network takes vastly more computation than using it (inferencing)
          It can take hundreds of GPUs days to train a very large neural network.

          B) Once trained a neural network can be used for inferencing for years.

          C) The training complexity is independent of the number of users (inference instances) of the neural network. Neural networks are retrained regularily in case better network topologies emerge or there is better training data.

          D) A neural network can be used for inferencing on milions even bilions of devices like cloud nodes, cell phones or cars. Each device requires inferencing hardware. Inferencing hardware can be CPU or GPU, or custom hardware like TPU.

    • K-L-Waster
    • 3 years ago

    Translation: “Yeah, Google kicked our @$$, but not as bad as they said they did.”

      • chuckula
      • 3 years ago

      Actually Nvidia’s main tagline is: Sure Tensor’s great… but you can actually buy the P40, so it’s effectively infinitely faster that silicon you’ll never own.

      • tipoo
      • 3 years ago

      Well, “it kicked our ass, but in a tangential workload that doesn’t always require that performance, while ours work where you do need it”

    • POLAR
    • 3 years ago

    “You’re holding it wrong”
    [url<]https://www.youtube.com/watch?v=znxQOPFg2mo[/url<]

    • the
    • 3 years ago

    The “pick the hardware that suits your needs” does hold merit but at the custom ASIC level not many companies can afford to go to that level. The more realistic solution for them would be a FPGA. That would be valuable data point vs. both Google’s ASIC and a GPU.

    The other thing is that the Telsa P40 isn’t nVidia’s best product for this segment. That would be the P100 I am not mistaken. The extra size of the GP100 die went toward integer work as well as FP64 support over the GP102 used in the P40. Also HBM memory provides plenty of bandwidth.

    • NoOne ButMe
    • 3 years ago

    Nvidia… this only makes you look even worse for the tasks that Google made the device for.

    Nvidia’s marketing screwed up. Not happening to often. At least not at this “major” of a scale.

      • DarkStar1
      • 3 years ago

      Not really – that Nvidia’s performance is competitive with a custom designed ASIC *at all* is impressive.

        • nico1982
        • 3 years ago

        How is it competitive? Just because Nvidia cobbled them all together onto a chart? The TPU has twice the inference performance for 1/3 of the power consumption 😛

          • EndlessWaves
          • 3 years ago

          And probably more than six times the cost.

            • NoOne ButMe
            • 3 years ago

            Google designed it. Google orders it from the fab.

            Even if it costs 2x to make as P40, Nvidia’s professional margins are way higher.

          • DarkStar1
          • 3 years ago

          How is it not? Bravo, we have a dedicated ASIC that is twice as fast in a single metric at one third the power reqs. That’s the point of ASICs – they’re really fast at *one* job, and often, that means power reqs go down. But let’s look at another side of this – the p40 is infinitely faster than a TPU for training and it’s only three times more power hungry! It’s also possible to buy a p40. 😉

        • NoOne ButMe
        • 3 years ago

        The TPU is 28nm.

        Er, I forgot P40 isn’t Pascal. I figured Nvidia would compare Pascal. So, it isn’t a node advantage per say for either side.

        Google claims using GDDR5 would increase performance 3x for about 10W more power.

        I think they exaggerate, but 2x the inference performance, is fair.

        Not that Any of this is new, super-specialized hardware is far better for specific tasks than generalized hardware if you can afford it.

          • BurntMyBacon
          • 3 years ago

          [quote=”NoOne ButMe”<]Er, I forgot P40 isn't Pascal. [/quote<] Are you sure? The article seem to think it is: [quote="Bruno Ferreira"<]He proceeds to point out that the contemporary Tesla P40 (based on the Pascal architecture) "delivers 26x [the K80's] deep-learning inferencing performance."[/quote<]

            • NoOne ButMe
            • 3 years ago

            oh. well. I’m just going to stop redoing it. I think i am only going to make more of a mess of it now. My undoing my non-mistake will stay.

          • DarkStar1
          • 3 years ago

          My point is that Nvidia is competitive *at all* with a programmable system is impressive. Trying to keep up with an ASIC with a programmable system is an effort in futility, but throw that ASIC a task it wasn’t designed for and the tables flip.

            • NoOne ButMe
            • 3 years ago

            It isn’t competitive. I don’t know where Nvidia gets 75W from, but google says 40W (claims estimates 10W more fopr GDDR5 = 3x performance).
            So Nvidia has <33% pref/watt for Inferences per second.
            And <10% pref/watt for INT8 Inference TOPs. If these are as worthless as DLTOPs than doesn’t matter. I don’t know.

            Bulldozer was noncompetitive with near/under 50% performance per watt of Intel.

            How is worse than Bulldozer competitive?

            • DarkStar1
            • 3 years ago

            I’m sorry, but I don’t think anyone would agree with you. A programmable system performing within an order of magnitude of an ASIC is kind of the definition of competitive, especially when we consider that in order for that ASIC to match the programmable system in flexibility it’ll require yet more specialized hardware, further boosting power requirements.

      • BurntMyBacon
      • 3 years ago

      The article talks about how much better P40 is for deep learning.

      [quote=” Bruno Ferreira”<]He proceeds to point out that the contemporary Tesla P40 (based on the Pascal architecture) "delivers 26x [the K80's] deep-learning inferencing performance."[/quote<] This is apparently shown in the first row of the chart: Inferences/Sec <10ms latency. They look to be twice as fast in this metric as the TPU, but half as fast in Inference TOPS. How do these metrics compare and which is more relevant to the desired workload? Also, could the advantage the P40 has in the first metric be related to the roughly 10x memory bandwidth advantage.

    • hungarianhc
    • 3 years ago

    Real question: Why does this article even matter if a consumer will never be able to buy a Google TPU?

      • RAGEPRO
      • 3 years ago

      Well, because this article isn’t for consumers. Or anyone looking to buy a Google TPU. Nvidia makes a lot of money from deep learning customers and this is trying to dissuade anyone who needs to do a lot of that from trying to get their own custom ASICs built.

      The thing is, I don’t think Nvidia really needs to do that because… I mean, Google can throw away millions or even billions on an experiment that may not pan out because it’s Google. Smaller firms, or even ones that are as large but don’t have the requisite engineering talent, are really better served with the more general-purpose and programmable Nvidia hardware -anyway-.

      • f0d
      • 3 years ago

      because its interesting

        • Neutronbeam
        • 3 years ago

        Ah, playing the old “because it’s interesting” card again, are we? Interesting….

      • torquer
      • 3 years ago

      Because this is tech. And its being reported on.

      As in, a tech report.

      .com

        • DragonDaddyBear
        • 3 years ago

        Am I the only one that read that and thought of the Jeff Dunham skit with Peanut?
        [url<]https://www.youtube.com/watch?v=kYAqQGEHg7I[/url<]

          • derFunkenstein
          • 3 years ago

          Yeah I’ve been rewatching Arrested Development so I read it as “I’m Oscar! Dot Com!”

          [url<]http://arresteddevelopment.wikia.com/wiki/IMOSCAR.COM[/url<]

    • Bumper
    • 3 years ago

    GPUs have been readily available and very parallel. and since “simulating a neural network requires aggregating the inputs and computing outputs for many neurons, a process that is easily parallelizable” (https://arstechnica.com/information-technology/2017/04/how-amazon-go-probably-makes-just-walk-out-groceries-a-reality/2/) it makes sense to use gpus.

    so what about using gpus in coordination with programmable chips that are optimized to process workloads faster. like a fpga? as the ai becomes more sophisticated the new model is reprogrammed on the chip. the “training” would still happen on the gpus. would that work?

      • stefem
      • 3 years ago

      That is actually what even Google does, TPU isn’t suitable for the learning process, it’s used just for inferencing and performance in this workload is not critical in many application.

Pin It on Pinterest

Share This