Google TPU chomps data at least 15x faster than regular hardware

So you remember Google's Tensor Processing Unit? If not, all you really need to know is that the chip is a custom ASIC designed by Google to accelerate the inference phase of machine learning tasks. Google initially said that the TPU could improve performance-per-watt in those tasks by a factor of ten in comparison to traditional CPUs and GPUs. Now, the company has released some performance data in the form of a study analyzing the performance of the TPU since its quiet introduction in 2015.

The short version is that predicting a 10x uplift in performance-per-watt was Google's way of being modest. The actual increase for that metric was between 30 and 80 times that of regular solutions, depending on the scenario. When it comes to raw speed, Google says its TPU is between 15x to 30x faster than using standard hardware. The software that runs on the TPU is based on Google's TensorFlow machine learning framework, and some of these performance gains came from optimizing it. The writers of the study say that there are further optimization gains on tap, too.

Apparently, Google saw the need for a chip like the TPU as far back as six years ago. Google uses machine learning algorithms in many of its projects, including Image Search, Photos, Cloud Vision, and Translate. By its nature, machine learning is computationally intensive. By way of example, the Google engineers said that if people used voice search for three minutes a day, running the associated speech recognition tasks without the TPU would have required the company to have twice as many datacenters.

Comments closed
    • chuckula
    • 3 years ago

    [quote<]If not, all you really need to know is that the chip is a custom ASIC designed by Google to accelerate the inference phase of machine learning tasks[/quote<] Inference, BTW, is not actually the "learning" or "training" part of neural network machine learning (although in some setups the inference process further refines an already-trained neural network). Instead, inference is more along the lines of: Use the trained neural network to generate a classification based on input (i.e. using the neural network to do stuff [i<]after[/i<] you trained it). [url<]https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/[/url<]

    • TurtlePerson2
    • 3 years ago

    Hardware implemented algorithms are faster than FPGA implemented algorithms which are faster than software implemented algorithms. This has always been the case. That’s why video decode moved to hardware a while back. It’s just that you need a lot of volume to justify creating hardware to solve a problem that software can solve.

    • brucethemoose
    • 3 years ago

    I remember when GPUs used to be “accelerators”, specifically for graphics… Now they’re the inefficient general purpose processors.

    And I’m not very old.

      • jihadjoe
      • 3 years ago

      S3 still wins for making the one and only ‘graphics decellerator’

        • ozzuneoj
        • 3 years ago

        “Virge – noun – an obsolete variant of verge”

        It’s like they knew when they named the thing that it was already useless for 3D acceleration…

        They do make good DOS Gaming 2D cards though. 🙂

        • lycium
        • 3 years ago

        Terminal Velocity, 640×480, ~10fps 🙂

      • lycium
      • 3 years ago

      GPUs are amazing semi-general purpose processors.

      Source: I develop GPU 3D and fractal rendering software with OpenCL.

    • tom_in_mn
    • 3 years ago

    Any one who remembers getting a 80287 numeric coprocessor and how much faster it was won’t be surprised. Hardware acceleration makes a large difference. Same reason we spend so much money on GPUs. But they are no help if they don’t do the calculation you need.

      • UberGerbil
      • 3 years ago

      Yes, but I also remember at the time that the “alternate floating point” library in the MS C compiler, which used 64bit fp rather than the 80bit stack machine the x87 used, was actually faster than the x87 hardware. A hardware-accelerated kludge can be slower than a smarter approach without any acceleration. Throwing transistors at a problem is the easy way (especially when Moore’s Law was running at full tilt) but it’s not always the optimum way.

        • Misel
        • 3 years ago

        So you had one of the first Intel FPUs?

        IIRC, Cyrix was the first to actually have a mathematician optimize the FPU. The result was that their x87 was a lot faster than the Intel x87.

        [url<]https://en.wikipedia.org/wiki/Cyrix[/url<]

          • Chrispy_
          • 3 years ago

          All I remember from Cyrix was that their processors sucked for gaming.

          My Pentium 133 ran Doom/Duke3D/Half-Life at [i<]more than[/i<] twice the framerate of a "166+ Cyrix" (which I believe only ran at 120MHz)

            • srg86
            • 3 years ago

            This was earlier than the 6×86.

            Their first products are what Misel is refurring to, the Cyrix FasMath CX-83D87.

            [url<]http://www.cpu-world.com/CPUs/80387/Cyrix-FasMath%20CX-83D87-33-GP.html[/url<] These were faster than Intel's 387's. The trouble was that the interface between the two chips were the man bottleneck (eliminated in the 80486). The 6x86's FPU was not much more advanced than the FasMath (although without the interface bottleneck), while the Pentium had a fully pipelined FPU. At Integer work, the 120MHz "166+ Cyrix" was actually more like a Pentium 166. it just sucked at floating point. That said, never was a fan of the 6x86 myself as it was quirky with instruction set support. The 6x86L at little better, and the 6x86MX *much* better.

    • blastdoor
    • 3 years ago

    I re-read TR’s original coverage and it highlighted the use of reduced precision computation. Since the GPU guys are doing that too, I’m guessing there must be more to it than that…

      • ImSpartacus
      • 3 years ago

      I bet part of it is the form factor.

      If you can honestly cram of these everywhere that you can put a 2.5″ drive, then you could probably achieve impressive density.

      Remember that something like Nvidia’s flagship machine learning rack, the DGX-1 is a relatively large 3U server.

      So Google is just betting that these tpu accelerators can provide enough performance density to beat big beefy gpus.

Pin It on Pinterest

Share This