Just a few days back, Google published some interesting benchmarks of its Tensor Processing Unit, a custom-made ASIC targeted at AI inference applications. The numbers were pretty impressive. Google claimed that its TPU was roughly 15 times faster than an unspecified GPU (which we took to mean a Tesla) when running inference tasks, and a whopping 30 to 80 times better when it came to performance-per-watt. As the CEO of a prominent GPU vendor, it was only natural that Nvidia's Jen-Hsun Huang would want to weigh in on the performance numbers that Google shared.
Nvidia agrees with Google on a lot of points, namely the fact that the existence of dedicated hardware is what makes many modern AI applications scalable at all. However, Huang notes that the Google engineers compared their TPU to a now-outdated Kepler-based Tesla K80, which was allegedly never really optimized for handling AI-related tasks. He proceeds to point out that the contemporary Tesla P40 (based on the Pascal architecture) "delivers 26x [the K80's] deep-learning inferencing performance."
With those facts in mind, Nvidia produced a chart quantifying the performance leap from the K80 to the P40 in both training and inference tasks, and estimating where Google's TPU fits in the picture.
Right off the start, it's easy to spot one caveat: Google's TPU isn't meant to run training tasks, explaining the "NA" in the appropriate table cell. However, despite the huge performance boost of the P40 versus the old K80, the fact remains that the TPU still comes out almost twice as fast as the modern Tesla in integer teraops (at 90 versus 48). Nvidia didn't note performance-per-watt figures in the charts, though. Those figures likely to wouldn't make even the modern Tesla look particularly good with the tested workload.
What this data amounts to is a little unclear. In our view, analyzing the performance of dedicated ASIC versus a general-purpose GPU is a flawed comparison to begin with—and one that doesn't even make the Tesla card look all that good in this specific instance. With that said, developing an ASIC and probably-specialized software to go with it isn't something that's even as remotely easy as just buying a Tesla card and hitting Nvidia's extensive HPC software library. The only moral in this story seems to be "pick the hardware that suits your needs."