Four years ago, Google was faced with a conundrum: if all its users hit its voice recognition services for three minutes a day, the company would need to double the number of data centers just to handle all of the requests to the machine learning system powering those services.
Rather than buy a bunch of new real estate and servers just for that purpose, the company embarked on a journey to create dedicated hardware for running machine- learning applications like voice recognition.
The result was the Tensor Processing Unit (TPU), a chip that is designed to accelerate the inference stage of deep neural networks. Google published a paper on Wednesday laying out the performance gains the company saw over comparable CPUs and GPUs, both in terms of raw power and the performance per watt of power consumed.
A TPU was on average 15 to 30 times faster at the machine learning inference tasks tested than a comparable server-class Intel Haswell CPU or Nvidia K80 GPU. Importantly, the performance per watt of the TPU was 25 to 80 times better than what Google found with the CPU and GPU.
Driving this sort of performance increase is important for Google, considering the company’s emphasis on building machine learning applications. The gains validate the company’s focus on building machine learning hardware at a time when it’s harder to get massive performance boosts from traditional silicon.
This is more than just an academic exercise. Google has used TPUs in its data centers since 2015 and they’ve been put to use improving the performance of applications including translation and image recognition. The TPUs are particularly useful when it comes to energy efficiency, which is an important metric related to the cost of using hardware at massive scale.
One of the other key metrics for Google’s purposes is latency, which is where the TPUs excel compared to other silicon options. Norm Jouppi, a distinguished hardware engineer at Google, said that machine learning systems need to respond quickly in order to provide a good user experience.
“The point is, the internet takes time, so if you’re using an internet-based server, it takes time to get from your device to the cloud, it takes time to get back,” Jouppi said. “Networking and various things in the cloud — in the data center — they takes some time. So that doesn’t leave a lot of [time] if you want near-instantaneous responses.”
Google tested the chips on six different neural network inference applications, representing 95 percent of all such applications in Google’s data centers. The applications tested include DeepMind AlphaGo, the system that defeated Lee Sedol at Go in a five-game match last year.
Sign up for Computerworld eNewsletters.