Skip to main content

Google’s next-generation AI training system is monstrously fast

Google’s next-generation AI training system is monstrously fast

/

The TPU V2 could be a huge boon to Google’s cloud computing platform

Share this story

Google today is unveiling its second-generation Tensor Processor Unit, a cloud computing hardware and software system that underpins some of the company’s most ambitious and far-reaching technologies. CEO Sundar Pichai announced the news onstage at the keynote address of the company’s I/O developer conference this morning.

Google’s TPU is the foundation of its artificial intelligence work

The first TPU, shown off last year as a special-purpose chip designed specifically for machine learning, is used by the AlphaGo artificial intelligence system as the foundation of its predictive and decision-making skills. Google also uses the computation power of TPUs every time someone enters a query into its search engine. More recently, the technology has been applied to machine learning models used to improve Google Translate, Google Photos, and other software that can make novel use of new AI training techniques.

Typically, this work is done using commercially available GPUs, often from Nvidia — Facebook uses Nvidia GPUs as part of its Big Basin AI training servers. But Google has opted over the last few years to build some of this hardware itself and optimize for its own software.

In that sense, the original TPU was designed specifically to work best with Google’s TensorFlow, one of many open-source software libraries for machine learning. Thanks to Google’s advances and its integration of hardware and software, however, TensorFlow has emerged as one of the leading platforms on which to build AI software. This optimization, coupled with the in-house talent from Google Brain and its DeepMind subsidiary, is part of the reason why Google remains at the forefront of the broader AI field.

Now, Google says the second version of its TPU system is fully operational and being deployed across its Google Compute Engine, a platform other companies and researchers can tap for computing resources similar to Amazon’s AWS and Microsoft’s Azure. Google will of course use the system itself, but it is also billing the new TPU as an unrivaled resource for other companies to make use of.

A server rack containing multiple Tensor Processing Units, which are now used to both train AI systems and help them perform real-time tasks.
A server rack containing multiple Tensor Processing Units, which are now used to both train AI systems and help them perform real-time tasks.
Photo: Google

To that end, the company developed a way to rig 64 TPUs together into what it calls TPU Pods, effectively turning a Google server rack into a supercomputer with 11.5 petaflops of computational power. Even on their own, the second-gen TPUs are capable of “delivering a staggering 180 teraflops of computing power and are built for just the kind of number crunching that drives machine learning today,” says Fei-Fei Li, Google’s chief scientist of AI and machine learning.

The edge this gives Google over competitors’ offerings is the speed and freedom to experiment, says Jeff Dean, a senior fellow on the Google Brain team. “Our new large-scale translation model takes a full day to train on 32 of the world's best commercially available GPU's,” Dean told a group of reporters in a press briefing this week. “While one eighth of a TPU pod can do the job in an afternoon.”

The second-gen TPU can be turned into an AI-training supercomputer

Beyond speed, the second-gen TPU is also going to allow Google’s servers to do what is known as both inference and training simultaneously. The previous TPU could only do inference — for instance, relying on Google Cloud to crunch numbers in real time to produce a result. Training, on the other hand, is how an AI algorithm is developed, and that takes exceptional resources.

Machine learning, as the bedrock of modern AI research, effectively means feeding an algorithm hundreds of thousands of examples to allow to learn to perform a task in a way that it was never expressly programmed to do. This has manifested itself in a number of different consumer products, like Google Translate’s near-instantaneous ability to turn an English sentence into its Mandarin counterpart or AlphaGo’s ability to play the deep, chess-like board game with superhuman proficiency.

It all comes down to training neural networks on large mounds of data and transforming that all into a workable algorithm — and that takes computation power. These training systems, in the more general sense, improve AI software through massive number crunching. So the more powerful the hardware, the faster you get results. “If we can get the time for each experiment down from weeks to days or hours, this improves the ability for everybody doing machine learning to iterate more quickly and do more experiments,” Dean says.

Because this newer TPU is now capable of doing both inference and training, researchers can deploy more versatile AI experiments far faster than before — so long as the software is built using TensorFlow. Google is also reiterating its commitment to the open source model by offering up TPU resources to researchers who agree to publish their findings and possibly even open source their code. The company is calling the program the TensorFlow Research Cloud, and it will be giving out access to a cluster of 1,000 TPUs for free.