Testing HPC on Google’s Matrix TPU Engines

In an ideal cloud platform, you wouldn’t know or care about the underlying hardware and how it was put together to run your HPC – and now AI – applications. The underlying hardware in a cloud would have a mix of different types of compute and storage, an all-to-all network connecting it, and anything you need could be dialed in on the fly.

This is precisely the type of compute cloud that Google wanted to build in April 2008 with App Engine, and ultimately very few organizations wanted to buy. Companies cared – and still do – about the underlying infrastructure, but at the same time, Google still believes in its heart in the cloud of the platform. And that’s one of the reasons its Tensor Processing Unit, or TPU, computational engines are only available on Google Cloud. (Although you can say that the mathematical units of the GroqChip matrix available through Groq are as much an architectural copy of the TPU as Kubernetes is for Google’s Borg Container and Cluster Controller, Hadoop is for Google’s MapReduce data analytics and storage, or CockroachDB is for SQL Database Spanner from Google.)

When you place a cluster of 2,048 TPU cores on a tightly coupled toroidal mesh network and have a combination of 32 TB of HBM2 memory and over 100 petaflops of single-precision floating-point calculations (with support for mixed precision to make inferences) out there on the cloud for people to run apps, it’s natural for people running HPC apps to change their codes to use the TPU.

Over the past few years, experiments have been conducted, and the latest by Google itself, showing how to speed up the fluid dynamics associated with predicting river floods. There are many more examples, which we’ll talk about in a moment, and it will be interesting to see if the TPU and its clones find a place in HPC or if it’s just a collection of science projects from valid investigation.

It should be noted that in 2006, this is precisely how mathematical algorithms and then portions of HPC codes were unloaded from CPUs to GPUs, triggering a whole revolution from which the explosion of AI a few years later took advantage. Now AI is driving technologies like TPU and HPC can benefit.

The latest article from Google, which you can read here, the researchers transferred the calculations underlying the hydrodynamic models of the CPUs to the TPUs and measured the performance of the TPU core against a relatively modern X86 CPU core. (Comparisons to GPUs were not provided, and obviously thanks to Google Cloud, which sells raw GPU capacity, as well as massive internal GPU farms that Google uses for various parts of its machine learning training stack. the TPUv4 matrix engine has been in development and pre-production for some time now, to our knowledge it has not been deployed to Google Cloud and TPUv3 matrix engines, that we have profiled hereare the only ones people can kick the tires on.

Here’s the interesting thing that came out from reading this article: the HPC codes to run the flood simulation were not ported from a parallel stack running, for example, Fortran code and using OpenMP and MPI . Google went straight to the Saint-Venant shallow water partial differential equations used to simulate liquid flow over the topography and implemented them in Python on top of its TensorFlow machine learning framework, using its compiler Accelerated Linear Algebra (XLA). It is important to note that Google has created a fully 2D river flood model rather than a hybrid 1D-2D model which is commonly used on CPU only systems which are computationally weak compared to a group of TPUs . This is what the TPU simulation stream looks like for this flood simulation:

Google has opened up the code behind this simulation to help HPC researchers see how this flood model works and perhaps do their own work in other parts of the HPC industry. You can download this code here. There are some tricky things you need to do to create the initial boundary conditions for any simulation, and this Google-made app handles that. And the collective operations in the TensorFlow framework and enabled by the TPUv3 network seem to do a good job, based on visual inspection of the images and comparison with actual data obtained from an actual flood that was simulated, to determine the height of the water and its flow through the simulated topology.

As with any fluid flow simulation, resolution is important, but it comes at a high computational cost. Low resolution models give you an idea, but high resolution models give you something more like data. So Google ran its simulations on X86 CPU cores and TPU cores at 8-meter, 4-meter, and 2-meter grid resolutions, then pushed a quarter pod of TPU to deliver a 1-meter grid resolution. . The simulation was run on a section of the Arkansas River that flooded in May 2019. Google tested the resolution scaling against different size slices of the TPU pod, ranging from a single TPU core up to a quarter pod with 512 TPU cores. The datasets ranged from 15.5 million grid points at 8 meter resolution to 1 billion grid points at 1 meter resolution.

Here’s how the river flood simulation performed on different TPU compute sizes and at different resolutions:

For some reason Google didn’t run this flood simulation on an entire TPU pod. As you can see in the table above there were diminishing returns going from 128 TPU cores to 512 TPU cores at the 8 meter resolution, but at lower resolutions the scaling still worked quite well because more calculation was added. But the scaling was dropping pretty quickly, and maybe Google didn’t want to talk about it. OK, we think Google I sure didn’t want to talk about it. But we realize it’s difficult to do full-iron scale simulations in any supercomputer, and in a second pass Google would no doubt be able to do better on a larger scale. Just like real HPC workshops do with their simulations.

So how well did the TPU predict the flood? Good enough to tell emergency responders where trouble spots were going to be, we think. Here is aerial flooding over a section of the Arkansas River at 1 meter resolution showing the actual flood extent:

And it’s here that the simulation predicted where the flood would be based on a river flow similar to what happened during the flood:

The other interesting piece of research done by Google was to run the same simulation on the CPUs as well as its TPUs, using the same code stack and simply replacing the XLA compiler used for the TPU with the Eigen C++ template library for linear algebra for processors under Linux.

Here’s how the CPU compared to the TPU on a per-core level:

The processor in question here is a “Cascade Lake” Xeon SP-8273CL Platinum, which has 28 cores running at 2.2 GHz and rated at around 2 teraflops at FP32 single precision. (Single-precision floating-point performance ratings for the IEEE FP32 formats of TPUs have never been published.) The performance difference per core is well over 500X, which makes sense given the number and size of the mathematical units of the MMX die in TPU cores. . Each TPUv3 core has two 128×128 matrix math units, and each TPUv3 chip has two cores; there are four TPUv3 chips per motherboard and 256 motherboards per TPUv3 pod. (By the way, the TPUv2 had half the HBM memory per core, at 8 GB, and half the MMX units per core, at one each, compared to the TPUv3. So the TPUv2 iron acceleration that is still available on the Google Cloud would cost about half per core compared to the X86 iron.

Google hasn’t shown how X86 servers could be clustered and scaled. And he certainly didn’t talk about the cost of running the simulation on a CPU cluster versus a TPU cluster within a certain timeframe, like you need for weather and emergency management simulations. But, given this data and many guesses, HPC workshops can start thinking about what it could be. (We might do this work ourselves when the newsfeed slows down in July, just for fun, figuring out how a more modern cluster using AMD “Milan” Epyc 7003 processors might compare to rented capacity on TPUv3 pods and TPUv4. Hmmm.)

As we pointed out above, the number of HPC codes that have been ported to the TPU to speed them up is increasing, and Google isn’t doing all the work because the HPC community is curious. Here are the papers we were able to find without hurting the search engine too much:

Wouldn’t it be fun if after coming all this way with processors and accelerators, we end up with an architecture that looks like an 80286 processor with a massively parallel set of 80287 coprocessors to do its math homework? IBM did the same with the six-lane System/3090 mainframes and slapped a vector math unit on each engine in 1989 when we were just getting started in this data center racket and when Cray was winning for the first times business customers in the company. Everything will depend on the software developed, of course.

And one final thought: any code created to speed up HPC on TPUs would probably be relatively easy to move to the matrix math engines created by Cerebras, SambaNova and GraphCore as well as Groq.

Leave a Reply

Your email address will not be published.