Highlights:
- A single A3 supercomputer VM is powered by eight H100 GPUs based on Nvidia’s Hopper architecture, delivering 3x faster processing power than the previous generation chip, the A100.
- A3 VMs are not only powerful, but Google Cloud also offers some flexible deployment options.
With the debut of its A3 supercomputers, Google Cloud is expanding its portfolio of virtual machines for training and operating artificial intelligence and machine learning models.
Announced at Google I/O, the Google Compute Engine A3 supercomputer is purpose-built to train and deploy state-of-the-art AI models, including those driving advances in the company’s exciting field of generative AI’s VM.
Cutting-edge AI and machine learning require massive amounts of computing power delivered by purpose-built infrastructure, said Roy Kim, Director of Product Management, and Chris Kleban, Group Product Manager at Google. Google Cloud will use the A3 supercomputer to offer a mix of Nvidia Corp.’s new H100 graphics processing units. Kim and Kleban said the company’s unique state-of-the-art networking advancements ensure customers have access to the highest-performing GPUs for AI workloads.
A single A3 virtual machine (VM) is fueled by eight H100 GPUs based on Nvidia’s Hopper architecture, delivering three times the processing speed of the A100. It also offers a half bandwidth of 3.6 terabytes per second across these GPUs via NVSwitch and NVLink 4.0, as well as integration with Intel Corp.’s 4th generation Xeon Scalable processors for offloading management tasks.
The A3 supercomputer is the first GPU instance to utilize Google’s purpose-built Intel Infrastructure Processing Units to bypass the CPU host and accelerate data transfers from the GPU to the central processing unit. According to Google, this will increase network bandwidth by up to 10x compared to previous generation A2 VMs.
These instances also utilize the intelligent network fabric of Google’s Jupiter data center and can scale across 26,000 interconnected GPUs to deliver up to 26 exaFlops of AI performance. As a result, according to Google, A3 VMs significantly reduce the time and cost required to train large-scale machine learning models. Additionally, when moving from model training to deployment, A3 VMs deliver a 30x improvement in inference performance compared to A2 VMs.
A3 VMs are powerful, and Google Cloud offers some flexible deployment options. For example, customers can choose to deploy A3 VMs on Google Cloud’s Vertex AI platform to build machine learning models on fully managed infrastructure purpose-built for high-performance training. Vertex AI was recently updated with new generative AI capabilities to support the development of large language models better.
Alternatively, the company said that customers looking to develop their own bespoke software stacks can deploy the A3 supercomputer on Google Compute Engine or Google Kubernetes Engine. This enables teams to train and service advanced fundamental models while taking advantage of automatic scaling, workload orchestration, and automated updates.