Highlights:
- Microsoft is taking on this challenge by using its ten years of supercomputing expertise to support the largest AI training workloads.
- Microsoft’s new ND H100 v5 instances use NVLink, an Nvidia technology, to connect the eight H100 chips to one another.
A new instance family created specifically to run artificial intelligence models has been added to Microsoft Corp.’s Azure cloud platform.
The ND H100 v5 series is the instance family that premiered recently.
A principal project manager at Azure’s high-performance computing and AI group, Matt Vegas, wrote in a blog post “Delivering on the promise of advanced AI for our customers requires supercomputing infrastructure, services, and expertise to address the exponentially increasing size and complexity of the latest models. At Microsoft, we are meeting this challenge by applying a decade of experience in supercomputing and supporting the largest AI training workloads.”
Eight H100 graphics processing units from Nvidia Corp. are present in each ND H100 v5 instance. The H100 is Nvidia’s most advanced data center GPU, which was released last March. Compared to the company’s previous flagship chip, it can train AI models nine times faster and operate them up to 30 times faster.
The 80 billion transistors of the H100 were created using a four-nanometer technique. The Transformer Engine, a specialized module included in it, is made to accelerate AI models built using the Transformer neural network architecture. Many cutting-edge AI models, such as the ChatGPT chatbot from OpenAI LLC, are powered by this architecture.
The H100 includes further upgrades from Nvidia as well. The chip has a built-in confidential computing function among its many other features. The capability can isolate an AI model to prevent requests for unauthorized access from the operating system and hypervisor on which it operates.
Advanced AI models are typically installed across numerous graphics cards. When used in this way, GPUs must communicate with one another often to coordinate their work. Companies frequently connect their GPUs using high-speed network connections to accelerate the data transfer between them.
NVLink, an Nvidia technology, is used to connect the eight H100 CPUs in Microsoft’s latest ND H100 v5 instances. Nvidia claims the technology is seven times faster than the well-known networking standard, PCIe 5.0. According to Microsoft, NVLink offers 3.6 terabits per second of bandwidth across the eight GPUs in its new instances.
NVSwitch, another Nvidia networking technology, is also supported by the instance series. NVSwitch links various GPU servers together, whereas NVLink links the GPUs inside a single server. This simplifies operating complicated AI models that must be deployed over numerous systems in a data center.
Microsoft’s ND H100 v5 instances pair Intel Corp. CPUs with H100 graphics cards. The 4th Gen Xeon Scalable Processor series from Intel is the source of the CPUs. The Sapphire Rapids chip series debuted in January.
A modified version of Intel’s 10-nanometer process serves as the foundation for Sapphire Rapids. Each CPU in the series has several onboard accelerators, computing units designed for particular tasks. Sapphire Rapids, according to Intel, offers up to 10 times more performance than its previous-generation silicon for some AI applications because of the integrated accelerators.