Highlights –
- The Training 2.1 benchmark for ML training, HPC 2.0 for large systems, including supercomputers, and Tiny 1.0 for small and embedded deployments are among the latest MLPerf benchmarks being announced right now.
- Nvidia reports significant technological improvements in the most recent MLPerf Training benchmarks.
The most recent set of Machine Learning (ML) MLPerf benchmarks from MLCommons is out today, demonstrating how hardware and software for Artificial Intelligence (AI) are becoming quicker.
MLCommons, a vendor-neutral firm, aims to offer benchmarks and standardized testing to help assess the state of ML hardware and software. MLCommons gathers various ML benchmarks under the MLPerf testing name multiple times a year. The MLPerf Inference results, which demonstrated how many technologies have enhanced inference performance, were published in September.
The new MLPerf benchmarks announced include the Training 2.1 benchmark, which is for ML training; HPC 2.0 for large systems, including supercomputers, and Tiny 1.0 for small and embedded deployments.
According to David Kanter, executive director of MLCommons, “The key reason why we’re doing benchmarking is to drive transparency and measure performance. This is all predicated on the key notion that once you can actually measure something, you can start thinking about how you would improve it.”
The operation of the MLPerf training benchmark
Focusing on the training benchmark, Kanter pointed out that MLPerf isn’t just about hardware but also about software.
Models in ML systems must first be trained on data in order to function. The training process benefits from accelerator hardware, as well as optimized software.
According to Kanter, the MLPerf Training benchmark begins with a predetermined dataset and a model. Organizations then train the model to reach a certain quality level. Time to train is one of the main criteria that the MLPerf Training benchmark measures.
Kanter said, “When you look at the results, and this goes for any submission — whether it’s training, tiny, HPC, or inference — all of the results are submitted to say something. Part of this exercise is figuring out what that something they say is.”
The metrics can recognize relative performance levels and highlight how hardware and software have improved over time.
The chair of MLPerf Training at MLCommons and senior director of deep learning libraries and hardware design at Nvidia, John Tran, called attention to the fact that there were several software-only submissions for the most recent benchmark.
Tran added, “I find it continually interesting how we have so many software-only submissions, and they don’t necessarily need help from the hardware vendors. I think that’s great and is showing the maturity of the benchmark and usefulness to people.”
Intel and Habana Labs’ advanced training with Gaudi2
Jordan Plawner, senior director of AI products at Intel, also emphasized the need for software. Plawner outlined during the MLCommons press call that the difference between ML inference and training workloads is in terms of hardware and software.
Plawner declared, “Training is a distributed-workload problem. Training is more than just hardware, more than just the silicon; it’s the software, it’s also the network and running distributed-class workloads.”
In contrast, Plawner claimed that ML inference could be a single-node problem that does not have the same distributed elements, which offers a minimal barrier to entry for vendor technologies than ML training.
In terms of performance, Intel is well characterized on the latest MLPerf Training benchmarks with its Gaudi2 technology. In 2019, Intel paid two billion dollars to acquire Habana Labs and its Gaudi technology. The acquisition has helped it to enhance the company’s capabilities in recent years.
The most advanced silicon from Habana Labs is now the Gaudi2 system, which was announced in May. The latest Gaudi2 results show gains over the first set of benchmarks that Habana Labs reported with the MLPerf Training update in June. According to Intel, Gaudi2 improved by 10% for time-to-train in TensorFlow for both BERT and ResNet-50 models
The Gaudi2 system, which was unveiled in May, is currently the most cutting-edge silicon produced by Habana Labs. Compared to the initial benchmarks that Habana Labs published with the MLPerf Training upgrade in June, the most recent Gaudi2 results demonstrate improvements. Intel claims that Gaudi2 has a 10% improvement in time-to-train in TensorFlow for both the BERT and ResNet-50 models.
Nvidia H100 surpasses its forerunner
Nvidia is also reporting strong gains for its technologies in the latest MLPerf Training benchmarks.
Nvidia reports significant technological improvements in the most recent MLPerf Training benchmarks.
Testing results for Nvidia’s Hopper-based H100 with MLPerf Training show significant gains over the prior generation A100-based hardware. In an Nvidia briefing call discussing the MLCommons results, Dave Salvator, director of AI, benchmarking and cloud at Nvidia, said that the H100 provides 6.7 times more performance than the first A100 submission had for the same benchmarks several years ago. Salvator said that a key part of what makes the H100 perform so well is the integrated transformer engine that is part of the Nvidia Hopper chip architecture.
Compared to the previous generation of A100-based hardware, test results for Nvidia’s Hopper-based H100 with MLPerf Training demonstrate considerable improvements. Dave Salvator, director of AI, benchmarking, and cloud at Nvidia, stated that the H100 offers 6.7 times more performance than the first A100 submission did for the same benchmarks some years ago during an Nvidia briefing call regarding the MLCommons results. According to Salvator, the integrated transformer engine component of the Nvidia Hopper chip architecture plays a significant role in the H100’s exceptional performance.
While H100 is now Nvidia’s leading hardware for ML training, that’s not to say the A100 hasn’t improved its MLPerf Training results as well.
Although the H100 is currently Nvidia’s top hardware for ML training, the A100’s MLPerf Training performance has also improved.
“The A100 continues to be a really compelling product for training, and over the last couple of years we’ve been able to scale its performance by more than two times from software optimizations alone,” Salvator said.
Salvator claimed, “The A100 continues to be a really compelling product for training, and over the last couple of years, we’ve been able to scale its performance by more than two times from software optimizations alone.”
Overall, whether it’s with new hardware or continued software optimizations, Salvator expects there will be a steady stream of performance improvements for ML training in the months and years to come.
Salvator anticipates that there will be a continual stream of performance improvements for ML training in the months and years to come, whether with new hardware or ongoing software refinements.
According to Salvator, “AI’s appetite for performance is unbounded, and we continue to need more and more performance to be able to work with growing datasets in a reasonable amount of time.”
The need to be able to train a model faster is critical for a number of reasons, including the fact that training is an iterative process. Data scientists often need to train and then retrain models in order to get the desired results.
“That ability to train faster makes all the difference in not only being able to work with larger networks, but being able to employ them faster and get them doing work for you in generating value,” Salvator said.
For a number of reasons, including the fact that training is an iterative process, it is essential to be able to train a model more quickly. Data scientists frequently need to train and then retrain models to achieve the intended outcomes.
Salvator said, “That ability to train faster makes all the difference in not only being able to work with larger networks but being able to employ them faster and get them doing work for you in generating value.”