Nvidia, Intel tout MLPerf benchmarking performance

It’s MLPerf time again. Just two months after MLCommons released results of its MLPerf AI and machine learning inference benchmark program, this week saw the release of results from the AI and ML training program, Version 2.1.

Nvidia again had an impressive showing. In the AI inference results showcased back in September, the company’s H100 Tensor Core GPU performed well in its benchmarking debut. This time around, in the v2.1 benchmark program, the H100 “delivered up to 6.7x more performance than previous-generation GPUs when they were first submitted on MLPerf training,” according to Dave Salvator, director of product marketing, accelerated cloud computing, at Nvidia.

Salvator has previously noted that the H100’s superior performance can be traced to the Hopper architecture and Hopper’s tightly integrated transformer engine.

Still, the previous-generation Nvidia A100 has not been kicked to the curb. Salvator said in a blog post that Nvidia’s A100 GPUs “swept all tests of training AI models in demanding scientific workloads run on supercomputers… For example, A100 GPUs trained AI models in the CosmoFlow test for astrophysics 9x faster than the best results two years ago in the first round of MLPerf HPC. In that same workload, the A100 also delivered up to a whopping 66x more throughput per chip than an alternative offering.”

Intel, for its part, said both its 4th Generation Xeon Scalable processor (code-named Sapphire Rapids) and its Habana Gaudi2 dedicated deep learning accelerator, which Intel launched back in May, performed well in benchmarking.

The company said in a statement that “results show that 4th Gen Intel Xeon Scalable processors are expanding the reach of general-purpose CPUs for AI training, so customers can do more with Xeons that are already running their businesses.”

The statement added, “The DLRM results are great examples of where we were able to train the model in less than 30 minutes (26.73) with only four server nodes. Even for mid-sized and larger models, 4th Gen Xeon processors could train BERT and ResNet-50 models in less than 50 minutes (47.26) and less than 90 minutes (89.01), respectively. Developers can now train small DL models over a coffee break, mid-sized models over lunch and use those same servers connected to data storage systems to utilize other analytics techniques like classical machine learning in the afternoon. This allows the enterprise to conserve deep learning processors, like Gaudi2, for the largest, most demanding models.”

The good results come as Intel has faced increasing competition on the CPU front. The company this week launched its Max series of products, which includes a Xeon high-bandwidth memory CPU and a data center GPU, and which analyst Jack Gold described as “critical to Intel’s market success.”