The latest chip speed benchmark tests in train neural networks was published tuesday by MLCommons, an industry consortium. As in previous years, Nvidia scored top marks across the board in the MLPerf tests.
With competitors Google, Graphcore, and Advanced Micro Devices not submitting entries this time around, Nvidia’s dominance of all eight tests was complete.
However, Intel’s Habana business has brought significant competition with its Guadi2 chip, and the company is committed to beating Nvidia’s high-end H100 GPU by this fall.
The benchmark test, Training version 3.0, indicates the number of minutes it takes to tune neural “weights”, or parameters, until the computer program achieves a minimum precision required on a given task, a process called “training”. of a neural network. .
Along with training on server computers, MLCommons has released a companion benchmark test, MLPerf Tiny version 1.1, which measures training performance on very low power devices.
The main Training 3.0 test, which totals eight separate tasks, records the time it takes to tune a neural network by fine-tuning its parameters over multiple experiments. It’s half of neural network performance, the other half being what is called inference, where the finite neural network makes predictions when it receives new data. Inference is covered in separate versions of MLCommons.
Nvidia took first place in all eight tests, with the shortest time to practice. Two new tasks have been added. One tests the Large GPT-3 language model (LLM) made by OpenAI. Generative AI the use of LLMS has become a craze due to the popularity of OpenAI’s ChatGPT program, which is based on the same LLM. In the GPT-3 task, Nvidia took first place with a system assembled with the help of its partner CoreWeave, which rents cloud-based instances of Nvidia GPUs.
The Nvidia-CoreWeave system took just under eleven minutes to practice using a dataset called Colossal Cleaned Common Crawl. This system used 896 Intel Xeon processors and 3,584 Nvidia H100 GPUs. The system performed the tasks using Nvidia’s NeMO framework for generative AI.
The training takes place on part of the full GPT-3 training, using the “big” version of GPT-3, with 175 billion parameters. MLCommons limits testing to 0.4% of full GPT-3 training to maintain a reasonable runtime for emitters.
Also new this time around was an expanded version of the recommendation engines that are popular for things like product research and social media recommendations. MLCommons replaced the training dataset that had been used, which was a one-terabyte dataset, with a four-terabyte dataset called Criteo 4TB multi-hot. MLCommons decided to upgrade because the smaller dataset was getting outdated.
“Production recommendation models are getting bigger and bigger — in terms of size, computation, and memory operations,” the organization noted.
The only AI chip vendor to compete with Nvidia was Intel’s Habana, which submitted five applications with its Gaudi2 acceleration chip, plus one application submitted by computer maker SuperMicro using Habana’s chip. These systems were submitted collectively in four of the eight tasks. Either way, the Habana systems came in well below the best Nvidia systems. For example, in the test to train Google’s BERT neural network on Wikipedia data to answer questions, Habana came in fifth place, taking two minutes to complete the training compared to eight seconds for an Nvidia-CoreWeave machine at 3,072 GPUs.
However, Jordan Plawner of Intel, head of AI products, noted in an interview with ZDNET that for comparable systems, the time difference between Habana and Nvidia is close enough to be negligible for many companies.
For example, in the BERT Wikipedia test, an 8-part Habana system, with two companion Intel Xeon processors, took just over 14 minutes to train. This result was better than two dozen other submissions, many with double the number of Nvidia A100 GPUs.
“We invite everyone to watch the 8-device machines,” Plawner said. “We have a huge price advantage with Gaudi2, where we are priced similarly to a similarly spec A100, giving you more training per dollar.”
Plawner noted that not only is the Gaudi2 able to beat some similar Nvidia A100 configurations, but the Gaudi2 performs with a slight handicap. Nvidia submitted its MLPerf entries using a data format called “FP-8”, for floating-point, 8-bit, while Habana used an alternative approach called BF-16, for B-float, 16-bit. The higher arithmetic accuracy of the BF-16 somewhat hinders training in terms of time to complete.
Later this year, Plawner said, Gaudi2 will use the FP-8, which he says will allow more performance. It will even allow Habana to beat Nvidia’s new H100 system in terms of performance, he predicted.
“The industry needs an alternative” to Nvidia, Plawner said. Customers, though traditionally reluctant to abandon the trustmark, are now being driven by a sudden shortage in Nvidia’s parts supply. CEO Jensen Huang said last month that Nvidia was struggling to keep up with demand for H100 GPUs.
“Now they’re motivated,” Plawner told ZDNET of customers frustrated by Nvidia’s lack of supply.
“That’s what we hear from them, that they have things they want to do tomorrow, the CEO is asking them, and they can’t do it because they can’t get GPUs. , period.”
“Believe me, they make way more than they spend (on generative AI). If they can put 50 people on a Gaudi project to literally get the same amount of training time, if the answer is, I don’t have GPUs, and I’m waiting, or, I have Guadi2, and I can launch my new service tomorrow, they’ll go buy Gaudi2 to launch their new service.”
Intel is the second largest in the world chip factoryor “fab,” according to Taiwan Semiconductor, Plawner noted, giving the company the ability to control its own supply.
Although Nvidia builds systems with several thousand GPUs to achieve the highest score, Habana is able to do the same, Plawner said. “Intel is building a cluster of several thousand Guadi2s internally,” he said, with the implied suggestion that such a machine could be an entry into a future MLPerf round.
Tuesday’s results are the second straight quarter for the practice test in which no alternative chipmaker has shown a better score against Nvidia.
One year ago, Google shared the best score with Nvidia thanks to its TPU chip. But Google didn’t show up in November last year, and was absent again this time. And startup Graphcore also dropped out of the race, focusing on its business rather than showing test results.
In a phone conversation, MLCommons Director David Kanter, when asked by ZDNET about the non-showing of competitors, remarked, “The more parties that participate, the better.”
Google did not respond to a request from ZDNET at press time asking why the company did not participate this time around. Advanced Micro Devices, which competes with Nvidia on GPUs, also did not respond to a request for comment.
AMD did, however, have its CPU chips represented in competing systems. However, in a surprising turn of events, every winning Nvidia system used Intel Xeon processors as their host processor. In the previous year’s results, all eight winning entries, whether from Nvidia or Google, were systems using AMD’s EPYC server processors. The change shows that Intel has managed to make up lost ground in server processors with the release of Sapphire Rapids this year.
Despite the absence of Google and Graphcore, the test continues to attract new system makers who submit submissions. This time around, early bidders included CoreWeave, but also IEI and Quanta Cloud Technology.