The AI world is tired of TOPS. As AI adoption increases, and workloads shift from training to inference, how many trillions of operations per second your GPU can do continues to be important, but issues like power efficiency and more efficient GPU utilization within the scope of a broader AI infrastructure are becoming more critical than ever.
AMD is making a move to increase GPU utilization and eliminate the management complexity and manual configuration of GPU resources that can hinder a customer’s return on investment in its AI computing infrastructure. The company has partnered with Rapt AI, which enables intelligent automated management of AI workloads and infrastructure, a teaming which both companies said will improve AI inference and training performance for AMD Instinct GPUs, such as its MI300X, MI325X and upcoming MI350 series GPUs.
The partnership comes at an important time for AMD, as it continues to do its best to chip away at Nvidia's seemingly insurmountable GPU market dominance. Much of that dominance, however, was built up during the earliest stages of the AI era, and the increasing importance of computing resource efficiency and power efficiency has opened up a window of opportunity for Nvidia competitors.
Rapt AI CEO Charlie Leeming explained the issues behind that opportunity during a briefing, saying, “Every corporation we deal with has made [big GPU] investments. They are panicking... GPU management is not as straightforward as other processes we've had… There can be tens of variables between the model and the infrastructure, something humans shouldn't be attempting to configure on their own, but they're spending a lot of time doing it. They're really smart people, so they get it right for a while, but it's screaming for the right solution. In terms of inefficiency, we're seeing many different numbers in the industry, most of them range between 10% and 30% true GPU utilization. There is money sitting there. There is spare GPU capability and capacity waiting to be harnessed.”
Rapt AI CTO Anil Ravindranath said the company’s solution is a software platform that collects insights from the AI model or the workload in question, and leverages machine learning to gain insights that can be used to virtualize, optimize, and automate the process of GPU configuration “without any human in the loop.”
That ultimately reduces the trial and error that data scientists and others go through in configuring and managing GPUs, Leeming added, leading to more optimal GPU usage, lower infrastructure costs, and more optimized power consumption.
Juergen Frick, director of product management for AMD’s Instinct GPUs, said the pairing of AMD and Rapt AI could make a difference as GPU customers looking to migrate their AI workloads from Nvidia GPUs to AMD GPUs. “Obviously, Nvidia, our competitor, has been around for a long time, and a lot of the AI workloads out there have been optimized for their GPUs. And as Anil has pointed out, the Rapt software provides an easy way to migrate, because it has ML-based capabilities to observe and then optimize the workloads, and that helps customers that are interested to have choice to move their workloads to our GPUs and get performant results.”
Frick added that AMD’s GPUs have “significantly more memory” than Nvidia's “and if you have memory-bound applications you want to maximize your utilization at some point you might run out of memory. More memory means you can migrate and run more workloads on the same GPU, and that amplifies that TCO benefit that customers are looking for, and that helps them to be motivated to migrate even more of the workloads to our GPUs.”