Nvidia beefs up AI work at GTC with new GPUs for inference

Nvidia is obviously expecting to cash in on the AI craze ignited by ChatGPT with inference-capable GPUs along with software and services the company announced at its GTC conference on Tuesday.

Taking note of how ChatGPT went mainstream in a few short weeks late last year and reached 100 million users, the fastest growth of any application, CEO Jensen Huang added, “We are at the iPhone moment of AI.”

“There are a lot of amazing hyperscalers and startups hoping to become the next Open.Ai,” Ian Buck, vice president of Nvidia’s accelerated computing unit, told Fierce in an interview. “That’s why we want to support the opportunity for inference, not just training. Inference is super exciting.”  

The company announced two new chips for inferencing work: the L4 for AI video GPU now in private preview on Google Cloud and through 30 computer makers and the H100 NVL with dual-GPU NVLink for large language models, expected in the second half of 2023. GTC announcements throughout the week will be bigger than chips alone, however, alongside new partnerships and software libraries.

The new GPUs are part of a set of four inferencing configurations from Nvidia that includes the L40 for Image Generation, now used as the engine of Nvidia’s Omniverse. The fourth inferencing-focused platform is Nvidia Grace Hopper for recommendation models, vector databases and graph neural networks. Grace Hopper has been previously revealed as a superchip that combines a CPU and GPU and is now sampling with full production in the second half of 2023, a slip from earlier production expected in first half of 2023.  

four devices from Nvidia for Ai inference
Nvidia's four inference platforms are from left: L4 for AI video (new), L40 for image generation, H100NVL for LLMs (new) and Grace Hopper for recommendation models. (Nvidia)

A few details of the new silicon were revealed. The H100 NVL is based on the H100 GPU, now shipping, which includes a Transformer Engine for processing models such as the GPT model in Open. AI’s ChatGPT.  Open.AI used the HGX A100 for GPT-3 processing, but Nvidia said a standard server with four pairs of H100 with dual-GPU NVLink runs up to 10 times faster.

The latest version of Nvidia’s DGX supercomputer features eight H100 GPUs linked together with NVLink interconnects and Quantum Infiniband and Spectrum Ethernet networking  and is now in full production.   It provides 32 petaflops of compute. .  Nvidia named OpenAI as a customer for H100s on its Azure supercomputer, along with Meta, Stability.ai, Twelve Labs and Anlatan.

Huang claimed H100 can reduce LLM processing costs by an order of magnitude. And, the H100 NVL can cut total cost of ownership dramatically, Buck added.  Two H100 NVL GPUs working together with an NV Link ycan reduce latency for inference and can connect via PCIe, which is ubiquitous in the server industry, he said.

Supply questions addressed

While some in the industry are worried about an adequate supply of GPUs to support mushrooming AI work, Buck said there are “no concerns” with a silicon supply from its major fab, TSMC.   “We’re full tilt on Hopper,” he said, referring to the shorthand for H100.  “Meanwhile, we will continue to roll through the A100 to market, which is already deployed and in strong demand that we will continue to see through this year. Certainly demand is high as GPUs get rolled out.”

 Sometimes it is hard for customers to get access to GPUs especially if customers are new in the AI field and are looking for GPUs on demand while reserve contracts from other customers are fulfilled first, Buck added.

With Hopper, any delays are just about getting the chip to market. “There’s only so many engineers and it’s a massive buildup with networking and racks and cabling, then to qualify it and make sure it works,” Buck added. “Hyperscale is a challenge. Hopper is just only now getting to market. The A100 saw very strong demand. As they roll into data centers, they get gobbled up and we’re building them as fast as we can.”  

While Buck dismissed any problem getting GPUs from TSMC, analyst Dylan Patel at SemiAnalysis told Fierce there is a currently a  “huge supply shortage of Nvidia GPUs and networking equipment from Broadcom and Nvidia due to a massive spike in demand.” Both companies are ramping up quickly but there is still a big gap, with one-year lead times that vary across products, he said.

The biggest bottlenecks in supply are across CoWoS, high bandwidth memory and networking products, all with various lead times, Patel said. CoWoS is TSMC’s shorthand for Chip-on-Wafer-on-Substrate, a type of integrated circuit for high performance computing.

RELATED: Update: ‘Huge’ GPU supply shortage due to AI needs, analyst Dylan Patel says

Initial reactions to GPU news

Patel said the H100NVL “will be perfect for inference of models like Chat GPT 3.5 turbo.”  The use of 2H100 GPUs with a high speed NVLink “make it look like a single much faster GPU.”

Regarding the L4, Patel said it interesting because it initially will ship to Google, which makes its own TPUs as a competitor to GPUs. “Yet, Google still turned to Nvidia for a chip that runs inference for video and image networks,” Patel said.