Nvidia’s upcoming Blackwell platform provided up to 4 times the inference performance compared to the older Hopper architecture in the latest MLPerf industry benchmarks, the company said Wednesday just hours before Nvidia’s second quarter earnings call.
While the Blackwell performance should prove impressive for customers hoping to expand their use of Generative AI in enterprise settings and for research work, Nvidia still has yet to release the Blackwell platform widely.
The company has started sampling Blackwell with production on track to ramp sometime in the second half of the year. Reports have widely indicated it has been delayed, however, including one from SemiAnalysis noting the entire Blackwell family of chips “is encountering major issues in reaching high volume production.” The analyst firm said design problems by Nvidia and packaging at TSMC were both to blame.
RELATED: Here’s what’s wrong with Nvidia’s Blackwell GPU
Some analysts are hopeful CEO Jensen Huang will address the concerns about Blackwell in the earnings call, but company shares have soared in 2024, up by 126%, with a 1.5% decline at market opening Wednesday.
The MLPerf Inference v4.1 benchmark tested Blackwell performance on Llama 2 70B, the first time Nvidia has submitted Blackwell, which was introduced in March. Dave Salvator, director of accelerated computing products at Nvidia said the Blackwell submission reached 4x performance over Hopper “thanks to its use of a second-generation Transformer Engine and FP4 Tensor Cores.”
A slide provided reporters showed “a giant leap” with Blackwell with 4X server performance per single GPU, handling 10,756 tokens per second.

He added: “to meet real-time latency requirements for serving today’s LLMs and do so for as many users as possible, multi-GPU compute is a must.” The comment could have been a response to other inference competitors, including Intel, who have argued for using CPUs for inference with an abundance of needs for smaller LLMs, some running on the edge on emerging AI PCs.
Nvidia is also not leaving out any future opportunities for expanding its use of Generative AI on the edge, which can be used for transforming sensor data like images and videos in real-time actionable insights with strong contextual awareness. Salvator said Nvidia Jetson is capable of running any kind of transformer model including LLMs, vision transformers and stable diffusion, done locally.
The Jetson AGX Orin system-on-modules achieved 6X improvement and 2.5X latency improved over the previous benchmark round with a GPT-J LLM workload. It is a 6 billion parameter model that can help transform generative AI at the edge, functioning as a general purpose model to interface with human language at the edge.
Nvidia also released a technical blog on Wednesday noting it has lowered latency in inference performance.
Blackwell power usage defended by Nvidia
Salvator also defended Blackwell against critics worried about its 1000 watt power strain, putting pressure on data centers and utilities to adapt with liquid cooling and to find alternative energy sources, even small nuclear power installations. Hopper requires 700 watts (measured by Thermal Design Power) by comparison.
“Hopper is a very efficient architecture,” Salvator told Fierce Electronics. “It’s also important to consider the amount of energy used to get work done. A better-performance architecture [including Blackwell] will get work done faster, thereby consuming less energy than another architecture that may list a lower TDP but takes longer to complete the same work.”
Huang has made a similar point in the past when defending the energy needed for Blackwell chips.
The Nvidia B200 Tensor Core GPU based on Blackwell architecture is deliving up to 9 PFLOPS of FP4 AI compute in its power envelope of 1000 watts, “an amazing amount of performance in a single GPU,” Salvator said.
He explained that designers design chips looking for optimal points to set chip power limits based on what’s known as a “voltage frequency curve” where different amounts of voltage will get you to different maximum clock frequencies for efficient compute. “Our GPUs have different options for how data centgers want to deploy them called MaxP and MaxQ where MaxP represents the maximum power setting to get maximum performance and MaxQ balances both performance and power to get an optimal blend of both.”
Also, all Nvidia GPUs have dynamic frequencies so if the GPU is not being used, its clock speed is lowered to reduce power consumption. Also, Power Steering is on the GH200 GPU that dynamically balances the power budget between the CPU and GPU to steer power to which unit is under the greatest load.