AI

Here's what's wrong with Nvidia's Blackwell GPU

Nvidia announces second quarter earnings Wednesday afternoon in what could be its most momentous report in the new AI era.

Investors will be focused on whether server and other infrastructure providers can count on whether Nvidia’s Blackwell chips will ship any time soon and in sufficient volume so that data centers at enterprises and research firms can continue training algorithms with many billions of parameters to enable Generative AI and other compute functions.

The most detailed analysis of what’s gone on with Nvidia Blackwell chip production comes from research firm SemiAnalysis, which asserts Blackwell GPUs have been delayed due to advanced packaging problems at TSMC, Nvidia’s chip manufacturer, as well as Nvidia’s physical design for the Blackwell series.

Blackwell production relies on TSMC’s CoWoS-L packaging  technology for the first time in high volume, a more complex process than the previous CoWoS-S. Nvidia’s roadmap of releasing a new GPU each year might be contributing to the delays.

“Nvidia’s Blackwell family is encountering major issues in reaching high volume production,” the analysts at SemiAnalysis wrote in early August, which was reasserted by the firm on Monday. “This setback has impacted [Nvidia’s] production targets for Q3/Q4 2024 as well as the first half of next year. This affects Nvidia’s volume and revenue.”

SemiAnalysis went on to say that timelines will be pushed out some, but that volumes are affected moreso. Blackwell’s predecessor GPU Hopper will be extended in lifespan and shipments as a result.

Also, Nvidia has had to scramble to create completely new systems not previously planned.

Nvidia has repeatedly issued a statement that doesn’t admit any problems with Blackwell. “Broad Blackwell sampling has started, and production is on track to ramp in 2H. Beyond that, we don’t comment on rumors,” a spokesperson said.

Nvidia CEO Jensen Huang had said in March that Blackwell would begin shipping in 2Q, but again, SemiAnalysis said the volume of shipments is the bigger concern.

Data center operators are concerned, because they have been busy since March revising specs  to get ready for liquid cooling systems required with Blackwell.  Most data centers today are air-cooled.

RELATED: Here’s why the rumored Nvidia Blackwell GPU delay matters for data centers

SemiAnalysis noted that the most technically advanced Blackwell chip is the GB200 which runs in a 72 GPU rack with a power density of about 125 kilowatts per rack compared to a standard for data center racks of up to 20kW.


 “This is a compute and power density that has never been achieved before and given the system level complexity needed, the ramp has proven challenging,” SemiAnalysis said. “Numerous issues have cropped up related to power delivery, overheating, water cooling supply chain ramp, water leakage from quick disconnects and a variety of board complexity challenges…The core issue impacting shipments is directly related to Nvidia’s design of the Blackwell architecture. The supply of the original Blackwell package is limited due to packaging issues at TSMC and with Nvidia’s design.”

Blackwell is the first high volume design using CoWoS-L technology, which stands for Chip-on-Wafer-on-Substrate, a 3D packaging approach that stacks chips and packages onto a substrate to reduce the space needed for chips and help reduce power consumption and therefore cost. TSMC is the only provider of a complete CoWoS approach. The L refers to a Local Silicon Interconnect, which is embedded to provide communication between compute and memory.

SemiAnalysis described problems with CoWoS-L related to warpage between silicon dies, bridges and substrate. The firm speculated about a rumor that bridges have needed to be redesigned at Nvidia, contributing to a multi-month delay.

With delays and insufficient supply, SemiAnalysis said Nvidia will introduce a Blackwell GPU called the B200A, using the same die as one used in the China version of Blackwell called B20. It will be packaged on CoWoS-S, instead of CoWoS-L.   Nvidia has not commented on this version or other matters related to production problems with Blackwell, and TSMC did not respond to a request to comment.

For its Blackwell design, Nvidia has taken some criticism for a chip that requires 1000 watts of power, compared to 700 watts for the predecessor Hopper.  Energy resources are in short supply in some cities, and some governments have even put data center development on hold over power concerns.

A general line of defense for the added power of Blackwell is Nvidia’s view that a better performing architecture like Blackwell will get work done faster, which consumes less energy than another architecture with lower power needs, (or lower Thermal Design Power) such as Hopper, that takes longer to complete the same amount of work. Nvidia’s B200 Tensor Core GPU based on Blackwell draws 1000 watts but delivers 9 petaFLOPS of FP4 AI compute on a single chip.