AI

Nvidia Spectrum-X Ethernet bolsters Colossus prowess

Nvidia said Monday its Spectrum-X Ethernet networking platform is being used in the xAI Colossus supercomputer cluster that runs 100,000 Nvidia Hopper GPUs out of Memphis.

The Nvidia Ethernet platform is designed for multi-tenant, hyperscale AI factories that use Ethernet, and xAI chose it to use for its Remote Direct Memory Access network.

Elon Musk had praised Nvidia by name in a tweet on Sept. 2 noting that the xAI team had brought Colossus online from start to finish in 122 days.  Nvidia said the typical timeframe for systems of this size can take many months to years.

“Colossus is the most powerful training system in the world,” Musk tweeted on X. “Nice work by xAI team, Nvidia and our many partners/suppliers.”  Colossus is in the process of being doubled to 200,000 Nvidia Hopper GPUs.

Training for the Grok model requires unprecedented network performance, Nvidia noted. The company said the system has seen zero application latency degradation or packet loss due to flow collisions across three tiers of the network fabric. Spectrum-X congestion control has maintained 95% data throughput.

“This level of performance cannot be achieved at scale with standard Ethernet, which creates thousands of flow collisions while delivering just 60% data throughput,” Nvidia noted.

Spectrum-X uses the Spectrum SN5600 Ethernet switch which operates at up to 800Gb/s. It is based on the Spectrum-4 switch ASIC. Nvidia noted xAI paired the 56000 with Nvidia BlueField SuperNICs to improve performance.