NVIDIA’s 800G Ethernet switch powers the AI-based Colossal supercomputer

Nov. 5, 2024
The cluster has maintained 95% data throughput and zero application latency degradation or packet loss due to flow collisions—performance NVIDIA says previously was available only via InfiniBand.

NVIDIA recently achieved a major networking achievement. xAI’s Colossus supercomputer cluster, comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform.

The AI-centric company said the platform could “deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.”

Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models. Chatbots are offered as a feature for X Premium subscribers. xAI is doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.

What’s even more compelling about this is the timeline.

Instead of the typical timeframe for systems of this size that can take many months to years, the supporting facility and supercomputer was built by xAI and NVIDIA in just 122 days. It took 19 days from the time the first rack rolled onto the floor until training began.

“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”

Maintaining low latency was also a factor.

NVIDIA said across all three tiers of the network fabric, the system has experience zero application latency degradation or packet loss due to flow collisions. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control.

The Spectrum SN5600 supports speeds of up to 800 Gbits/sec and is based on the Spectrum-4 switch ASIC. xAI is pairing the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs.

Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation — all key requirements for multi-tenant generative AI clouds and large enterprise environments.

“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”

For related articles, visit the Business Topic Center.
For more information on high-speed transmission systems and suppliers, visit the Lightwave Buyer’s Guide.
To stay abreast of fiber network deployments, subscribe to Lightwave’s Service Providers and Datacom/Data Center newsletters.

Sponsored Recommendations

Advances in Fiber & Cable

Oct. 3, 2024
Attend this robust webinar where advancements in materials for greater durability and scalable solutions for future-proofing networks are discussed.

High-Speed Networking Event

Oct. 23, 2024
A Multi-Day online learning event crafted for optical communications professionals specializing in high-speed networking solutions Date: November 12-14Platinum Sponsor: AFLGold...

The Road to 800G/1.6T in the Data Center

Oct. 31, 2024
Join us as we discuss the opportunities, challenges, and technologies enabling the realization and rapid adoption of cost-effective 800G and 1.6T+ optical connectivity solutions...

How AI is driving new thinking in the optical industry

Sept. 30, 2024
Join us for an interactive roundtable webinar highlighting the results of an Endeavor Business Media survey to identify how optical technologies can support AI workflows by balancing...