NVIDIA’s 800G Ethernet switch powers the AI-based Colossal supercomputer

Nov. 5, 2024
The cluster has maintained 95% data throughput and zero application latency degradation or packet loss due to flow collisions—performance NVIDIA says previously was available only via InfiniBand.

NVIDIA recently achieved a major networking achievement. xAI’s Colossus supercomputer cluster, comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform.

The AI-centric company said the platform could “deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.”

Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models. Chatbots are offered as a feature for X Premium subscribers. xAI is doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.

What’s even more compelling about this is the timeline.

Instead of the typical timeframe for systems of this size that can take many months to years, the supporting facility and supercomputer was built by xAI and NVIDIA in just 122 days. It took 19 days from the time the first rack rolled onto the floor until training began.

“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”

Maintaining low latency was also a factor.

NVIDIA said across all three tiers of the network fabric, the system has experience zero application latency degradation or packet loss due to flow collisions. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control.

The Spectrum SN5600 supports speeds of up to 800 Gbits/sec and is based on the Spectrum-4 switch ASIC. xAI is pairing the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs.

Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation — all key requirements for multi-tenant generative AI clouds and large enterprise environments.

“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”

For related articles, visit the Business Topic Center.
For more information on high-speed transmission systems and suppliers, visit the Lightwave Buyer’s Guide.
To stay abreast of fiber network deployments, subscribe to Lightwave’s Service Providers and Datacom/Data Center newsletters.

Sponsored Recommendations

Transforming the metro network and the evolution of the "Digital Service Provider"

March 4, 2025
Join experts at EXFO and Ekinops in this webinar that will review the evolving metro-centric requirements and the technologies emerging to meet them.

Innovations Optical Transceivers

March 10, 2025
The continual movement around artificial intelligence (AI) cluster environments is driving new sales of optical transceiver sales and the adoption of linear pluggable optics (...

Unveiling the Synergy Between AI and Optical Networking

March 12, 2025
Join us for an engaging discussion with industry experts on the intersection of AI and optics. Moderated by Sean Buckley, editor-in-chief of Lightwave+BTR, this panel will explore...

Simplifying and Accelerating Rural Broadband Deployments

March 25, 2025
Explore how government initiatives and industry innovations are transforming rural broadband deployments, overcoming cost and logistical challenges to connect underserved areas...