NVIDIA’s 800G Ethernet switch powers the AI-based Colossal supercomputer
NVIDIA recently achieved a major networking achievement. xAI’s Colossus supercomputer cluster, comprising 100,000 NVIDIA Hopper GPUs in Memphis, Tennessee, achieved this massive scale by using the NVIDIA Spectrum-X™ Ethernet networking platform.
The AI-centric company said the platform could “deliver superior performance to multi-tenant, hyperscale AI factories using standards-based Ethernet, for its Remote Direct Memory Access (RDMA) network.”
Colossus, the world’s largest AI supercomputer, is being used to train xAI’s Grok family of large language models. Chatbots are offered as a feature for X Premium subscribers. xAI is doubling the size of Colossus to a combined total of 200,000 NVIDIA Hopper GPUs.
What’s even more compelling about this is the timeline.
Instead of the typical timeframe for systems of this size that can take many months to years, the supporting facility and supercomputer was built by xAI and NVIDIA in just 122 days. It took 19 days from the time the first rack rolled onto the floor until training began.
“Colossus is the most powerful training system in the world,” said Elon Musk on X. “Nice work by xAI team, NVIDIA and our many partners/suppliers.”
Maintaining low latency was also a factor.
NVIDIA said across all three tiers of the network fabric, the system has experience zero application latency degradation or packet loss due to flow collisions. Across all three tiers of the network fabric, the system has experienced zero application latency degradation or packet loss due to flow collisions. It has maintained 95% data throughput enabled by Spectrum-X congestion control.
The Spectrum SN5600 supports speeds of up to 800 Gbits/sec and is based on the Spectrum-4 switch ASIC. xAI is pairing the Spectrum-X SN5600 switch with NVIDIA BlueField-3 SuperNICs.
Spectrum-X Ethernet networking for AI brings advanced features that deliver highly effective and scalable bandwidth with low latency and short tail latency, previously exclusive to InfiniBand. These features include adaptive routing with NVIDIA Direct Data Placement technology, congestion control, as well as enhanced AI fabric visibility and performance isolation — all key requirements for multi-tenant generative AI clouds and large enterprise environments.
“AI is becoming mission-critical and requires increased performance, security, scalability and cost-efficiency,” said Gilad Shainer, senior vice president of networking at NVIDIA. “The NVIDIA Spectrum-X Ethernet networking platform is designed to provide innovators such as xAI with faster processing, analysis and execution of AI workloads, and in turn accelerates the development, deployment and time to market of AI solutions.”
For related articles, visit the Business Topic Center.
For more information on high-speed transmission systems and suppliers, visit the Lightwave Buyer’s Guide.
To stay abreast of fiber network deployments, subscribe to Lightwave’s Service Providers and Datacom/Data Center newsletters.