How Will Data Centers Support New AI Workloads at Scale?

By Aniket Khosla / Spirent Communications

For anyone connected to the tech industry, the rise of artificial intelligence (AI) has become practically inescapable. As of March 2024, a Google News search on the topic returned more than 211 million results. By now, that figure is undoubtedly higher as AI continues to launch new applications and work its way into seemingly every existing digital service. An unspoken assumption underlying the market optimism is that we can process this tsunami of new AI workloads. However, this is far from a given for the companies running the world’s largest data centers.

AI applications, incredibly generative AI (genAI), introduce ever-larger compute- and data-intensive workloads, new network traffic patterns, and new demands for extreme performance at a massive scale. Hyperscalers are upgrading data center infrastructures as quickly as possible to achieve the terabit-scale throughput they need. But they can’t do it traditionally, so they add more racks and fiber runs. They need new architectures and a new generation of high-performance, high-density networking solutions.

All parts of the ecosystem—network equipment providers, interface vendors, chipmakers, and other component suppliers—are accelerating product timetables to meet this demand. This is welcome news for data center operators, but the race to next-generation network technologies brings uncertainty, too. With so much customer demand, hyperscalers have little choice but to pick a strategy and dive in, even as significant questions remain unanswered.

Let’s examine why AI puts so much pressure on data center networks and how operators respond, answering questions as they go.

AI changes everything

Data center networks are designed to process demanding application workloads at massive scales. However, the nature of AI workloads, especially genAI, presents a new challenge.

There are two basic types of AI workloads: training, which involves ingesting vast data sets for AI models to train on, and inference, where an AI model applies its training to make classifications or predictions on new data. Both training and inference can entail billions of parallel computations, requiring vast numbers of graphics processing units (GPUs) and other specialized accelerators, networks with extremely high throughput, low latency, and close to no packet loss. The number of processors a given AI cluster requires depends on the size and complexity of the applications it will support. However, data centers need help to support current AI models, which are growing 1,000 times larger every three years.

When building new large-scale AI clusters, it’s reasonable to assume they’ll need to connect tens of thousands of processors and support workloads with trillions of dense parameters. These networks will also need to achieve extremely low latency, as any delayed flow can impede overall application performance, leading to training errors, timeouts, and poor user experiences. (According to Meta, about a third of the elapsed time for AI workloads is currently spent waiting for the network.)

These requirements are already pushing the limits of data center infrastructures, even as operators adopt dedicated back-end networks for AI clusters. For example, each high-end GPU card corresponds to a 400G interface in a typical AI cluster. Improving the capacity and density of each network device is the only viable way to build large-scale networks with thousands of processors.

Dell'Oro forecast operators one-third of all ports will be 800G in front-end data center networks by 2027.

Moving to 800G

Vendors across the data center ecosystem have worked furiously to meet this demand. A new generation of 800G interfaces and networking platforms is now reaching the market, with work on the next upgrade technology, 1.6 Tbps, well underway. It’s here where we start running into questions. How quickly should we expect data center operators to upgrade their networks? Which technologies are they considering, and what factors are they weighing to make those calls?

To help answer those questions, we can turn to Dell’Oro Group. According to their 2023-2027 forecast, operators are investing at different rates in front- and back-end data center networks. For front-end networks that will be used for data ingestion, operators are using Ethernet, with one-third of all ports expected to be 800G by 2027.

In back-end inference networks, where the need for throughput and scale is most urgent, operators are upgrading as soon as they can get their hands on new equipment. Nearly all back-end ports will be 800G by 2027. Here, though, we see a mix of interfaces being adopted. What makes sense for a given data center operator will depend on multiple factors, including:

Workload type: What size and type of AI applications will the operator focus on? Will they support all AI workloads, outsource training, and concentrate on inferencing?

Performance demands: How important will it be to the customers they’re targeting to be able to process workloads with deterministic latencies? Will they require a lossless technology like InfiniBand?

Standardization: How does the operator prioritize using standards-based technologies versus proprietary ones?

Roadmap: How comfortable is the operator with the technology roadmap they’re considering and the expected timeline for future upgrades?

More than ever, testing matters

Even as operators face mounting pressure to build larger AI clusters, many questions still need to be answered, and there is no single path to answering them. Indeed, Amazon, Microsoft, and Google are all pursuing different AI infrastructure strategies. However, testing and validation become critical while waiting for clear answers.

Many next-generation networking solutions will use components from multiple vendors, often using silicon manufactured while standards are still being developed. With so much capital investment on the line and so little room for error, operators and vendors must have confidence in the interoperability and performance of any new networking solution. They’ll need new testing capabilities to achieve it.

Among the innovations with 800G technologies, for example, new standards double per-lane I/O data rates and PAM4 signal modulation speeds. However, these next-generation technologies cannot be tested using tools designed for previous-generation networks. Legacy testing equipment also can’t effectively emulate AI workloads at scale, which is a big problem since vendors’ only alternative would be to build entire AI clusters in the lab at a cost of many millions of dollars.

Aniket Khosla is the VP of wireline product management at Spirent Communications.

Fortunately, the industry is responding yet again. Test and assurance vendors have developed next-generation testing tools to support the new generation of AI networks and applications. These new solutions are designed end-to-end for modern workloads, including genAI and the low-latency, terabit-scale Ethernet networks enabling them.

The scale and pace of change introduced by AI are unprecedented. But behind the scenes, a fantastic success story unfolds as the entire networking ecosystem comes together to meet the AI challenge.