A Look at Cabling Best Practices for AI Data Centers
September 20, 2024
By Andrew Jimenez, Senior Director – Technical Sales, Wesco Data Center Solutions
The dramatic growth of AI is driving unprecedented changes in data center infrastructure. Much attention has been focused on greater compute density and the need for innovative cooling systems to manage heat effectively. Cabling infrastructure also requires a new approach.
AI servers use graphics processing units (GPUs) to deliver the raw computational power needed for AI workloads. GPUs are designed for parallel processing, enabling them to perform multiple calculations simultaneously. Still, one AI-enabled server is not enough to train an AI model and run some AI workloads. Multiple AI servers must be harnessed together in a high-performance computing (HPC) cluster.
This is a fundamentally different approach than the typical data center with traditional, CPU-based servers. In comparison, AI servers require much greater cabling density (up to 4-5 times more fiber connections) in a design that maximizes performance and minimizes latency.
Why AI Workloads Require a Different Cabling Architecture
In most data centers, servers connect to top-of-rack (TOR) switches, which connect to end-of-row (EOR) or middle-of-row (MOR) switches. EOR/MOR switches connect to the network core. This “leaf-and-spine” model works well for traditional workloads, even in hyperscale environments. However, AI workloads require a different cabling design.
Because AI servers are deployed in a cluster architecture, they require much more inter-server connectivity. At the same time, AI environments generally have fewer servers per rack due to the heat generated by GPUs. As a result, AI data centers require much more inter-rack cabling. They also require high-speed data transfer rates to support the computational intensity of AI workloads.
Latency is another key consideration when designing network infrastructure for AI. Training an AI model takes a lot of time, and network latency increases the amount of time required and therefore the cost. According to one recent report, up to 30 percent of the time needed to train a large AI model is spent on network latency. Therefore, the servers in an AI cluster are deployed in close proximity to minimize the length of the cable runs.
Choosing the Right Fiber-Optic Cabling
Given the number of high-speed connections that must be packed in a very small space, fiber-optic cabling is a necessity. However, there is a wide variety of fiber-optic cabling options to choose from. The best options strike the right balance between cost, reliability, and agility.
AI and other HPC workloads typically use active optical cables (AOCs). These cables have transceivers permanently attached to each end, eliminating the need to buy and deploy transceivers separately. However, AOCs are somewhat less reliable than other types of cables, and their all-in-one design limits flexibility. They are more difficult to upgrade as needs change.
For intra and inter-rack server-to-leaf cabling, consideration should be given to multimode fiber rather than single-mode fiber for multiple reasons. Multimode fiber utilizes VCSEL (vertical cavity surface emitting laser) transceivers which support up to 400Gbps data rates at 100 meters or less, making it ideal for short reach cabling applications within the data center. Additionally, VCSEL-based optics are typically less expensive than single-mode fiber transceivers.
Exploring the Benefits of Parallel Optic Technology
Parallel optic technology offers a great option for AI data center cabling. It simultaneously transmits and receives data over multiple optical fibers by spatially dividing the high-data-rate signal among each fiber lane, which makes it well suited for high-data-rate multimode fiber connections of less than 100 meters. It can use OM3 and OM4 multimode fiber to provide aggregate speeds up to 28G cost-effectively.
Additionally, parallel optics can support both Ethernet and InfiniBand protocols. By utilizing Ethernet, AI data centers gain the advantages of a well-established, open-source protocol that can be implemented quickly at one-half to one-third the cost of InfiniBand. Industry organizations such as the Ultra Ethernet Consortium are developing best practice design guidelines that would capitalize on Ethernet’s strengths while minimizing the packet loss that can cause latency.
Wesco’s network infrastructure specialists can help you select the right fiber-optic cabling products from best-in-class suppliers. We also offer a comprehensive suite of cabling services through our network of trusted partners. Let us help you optimize your data center network infrastructure to meet the unique demands of AI workloads.