What is AI Data Center Networking?

 

AI data center network refers to the network infrastructure in the data center that meets artificial intelligence (AI) functions. The network that meets these functions requires higher network scalability, high performance, and low latency than previous networks. AI and machine learning (ML) workloads are more demanding than normal network tasks, especially during intensive AI training phases.

 

InfiniBand’s fast and hilarious communication among servers and storage systems made this networking technology a popular proprietary technology for high-performance computing (HPC) and AI training networks. Subsequently, Ethernet has gained significant traction in the AI data center networking market as an open alternative technology solution and may become the dominant technology in future developments.

 

There are several reasons for the increasing focus on Ethernet in AI data center networks, with operational efficiency and cost control being the first and most apparent. On the technical management side, Ethernet now has a large pool of network engineers capable of building and managing networks, whereas expertise in proprietary InfiniBand networks is more limited. Secondly, cost control is also a big reason. For AI data center equipment with today’s 400Gb Ethernet, the overall price is considerably cheaper than InfiniBand solutions, InfiniBand equipment, and management tools are primarily sourced through Nvidia, whereas Ethernet has extensive access to network management tools.

 

How Does AI Data Center Networking Work?

 

In addition to the fact that GPU servers are being deployed in large numbers to meet the massive computing demands of AI data centers and, therefore, the overall cost is increasing, the networking components of AI data centers also need to improve their utilization of GPUs, which means that they need to be more efficiently networked. Ethernet networking is a proven open-house technology and is a suitable solution for AI data center networks that can be specifically designed for AI workload networking needs. Its network architecture is enhanced by congestion management, load balancing, and low latency, which optimizes job completion time (JCT). Simplified management and automation also contribute to network reliability and consistent performance.

 

Fabric Design

 

The AI Data Center Network can use a variety of architectural designs, but it is suggested that an any-to-any non-blocking Clos architecture be chosen. This design optimizes the training framework by leveraging consistent 400 Gbps network speeds (possibly increasing to 800 Gbps) from the NIC to the leaf and backbone. Based on the GPUs’ size and the model’s size, a two-tier, three-tier non-blocking structure or a three-tier, five-tier non-blocking structure can be implemented.

 

Flow Control and Congestion Avoidance

 

In addition to Fabric capacity, other design considerations are critical to improving the reliability and efficiency of the overall Fabric. These considerations include appropriately sized fabric interconnections with an optimal number of links and the ability to check and correct traffic imbalances to avoid congestion and packet loss. Explicit Congestion Notification (ECN), Data Center Quantified Congestion Notification (DCQCN), and Priority-Based Flow Control address traffic imbalances to ensure lossless data transmission.

 

Dynamic and adaptive load balancing is deployed on switches to reduce data congestion. Dynamic load balancing and redistribution of traffic locally at the switch results in an even distribution of traffic. Adaptive load balancing monitors traffic forwarding and next-transfer tables to identify imbalances and direct traffic away from congested paths.

 

The ECN notifies applications in advance when data congestion is unavoidable. During this time, leafs and spines are updated with ECN-enabled packets to inform the sender of the congestion, allowing the sender to slow down transmission to avoid packet loss. If the sender does not respond promptly, Priority-based Flow Control (PFC) allows Ethernet receivers and senders to share feedback on buffer availability. Finally, during congestion, leaf and spine switches pause or throttle traffic with specific links to alleviate congestion and prevent packet loss, enabling lossless transmission of particular traffic classes.

 

Scale and Performance

 

Ethernet solutions have become the preferred open standards for handling HPC and AI applications. Looking to the future (800GbE and Data Center Bridging (DCB)), Ethernet’s faster, more reliable, and scalable nature makes it capable of handling the high data throughput and low latency requirements of mission-critical AI applications.

 

Automation

 

Automation can help AI data center network solutions effectively complete the final loop; not all automation is created equal, and to realize the total value of network capacity, automation software needs to deliver experience-first operations. It can be English in designing, deploying, and managing AI data centers. It allows repeatable and continuously validated AI data center design and deployment, in addition to eliminating human-induced errors, it can also leverage telemetry and traffic data to optimize performance, facilitate proactive failures to troubleshoot and avoid network outages.

 

Why Do You Need AI Data Center Networks to Meet AI-Driven Demands?

 

The AI data center network meets the requirements driven by generative AI and large-scale deep-learning AI models. The development of AI models involves three phases:

 

Phase 1: Data Preparation – Collecting and organizing datasets for training AI models.

 

Phase 2: AI Training – Train the AI model by exposing it to large amounts of data so that it learns patterns and relationships to develop intelligence.

 

Phase 3: AI Inference – Applying trained models to real-world scenarios to make predictions or decisions based on new, unknown data.

 

While Stage 3 typically leverages existing data centers and cloud networks, Stage 2 (AI training) requires significant data and computing resources to support the iterative learning process. Graphics Processing Units (GPUs) are commonly used for AI learning and inference, often in cluster configurations to increase efficiency. However, scaling clusters can increase costs, highlighting the importance of AI data center networks that do not compromise cluster efficiency.

 

Training large models requires connecting a large number of GPU servers, sometimes tens of thousands, at a cost of more than $400,000 per server in 2023. Therefore, optimizing job completion times and minimizing tail latency, where anomalous AI workloads slow down overall job completion, is critical to maximizing the return on GPU investment. In this context, AI data center networks must be reliable and avoid a loss of cluster efficiency.

FAQs About AI Data Center Networking 

 

What Problem Does the AI Data Center Network Solve?

 

Artificial Intelligence has a very high demand for the large amounts of data and resource computation required for generative AI and deep learning to support the iterative process of model development, where AI models learn and improve their parameters from the data they are constantly collecting. Graphics processors are well suited for AI learning and inference workloads, but in addition to building GPU clusters, significant data centers need to improve the efficiency of AI models as they scale their clusters, a process that also increases costs, so it is critical to use a network of AI data centers that don’t compromise cluster efficiency.

 

Ethernet vs. InfiniBand: Which is Better for AI Data Center Networks?

While InfiniBand, a proprietary technology, has brought some advancement and innovation to AI technology, the use of InfiniBand equipment is expensive, cannot be made cheaper for a competitive supply and demand market, and needs to be paired with proprietary network technicians. Ethernet technology, on the other hand, is the most widely used network technology in the world after IP technology and can also meet the high data throughput and low latency requirements of AI applications, with 800GbE instances of Ethernet technology now available with enhanced Data Center Bridging (DCB) capabilities. For InfiniBand technology, which is primarily supplied through NVIDIA, a wider variety of management tools can be used for Ethernet networking than for InfiniBand technology. In summary, Ethernet fabrics become ideal for high-priority and mission-critical AI traffic.