InfiniBand is a high-performance network architecture. In recent years, it has attracted more and more attention due to the large increase in HPC models represented by ChatGPT. Because these HPC models are based on InfiniBand developed by NVIDIA. It is mainly used for high-speed data transmission between systems and supports multiple communication modes. Next, I will take you to learn in detail what InfiniBand is and how it works.

InfiniBand: A Detailed Explanation

InfiniBand is a high-speed, low-latency interconnect technology used primarily in data centers and high-performance computing environments. It can provide a high-performance network architecture for service areas and storage facilities within a cluster or data center, as well as other network resources. InfiniBand devices usually include InfiniBand optical modules, InfiniBand switches, etc.

How Does InfiniBand Work

InfiniBand generally uses a two-layer architecture that separates the network layer from the physical layer and the data link layer. Each layer is responsible for different functions. The physical layer provides direct point-to-point connections between devices through high-bandwidth serial links, while the data link layer handles the transmission and reception of packets between devices. The network layer is responsible for providing key features of InfiniBand, such as QoS, virtualization, and RDMA. The coordination between the layers makes InfiniBand a powerful tool to support HPC work that requires low latency and high bandwidth.

InfiniBand: Explanation of Working Principle

After understanding what InfiniBand is, I will introduce its working principle to you

RDMA: The Basic Function of InfiniBand

An important basic function in the InfiniBand architecture is RDMA, which allows data to be transmitted directly in the application memory space without passing through the CPU, which can not only reduce latency but also effectively improve system operation efficiency. RDMA is also one of the basic functions of many data center switches.

InfiniBand Network and Architecture

InfiniBand uses a switching fabric architecture to interconnect multiple nodes through switches. Under such an architecture, high bandwidth and low latency communication can be provided between nodes, which is very suitable for application environments such as HPC. The components of this channel-based structure can be divided into the following four categories:

HCA(Host Channel Adapter): It is a host channel adapter that acts as an interface between the host and the InfiniBand network, converting the host’s internal data bus into the InfiniBand data format.

TCA(Target Channel Adapter): It is used for target device operation within the InfiniBand network. Its main function is to manage data reception and processing on the target device.

InfiniBand link: It refers to the physical and logical connection between two nodes in the InfiniBand network, providing a channel for data transmission and supporting high-speed data exchange. It usually uses media such as optical fiber, cable, or even onboard links within the device to establish the connection.

InfiniBand switches and routers: These devices play an important role in networking within the InfiniBand infrastructure. They manage the routing of data packets between different devices connected to the network to enable seamless communication and data exchange.

Benefit of InfiniBand

High bandwidth and low latency: InfiniBand can provide up to 200Gbps or even higher bandwidth. This high-speed data exchange is extremely important for applications and workloads that require large throughput, such as HPC and big data processing. InfiniBand also supports direct data transmission between memory spaces through RDMA and switching fabric architecture, keeping latency at the microsecond level, which is very important for HFT and the financial industry.

Flexible scalability: InfiniBand is also quite scalable, and its scalable architecture can meet the changing needs of data centers. It adapts to growing business needs by seamlessly upgrading network capacity and supports dynamic resource allocation. Its architecture can support the interconnection of thousands of nodes to meet computing needs of different scales.

Low power consumption: InfiniBand provides higher performance while also having lower power consumption. Because it uses RDMA, data transmission does not need to be processed by the CPU but is directly transferred from one node memory to another, which reduces the burden on the CPU and reduces overall power consumption. In addition, InfiniBand devices have dedicated hardware accelerators, which can be optimized to process and transmit data more efficiently, reducing power consumption.

High reliability: InfiniBand also supports multipath redundancy and fast fault recovery mechanisms to ensure the reliability and availability of data transmission.

Application of InfiniBand

InfiniBand plays an indispensable role in various industries today. Its high-speed data transmission capability has become a key solution for many industries.

InfiniBand Applications in Artificial Intelligence

When training artificial intelligence models, it needs to process a large number of data sets every day. InfiniBand can provide high-bandwidth and low-latency network connections to accelerate model training. Through RDMA, it can also communicate between nodes with extremely low latency, improving the efficiency of training.

InfiniBand in HPC Workloads

InfiniBand also plays an important role in HPC workloads. It can provide high bandwidth and low latency, and process a large amount of data at the same time, improving the overall performance of the network. It is also suitable for particularly sophisticated work, such as the financial industry, medical diagnosis, etc.

For Storage Network

InfiniBand can be used in storage networks to connect storage arrays and network devices for efficient data access and transmission. It supports RDMA, which can reduce the latency of storage access and improve the performance of storage systems.

Conclusion

InfiniBand achieves its expected goals through a unique architecture in the face of growing network bandwidth requirements and low latency. It not only solves the network requirements of high-performance computing (HPC) and data centers but also provides good assistance for artificial intelligence model training such as ChatGPT. In a society where science and technology are advancing day by day, it will undoubtedly have better application scenarios in the future.