How to Choose Between InfiniBand and RoCE

Ultra-high bandwidth, ultra-low latency, and ultra-high reliability are the network requirements for large model training.

For many years, TCP/IP protocol has been the pillar of internet communication, but for AI networks, TCP/IP has some fatal drawbacks. TCP/IP protocol has high latency, usually around tens of microseconds, and also causes serious CPU load. RDMA can directly access memory data through the network interface, without the intervention of the operating system kernel. This allows high-throughput, low-latency network communication, especially suitable for use in large-scale parallel computer clusters.

InfiniBand, RoCEv1, RoCEv2, and iWARP are the four implementations of RDMA technology. Yet, RoCEv1 is no longer used and iWARP is uncommon. The prevalent network solutions in the industry are InfiniBand and RoCEv2.

RDMA protocol stack

What are the Network Requirements for HPC/AI Workloads?

Most data centers today use a two-tier network architecture, while AI clusters are supercomputers built to execute complex large-scale AI tasks. Computer workloads run in parallel on multiple GPUs, requiring high utilization. Therefore, compared to traditional data center networks, AI data center networks face additional complexity:

  • Parallel computing: AI workloads are a unified infrastructure of multiple machines running the same application/computing task;
  • Scale: The scale of HPC/AI tasks can reach thousands of computing engines (such as GPUs, CPUs, FPGAs, etc.);
  • Task types: Different tasks vary in size, running duration, data set size and quantity to consider, answer type to generate, and different languages used to code the application and the type of hardware it runs on, which all cause the traffic patterns within the network built to run HPC/AI workloads to constantly change;
  • Lossless: In traditional data centers, lost messages are retransmitted, while in AI workloads, lost messages mean that the entire computation is either wrong or stuck. Therefore, AI data centers need a lossless network.
  • Bandwidth: High-bandwidth traffic needs to run between servers so that applications can access data. In modern deployments, the interface speed of each computing engine for AI or other high-performance computing capabilities reaches 400Gbps.

These complexities pose significant challenges for AI networks, so AI data center networks need to have high bandwidth, low latency, no jitter, no packet loss, and long-term stability.

From TCP/IP to RDMA

For applications such as HPC/AI that require low latency and high I/O concurrency, the existing TCP/IP software and hardware architecture cannot meet the application requirements. Traditional TCP/IP network communication uses the kernel to send messages, which has high data movement and data copying overhead. For example, in a typical IP data transfer, when an application on one computer sends data to an application on another computer, the following operations occur on the receiving end:

  1. The kernel must receive the data.
  2. The kernel must determine which application the data belongs to.
  3. The kernel wakes up the application.
  4. The kernel waits for the application to perform a system call on the kernel.
  5. The application copies the data from the kernel memory space to the buffer provided by the application.

This process means that if the host adapter uses direct memory access (DMA), most of the network traffic is copied into the system’s main memory. In addition, the computer performs some context switches to switch between the kernel and the application. These context switches may cause higher CPU load and high traffic while slowing down other tasks.

TCP IP transmission

TCP/IP Transmission

Unlike traditional IP communication, RDMA communication bypasses the kernel intervention in the communication process, allowing the host to directly access the memory of another host, reducing CPU overhead. The RDMA protocol enables the host adapter to decide which application should receive it and where to store it in that application’s memory space after the packet enters the network. The host adapter does not send the packet to the kernel for processing and copy it to the user application’s memory, but instead directly puts the packet content into the application buffer.

RDMA Transmission

RDMA Transmission

RDMA transmission reduces the number of CPU cycles involved, which helps to improve throughput and performance. In other words, the essence of RDMA is that for large-scale distributed computing and storage scenarios, it allows the network card to bypass the CPU and directly access the memory of the remote server, accelerating the interaction between servers, reducing latency, and using the precious CPU resources for high-value computing and logic control.

Compared to traditional TCP/IP networks, InfiniBand and RoCEv2 bypass the kernel protocol stack, and the latency performance can be improved by several orders of magnitude. When communication within the same cluster can be achieved in a single hop, experimental tests show that after bypassing the kernel protocol stack, the end-to-end latency at the application layer can be reduced from 50us (TCP/IP) to 5us (RoCE) or 2us (InfiniBand).

end-to-end communication latency of different technologies

An Introduction to InfiniBand Networks

InfiniBand networks use InfiniBand adapters or switches instead of Ethernet to achieve data transmission. The port-to-port latency of a specific type of Ethernet switch is 230 ns, while the latency of an InfiniBand switch with the same number of ports is 100 ns.

InfiniBand Networks

The key components of an InfiniBand network include a subnet manager (SM), an IB network card, an IB switch, and an IB cable. InfiniBand switches do not run any routing protocols, and the forwarding tables of the entire network are calculated and distributed by a centralized subnet manager. In addition to forwarding tables, the SM is also responsible for managing partitions, QoS, and other configurations in the InfiniBand subnet. InfiniBand networks require dedicated cables and optical modules to interconnect switches and connect switches to network cards.

Local Lossless Network

InfiniBand networks use a credit token mechanism to fundamentally avoid buffer overflow and packet loss. The sender only initiates packet transmission after confirming that the receiver has enough credits to accept the corresponding number of packets.

Each link in the InfiniBand network has a predetermined buffer. The sender does not transmit data that exceeds the size of the predetermined buffer available at the receiver. Once the receiver completes forwarding, it releases the buffer and continuously returns the current available predetermined buffer size to the sender. This link-level flow control mechanism ensures that the sender does not send too much data, preventing network buffer overflow and packet loss.

diagram of lossless data transmission in InfiniBand network

Network Card Expansion Capability

InfiniBand’s adaptive routing is based on per-packet dynamic routing, ensuring optimal network utilization in large-scale deployments. There are many examples of large-scale GPU clusters using InfiniBand networks, such as Baidu Artificial Intelligence Cloud and Microsoft Azure.

InfiniBand network cards have been rapidly evolving in terms of speed, with 200Gbps HDR already widely deployed commercially, and 400Gbps NDR network cards also starting to be commercially deployed. Currently, there are major InfiniBand network solutions and supporting equipment providers in the market, such as NVIDIA, Intel, Cisco, and HPE. Among them, NVIDIA has the highest market share, exceeding 70%. The following figure shows the commonly used InfiniBand network cards.

Network Card Expansion Capability

An Introduction to RoCEv2 Networks

RoCE implements RDMA functionality over Ethernet, which can bypass TCP/IP and use hardware offloading, thereby reducing CPU utilization. RoCE has two main versions: RoCEv1 and RoCEv2. RoCEv1 is an RDMA protocol implemented over the Ethernet link layer. Switches need to support flow control techniques such as PFC to ensure reliable transmission at the physical layer. RoCEv2 is implemented over the UDP layer of the Ethernet TCP/IP protocol and introduces the IP protocol to solve scalability issues.

RoCEv2 supports RDMA routing over layer 3 Ethernet networks. RoCEv2 replaces the InfiniBand network layer with IP and UDP headers over the Ethernet link layer, which makes it possible to route RoCE between traditional routers based on IP.

InfiniBand networks are, to some extent, centrally managed networks with SM (subnet manager), while RoCEv2 networks are pure distributed networks composed of NICs and switches that support RoCEv1, usually adopting a two-layer architecture.

diagram of RoCE network architecture

The main suppliers of RoCE network cards are NVIDIA, Intel and Broadcom, etc. PCIe cards are the main form of network cards for data center servers. The port PHY speed of RDMA cards usually starts from 50Gbps, and the currently available commercial network cards can achieve up to 400Gbps of single-port speed.

RoCE network card

Currently, most data center switches support RDMA flow control technology, which can achieve end-to-end RDMA communication when combined with network cards that support RoCE. The main players in data center switches include Cisco, HPE, Arista, etc. The core of high-performance switches is the forwarding chip they use. Currently on the market, Broadcom’s Tomahawk series chips are widely used in commercial forwarding chips. Among them, the Tomahawk3 series chip is more commonly used in switches, and switches that support the Tomahawk4 series chip are gradually increasing on the market.

evolution of ethernet forwarding chips

InfiniBand vs. RoCE

Compared to InfiniBand, RoCE has greater versatility and relatively lower cost. It can be used not only to build high-performance RDMA networks, but also for traditional Ethernet. However, configuring parameters such as Headroom, PFC (Priority-based Flow Control) and ECN (Explicit Congestion Notification) on switches may be complex. In large-scale deployments, the overall throughput performance of RoCE networks may be slightly lower than InfiniBand networks.

IB vs. RoCE
  • From a technical perspective, InfiniBand adopts various technologies to improve network forwarding performance, reduce fault recovery time, enhance scalability, and reduce operational complexity.
  • In terms of business performance, InfiniBand has lower end-to-end latency than RoCEv2, so networks built on InfiniBand have an advantage in application-level business performance.
  • In terms of bandwidth and latency, factors such as congestion and routing affect high-performance network interconnection.

Congestion

InfiniBand uses two different frame relay messages to control congestion: Forward Explicit Congestion Notification (FECN) and Backward Explicit Congestion Notification (BECN). When the network is congested, FECN notifies the receiving device, while BECN notifies the sending device. InfiniBand combines FECN and BECN with an adaptive marking rate to reduce congestion. It provides coarse-grained congestion control.

Congestion control on RoCE uses Explicit Congestion Notification (ECN), which is an extension of IP and TCP that enables endpoint network congestion notification without dropping packets. ECN places a mark on the IP header to tell the sender that there is congestion. For non-ECN congestion communication, lost packets need to be retransmitted. ECN reduces packet loss caused by TCP connection congestion, avoiding retransmission. Fewer retransmissions can reduce latency and jitter, thus providing better transaction and throughput performance. ECN also provides coarse-grained congestion control, which has no obvious advantage over InfiniBand.

Routing

When there is congestion in the network, adaptive routing sends devices through alternative routes to alleviate congestion and speed up transmission. RoCE v2 runs on top of IP. IP has been routable for decades through advanced routing algorithms, and now it can predict congested routes with AI machine learning and automatically send packets through faster routes. In terms of routing, Ethernet and RoCE v2 have significant advantages.

However, InfiniBand and RoCE do not do much to deal with tail latency. Tail latency is very important for the synchronization of HPC message applications.

UEC Plans to Define a New Transport Protocol

In addition to InfiniBand and RoCE, other protocols have been proposed by the industry.

On July 19, the Ultra Ethernet Consortium (UEC) was officially established. UEC’s goal is to go beyond the existing Ethernet capabilities and provide a high-performance, distributed, and lossless transport layer optimized for high-performance computing and artificial intelligence. UEC’s founding members include AMD, Arista, Broadcom, Cisco, Eviden, HPE, Intel, Meta, and Microsoft, all of whom have decades of experience in networking, artificial intelligence, cloud, and large-scale high-performance computing deployments.

founding members

UEC believes that RDMA, which was defined decades ago, is outdated in the demanding AI/ML network traffic. RDMA transfers data in large blocks of traffic, which can cause link imbalance and overload. It is time to start building a modern transport protocol that supports RDMA for emerging applications.

It is reported that the UEC transport protocol is under development, aiming to provide better Ethernet transport than the current RDMA (still supporting RDMA), while retaining the advantages of Ethernet/IP and providing the performance required by AI and HPC applications. UEC transport is a new form of near-transport layer that has some semantic adjustments, congestion notification protocol, and enhanced security features. UEC will provide more flexible transport that does not require a lossless network, allowing many-to-many artificial intelligence workloads to require multipath and out-of-order packet transmission and other features.

More Enterprise Power

As the HPC/AI network continues to evolve, more and more enterprises are launching their own network protocols or solutions to meet their specific needs.

Tencent Cloud uses its self-developed Starlink RDMA network in its Starlink network, which allows GPUs to communicate directly with each other, saving CPU resources and improving the overall performance and efficiency of computing nodes. Through its self-developed end-to-end collaborative protocol TiTa, Starlink network can achieve 90% load 0 packet loss. TiTa protocol embeds a congestion control algorithm, which can monitor the network status in real-time and optimize communication, making data transmission smoother and latency lower.

Alibaba Cloud Panjiu PredFabric uses its self-developed Solar-RDMA high-speed network protocol, which allows processors to access the memory of any other server through load/store instructions, which is very suitable for the interactive form of neural networks within deep learning models. Compared with the traditional mode, the fault self-healing time and tail latency can be reduced by 90%.

Huawei’s hyper-converged data center network uses its original iLossless intelligent lossless algorithm, which cooperates with three key technologies of traffic control technology, congestion control technology, and intelligent lossless storage network technology to prevent PFC deadlock from occurring in advance, alleviate/release congestion, and achieve fast control of hosts, thus building a lossless Ethernet network and solving the problem of congestion packet loss in traditional Ethernet networks.

The growing market demand is the fundamental driving force for technological development. According to IDC data, AI infrastructure construction investment will reach 154 billion US dollars in 2023 and increase to 300 billion US dollars by 2026. In 2022, the AI network market has reached 2 billion US dollars, of which InfiniBand contributed 75% of the revenue.

When comparing InfiniBand and RoCE, we can see that both have their own advantages and application scenarios. InfiniBand performs well in the field of high-performance computing, providing excellent performance, low latency, and scalability. RoCE is easier to integrate into existing Ethernet infrastructure and has a lower cost. The emerging transport protocols represented by UEC also represent the continuous development and innovation of technology. Only by adapting to the changing needs can we maintain core competitiveness.

Leave a Comment

Scroll to Top