EtherNET or EtherNOT?

December 9, 2023

Brian

Optical Network Engineer

A survey of the AI network positions of the leading vendors In July 2023, the Ultra Ethernet Consortium (UEC), initiated by the Linux Foundation and its Joint Development Foundation, was officially launched, dropping a depth charge into the turbulent AI network interconnection ecosystem. In August 2023, at the IEEE Hot Interconnects (HOTI) international forum, which focuses on advanced hardware and software architectures and various interconnect network implementations, representatives from Intel, Nvidia, AMD, and other companies participated in a panel discussion on the question of “EtherNET or EtherNOT”, and expressed their views on Ethernet. The emerging AI/ML workloads are driving the demand for high-performance network interconnection. About ten years ago, RDMA over Converged Ethernet (RoCE) introduced low-latency data transmission into the Ethernet architecture, but compared with other network technologies, Ethernet seemed to lag in technology development. Is the battle between EtherNET and EtherNOT coming again? In the Ethernet era, cloud vendors, equipment vendors, and other parties have their interests, and it is a critical decision-making period. How will they choose?

This topic of “EtherNET or EtherNOT” was already discussed at the HOTI conference in 2005, and the conclusion at that time was as follows:

At the discussion of the 2023 HOTI conference, Brad Burres, senior researcher and chief hardware architect for the Network and Edge Group at Intel, and Frank Helms, data center GPU system architect at AMD, favored Ethernet. Brad Burres argued that no matter what technology is adopted, an open ecosystem is needed to reduce the cost of the entire industry and achieve the required software infrastructure. As the protocol matures, Ethernet will be the winner unless another open standard structure emerges immediately (such as CXL). Frank Helms listed the first, second, and fifth places in the global supercomputer TOP500 list, Frontier, Aurora, and LUMI, respectively, which are all based on the Ethernet-based HPE Cray Slingshot-11 network structure for connection. He believed that Ethernet is at the forefront of interconnect technology. The emergence of UEC (Ultra Ethernet Alliance) also reflects that there is a lot of suppressed demand for Ethernet for large-scale AI training cluster interconnection. Larry Dennison, director of network research at NVIDIA, believed that there is still a gap between Ethernet and meeting the needs of AI workloads. If Ethernet meets all these needs, is it still Ethernet? How long can it be achieved? The Ethernet market is indeed huge, it will not disappear, but in the next few years, the development speed of Ethernet will not be able to meet the needs of this market. Torsten Hoefler, professor at ETH Zurich and consultant for Microsoft in the field of large-scale artificial intelligence and network, pointed out that Ethernet is the present and future of data centers and supercomputers, but not the Ethernet we are talking about now, Ethernet needs to evolve.

Table of Contents

Open Ecology or Vendor Lock-in?

Historically, InfiniBand and Ethernet have been competing for the dominance of the AI/HPC market, as they are both open standards. However, a key difference is that InfiniBand is currently supported by Nvidia as a single vendor, while Ethernet enjoys multi-vendor support, fostering a vibrant and competitive ecosystem. However, even in the field of AI/HPC network solutions, Ethernet solutions may come with a “partially customized” label, which may lead to vendor lock-in.

For example, Broadcom’s Jericho3 Ethernet switch requires the entire network fabric to use the same switch chip when running in its high-performance “fully scheduled fabric” mode. Cisco’s Silicon One switch and Nvidia’s Spectrum-X switch also have similar situations – high-performance requirements may cause vendor lock-in. Some hyperscale enterprises have designed “custom” NICs, which can also lead to custom networks. Therefore, even when choosing Ethernet solutions, one may encounter custom implementations and vendor lock-in. AI/HPC networks may transition to a new, open, and more powerful transport standard, partially or fully replacing the ROCEv2 RDMA protocol, which is the vision that the Beyond Ethernet Alliance is pursuing.

AI/ML Networking Technology Inventory

How do the hyperscale vendors choose their AI/ML network technologies? Is it EtherNET or EtherNOT?

Amazon AWS

Amazon drew inspiration from the InfiniBand RD protocol and launched the Scalable Reliable Datagram (SRD) transport protocol for HPC networks. Amazon “exclusively” uses Enhanced Network Adapters (ENA), which are based on its proprietary Nitro chip. SRD uses UDP, supports packet spraying across multiple links, and eliminates the “in-order” packet delivery requirement, reducing fabric congestion and tail latency. When necessary, packet reordering is handled by the upper layer of SRD. Amazon continues to pursue a native AI/HPC network strategy and is probably the least cooperative with NVIDIA.

Google

Google uses a mix of its TPUs and NVIDIA’s GPUs. TPUs and GPUs compete with each other and may be deployed depending on the workload suitability. Google is unlikely to use InfiniBand products in its network. Google’s AI/ML network is relatively customized and has been deploying a similar NVLink “coherent” architecture for years. Google has innovated a lot on the network stack and deployed “native” Optical Switching Systems (OCS) – a circuit switch based on Micro-Electro-Mechanical Systems (MEM mirrors) – in its regular data centers and artificial intelligence data centers. Optical switches typically eliminate a layer of physical switches, support higher radix configurations, and reduce power consumption and latency. Optical switches “reflect” light and are independent of network protocols and network switch upgrades. The downside is that the mirror reconfiguration time is usually long, in the tens of milliseconds range, so these OCS switches work as fixed-capacity “circuit”. For artificial intelligence training networks, this is not a major issue, as the traffic patterns are predictable.

Microsoft

Microsoft is the most pragmatic among the hyperscale enterprises, and it adopted InfiniBand early on to build artificial intelligence networks for its partner OpenAI. Although Microsoft developed its custom network adapter and used a custom RDMA protocol for Azure cloud, its openness to InfiniBand, embrace of NVIDIA’s full-stack AI/ML solution, and close collaboration with OpenAI, all make it NVIDIA’s preferred customer. Microsoft acquired Fungible, which invented True Fabric – a reliable datagram protocol based on UDP that handles traffic, congestion, and error control, and optimizes tail latency. Some of Fungible’s technological innovations may appear in Microsoft’s future products and open-source contributions.

Oracle

Oracle firmly supports Ethernet and does not use InfiniBand. Oracle Cloud Infrastructure (OCI) leverages Nvidia GPUs and ConnectX NICs to build a supercluster based on ROCEv2 RDMA. OCI builds a separate RDMA network, based on a custom congestion notification protocol of DC-QCN, minimizes the use of PFC, and fine-tunes custom profiles for AI and HPC workloads.

NVIDIA

NVIDIA’s GPUs and its full-stack AI/ML solutions make it an undisputed upstream player in the market. NVIDIA DGX Cloud solution integrates Quantum-2 (25.6Tbs) InfiniBand switch with ConnectX and Bluefield network adapters. These network adapters support both Ethernet and InfiniBand. The full-stack InfiniBand solution based on DGX Cloud will also be sold to telecom and enterprise markets by NVIDIA and its OEMs. However, NVIDIA is also investing heavily in Ethernet through its Spectrum-X switch. A few years ago, InfiniBand was the preferred architecture for AI training, making it the ideal choice for NVIDIA’s integrated DGX cloud solution. With the launch of the NVIDIA Spectrum-X Ethernet switch (capacity of 51.2 Tbs, twice the capacity of InfiniBand switch), NVIDIA will switch to Ethernet for large-scale GPU deployment, to take advantage of Ethernet’s higher port speed, cost-effectiveness, and scalability. Spectrum-X Ethernet switch supports advanced ROCEv2 extensions – RoCE adaptive routing and congestion control, telemetry support, and in-network computing called collective (through NVIDIA’s SHARP product).

Broadcom

Broadcom offers comprehensive AI/HPC network solutions, including switch chips and network adapters. Broadcom’s strategic acquisition of “Correct Networks” introduced a transport protocol based on EQDS UDP, which moves all queuing activities from the core network to the transmitting host or leaf switch. This approach supports switch optimization in the Jericho3/Ramon3 chip combination, which is a “fully scheduled fabric” equipped with packet spraying, reordering buffers in leaf switches, path rebalancing, congestion notification dropping, and hardware-driven in-band fault recovery mechanisms. The Tomahawk (52Tbs) series is designed to optimize single-chip capacity and is not a fully scheduled fabric. Tomahawk switches also support edge queues, as well as latency-critical functions in hardware, such as global fabric-level load balancing and path rebalancing. Tomahawk does not support packet sorting in leaf switches, so packet reordering buffers need to be implemented in network adapters (endpoints).

Cisco

Cisco recently launched the Silicon One 52Tb/s switch, demonstrating the versatility of its network solutions. The switch is P4 programmable, allowing flexible programming for various network use cases. Cisco’s Silicon One-based switches provide support for fully scheduled fabrics, load balancing, hardware fault isolation, and telemetry. Cisco partners with multiple NIC vendors to provide complete AI/ML network solutions.

Conclusion

The journey of Ethernet standardization for AI/HPC networks has just begun, and it requires further cost and power reduction through scale, open innovation, and multi-vendor competition. The Super Ethernet Alliance is composed of major network stakeholders and is committed to creating an open, “full-stack” Ethernet solution tailored for AI/HPC workloads. As mentioned above, most of the “necessary” AI/HPC network technologies have been deployed by various Ethernet vendors and hyperscalers in some form or way. Therefore, the challenge of standardization is not technical, but more about building consensus.