Broadcom vs. NVIDIA: The 400G/800G Switch Race

At Computex, NVIDIA made a promise to provide “lossless Ethernet” for AI workloads with its Spectrum-X platform. However, if you ask Broadcom, this is not even a new idea. Ram Velaga, the Senior Vice President of Core Switching Group at Broadcom, commented that “there is nothing unique about their device.” He explained that NVIDIA is essentially building a vertically integrated Ethernet platform, which excels at managing congestion in a way that minimizes tail latency (high percentiles of response time) and reduces AI job completion time. Velaga believes that this is no different from what Broadcom has done with its Tomahawk5 and Jericho3-AI switch ASICs. He also sees the launch of this switch as NVIDIA’s recognition of the significance of Ethernet in handling GPU flows in AI.
Spectrum-X platform
Regarding NVIDIA, the company has not abandoned InfiniBand networking. In fact, they invested a substantial amount of money (USD 17 billion) to acquire Mellanox. InfiniBand is highly suitable for users running a small number of extremely large workloads, such as GPT-3 or digital twins. However, Gilad Shainer, the Vice President of Marketing for NVIDIA’s Networking division, explained that in certain environments, especially multi-tenant clouds, Ethernet is the preferred choice. Shainer stated that traditional Ethernet infrastructure works well for smaller AI/ML workloads but now the growth of these workloads has exceeded single-node capabilities, resulting in slow speeds. NVIDIA’s Spectrum-X platform claims to address this challenge.
It should be noted that NVIDIA’s Spectrum-X is not a standalone product. It is a combination of hardware and software, with core components including NVIDIA’s 51.2Tbit/sec Spectrum-4 Ethernet switch and BlueField-3 Data Processing Unit (DPU). The basic idea is that when using both NVIDIA’s switch and DPU together, they collaborate to alleviate traffic congestion and, if NVIDIA is to be believed, completely eliminate packet loss.
Although Shainer claims that this is a new functional unit of NVIDIA, Velaga believes that the idea of “lossless Ethernet” is merely marketing. “Instead of calling it lossless, it’s more accurate to say that you effectively manage congestion to the point where you have a highly efficient Ethernet structure,” he commented.
Furthermore, Velaga claims that this congestion management has been built into Broadcom’s latest generation of switch ASICs, and only they can be used with smartNICs or DPUs from any vendor or cloud service provider. “You don’t have to do it on the NIC; you can go from one Jericho3-AI leaf to another Jericho3-AI leaf,” he added.
When asked about Broadcom’s Tomahawk5 and Jericho3-AI, Shainer refused to compare them, arguing that Spectrum-X belongs to its own category and implying that some vendors are simply adding “AI” to existing products. “No matter what you call it, there’s nothing that has features specifically designed for AI,” he said.
Broadcom vs. NVIDIA
view of switch front of switch
According to Velaga, NVIDIA is attempting vertical integration to address Ethernet congestion. “The reason Ethernet has succeeded today is that it’s a very open ecosystem,” he said. Because of this, NVIDIA’s Spectrum-X may prove challenging to sell to cloud providers who prefer to avoid vendor lock-in. They strongly want to avoid a situation that leads to the widespread adoption of vendor-agnostic network operating systems like SONiC. This allows them to run their clouds on any compatible switch.
In terms of value, NVIDIA’s Spectrum-4 does indeed support SONiC, as well as its own Cumulus NOS and Linux Switch drivers. However, due to the Spectrum-X platform relying on having both Spectrum-4 and BlueField simultaneously, you cannot simply swap one for another compatible SONiC switch or DPU without losing functionality.
Speaking of DPUs, many major cloud service providers already have SmartNICs tailored to their environments. Amazon Web Services has Nitro, Google co-developed an ASIC-based SmartNIC with Intel, and Microsoft acquired Fungible in January. These devices are highly valuable to cloud providers as they allow offloading common networking, storage, and security workloads, freeing up CPUs to run tenant workloads.
Shainer stated that it is completely feasible. He believes that cloud providers can utilize their existing DPUs to manage their infrastructure and control the north-south traffic while using NVIDIA’s BlueField-3 to manage the east-west traffic between nodes in the cluster.
He added that there is nothing to stop people from deploying NVIDIA’s switches or DPUs as standalone products. “If someone wants to use our switches and build their own solution, we welcome that. If someone wants to use our DPUs and use someone else’s switches, of course, go ahead. You can develop these components on your own,” said Shainer.
However, Velaga from Broadcom is unsure how customers would embrace this idea. “It’s hard to say how the value of vertically integrated Ethernet solutions would be marketed in a world where everything is being broken down,” he commented.

Leave a Comment

Scroll to Top