The Backbone Nobody Sees

A Deep Dive into HPC Communication Protocols and the Networks That Make Everything Possible

There is a persistent illusion in the world of high-performance computing, and it goes something like this. People assume the hard part is the compute. The GPUs, the CPUs, the accelerators. They obsess over teraflops and petaflops, over CUDA cores and tensor cores, over clock speeds and memory bandwidth. And they are not wrong to care about these things. But they are looking at only half the picture. Maybe less than half.

The truth, which anyone who has spent serious time with large-scale systems will confirm, is that the network is where the magic happens. Or, more precisely, where the magic falls apart when you get it wrong. You can have the most powerful GPUs money can buy, but if they cannot talk to each other with sufficient speed and low enough latency, you might as well be running a very expensive space heater. The history of HPC is, in many ways, the history of figuring out how to get processors to communicate. And that history is far more interesting than most people realize.

The Early Days, or How We Learned to Pass Messages

Before there was a standard, there was chaos. In the late 1980s and early 1990s, the distributed computing world was a Babel of incompatible communication libraries. IBM had its own. Intel had its own. nCUBE, PVM, Express, P4, PARMACS. Each system had its proprietary message-passing library, and if you wrote your code for one platform, good luck porting it to another. The portability problem was not merely annoying. It was existential for the field.

The applications driving this era were the classic workhorses of scientific computing. Computational fluid dynamics, weather simulation, structural mechanics, nuclear weapons modeling. These were problems that demanded enormous parallelism because you were discretizing physical space into millions of cells and solving coupled partial differential equations across them. The Navier-Stokes equations do not simplify themselves just because you are impatient. Each cell needs to know what its neighbors are doing, which means communication. Lots of it. And at the time, the network fabric was often simple Ethernet or proprietary interconnects that varied wildly in performance.

Out of this chaos emerged something remarkable. In 1992, researchers from over forty organizations, including hardware vendors, academics, and software library developers, sat down to create a common standard. By May 1994, they had produced MPI 1.0, the Message Passing Interface. It was not technically sanctioned by any standards body like ANSI or ISO, which gives it an endearingly rebellious quality for something so foundational. But it became the de facto standard anyway, through sheer consensus and practical necessity.

MPI and Its Many Lives

The genius of MPI was its pragmatism. Rather than trying to invent something entirely new, the designers pulled the best ideas from all the existing systems and welded them into a coherent interface. The basic concept was beautifully simple. You have processes. Each process has a rank, which is just a number. Processes communicate by sending and receiving messages. A communicator defines a group of processes that can talk to each other. MPI_COMM_WORLD includes everyone. That is essentially the whole mental model, and it is powerful enough to build everything from weather simulations to genome assemblers.

MPI 1.0 gave us point-to-point communication (send and receive, both blocking and non-blocking) and collective operations (broadcast, reduce, gather, scatter, all-reduce). These collectives are worth dwelling on because they turn out to be absolutely central to everything that followed, including modern AI training. An all-reduce, for instance, takes data from all processes, applies a reduction operation (typically summation), and distributes the result back to everyone. When you are training a neural network across many GPUs and need to average gradients, that is exactly what you are doing. The irony is that a communication pattern designed for fluid dynamics in the 1990s became the backbone of deep learning training in the 2020s.

MPI 2.0 arrived in 1997 and brought dynamic process management, one-sided communication (where one process can write directly into another’s memory without the other explicitly participating), and parallel I/O. That last one mattered enormously for HPC storage, because reading and writing data to shared file systems was becoming a serious bottleneck. MPI-IO gave applications a standard way to coordinate parallel access to files, which worked hand-in-glove with parallel file systems like the ones we will discuss shortly.

MPI 3.0 in 2012 was a significant modernization. Non-blocking collectives finally arrived, which meant you could initiate a collective operation and then go do useful work while it completed in the background. One-sided communication got serious enhancements. Fortran 2008 bindings were added. The standard was catching up with the reality that hardware was becoming increasingly heterogeneous.

MPI 4.0, released in June 2021, confronted the elephant in the room. Modern HPC systems run thousands to millions of concurrent threads, a degree of parallelism that was simply unimaginable in 1994. The new standard introduced large-count routines (finally removing the int-sized limitations on message sizes), persistent collectives, partitioned communications for better overlap of computation and communication, and refined initialization methods. Perhaps most importantly, it addressed the growing reality that GPU-accelerated environments needed better support.

And then there is MPI 5.0, which introduced a standardized Application Binary Interface. This sounds dry until you realize what it means. Different MPI implementations from different vendors can now be interchangeable at the binary level. You compile once, and your code works with any compliant MPI library. For containerized environments and portable HPC, this is genuinely transformative.

The major implementations tell their own story. Open MPI, MPICH, Intel MPI, and MVAPICH2 have each carved out their niches. Open MPI is the open-source workhorse, used on everything from Raspberry Pi clusters to some of the world’s largest supercomputers. MPICH is the reference implementation, and many commercial MPIs (including Intel MPI) are derived from it. MVAPICH2, from Ohio State University, under the steady stewardship of Dhabaleswar K. (DK) Panda, pioneered GPU-aware MPI implementations and has been instrumental in bridging the gap between traditional HPC and GPU computing.

The Rise of InfiniBand, or How the Network Got Serious

While MPI was evolving as a software standard, the hardware that carried the messages was undergoing its own revolution. In the early days of HPC clustering, Gigabit Ethernet was the default interconnect, and it was adequate for loosely coupled workloads. But for tightly coupled simulations where every timestep requires extensive nearest-neighbor communication, Ethernet’s latency and overhead were painful.

InfiniBand emerged in the early 2000s from a merger of two competing proposals, Future I/O (backed by Compaq, IBM, and HP) and Next Generation I/O (backed by Intel, Microsoft, and Sun). The resulting standard was designed from the ground up for exactly the kind of low-latency, high-bandwidth communication that HPC demanded.

The speed evolution of InfiniBand reads like a technology sprint compressed into two decades. Mellanox (later acquired by NVIDIA in 2020) was the first significant commercial player, delivering 10 Gb/s SDR (Single Data Rate) InfiniBand silicon and boards in 2001. Then DDR (Double Data Rate) at 20 Gb/s arrived in 2004. QDR (Quad Data Rate) doubled again to 40 Gb/s in 2008. FDR (Fourteen Data Rate) reached 56 Gb/s in 2011, notable for being the first generation to adopt the more efficient 64b/66b encoding instead of the older 8b/10b scheme, which meant less overhead and more usable bandwidth per signal.

EDR (Enhanced Data Rate) hit 100 Gb/s in 2015, coinciding neatly with the explosion of big data analytics and early deep learning adoption. HDR (High Data Rate) at 200 Gb/s arrived in 2018 and introduced something profound. In-switch HPC acceleration through NVIDIA’s Scalable Hierarchical Aggregation Protocol, or SHARP. This was developed specifically for the Summit and Sierra supercomputers at Oak Ridge and Lawrence Livermore, and it represented a paradigm shift. Instead of merely passing data through the network, the network could now perform data reductions inside the switches themselves. All-reduce operations that previously required data to traverse the entire topology could now be partially computed in-network, dramatically reducing latency and bandwidth consumption.

NDR (Next Data Rate) at 400 Gb/s followed with NVIDIA’s Quantum-2 ASICs, featuring 256 SerDes running at 51.6 GHz with PAM-4 encoding and aggregate bandwidth of 51.2 Tb/s bi-directional. XDR (eXtreme Data Rate) at 800 Gb/s entered commercial deployment in 2025 with the Quantum-X800 switch and ConnectX-8 NICs. The roadmap extends to GDR (Gigantic Data Rate) at 1.6 Tb/s and LDR at 3.2 Tb/s.

But the speed numbers, impressive as they are, only tell part of the story. What makes InfiniBand fundamentally different from standard Ethernet is its native support for RDMA, or Remote Direct Memory Access. RDMA allows one machine to read from or write to another machine’s memory directly, without involving either machine’s CPU or operating system. The data path bypasses the kernel entirely. This is not an incremental improvement over TCP/IP. It is a completely different paradigm. Typical InfiniBand latencies run from around 5 microseconds for SDR down to under 0.5 microseconds for HDR and beyond. For comparison, 10 Gigabit Ethernet typically shows latencies around 7–10 microseconds, roughly an order of magnitude worse than modern InfiniBand.

InfiniBand’s architecture is based on a switched fabric topology, where nodes connect to switches via high-speed serial links, and switches interconnect to form the fabric. It supports various topologies including fat-tree, mesh, and 3D torus, each with their own trade-offs for different workload patterns. Fat-tree topologies are popular because they provide full bisection bandwidth, meaning any partition of the cluster can communicate with any other partition at full speed. For workloads like deep learning training where all-to-all communication patterns are common, this matters enormously.

The switched fabric also enables advanced features like adaptive routing, where packets can dynamically choose paths through the network to avoid congestion, and Quality of Service mechanisms that allow different traffic classes to be prioritized. These features become critical in large clusters where thousands of nodes are competing for network bandwidth simultaneously.

RoCEv2, or Teaching Ethernet New Tricks

InfiniBand is magnificent, but it has a significant practical limitation. It requires dedicated InfiniBand switches, adapters, and cables. For organizations that have already invested heavily in Ethernet infrastructure, ripping it all out and replacing it with InfiniBand is a non-trivial proposition both financially and operationally.

Enter RDMA over Converged Ethernet, or RoCE (pronounced “rocky,” because apparently the networking community has a sense of humor). RoCE takes the RDMA semantics of InfiniBand and layers them on top of standard Ethernet. The original RoCE v1 was an Ethernet link-layer protocol (Ethertype 0x8915) that only worked within a single Layer 2 broadcast domain. This was limiting because it meant you could not route RoCE traffic across different network segments.

RoCE v2 solved this by encapsulating the InfiniBand transport packets inside UDP/IPv4 or UDP/IPv6 headers (using reserved UDP destination port 4791). This made the protocol routable across Layer 3 networks, which is why it is sometimes called Routable RoCE or RRoCE. The addition of UDP encapsulation also enabled ECMP (Equal-Cost Multi-Path) load balancing, because switches could hash on the source UDP port to distribute traffic across multiple paths.

But here is where it gets interesting, and also complicated. Ethernet was not designed to be lossless. Standard Ethernet switches will happily drop packets when buffers overflow, and TCP deals with this through retransmission. RDMA, however, is exquisitely sensitive to packet loss. A single dropped packet can trigger massive latency spikes because the Go-Back-N retransmission strategy used by RoCEv2 means that when one packet is lost, that packet and all subsequent packets in the sequence must be retransmitted, even if they were received successfully.

To make Ethernet behave in a lossless fashion for RoCE traffic, you need two complementary mechanisms. Priority Flow Control (PFC) provides hop-by-hop flow control by sending IEEE 802.3 PAUSE frames when buffer occupancy exceeds a threshold. This prevents packet loss by temporarily pausing the sender, but it introduces its own problems. PFC storms can cascade through the network, freezing entire sections. Victim flows (unrelated traffic sharing the same priority queue) get paused as collateral damage. In extreme cases, circular dependencies between PFC-pausing devices can create deadlocks that require manual intervention to resolve.

Explicit Congestion Notification (ECN) provides a complementary, end-to-end congestion signal. Switches mark packets with ECN bits in the IP header when queue depth exceeds configurable thresholds. When the receiver sees these marks, it sends Congestion Notification Packets (CNPs) back to the sender, which then reduces its transmission rate. ECN is the more graceful mechanism, addressing congestion before it becomes severe enough to trigger PFC pauses.

The practical reality is that deploying RoCEv2 in production requires careful configuration of Data Center Bridging (DCB) features, proper QoS policies, appropriate buffer allocation, and DSCP settings. It is more operationally complex than InfiniBand, where the lossless behavior is built into the protocol from the ground up. But for organizations committed to Ethernet-based infrastructure, RoCEv2 provides RDMA performance that is increasingly competitive with InfiniBand, particularly at the 100G and 400G speeds where the performance gap has narrowed considerably.

The battle between InfiniBand and Ethernet with RoCEv2 for AI training workloads is one of the defining infrastructure debates of our era. InfiniBand offers lower latency (100 ns switch latency versus roughly 230 ns for comparable Ethernet switches), simpler lossless configuration, mature in-network computing with SHARP, and a well-established HPC ecosystem. RoCEv2 offers broader vendor choice, compatibility with existing Ethernet infrastructure, lower cost per port at comparable speeds, and the advantage that every network engineer in the world already understands Ethernet. The market has not settled this debate, and it may not for years.

The xCCL Revolution, or Why MPI Was Not Enough

As deep learning exploded in scale around 2012–2015, a fundamental mismatch became apparent. MPI was designed for CPU-based distributed computing. Its implementations assumed that the CPU orchestrated communication, that data lived in host memory, and that the overheads of going through the MPI stack (however small they might be) were acceptable relative to computation time. GPU computing changed all three assumptions.

When you are training a neural network, the data lives on the GPU. The computation happens on the GPU. If you need to do an all-reduce to synchronize gradients across multiple GPUs, the last thing you want to do is copy data from GPU memory to host memory, hand it to MPI, have MPI send it through the network, receive it on the other side into host memory, and then copy it back to the GPU. The overhead is devastating.

NVIDIA recognized this problem early and created NCCL (NVIDIA Collective Communication Library, pronounced “nickel”) in 2015. NCCL was purpose-built for GPU-to-GPU communication. It implements collective operations (all-reduce, all-gather, reduce-scatter, broadcast, and reduce, plus point-to-point send/receive since version 2.7) as single CUDA kernels that communicate directly between GPUs without involving the CPU.

The internal architecture of NCCL is fascinating and worth understanding. NCCL automatically detects the topology of the system, understanding which GPUs are connected via PCIe, NVLink, NVSwitch, InfiniBand, or RoCE, and uses this information to build optimized communication patterns. It constructs virtual rings and trees through graph search algorithms, selecting the topology that maximizes bandwidth and minimizes latency for the given hardware configuration.

NCCL uses three internal protocols for data transfer. The Simple protocol is high-bandwidth, operating on large chunks (512 KB per step) with explicit synchronization through CUDA memory fences. It achieves the highest throughput for large messages. The LL (Low Latency) protocol uses 8-byte data units with inline flag-based synchronization, achieving lower latency at the cost of reduced effective bandwidth (only 4 bytes of every 8 are payload). The LL128 protocol is a hybrid, operating on 128-byte units where 120 bytes are payload and 8 bytes are flags, offering a middle ground between throughput and latency. NCCL automatically selects the protocol based on message size and the number of communication channels.

For multi-node communication, NCCL leverages GPUDirect RDMA, which allows network adapters to read and write GPU memory directly without staging through host memory. This eliminates two memory copies (GPU to host, host to GPU) and reduces latency significantly. When GPUDirect RDMA is not available, NCCL falls back to shared memory transfers for intra-node communication and proxied transfers through host memory for inter-node.

The ring all-reduce algorithm, which NCCL uses extensively, works by arranging GPUs in a logical ring and performing the reduction in two phases. In the reduce-scatter phase, each GPU sends a chunk of data to its neighbor around the ring, which combines it with its own chunk, and this continues for (N-1) steps until every GPU holds a fully reduced chunk. In the all-gather phase, the fully reduced chunks are similarly circulated around the ring until every GPU has the complete result. This algorithm achieves optimal bandwidth utilization because each GPU sends and receives exactly the same amount of data regardless of the number of GPUs.

NCCL’s tree algorithm provides an alternative for latency-sensitive small-message operations. It organizes GPUs in a binary tree and performs the reduction by aggregating data up the tree to the root, then broadcasting the result back down. Tree algorithms have O(log N) latency scaling compared to the ring’s O(N), making them preferable for small messages where latency dominates over bandwidth.

But NCCL is NVIDIA-specific, which created an immediate problem for the rest of the GPU ecosystem. AMD responded with RCCL (ROCm Collective Communication Library), which mirrors NCCL’s API closely enough that applications can typically switch between them by relinking. RCCL is optimized for AMD GPUs and the HIP programming model, supporting both GPUDirect and host-based communication paths depending on system topology.

Intel entered the arena with oneCCL (oneAPI Collective Communications Library), designed for Intel CPUs and GPUs. Built on top of MPI and libfabrics, oneCCL transparently supports multiple interconnects including Intel Omni-Path, InfiniBand, and Ethernet. It includes DL-specific optimizations like prioritization, persistent operations, and out-of-order execution. In a notable convergence, Intel’s 2026 release of oneCCL will default to an API that follows the NCCL API standard, acknowledging NCCL’s dominance as the industry interface for collective communication.

Facebook (now Meta) contributed Gloo, a collective communication library designed for CPU-based distributed training. Gloo supports multiple transports including TCP, InfiniBand verbs, and shared memory. While it lacks the GPU-specific optimizations of NCCL, it provides a vendor-neutral option for environments where GPU collectives are not needed or where the training framework handles GPU communication separately.

Microsoft’s contribution, MSCCL (Microsoft Collective Communication Library), takes a different philosophical approach. Rather than providing fixed algorithms, MSCCL is a platform for executing custom collective communication algorithms. It includes a high-level DSL called MSCCLang and a compiler that generates intermediate representations for MSCCL’s runtime (which itself is built on top of NCCL). This allows researchers to write hyper-optimized collective algorithms for specific topologies and buffer sizes, going beyond what any single library’s built-in algorithms can achieve.

The broader trend these libraries represent is the evolution from MSCCL++ and HiCCL, which further abstract communication from hardware-specific optimizations. HiCCL in particular has demonstrated that by composing collectives hierarchically across different levels of the hardware topology (intra-GPU, intra-node, inter-node), it can match or outperform vendor-specific libraries on AMD, NVIDIA, and Intel hardware. This is significant because it suggests a future where the collective communication layer can be truly hardware-agnostic, which is exactly the kind of future that my own work at MisaLabs is aimed at enabling.

The Storage Side of the Equation

No discussion of HPC communication would be complete without talking about storage, because in practice, compute and network and storage form an inseparable triad. You can have the fastest GPUs and the lowest-latency network in the world, and if your storage cannot feed data to the GPUs fast enough, your expensive hardware sits idle.

The parallel file system was invented to solve exactly this problem. Instead of a single storage server that becomes a bottleneck when hundreds of compute nodes try to read simultaneously, a parallel file system stripes data across many storage servers and allows clients to read from all of them concurrently. The aggregate bandwidth scales with the number of servers, and in a well-designed system, adding more storage servers linearly increases both capacity and throughput.

Lustre, whose name is a portmanteau of “Linux” and “cluster,” has been the dominant parallel file system in HPC since its first release in 2003. It powers six of the top ten fastest supercomputers in the world, over 65% of the top 100, and over 60% of the top 500. Lustre separates metadata (file names, directory structures, permissions) from data, using dedicated Metadata Servers (MDS) and Object Storage Servers (OSS). Clients communicate with the MDS to look up file locations, then read and write data directly to the appropriate OSS, avoiding the metadata server as a bottleneck for data-intensive operations.

Lustre’s journey has been tumultuous. It was purchased by Sun Microsystems, which was then acquired by Oracle, which announced it would stop supporting Lustre, sending the HPC community into a tizzy. The code was forked by multiple organizations, bounced through Xyratex, Intel, Seagate, and DDN. Intel’s 2017 decision to abandon its Lustre commercialization efforts further unsettled the community. But Lustre persists, maintained now primarily by DDN and the open-source community, because nothing else matches its throughput for the large sequential I/O patterns that characterize traditional HPC workloads.

IBM’s GPFS (General Parallel File System, marketed as IBM Spectrum Scale and later IBM Storage Scale) takes a fundamentally different architectural approach. Rather than separating metadata and data into distinct services, GPFS distributes both across all nodes in the cluster. This distributed metadata model eliminates the MDS as a potential bottleneck and provides more consistent performance across mixed workloads. GPFS also includes features that Lustre historically lacked, including file-level data integrity checking, automated tiering, and more robust multi-tenancy support. For enterprise environments running diverse workloads, GPFS often provides a better balance than Lustre.

BeeGFS entered the scene more recently, originating in 2005 at the Fraunhofer Center for High Performance Computing in Germany. Like Lustre, it separates data and metadata services. But BeeGFS was designed from the start with ease of installation and management as primary goals. Where Lustre and GPFS require significant expertise to deploy and maintain, BeeGFS aims to be something that a system administrator with general Linux knowledge can set up and run. This simplicity has made it popular with smaller HPC centers and research groups that lack dedicated storage teams.

The newer entrants are perhaps the most interesting for the AI era. Intel DAOS (Distributed Asynchronous Object Storage) is designed from the ground up for NVMe and persistent memory, bypassing the traditional POSIX file system semantics in favor of a key-value and array-based object model that better matches how modern applications actually access data. WekaIO (now WEKA) is a software-defined parallel file system that achieves extremely high IOPS performance by running entirely in user space, bypassing the kernel entirely, much like RDMA bypasses the kernel for networking. And VAST Data offers an all-flash NFS-based solution that is gaining traction for AI workloads where small-file random I/O patterns dominate over the large sequential patterns that traditional parallel file systems were designed for.

The shift from traditional HPC to AI workloads has fundamentally changed what storage systems need to optimize for. Traditional CFD and weather simulations produce enormous checkpoint files that need to be written periodically and read at restart. The I/O pattern is large sequential writes and reads, which is exactly what Lustre excels at. AI training, by contrast, involves reading millions of small training samples (often images or text snippets of a few kilobytes each) in random order, while periodically writing model checkpoints that can be tens of gigabytes. The read pattern is small and random, the write pattern is large and sequential, and neither pattern alone characterizes the workload.

This mismatch between traditional parallel file system design and AI workload requirements is driving a wave of innovation in storage architecture. Burst buffer systems like DAOS provide ultra-fast NVMe-based storage for checkpoint and restart, while object stores like S3 handle the bulk training data. Multi-tier architectures combine flash storage for active data with disk-based parallel file systems for capacity. And increasingly, GPU-direct storage technologies allow data to flow directly from NVMe drives to GPU memory, bypassing the CPU and file system entirely, mirroring the same kernel-bypass philosophy that RDMA brought to networking.

From Fluid Dynamics to Foundation Models, A Network Evolution Story

The transformation from traditional HPC to AI-dominated computing represents perhaps the most dramatic shift in the history of computing infrastructure, and the network is where you can see it most clearly.

Traditional HPC workloads like computational fluid dynamics, weather modeling, and structural mechanics share a common communication pattern. They discretize physical space into a mesh, assign portions of that mesh to different processors, and each processor needs to exchange boundary data with its neighbors at every timestep. This creates a nearest-neighbor communication pattern that is relatively sparse (each process only talks to a handful of others) but extremely latency-sensitive (the simulation cannot advance until all boundary exchanges complete). The message sizes are moderate, and the ratio of computation to communication is typically high enough that even modest networks can deliver acceptable scaling to hundreds or thousands of cores.

These workloads drove the early development of MPI and InfiniBand. The emphasis was on point-to-point latency and small-message throughput, because that is what nearest-neighbor exchanges demand. Fat-tree topologies were popular because they provided guaranteed bandwidth between any pair of nodes, even though most pairs never actually communicated.

The transition began when GPU computing entered the picture. GPUs were initially adopted for the compute-intensive parts of traditional simulations, accelerating the per-cell calculations within each process’s mesh partition. The communication patterns did not change much, but the ratio of computation to communication shifted dramatically. GPUs could process their mesh partitions much faster than CPUs, which meant the network needed to deliver boundary data much more quickly or become the bottleneck. This drove demand for higher network bandwidth and lower latency, accelerating the adoption of EDR and HDR InfiniBand.

Then deep learning changed everything. Training a large neural network across many GPUs involves a fundamentally different communication pattern. Instead of sparse nearest-neighbor exchanges, you need dense collective operations. Every training step requires synchronizing gradients across all participating GPUs, typically through an all-reduce operation that involves every GPU in the training run. The message sizes can be enormous, as you are essentially moving copies of the entire model’s gradients. And unlike traditional simulations where communication might consume 10–20% of runtime, in data-parallel deep learning training, communication can consume 50% or more of the total time if the network is not fast enough.

This shift explains why NCCL was created, why in-network computing (SHARP) was developed, why NVLink and NVSwitch were invented to provide 900 GB/s intra-node GPU-to-GPU bandwidth, and why the latest GPU servers like the NVIDIA DGX systems have eight 400 Gb/s InfiniBand connections, for an aggregate of 3.2 Tb/s of external network bandwidth per node.

The Mixture of Experts (MoE) architecture has introduced yet another wrinkle. MoE models route different input tokens to different expert sub-networks, which means you need all-to-all communication to redistribute tokens across GPUs. All-to-all is far more network-intensive than all-reduce because every GPU needs to send data to every other GPU, not just contribute to a shared aggregation. This puts enormous pressure on bisection bandwidth and has made fat-tree or Clos topologies with full bisection bandwidth essentially mandatory for large-scale MoE training.

Inference workloads present their own networking challenges, different from but no less demanding than training. Large model inference often uses tensor parallelism, where the model’s layers are split across multiple GPUs that must collaborate on every single token generation. The latency of inter-GPU communication directly impacts the time-to-first-token and inter-token latency that end users experience. For real-time applications, every microsecond of network latency translates directly into user-visible delay. This is why inference deployments often prefer NVLink-connected GPU configurations where inter-GPU latency is measured in hundreds of nanoseconds rather than the microseconds of InfiniBand.

Looking Forward

The trajectory of HPC networking is being shaped by several converging forces. Model sizes continue to grow, pushing the boundaries of what even the fastest networks can support. The emergence of ultra-long-context models and multimodal architectures creates new communication patterns that existing collective algorithms were not designed for. The economics of GPU clusters mean that even small improvements in network utilization translate to millions of dollars in saved infrastructure costs.

At the same time, the monopoly position that NVIDIA has built through the combination of CUDA, NCCL, NVLink, and InfiniBand creates both technical excellence and market anxiety. The HPC community has always valued openness and portability, which is why MPI was created as a standard in the first place. The xCCL libraries, UCC (Unified Collective Communications), and UCX (Unified Communication X) represent the community’s response, attempting to create portable abstraction layers that can deliver competitive performance across different vendors’ hardware.

The convergence of Ethernet and InfiniBand continues, with each borrowing ideas from the other. Ultra Ethernet Consortium (UEC) is developing next-generation Ethernet specifications specifically for AI and HPC workloads, incorporating ideas like in-network computing and improved congestion control that were pioneered in InfiniBand. Meanwhile, InfiniBand is adopting some of Ethernet’s flexibility around routing and multi-tenancy.

For storage, the line between networking and storage is blurring entirely. NVMe over Fabrics (NVMe-oF) allows storage devices to be accessed remotely with near-local latency, using either RDMA or TCP as the transport. Computational storage devices can perform data processing at the storage layer, reducing the data that needs to traverse the network. And disaggregated memory architectures like CXL (Compute Express Link) are creating a new tier of shared memory that sits between local DRAM and remote storage, accessible through yet another communication protocol.

What has not changed, and what I suspect will never change, is the fundamental tension at the heart of distributed computing. Compute is fast and getting faster. Networks are fast and getting faster. But the speed of light remains stubbornly constant, and the distance between two racks in a data center imposes physical limits that no amount of clever engineering can eliminate. The history of HPC networking is the history of working within those limits, finding every possible way to squeeze out a few more nanoseconds, a few more gigabytes per second, a few more useful operations before the network becomes the bottleneck again.

It is, when you think about it, the most human of engineering problems. We build these extraordinary machines and then spend decades figuring out how to get them to talk to each other. Much like civilization itself.

Thoughts?

Search This Blog

Patterns that Connect: AI, Management, Metaverse, Quantum, Philosophy, and Physics