Clos Networks

From Bell Labs to Modern Cloud Datacenters

A Deep Technical Dive into Spine-Leaf Architecture, Network Scaling, and AI Infrastructure

Why Clos Networks Matter Now More Than Ever

In 1953, a Bell Labs engineer named Charles Clos published a paper titled "A Study of Non-Blocking Switching Networks" in the Bell System Technical Journal. The problem he solved was deceptively simple. How do you connect any telephone caller to any receiver without ever getting a busy signal due to network congestion? His solution would lie dormant for decades, largely forgotten outside telecommunications circles, before experiencing a dramatic resurrection in the age of cloud computing.

Today, every major hyperscaler runs their datacenters on Clos-derived architectures. Google, Microsoft Azure, Amazon Web Services, and Oracle Cloud Infrastructure all deploy variations of the spine-leaf topology that traces its mathematical foundations directly to Clos's original work. When you train a large language model across thousands of GPUs, when you run distributed databases across multiple availability zones, when you stream video to millions of concurrent users, your packets traverse networks built on principles conceived for telephone switching.

This article provides a comprehensive examination of Clos networking from first principles to production deployment. We will explore the mathematics that make non-blocking networks possible, trace the evolution from three-stage telephone switches to five-stage datacenter fabrics with super-spines, and examine the protocols that bring these architectures to life. BGP for routing, ECMP for load balancing, VXLAN for overlay networks, and the congestion control mechanisms that enable lossless fabric for AI workloads.

The Birth of Non-Blocking Networks

Charles Clos and the Telephone Switching Problem

Charles Clos (pronounced "Kloh") was born on June 21, 1905 and worked at Bell Labs until his retirement from AT&T in 1970. He passed away on December 22, 1988 in New York. While relatively unknown outside telecommunications engineering, his 1953 paper has accumulated over 1,700 academic citations and forms the theoretical foundation for virtually all modern datacenter network design.

The problem Clos addressed was fundamentally economic. In the early telephone era, connecting N inputs to N outputs required a crossbar switch with N x N crosspoints. Each crosspoint was a physical relay that could establish a connection. For a 1,000-subscriber exchange, this meant 1,000,000 crosspoints. The cost scaled quadratically, making large exchanges prohibitively expensive.

Clos's insight was elegant. By organizing switches into multiple stages with specific connectivity patterns, you could achieve non-blocking behavior with far fewer crosspoints. His three-stage architecture could connect any idle input to any idle output without blocking, while reducing crosspoint count from O(N squared) to approximately O(N to the power of 3/2). For large N, this represented enormous cost savings.

The Mathematical Foundation

A three-stage Clos network is parameterized by three integers: n, m, and r. The parameter n represents the number of inputs feeding into each of r ingress-stage switches. Each ingress switch has m outputs connecting to m middle-stage switches. The egress stage mirrors the ingress stage with r switches of m inputs and n outputs.

For a symmetric network with N total inputs and N total outputs can be represented by N = n x r

The Non-Blocking Condition Clos proved that for a three-stage network to be strictly non-blocking (meaning any new connection can always be established without rearranging existing connections), the number of middle-stage switches m must satisfy: m >= 2n - 1

The proof is intuitive. Consider the worst case. You want to connect an idle input on ingress switch A to an idle output on egress switch B. In the adversarial scenario, ingress switch A already has (n-1) other active connections, each consuming a different middle-stage switch. Similarly, egress switch B has (n-1) active connections, each using different middle switches. In the absolute worst case, all (2n-2) of these connections use different middle switches. To guarantee a path exists, you need at least one more middle switch: m = 2n - 2 + 1 = 2n - 1.

Crosspoint Calculation. For a symmetric three-stage Clos network C(n, m, r) with N = n x r inputs:

Ingress stage: r switches, each n x m = r x n x m crosspoints

Middle stage: m switches, each r x r = m x r x r crosspoints

Egress stage: r switches, each m x n = r x m x n crosspoints

Total crosspoints = 2rnm + mr squared

Setting m = 2n - 1 and optimizing, the minimum crosspoint count occurs when n is approximately the square root of N/2. For N = 512, a crossbar requires 262,144 crosspoints while an optimized Clos network requires only about 16,000. The savings become more dramatic as N increases.

Table 1: Crosspoint Comparison (Crossbar vs. Clos)

Created by Sanjay Basu

Spine-Leaf Architecture

From Telephone Switches to Packet Networks

The transition from circuit-switched telephone networks to packet-switched datacenter networks required reinterpreting Clos's principles. In the original formulation, a "connection" was a dedicated circuit held for the duration of a call. In packet networks, each packet independently seeks a path through the network.

The modern datacenter implementation maps Clos stages to physical switch tiers. The three-stage Clos becomes a leaf-spine architecture. Leaf switches (ingress/egress stages) connect to servers, while spine switches (middle stage) provide interconnection. Every leaf connects to every spine, creating the characteristic full-mesh pattern between tiers.

This architecture delivers several critical properties for modern workloads. First, any server can communicate with any other server through at most two switching hops (leaf to spine to leaf), providing predictable, uniform latency. Second, aggregate bandwidth scales linearly by adding spine switches. Third, multiple equal-cost paths exist between any source and destination, enabling load balancing and fault tolerance.

Three-Stage (Leaf-Spine) Architecture

In a standard three-stage datacenter Clos, leaf switches serve as the access layer connecting directly to compute, storage, and network services. Each leaf maintains uplinks to every spine switch in the fabric. Spine switches provide the high-bandwidth interconnection layer, forwarding traffic between leaves without connecting to endpoints directly.

Consider a practical example. A fabric with 4 spine switches and 32 leaf switches, where each leaf has 48 server-facing ports and 4 uplinks (one to each spine). This topology supports 1,536 server connections (32 x 48). If each uplink runs at 100 Gbps, the total spine-tier bandwidth is 12.8 Tbps (4 spines x 32 leaves x 100 Gbps).

The oversubscription ratio compares downlink capacity to uplink capacity. In this example, each leaf has 48 x 25 Gbps = 1,200 Gbps of server-facing capacity and 4 x 100 Gbps = 400 Gbps of spine-facing capacity, yielding a 3:1 oversubscription. For AI/ML workloads requiring east-west heavy traffic patterns, operators typically deploy 1:1 (non-blocking) or 2:1 fabrics.

Five-Stage Architecture with Super-Spines

When a three-stage fabric reaches its scaling limits (determined by spine switch port density), operators deploy a five-stage Clos by adding super-spine switches. The five stages are: leaf, spine, super-spine, spine, leaf. This architecture segments the network into multiple "pods," each containing its own leaf-spine fabric, with super-spines providing inter-pod connectivity.

The mathematics extend naturally. If each pod contains S spine switches and L leaves, and there are P pods, the super-spine layer requires enough switches to maintain non-blocking connectivity between pods. With K super-spines and each spine connecting to each super-spine, the effective middle-stage width becomes K. For strict non-blocking behavior across pods, K >= 2S - 1.

Hyperscale deployments often implement variations called "fabric planes." Rather than a single super-spine layer, multiple independent planes each provide full connectivity. This design improves fault isolation: a failure in one plane does not affect others. Traffic distributes across planes using ECMP, maintaining full bandwidth until multiple planes fail simultaneously.

Cisco's hyperscale fabric design, for example, deploys four fabric planes. Each pod contains 64 leaf switches connecting to 64 fabric switches. The fabric switches connect to 64 spine-plane switches per plane. With four planes, the architecture supports over 260,000 server connections while maintaining near-linear scaling.

Equal-Cost Multi-Path Routing (ECMP)

The End of Spanning Tree

Traditional Layer 2 networks used Spanning Tree Protocol (STP) to prevent loops by blocking redundant links. This approach wastes bandwidth. If you have four paths between two switches, STP blocks three of them. The active path carries all traffic while redundant paths sit idle until a failure.

Spine-leaf architectures eliminate STP by operating at Layer 3 (routed) between tiers. When multiple equal-cost routes exist to a destination, ECMP distributes traffic across all paths simultaneously. In a four-spine fabric, traffic between any two leaves can utilize all four spine switches concurrently, quadrupling effective bandwidth compared to STP-blocked designs.

Hash-Based Load Balancing

ECMP selects paths using hash functions computed over packet headers. A typical five-tuple hash includes source IP, destination IP, source port, destination port, and protocol. The hash output maps to one of the available next-hops, ensuring that all packets of a given flow traverse the same path (preserving packet ordering) while different flows distribute across paths.

The hash function's quality directly impacts load distribution. Poor hashing can create "polarization" where traffic clusters on certain paths. Modern switches implement sophisticated hash algorithms (CRC-based, polynomial-based) and often allow operators to configure which header fields participate in the hash. Some implementations add entropy through randomized salt values.

Configuration Example (BGP with ECMP): On Arista EOS, enabling ECMP requires configuring BGP with the maximum-paths directive and often the as-path multipath-relax option (which allows ECMP across paths learned from different autonomous systems):

router bgp 65001   
maximum-paths 4   
bgp bestpath as-path multipath-relax

ECMP and the Elephant Flow Problem

Hash-based ECMP has a fundamental limitation. It operates per-flow, not per-packet. A single large flow (an "elephant") consumes one path entirely while other paths remain underutilized. In a four-spine fabric, four elephant flows hashing to the same spine create congestion even though 75% of spine capacity sits idle.

Several approaches mitigate this problem. Flowlet switching breaks long flows into shorter segments, rehashing at natural gaps in packet arrival. Adaptive routing monitors spine congestion and dynamically redirects traffic. NVIDIA's Spectrum switches implement these techniques in hardware, maintaining flow ordering while improving utilization.

For AI workloads generating massive all-reduce traffic patterns, elephant flows dominate. Operators address this through careful topology design (more spines), traffic engineering (spreading collective operations across multiple flows), and advanced load balancing (adaptive routing, flowlet switching, or even per-packet spraying for loss-tolerant workloads).

BGP as the Datacenter Routing Protocol

Why BGP for Underlay Routing

Border Gateway Protocol has become the de facto standard for datacenter fabric routing, despite its origins in Internet inter-domain routing. Several factors drive this adoption. BGP scales to enormous route counts without the computation overhead of link-state protocols. Its policy-rich nature allows fine-grained control over route advertisement and selection. The protocol operates over TCP, providing reliable transport without additional mechanisms.

RFC 7938 ("Use of BGP for Routing in Large-Scale Data Centers") codified best practices for datacenter BGP deployments. The document recommends using eBGP (external BGP) between tiers, assigning each switch its own autonomous system number. This simplifies configuration by avoiding the iBGP full-mesh requirement and route reflector complexity.

Typical ASN Assignment Pattern:

Super-spines: ASN 65001 (shared or unique per device)

Spines: ASN 65100-65199 (unique per spine)

Leaves: ASN 65200-65299 (unique per leaf)

BGP Configuration for Leaf-Spine

Datacenter BGP configurations emphasize simplicity and convergence speed. Key settings include: aggressive timers (BFD sub-second failure detection, reduced hold-timers), limited route advertisement (only loopback addresses propagate through the underlay), and relaxed path comparison (allowing ECMP across different AS paths).

BGP unnumbered simplifies configuration further by eliminating the need to assign IP addresses to inter-switch links. Switches establish BGP sessions using IPv6 link-local addresses and advertise IPv4 prefixes with IPv6 next-hops. This reduces configuration complexity and IP address consumption in large fabrics.

VRFs and Multi-Tenancy

Virtual Routing and Forwarding (VRF) instances create isolated routing domains on shared infrastructure. Each VRF maintains its own routing table, allowing overlapping IP address spaces between tenants. In the datacenter context, VRFs segment traffic for different customers, applications, or security zones.

VRF implementation involves creating the VRF instance, assigning interfaces to it, and configuring routing protocols within that VRF context. Routes remain isolated by default. Traffic in VRF-A cannot reach destinations in VRF-B unless explicitly permitted.

BGP Route Leaking Between VRFs

Some scenarios require controlled communication between VRFs. Shared services (DNS, NTP), internet access through a central egress point, or inter-tenant connectivity via a firewall. Route leaking enables this by importing routes from one VRF into another.

BGP-based route leaking uses route targets (RTs) and route distinguishers (RDs) inherited from MPLS VPN technology. A route exported from VRF-A carries an RT that VRF-B imports, making the route appear in both routing tables. Route maps filter which specific routes leak, preventing unintended exposure.

Configuration Example (NVIDIA Cumulus):

router bgp 65001 vrf RED   
address-family ipv4 unicast     
import vrf BLUE     
import vrf route-map FILTER-ROUTES

Critical caution. The default VRF typically contains underlay routes (VTEP addresses for VXLAN) that must never leak to tenant VRFs. Operators should run shared services in a dedicated service VRF rather than the default VRF, simplifying filtering requirements.

VXLAN or The Overlay Network Protocol

Why Overlay Networks

Layer 2 domains traditionally required physical network adjacency. Servers in the same VLAN had to connect through switches maintaining that VLAN. This constraint complicated datacenter operations. Virtual machine migration required VLAN extension, multi-tenant isolation demanded physical separation, and the 4,094 VLAN limit (12-bit VLAN ID) restricted scalability.

VXLAN (Virtual Extensible LAN) solves these problems by encapsulating Layer 2 frames in UDP packets, creating virtual networks that ride over the Layer 3 underlay. A VXLAN network identifier (VNI) uses 24 bits, supporting approximately 16 million logical networks. Servers in the same VNI appear to share a Layer 2 broadcast domain regardless of their physical location.

VXLAN Architecture

VXLAN tunnel endpoints (VTEPs) perform encapsulation and decapsulation. A VTEP can be a physical switch (hardware VTEP), a hypervisor virtual switch (software VTEP), or a network appliance. Each VTEP has a unique IP address in the underlay network used as the outer source/destination for encapsulated packets.

VXLAN Packet Structure:

Outer Ethernet Header: VTEP-to-VTEP MAC addresses

Outer IP Header: VTEP source and destination IPs

Outer UDP Header: Destination port 4789 (IANA assigned)

VXLAN Header: 24-bit VNI identifying the virtual network

Original Ethernet Frame: The tenant traffic being tunneled

The encapsulation adds approximately 50 bytes of overhead. For 1500-byte frames, the outer packet size reaches 1550 bytes, requiring jumbo frame support (typically MTU 9000+) on the underlay network to avoid fragmentation.

EVPN-VXLAN: Control Plane Intelligence

Early VXLAN implementations used flood-and-learn for MAC address discovery, similar to traditional Ethernet. This approach scales poorly. Broadcast traffic floods to all VTEPs in the VNI, consuming bandwidth and processing resources.

EVPN (Ethernet VPN) provides a BGP-based control plane for VXLAN. VTEPs advertise their locally-learned MAC addresses via BGP EVPN routes. Remote VTEPs populate their forwarding tables from these advertisements, eliminating flood-and-learn. Unknown unicast, broadcast, and multicast (BUM) traffic can use ingress replication or multicast distribution.

EVPN Route Types:

Type-2: MAC/IP advertisement (host reachability)

Type-3: Inclusive multicast Ethernet tag (BUM handling)

Type-5: IP prefix route (inter-subnet routing)

The combination of EVPN and VXLAN has become the standard architecture for modern datacenter overlay networks, replacing legacy technologies like VPLS for Layer 2 services and providing integrated Layer 2/Layer 3 virtualization.

VTEP Architecture

A VTEP (VXLAN Tunnel Endpoint) is the network device responsible for encapsulating and decapsulating VXLAN traffic at the boundary between the overlay and underlay networks. Each VTEP has a unique IP address on the underlay network and acts as the ingress/egress point for VXLAN tunnels.

When a VM or server sends an Ethernet frame, the local VTEP encapsulates it with VXLAN, UDP, and outer IP headers—using its own IP as source and the destination VTEP's IP as the target—then transmits it across the Layer 3 fabric. The receiving VTEP strips the encapsulation and delivers the original frame to the destination endpoint. VTEPs maintain MAC-to-VTEP mappings either through traditional flood-and-learn (multicast-based) or, more commonly in modern deployments, via BGP EVPN control plane distribution which eliminates flooding overhead and enables efficient MAC/IP advertisement across the fabric. VTEPs can be implemented in hardware (leaf switches like Arista, Cisco Nexus, or Juniper QFX), in software on hypervisors (VMware NSX, Linux with Open vSwitch), or on SmartNICs for offloaded performance.

In OCI's architecture, the SmartNIC functions as the VTEP, handling all encapsulation and decapsulation in dedicated hardware while keeping the customer's CPU cycles fully available for workloads.

Lossless Networking for AI Workloads

The Challenge of GPU Cluster Networking

AI training workloads generate network traffic patterns fundamentally different from traditional datacenter applications. During all-reduce operations in distributed training, thousands of GPUs simultaneously exchange gradient data. A single straggler can stall the entire training step. Packet loss triggers retransmission, adding latency that compounds across training iterations.

RDMA (Remote Direct Memory Access) enables GPU-to-GPU communication without CPU involvement, reducing latency from milliseconds to microseconds. However, RDMA protocols historically required lossless networks. Even a single dropped packet causes connection resets and performance degradation orders of magnitude worse than TCP retransmission.

RoCEv2 (RDMA over Converged Ethernet version 2) brings RDMA capabilities to standard Ethernet networks, but requires congestion control mechanisms to approach lossless behavior.

Priority Flow Control (PFC)

PFC provides link-level flow control on a per-priority basis. When a switch's receive buffer for a particular traffic class exceeds a threshold, it sends a PFC pause frame to the upstream sender. The sender stops transmitting that priority class until receiving a resume signal.

PFC prevents buffer overflow and packet drops, but introduces problems of its own. Pause propagation can cascade through the network, creating "head-of-line blocking" where paused traffic affects unrelated flows. Persistent PFC storms can deadlock the network. Improper configuration leads to the "parking lot problem" where bandwidth distributes unfairly.

Best practices limit PFC to specific traffic classes (RoCEv2 traffic, typically DSCP 26 or priority 3), configure appropriate buffer thresholds, and implement watchdog timers to detect and break PFC storms.

Explicit Congestion Notification (ECN)

ECN provides end-to-end congestion signaling without dropping packets. When switch queue depth exceeds a threshold, the switch marks packets with the Congestion Experienced (CE) bit in the IP header. The receiver reflects this marking to the sender, which reduces its transmission rate.

ECN operates earlier and more gradually than PFC. While PFC is a binary stop/go mechanism triggered by buffer near-overflow, ECN can signal congestion while buffers still have headroom, allowing senders to slow down before queues fill.

DCQCN: Combining ECN and PFC

Data Center Quantized Congestion Notification (DCQCN) combines ECN and PFC for RoCEv2 networks. ECN serves as the primary congestion control mechanism, with PFC as a safety net for situations where ECN cannot react quickly enough. The key to DCQCN operation is threshold tuning. ECN marking must begin early enough to slow senders before PFC triggers. PFC thresholds must allow headroom for ECN to work while preventing buffer overflow. The relationship between thresholds:

ECN_MIN < ECN_MAX < PFC_XOFF < Buffer_Size

NVIDIA Spectrum switches and Broadcom NICs implement DCQCN in hardware, maintaining per-flow congestion state and rate control. Congestion Notification Packets (CNPs) signal congestion from receiver to sender, triggering rate reduction algorithms similar to TCP congestion control.

Table 2: Typical DCQCN Threshold Configuration

Created by Sanjay Basu

Non-Blocking Network Design Principles

What "Non-Blocking" Really Means

The term "non-blocking" is frequently misused in datacenter marketing. A strict non-blocking network guarantees that any new connection request can be satisfied without rearranging existing connections and without experiencing congestion at any point in the network. In Clos terms, this requires m >= 2n - 1 middle-stage switches.

A rearrangeably non-blocking network can satisfy any connection request, but may need to reroute existing connections. Clos showed this requires only m >= n middle switches, half the strict requirement. While practical for circuit-switched networks with a central controller, rearrangeable non-blocking is rarely implemented in packet networks.

In datacenter marketing, "non-blocking" typically means 1:1 oversubscription (aggregate downlink bandwidth equals aggregate uplink bandwidth). This ensures the network can sustain all-to-all traffic at full rate, but does not guarantee freedom from congestion. Incast patterns, elephant flows, and momentary bursts can still cause queue buildup even in 1:1 fabrics.

Bisection Bandwidth

Bisection bandwidth measures the worst-case bandwidth when the network is partitioned into two equal halves. In a Clos network with full bisection bandwidth, you can divide the leaves into two groups and the aggregate bandwidth between groups equals the aggregate leaf downlink capacity.

Calculation Example: Consider 16 leaves, each with 400 Gbps downlink capacity (total 6.4 Tbps). For full bisection bandwidth, the spine tier must support 3.2 Tbps crossing from one half to the other. With 4 spines, each spine carries 800 Gbps of cross-half traffic, requiring 800 Gbps from each half (16/2 = 8 leaves x 100 Gbps uplink per leaf per spine).

Oversubscription Ratios and Trade-offs

Not all workloads require non-blocking fabrics. Traditional web applications with north-south traffic patterns (client to server) may tolerate 3:1 or higher oversubscription. Big data workloads with localized computation benefit from rack-local optimization more than fabric-wide bandwidth.

AI training represents the extreme case: collective operations create all-to-all traffic patterns that fully exercise east-west bandwidth. GPU clusters typically deploy 1:1 fabrics with additional optimizations (rail optimization matching GPU-to-NIC topology, dedicated RDMA networks, adaptive routing).

Table 3: Oversubscription Recommendations by Workload

Created by Sanjay Basu

Oracle Cloud Infrastructure

Gen2 Network Architecture

Off-Box Virtualization and Network Design

OCI's Gen2 architecture fundamentally differs from competitors by moving all virtualization functions to dedicated SmartNIC hardware. The hypervisor overhead that typically consumes CPU cycles and introduces latency variability runs entirely on the off-box infrastructure. This design has profound implications for network architecture.

Customer bare metal instances connect to the network fabric through SmartNICs that handle VXLAN encapsulation, security ACLs, QoS enforcement, and overlay routing at line rate. The host CPU sees raw network performance indistinguishable from direct hardware attachment. No CPU cycles are consumed for network virtualization.

The SmartNIC architecture enables OCI's flat-rate pricing model. Since virtualization costs are fixed in dedicated hardware rather than scaling with VM size, OCI can offer consistent performance guarantees regardless of instance type.

OCI RDMA Cluster Network

For GPU workloads, OCI provides dedicated RDMA cluster networks built on non-blocking Clos topologies. The BM.GPU.H100.8 instances each contain 8 NVIDIA H100 GPUs with 8 dedicated 200 Gbps ConnectX-7 NICs, providing 1.6 Tbps of network bandwidth per node.

The cluster network uses GPUDirect RDMA, allowing GPU-to-GPU communication without traversing host memory or CPU. Data flows directly from one GPU's HBM, through the local NIC, across the fabric, through the remote NIC, and into the destination GPU's HBM. This path achieves sub-5 microsecond latency with full line-rate bandwidth.

The network fabric maintains full bisection bandwidth across the cluster. During all-reduce operations, all GPUs communicate simultaneously without contention. The non-blocking Clos topology ensures no bottleneck exists regardless of communication pattern.

Scaling to Superclusters

OCI's largest GPU configurations scale to 65,536+ GPUs in a single RDMA domain. This scale requires multiple Clos stages and careful attention to rail optimization. Each GPU connects to a dedicated NIC; NICs in the same position across nodes (the "rail") share network paths through the fabric.

Rail-optimized topology ensures that collective operations naturally align with network structure. All-reduce implementations like NVIDIA's NCCL exploit this alignment, communicating primarily within rails before exchanging across rails. The fabric supports both patterns at full bandwidth.

The Continuing Evolution

Charles Clos could not have imagined that his telephone switching research would underpin the infrastructure running large language models seven decades later. Yet the mathematical principles he established remain central to modern network design. The three-stage structure, the non-blocking condition, the trade-off between crosspoint count and blocking probability, all continue to guide architects building networks for AI superclusters.

The protocols layered atop Clos topologies continue evolving. BGP extensions for EVPN enable sophisticated overlay services. ECMP implementations grow smarter through adaptive routing and flowlet switching. Congestion control mechanisms like DCQCN make Ethernet viable for demanding RDMA workloads. Each advance extends the capabilities of the underlying Clos fabric.

Future developments will push these architectures further. Optical circuit switching may complement packet switching for ultra-high-bandwidth collective operations. Silicon photonics could collapse multi-stage fabrics into single-chip solutions. Machine learning itself may optimize network configuration and traffic engineering in real-time.

What remains constant is the fundamental insight Clos articulated in 1953: by organizing switches into properly-sized stages with the right connectivity pattern, you can build networks that scale gracefully while maintaining the performance characteristics your applications require. That insight, mathematically proven and practically validated across decades of deployment, continues to make modern cloud computing possible.

References

1. Clos, C. (1953). A Study of Non-Blocking Switching Networks. Bell System Technical Journal, 32(2), 406-424.

2. RFC 7938: Use of BGP for Routing in Large-Scale Data Centers (2016)

3. RFC 7348: Virtual eXtensible Local Area Network (VXLAN)

4. RFC 3168: The Addition of Explicit Congestion Notification (ECN) to IP

5. IEEE 802.1Qbb: Priority-based Flow Control

6. Zhu, Y. et al. (2015). Congestion Control for Large-Scale RDMA Deployments. SIGCOMM.

7. Alizadeh, M. et al. (2010). Data Center TCP (DCTCP). SIGCOMM.

8. Greenberg, A. et al. (2009). VL2: A Scalable and Flexible Data Center Network. SIGCOMM.

Clos Networks

From Bell Labs to Modern Cloud Datacenters

Why Clos Networks Matter Now More Than Ever

The Birth of Non-Blocking Networks

Charles Clos and the Telephone Switching Problem

The Mathematical Foundation

Spine-Leaf Architecture

From Telephone Switches to Packet Networks

Three-Stage (Leaf-Spine) Architecture

Five-Stage Architecture with Super-Spines

Equal-Cost Multi-Path Routing (ECMP)

The End of Spanning Tree

Hash-Based Load Balancing

ECMP and the Elephant Flow Problem

BGP as the Datacenter Routing Protocol

Why BGP for Underlay Routing

BGP Configuration for Leaf-Spine

VRFs and Multi-Tenancy

BGP Route Leaking Between VRFs

VXLAN or The Overlay Network Protocol

Why Overlay Networks

VXLAN Architecture

EVPN-VXLAN: Control Plane Intelligence

VTEP Architecture

Lossless Networking for AI Workloads

The Challenge of GPU Cluster Networking

Priority Flow Control (PFC)

Explicit Congestion Notification (ECN)

DCQCN: Combining ECN and PFC

Non-Blocking Network Design Principles

What "Non-Blocking" Really Means

Bisection Bandwidth

Oversubscription Ratios and Trade-offs

Oracle Cloud Infrastructure

Gen2 Network Architecture

Off-Box Virtualization and Network Design

OCI RDMA Cluster Network

Scaling to Superclusters

The Continuing Evolution

References

Comments

Post a Comment

Popular posts from this blog

Digital Selfhood

Axiomatic Thinking

How MSPs Can Deliver IT-as-a-Service with Better Governance