Stacy Lu
Support Team
Feedback:
support@nextpcb.comWhen a single GPU is not enough—and in modern AI training, a single GPU is never enough—the question becomes how to connect multiple GPUs so they can exchange data fast enough to stay in step with each other. The answer NVIDIA engineered is NVSwitch: a dedicated silicon chip that acts as a high-speed crossbar switch for NVLink, allowing every GPU in a server node to communicate with every other GPU at full bandwidth, simultaneously, without contention.
NVSwitch is less well-known than the GPUs it connects, but it is equally essential to AI infrastructure. It is also one of the most demanding components a PCB engineer will encounter: each NVSwitch chip connects to dozens of NVLink differential pairs running at 100–200 Gb/s per lane, and routing those signals on a shared baseboard or dedicated switch board is among the highest-complexity PCB work done in the industry today.
This article explains what NVSwitch is, how it works, how it has evolved across GPU generations, and what it demands from the PCBs that carry it.
NVSwitch is a dedicated NVLink switch chip developed by NVIDIA. Its function is analogous to a network switch, but instead of switching Ethernet or InfiniBand packets between servers, it switches NVLink traffic between GPU dies within a single server node or rack-scale system. NVSwitch provides a fully non-blocking crossbar fabric: any GPU can send data to any other GPU at full NVLink bandwidth without any other GPU's traffic reducing the available bandwidth.
NVSwitch is a separate silicon component from the GPU. It is manufactured on a dedicated process node, packaged as a large BGA, and mounted on the same PCB as the GPUs it serves (on the baseboard in some systems) or on a dedicated switch board in rack-scale architectures. Each NVSwitch chip connects to multiple GPUs via NVLink ports, and multiple NVSwitch chips work together to provide a fully connected fabric across all GPUs in the system.
Deep learning training requires frequent exchange of gradient data between GPUs during the backward pass (all-reduce operations). The speed of this exchange directly determines how efficiently the GPUs can train in parallel—slow GPU-to-GPU communication forces GPUs to idle while waiting for gradients, reducing utilization and extending training time.
Before NVSwitch, multi-GPU systems used PCIe as the GPU interconnect. PCIe bandwidth (64 GB/s for Gen4 ×16, 128 GB/s for Gen5 ×16) is adequate for CPU-to-GPU data transfers but is far too slow for the collective operations required during large-model training. NVLink addressed this by providing direct GPU-to-GPU links at 600–1,800 GB/s per GPU, but NVLink without a switch can only connect GPUs in a fixed topology—peer-to-peer between specific pairs.
NVSwitch adds the switching layer that makes NVLink a general-purpose, fully connected fabric rather than a fixed point-to-point topology. With NVSwitch, any GPU can reach any other GPU at full NVLink bandwidth in a single hop, regardless of how many GPUs are in the system.
A non-blocking fabric guarantees that any GPU can transmit to any other GPU at full link speed simultaneously, without any combination of active transfers reducing available bandwidth for other transfers. NVSwitch achieves this through an internal crossbar architecture: every input port is directly crossconnected to every output port through a switching matrix.
In a DGX H100 with 8 GPUs and 4 NVSwitch 3.0 chips:
Without NVSwitch, achieving equivalent all-to-all bandwidth would require 7 direct NVLink connections per GPU (to each other GPU), which is physically impractical at scale. NVSwitch makes the connection topology a software abstraction rather than a physical wiring constraint.
NVSwitch does not just forward data—it can perform in-fabric reduction operations. In NVSwitch 3.0 and later, the chip can execute all-reduce (sum, min, max) operations on data in flight through the switch, without requiring the data to be received by a GPU, processed, and re-transmitted. This in-fabric reduction capability accelerates collective communication operations that are central to distributed training, reducing latency and freeing GPU compute cycles that would otherwise be spent on communication overhead.
NVSwitch 1.0 was introduced with the DGX-2 in 2018, the first system to use NVSwitch. It was designed to connect up to 16 V100 GPUs in a single server, forming the first fully non-blocking NVLink fabric at scale.
NVSwitch 2.0 accompanied the A100 GPU and DGX A100. It maintained the same 18-port architecture as NVSwitch 1.0 but added support for NVLink 3.0's higher per-lane bandwidth, increasing total switch throughput.
NVSwitch 3.0 is paired with H100 GPUs and NVLink 4.0. The key addition over NVSwitch 2.0 is in-fabric multicast and reduction support, enabling all-reduce operations to be executed within the switch fabric rather than at the GPU.
The jump from 18 NVLink ports (NVSwitch 2.0) to 64 ports (NVSwitch 3.0) is the most significant change between generations. It reflects the move to larger NVLink fabric topologies enabled by NVLink 4.0's higher aggregate bandwidth per GPU (900 GB/s vs 600 GB/s for A100).
NVSwitch 4.0 supports NVLink 5.0 and is the switching fabric inside the GB200 NVL72 rack-scale system. At this generation, NVSwitch moves from the GPU baseboard to dedicated switch boards within the rack, enabling the fabric to scale beyond a single server node.
| Parameter | NVSwitch 1.0 | NVSwitch 2.0 | NVSwitch 3.0 | NVSwitch 4.0 |
|---|---|---|---|---|
| GPU generation | Volta (V100) | Ampere (A100) | Hopper (H100) | Blackwell (B200) |
| NVLink version | NVLink 2.0 | NVLink 3.0 | NVLink 4.0 | NVLink 5.0 |
| Ports per chip | 18 | 18 | 64 | 72 |
| Per-port BW (bidir.) | 50 GB/s | 100 GB/s | 100 GB/s | 200 GB/s |
| Total chip BW (bidir.) | 900 GB/s | 1.8 TB/s | 6.4 TB/s | 14.4 TB/s |
| In-fabric reduction | No | No | Yes | Yes (enhanced) |
| Process node | TSMC 12 nm | TSMC 7 nm | TSMC 4N | TSMC 4NP |
| TDP (approx.) | ~100 W | ~150 W | ~270 W | ~400 W |
| System placement | Dedicated switch board (DGX-2) | Baseboard (DGX A100) | Baseboard (DGX H100) | Dedicated switch boards (GB200 NVL72) |
In the DGX H100, four NVSwitch 3.0 chips are mounted directly on the GPU baseboard, sharing the PCB with the eight H100 SXM5 modules. This tight integration minimizes NVLink trace lengths—shorter traces mean lower insertion loss and fewer signal integrity challenges—but it also means the baseboard must simultaneously handle:
The result is one of the largest and most complex single PCBs in commercial production—a board exceeding 700 mm × 700 mm in some configurations, with 20+ layers and multiple laminate types in a hybrid stackup.
In the GB200 NVL72, the NVSwitch function moves off the GPU compute tray and onto dedicated NVSwitch boards within the rack. Nine NVSwitch boards, each carrying multiple NVSwitch 4.0 chips, form a fully connected NVLink 5.0 fabric across all 72 B200 GPUs in the rack.
Separating the switch function onto dedicated boards has several engineering advantages:
The NVSwitch boards in the GB200 NVL72 are dedicated high-density PCBs with NVLink 5.0 routing at 200 Gb/s per lane across dozens of connections per board—among the most demanding PCB designs currently in production.
| Parameter | PCIe Switch | NVSwitch |
|---|---|---|
| Protocol | PCIe (Gen4/5/6) | NVLink (2.0/3.0/4.0/5.0) |
| Bandwidth per port | 64–256 GB/s (×16 link) | 100–200 GB/s per NVLink port |
| Latency | ~100–500 ns (switch hop) | ~1–5 μs (GPU-to-GPU, hardware-managed) |
| Memory coherence | No (PCIe is not coherent) | Yes (NVLink supports cache-coherent access) |
| In-fabric operations | No | Yes (multicast, all-reduce from NVSwitch 3.0) |
| Use case | General I/O expansion, storage, networking | GPU-to-GPU collective communication for AI training |
| Maximum system scale | Hundreds of PCIe endpoints | 72 GPUs per rack (GB200 NVL72) |
PCIe switches and NVSwitches are complementary, not competing, components in an AI server. The NVSwitch fabric handles GPU-to-GPU collective communication. PCIe connects GPUs to the host CPU and to network interface cards. Both are present in a complete AI server system.
The core PCB challenge introduced by NVSwitch is routing a very large number of NVLink differential pairs between GPU packages and NVSwitch packages on a shared PCB. In a DGX H100 baseboard with 8 H100 GPUs and 4 NVSwitch 3.0 chips, the total number of NVLink differential pairs routed on the board can exceed 2,000 individual traces. Each pair must be routed as a controlled-impedance differential pair (100 Ω ± 5%), with intra-pair skew < 5 ps and inter-pair spacing adequate to meet crosstalk specifications.
At NVLink 4.0 speeds (100 Gb/s per lane), traces cannot be routed haphazardly—length matching, layer assignment, and spacing are all constrained by the signal integrity requirements. This routing density is one of the primary drivers of high layer count on AI server baseboards.
Boards carrying NVSwitch require more layers than boards without it, for two reasons:
A DGX H100 baseboard (4 NVSwitch 3.0 + 8 H100) typically uses 20–24 layers. A dedicated NVSwitch 4.0 board in a GB200 NVL72 rack, optimized purely for switching density, may use 28–36 layers.
NVLink signal layers on boards carrying NVSwitch require ultra-low-loss laminates. The loss budget for an NVLink 4.0 channel from GPU package pad to NVSwitch package pad is fixed by the NVLink specification, and the dielectric loss of the PCB laminate is the dominant contributor to insertion loss over the 10–20 cm trace lengths typical on a DGX H100 baseboard.
The following SI design rules apply specifically to NVLink routing between GPUs and NVSwitch chips:
NVSwitch chips are themselves high-power components that require careful PDN design on the boards that carry them:
At 270–400 W per chip, NVSwitch requires the same quality of thermal management attention as the GPUs themselves:
The fabrication and assembly of PCBs carrying NVSwitch chips demands capabilities at the upper end of commercial PCB manufacturing:
How many NVSwitch chips are in a DGX H100?
A DGX H100 contains four NVSwitch 3.0 chips, all mounted on the GPU baseboard alongside the eight H100 SXM5 GPU modules. The four NVSwitch chips together form a fully non-blocking fabric connecting all eight GPUs with 900 GB/s of NVLink 4.0 bandwidth per GPU.
What is the difference between NVLink and NVSwitch?
NVLink is the high-speed interconnect protocol and physical link that runs between GPU and switch chips. NVSwitch is the switch chip that implements a crossbar fabric for NVLink, allowing any GPU to communicate with any other GPU at full NVLink bandwidth. NVLink defines the electrical and protocol interface; NVSwitch provides the switching function that makes NVLink a scalable fabric rather than a fixed point-to-point topology. For a detailed explanation of NVLink specifically, see What Is NVLink? How NVIDIA's High-Speed GPU Interconnect Shapes PCB Routing.
Can NVSwitch connect GPUs across multiple servers?
In the GB200 NVL72 architecture, NVSwitch 4.0 connects GPUs across multiple compute trays within a single rack, effectively scaling NVLink connectivity beyond a single server node. Connections between separate racks still require InfiniBand or Ethernet networking; NVSwitch does not replace the inter-rack network fabric.
Is NVSwitch used in consumer NVIDIA GPUs?
No. NVSwitch is exclusively a data center product used in DGX systems, HGX server platforms, and rack-scale AI infrastructure like the GB200 NVL72. Consumer GPUs (GeForce) and workstation GPUs (RTX) connect via PCIe and do not use NVLink or NVSwitch.
Why did the port count jump from 18 (NVSwitch 2.0) to 64 (NVSwitch 3.0)?
The jump from 18 to 64 ports reflects a fundamental design philosophy change. NVSwitch 1.0 and 2.0 were designed to connect up to 16 GPUs with each GPU using a small number of NVLink links per switch connection. NVSwitch 3.0 is designed to support the full NVLink 4.0 bandwidth of the H100 (18 NVLink links per GPU × 2 directions) across a fabric that must remain non-blocking at 900 GB/s per GPU. The higher port count enables each GPU to connect to each NVSwitch chip with multiple NVLink links, maintaining full aggregate bandwidth.
What PCB materials are required for NVSwitch 4.0 boards?
NVSwitch 4.0 operates with NVLink 5.0 at 200 Gb/s per lane. The NVLink signal routing layers on NVSwitch 4.0 boards require ultra-low-loss laminates with dissipation factor (Df) < 0.002 at 10 GHz—typically Panasonic Megtron 7 or Isola Tachyon 100G. Standard Megtron 6 (Df ~0.004) is not suitable for NVLink 5.0 signal layers at the trace lengths involved in rack-scale NVSwitch board designs.
NVSwitch boards and GPU baseboards represent some of the most technically demanding PCB work in the industry. NextPCB supports high-layer-count fabrication, ultra-low-loss laminate processing, controlled-depth backdrilling, via-in-pad, large-format boards, and complete PCBA services for AI server hardware programs.
Related Articles:
Still, need help? Contact Us: support@nextpcb.com
Need a PCB or PCBA quote? Quote now