Contact Us
Blog / What Is NVSwitch? The Silicon Behind NVIDIA's GPU Cluster Scale-Out

What Is NVSwitch? The Silicon Behind NVIDIA's GPU Cluster Scale-Out

Posted: June, 2026 Writer: Stacy Lu Share: NEXTPCB Official youtube NEXTPCB Official Facefook NEXTPCB Official Twitter NEXTPCB Official Instagram NEXTPCB Official Linkedin NEXTPCB Official Tiktok NEXTPCB Official Bksy

Introduction

When a single GPU is not enough—and in modern AI training, a single GPU is never enough—the question becomes how to connect multiple GPUs so they can exchange data fast enough to stay in step with each other. The answer NVIDIA engineered is NVSwitch: a dedicated silicon chip that acts as a high-speed crossbar switch for NVLink, allowing every GPU in a server node to communicate with every other GPU at full bandwidth, simultaneously, without contention.

NVSwitch is less well-known than the GPUs it connects, but it is equally essential to AI infrastructure. It is also one of the most demanding components a PCB engineer will encounter: each NVSwitch chip connects to dozens of NVLink differential pairs running at 100–200 Gb/s per lane, and routing those signals on a shared baseboard or dedicated switch board is among the highest-complexity PCB work done in the industry today.

This article explains what NVSwitch is, how it works, how it has evolved across GPU generations, and what it demands from the PCBs that carry it.

  1. Table of Contents
  2. Introduction
  3. What Is NVSwitch?
  4. Why NVSwitch Exists: The GPU-to-GPU Bandwidth Problem
  5. How NVSwitch Works
  6. Fully Connected Non-Blocking Fabric
  7. Multicast and Reduction Operations
  8. NVSwitch Generations
  9. NVSwitch 1.0 (Volta Era)
  10. NVSwitch 2.0 (Ampere Era)
  11. NVSwitch 3.0 (Hopper Era)
  12. NVSwitch 4.0 (Blackwell Era)
  13. NVSwitch Generation Comparison Table
  14. NVSwitch in Real AI Server Systems
  15. DGX H100: NVSwitch 3.0 on the Baseboard
  16. GB200 NVL72: NVSwitch 4.0 as a Dedicated Switch Board
  17. NVSwitch vs PCIe Switches: Why They Are Not the Same
  18. PCB Design Implications of NVSwitch
  19. NVLink Routing Density
  20. Layer Count Requirements
  21. Material Selection
  22. Signal Integrity Rules
  23. Power Delivery for NVSwitch
  24. Thermal Management
  25. NVSwitch Board Manufacturing Requirements
  26. FAQ

What Is NVSwitch?

NVSwitch is a dedicated NVLink switch chip developed by NVIDIA. Its function is analogous to a network switch, but instead of switching Ethernet or InfiniBand packets between servers, it switches NVLink traffic between GPU dies within a single server node or rack-scale system. NVSwitch provides a fully non-blocking crossbar fabric: any GPU can send data to any other GPU at full NVLink bandwidth without any other GPU's traffic reducing the available bandwidth.

NVSwitch is a separate silicon component from the GPU. It is manufactured on a dedicated process node, packaged as a large BGA, and mounted on the same PCB as the GPUs it serves (on the baseboard in some systems) or on a dedicated switch board in rack-scale architectures. Each NVSwitch chip connects to multiple GPUs via NVLink ports, and multiple NVSwitch chips work together to provide a fully connected fabric across all GPUs in the system.


Why NVSwitch Exists: The GPU-to-GPU Bandwidth Problem

Deep learning training requires frequent exchange of gradient data between GPUs during the backward pass (all-reduce operations). The speed of this exchange directly determines how efficiently the GPUs can train in parallel—slow GPU-to-GPU communication forces GPUs to idle while waiting for gradients, reducing utilization and extending training time.

Before NVSwitch, multi-GPU systems used PCIe as the GPU interconnect. PCIe bandwidth (64 GB/s for Gen4 ×16, 128 GB/s for Gen5 ×16) is adequate for CPU-to-GPU data transfers but is far too slow for the collective operations required during large-model training. NVLink addressed this by providing direct GPU-to-GPU links at 600–1,800 GB/s per GPU, but NVLink without a switch can only connect GPUs in a fixed topology—peer-to-peer between specific pairs.

NVSwitch adds the switching layer that makes NVLink a general-purpose, fully connected fabric rather than a fixed point-to-point topology. With NVSwitch, any GPU can reach any other GPU at full NVLink bandwidth in a single hop, regardless of how many GPUs are in the system.


How NVSwitch Works

Fully Connected Non-Blocking Fabric

A non-blocking fabric guarantees that any GPU can transmit to any other GPU at full link speed simultaneously, without any combination of active transfers reducing available bandwidth for other transfers. NVSwitch achieves this through an internal crossbar architecture: every input port is directly crossconnected to every output port through a switching matrix.

In a DGX H100 with 8 GPUs and 4 NVSwitch 3.0 chips:

  • Each GPU connects to all 4 NVSwitch chips via NVLink 4.0
  • Each NVSwitch chip connects to all 8 GPUs
  • Any GPU-to-GPU transfer is routed through one or more NVSwitch chips in a single switching hop
  • Total all-to-all bisection bandwidth: 900 GB/s per GPU (the full NVLink 4.0 bandwidth), sustained across all 8 GPUs simultaneously

Without NVSwitch, achieving equivalent all-to-all bandwidth would require 7 direct NVLink connections per GPU (to each other GPU), which is physically impractical at scale. NVSwitch makes the connection topology a software abstraction rather than a physical wiring constraint.

Multicast and Reduction Operations

NVSwitch does not just forward data—it can perform in-fabric reduction operations. In NVSwitch 3.0 and later, the chip can execute all-reduce (sum, min, max) operations on data in flight through the switch, without requiring the data to be received by a GPU, processed, and re-transmitted. This in-fabric reduction capability accelerates collective communication operations that are central to distributed training, reducing latency and freeing GPU compute cycles that would otherwise be spent on communication overhead.


NVSwitch Generations

NVSwitch 1.0 (Volta Era)

NVSwitch 1.0 was introduced with the DGX-2 in 2018, the first system to use NVSwitch. It was designed to connect up to 16 V100 GPUs in a single server, forming the first fully non-blocking NVLink fabric at scale.

  • NVLink generation: NVLink 2.0
  • NVLink ports per chip: 18
  • Per-port bandwidth: 50 GB/s bidirectional
  • Total chip bandwidth: 900 GB/s bidirectional
  • Process node: TSMC 12 nm
  • Number of chips in DGX-2: 6
  • GPUs supported: Up to 16 V100

NVSwitch 2.0 (Ampere Era)

NVSwitch 2.0 accompanied the A100 GPU and DGX A100. It maintained the same 18-port architecture as NVSwitch 1.0 but added support for NVLink 3.0's higher per-lane bandwidth, increasing total switch throughput.

  • NVLink generation: NVLink 3.0
  • NVLink ports per chip: 18
  • Per-port bandwidth: 100 GB/s bidirectional (NVLink 3.0)
  • Total chip bandwidth: 1,800 GB/s bidirectional per chip
  • Process node: TSMC 7 nm
  • Number of chips in DGX A100: 6
  • GPUs supported: Up to 8 A100 (DGX A100 configuration)

NVSwitch 3.0 (Hopper Era)

NVSwitch 3.0 is paired with H100 GPUs and NVLink 4.0. The key addition over NVSwitch 2.0 is in-fabric multicast and reduction support, enabling all-reduce operations to be executed within the switch fabric rather than at the GPU.

  • NVLink generation: NVLink 4.0
  • NVLink ports per chip: 64
  • Per-port bandwidth: 100 GB/s bidirectional
  • Total chip bandwidth: 6,400 GB/s (6.4 TB/s) bidirectional per chip
  • In-fabric reduction: Yes (multicast, all-reduce)
  • Process node: TSMC 4N
  • Number of chips in DGX H100: 4 (on GPU baseboard)
  • GPUs supported per node: 8 H100 (fully connected)

The jump from 18 NVLink ports (NVSwitch 2.0) to 64 ports (NVSwitch 3.0) is the most significant change between generations. It reflects the move to larger NVLink fabric topologies enabled by NVLink 4.0's higher aggregate bandwidth per GPU (900 GB/s vs 600 GB/s for A100).

NVSwitch 4.0 (Blackwell Era)

NVSwitch 4.0 supports NVLink 5.0 and is the switching fabric inside the GB200 NVL72 rack-scale system. At this generation, NVSwitch moves from the GPU baseboard to dedicated switch boards within the rack, enabling the fabric to scale beyond a single server node.

  • NVLink generation: NVLink 5.0
  • NVLink ports per chip: 72
  • Per-port bandwidth: 200 GB/s bidirectional
  • Total chip bandwidth: 14,400 GB/s (14.4 TB/s) bidirectional per chip
  • In-fabric reduction: Yes (enhanced over NVSwitch 3.0)
  • Process node: TSMC 4NP
  • Number of switch boards in GB200 NVL72: 9 NVSwitch boards (72 total NVSwitch 4.0 chips)
  • GPUs supported per rack: 72 B200 (fully connected)

NVSwitch Generation Comparison Table

Parameter NVSwitch 1.0 NVSwitch 2.0 NVSwitch 3.0 NVSwitch 4.0
GPU generation Volta (V100) Ampere (A100) Hopper (H100) Blackwell (B200)
NVLink version NVLink 2.0 NVLink 3.0 NVLink 4.0 NVLink 5.0
Ports per chip 18 18 64 72
Per-port BW (bidir.) 50 GB/s 100 GB/s 100 GB/s 200 GB/s
Total chip BW (bidir.) 900 GB/s 1.8 TB/s 6.4 TB/s 14.4 TB/s
In-fabric reduction No No Yes Yes (enhanced)
Process node TSMC 12 nm TSMC 7 nm TSMC 4N TSMC 4NP
TDP (approx.) ~100 W ~150 W ~270 W ~400 W
System placement Dedicated switch board (DGX-2) Baseboard (DGX A100) Baseboard (DGX H100) Dedicated switch boards (GB200 NVL72)

NVSwitch in Real AI Server Systems

DGX H100: NVSwitch 3.0 on the Baseboard

In the DGX H100, four NVSwitch 3.0 chips are mounted directly on the GPU baseboard, sharing the PCB with the eight H100 SXM5 modules. This tight integration minimizes NVLink trace lengths—shorter traces mean lower insertion loss and fewer signal integrity challenges—but it also means the baseboard must simultaneously handle:

  • Power delivery for 8 H100 GPUs at 700 W each (5,600 W total GPU power)
  • Power delivery for 4 NVSwitch 3.0 chips at ~270 W each (~1,080 W total switch power)
  • NVLink 4.0 routing from every GPU to every NVSwitch chip (8 GPUs × 4 switches × NVLink lanes per connection)
  • PCIe Gen5 host routing from CPUs to GPUs
  • Management and auxiliary I/O

The result is one of the largest and most complex single PCBs in commercial production—a board exceeding 700 mm × 700 mm in some configurations, with 20+ layers and multiple laminate types in a hybrid stackup.

GB200 NVL72: NVSwitch 4.0 as a Dedicated Switch Board

In the GB200 NVL72, the NVSwitch function moves off the GPU compute tray and onto dedicated NVSwitch boards within the rack. Nine NVSwitch boards, each carrying multiple NVSwitch 4.0 chips, form a fully connected NVLink 5.0 fabric across all 72 B200 GPUs in the rack.

Separating the switch function onto dedicated boards has several engineering advantages:

  • The GPU compute trays (carrying GB200 Superchips) can be designed and replaced independently of the switch fabric
  • The NVSwitch boards can be optimized purely for switching density without the competing constraints of GPU power delivery and thermal management
  • The rack fabric can theoretically be extended or upgraded by replacing switch boards without replacing GPU trays

The NVSwitch boards in the GB200 NVL72 are dedicated high-density PCBs with NVLink 5.0 routing at 200 Gb/s per lane across dozens of connections per board—among the most demanding PCB designs currently in production.


NVSwitch vs PCIe Switches: Why They Are Not the Same

Parameter PCIe Switch NVSwitch
Protocol PCIe (Gen4/5/6) NVLink (2.0/3.0/4.0/5.0)
Bandwidth per port 64–256 GB/s (×16 link) 100–200 GB/s per NVLink port
Latency ~100–500 ns (switch hop) ~1–5 μs (GPU-to-GPU, hardware-managed)
Memory coherence No (PCIe is not coherent) Yes (NVLink supports cache-coherent access)
In-fabric operations No Yes (multicast, all-reduce from NVSwitch 3.0)
Use case General I/O expansion, storage, networking GPU-to-GPU collective communication for AI training
Maximum system scale Hundreds of PCIe endpoints 72 GPUs per rack (GB200 NVL72)

PCIe switches and NVSwitches are complementary, not competing, components in an AI server. The NVSwitch fabric handles GPU-to-GPU collective communication. PCIe connects GPUs to the host CPU and to network interface cards. Both are present in a complete AI server system.


PCB Design Implications of NVSwitch

NVLink Routing Density

The core PCB challenge introduced by NVSwitch is routing a very large number of NVLink differential pairs between GPU packages and NVSwitch packages on a shared PCB. In a DGX H100 baseboard with 8 H100 GPUs and 4 NVSwitch 3.0 chips, the total number of NVLink differential pairs routed on the board can exceed 2,000 individual traces. Each pair must be routed as a controlled-impedance differential pair (100 Ω ± 5%), with intra-pair skew < 5 ps and inter-pair spacing adequate to meet crosstalk specifications.

At NVLink 4.0 speeds (100 Gb/s per lane), traces cannot be routed haphazardly—length matching, layer assignment, and spacing are all constrained by the signal integrity requirements. This routing density is one of the primary drivers of high layer count on AI server baseboards.

Layer Count Requirements

Boards carrying NVSwitch require more layers than boards without it, for two reasons:

  1. Routing channels: The volume of NVLink differential pairs cannot be routed in fewer than 6–10 dedicated signal layers without violating spacing rules; the more NVSwitch chips and GPU connections, the more layers required
  2. Power planes: NVSwitch 3.0 at ~270 W and NVSwitch 4.0 at ~400 W each require their own regulated power domains with adequate copper plane area; adding these planes increases layer count

A DGX H100 baseboard (4 NVSwitch 3.0 + 8 H100) typically uses 20–24 layers. A dedicated NVSwitch 4.0 board in a GB200 NVL72 rack, optimized purely for switching density, may use 28–36 layers.

Material Selection

NVLink signal layers on boards carrying NVSwitch require ultra-low-loss laminates. The loss budget for an NVLink 4.0 channel from GPU package pad to NVSwitch package pad is fixed by the NVLink specification, and the dielectric loss of the PCB laminate is the dominant contributor to insertion loss over the 10–20 cm trace lengths typical on a DGX H100 baseboard.

  • NVLink 4.0 (NVSwitch 3.0) boards: Isola Tachyon 100G or Panasonic Megtron 6E (Df 0.002–0.003 at 10 GHz) on NVLink signal layers
  • NVLink 5.0 (NVSwitch 4.0) dedicated switch boards: Panasonic Megtron 7 or equivalent (Df < 0.002 at 10 GHz); standard Megtron 6 is not acceptable for NVLink 5.0 at these trace lengths
  • Non-NVLink layers (power, ground, management signals): Megtron 6 or FR4-equivalent acceptable; hybrid stackup approach controls board cost

Signal Integrity Rules

The following SI design rules apply specifically to NVLink routing between GPUs and NVSwitch chips:

  • Differential impedance: 100 Ω ± 5% throughout; impedance discontinuities at vias, connectors, and package launches must be minimized through pad geometry optimization
  • Via stubs: Through-hole vias carrying NVLink signals must be backdrilled to < 10 mils stub length (NVLink 4.0) or < 5 mils (NVLink 5.0); alternately, blind/buried or laser vias eliminate stubs entirely for critical connections
  • Trace length matching: All traces within a NVLink lane pair must be matched to < 5 ps skew (approximately 0.85 mm at typical laminate propagation velocity); traces between lanes in the same direction group matched to < 50 ps
  • Reference plane integrity: Every NVLink signal layer must have a continuous, unbroken reference plane on the immediately adjacent layer; no splits, voids, or connector cutouts beneath active NVLink traces
  • Crosstalk spacing: Minimum edge-to-edge spacing between adjacent differential pairs of 2× trace width on the same layer; aggressor traces on adjacent layers must be offset or spaced to avoid broadside coupling

Power Delivery for NVSwitch

NVSwitch chips are themselves high-power components that require careful PDN design on the boards that carry them:

  • NVSwitch 3.0 TDP: approximately 270 W per chip; four chips in a DGX H100 baseboard add ~1,080 W of switch power to the board's total power budget
  • NVSwitch 4.0 TDP: approximately 400 W per chip; a single NVSwitch 4.0 board in a GB200 NVL72 carrying multiple chips must handle commensurately higher total power
  • Core voltage: NVSwitch core VDD is typically in the 0.75–0.85 V range; high current at low voltage demands very low-resistance power planes and short VRM-to-chip distances
  • I/O voltage: NVLink I/O runs at a separate regulated voltage (typically 1.0–1.1 V); must be isolated from core VDD with adequate decoupling to prevent noise coupling between the high-switching NVLink I/O and the core logic

Thermal Management

At 270–400 W per chip, NVSwitch requires the same quality of thermal management attention as the GPUs themselves:

  • Heatsinks or cold plates must cover NVSwitch package surfaces; in DLC systems, NVSwitch chips share the liquid cooling loop with GPUs
  • Thermal interface material (TIM) between the NVSwitch package and heatsink/cold plate must maintain contact across the full package area; large BGA packages are susceptible to warpage that creates TIM gaps
  • Thermal vias under the NVSwitch package spread heat from the package attach area into internal copper planes; via pitch of 0.5–0.7 mm is typical
  • Local PCB temperature near NVSwitch chips can reach 80–100°C in sustained operation; board material Tg ≥ 170°C is required to prevent delamination under thermal cycling

NVSwitch Board Manufacturing Requirements

The fabrication and assembly of PCBs carrying NVSwitch chips demands capabilities at the upper end of commercial PCB manufacturing:

  • Large-format fabrication: DGX H100 baseboards exceed 600 mm × 600 mm; standard panel sizes must accommodate these dimensions with adequate border for test coupons and tooling
  • High layer count with tight registration: 20–36 layers with layer-to-layer registration ± 50 μm; misregistration at this layer count degrades via alignment, increases contact resistance, and degrades controlled impedance
  • Hybrid stackup lamination: Mixing Megtron 7 (or Tachyon 100G) on signal layers with lower-cost materials on non-critical layers requires qualification of the bonding chemistry between dissimilar laminates and modified press cycles
  • Controlled-depth backdrilling at scale: Hundreds or thousands of through-hole vias per board requiring backdrilling to < 10 mils stub length; CNC depth feedback, per-board drill file generation, and 100% post-drill inspection
  • NVSwitch BGA assembly: NVSwitch chips are large, high-pin-count BGAs requiring optimized solder paste printing (stencil aperture design for fine-pitch pads), carefully profiled reflow (large thermal mass requires slow ramp to avoid package warpage), and 100% X-ray inspection post-reflow
  • Electrical validation: TDR (Time Domain Reflectometry) testing on NVLink impedance coupons; network analyzer (VNA) measurement of insertion loss and return loss on channel coupons; flying probe continuity testing for inner layer nets

FAQ

How many NVSwitch chips are in a DGX H100?
A DGX H100 contains four NVSwitch 3.0 chips, all mounted on the GPU baseboard alongside the eight H100 SXM5 GPU modules. The four NVSwitch chips together form a fully non-blocking fabric connecting all eight GPUs with 900 GB/s of NVLink 4.0 bandwidth per GPU.

What is the difference between NVLink and NVSwitch?
NVLink is the high-speed interconnect protocol and physical link that runs between GPU and switch chips. NVSwitch is the switch chip that implements a crossbar fabric for NVLink, allowing any GPU to communicate with any other GPU at full NVLink bandwidth. NVLink defines the electrical and protocol interface; NVSwitch provides the switching function that makes NVLink a scalable fabric rather than a fixed point-to-point topology. For a detailed explanation of NVLink specifically, see What Is NVLink? How NVIDIA's High-Speed GPU Interconnect Shapes PCB Routing.

Can NVSwitch connect GPUs across multiple servers?
In the GB200 NVL72 architecture, NVSwitch 4.0 connects GPUs across multiple compute trays within a single rack, effectively scaling NVLink connectivity beyond a single server node. Connections between separate racks still require InfiniBand or Ethernet networking; NVSwitch does not replace the inter-rack network fabric.

Is NVSwitch used in consumer NVIDIA GPUs?
No. NVSwitch is exclusively a data center product used in DGX systems, HGX server platforms, and rack-scale AI infrastructure like the GB200 NVL72. Consumer GPUs (GeForce) and workstation GPUs (RTX) connect via PCIe and do not use NVLink or NVSwitch.

Why did the port count jump from 18 (NVSwitch 2.0) to 64 (NVSwitch 3.0)?
The jump from 18 to 64 ports reflects a fundamental design philosophy change. NVSwitch 1.0 and 2.0 were designed to connect up to 16 GPUs with each GPU using a small number of NVLink links per switch connection. NVSwitch 3.0 is designed to support the full NVLink 4.0 bandwidth of the H100 (18 NVLink links per GPU × 2 directions) across a fabric that must remain non-blocking at 900 GB/s per GPU. The higher port count enables each GPU to connect to each NVSwitch chip with multiple NVLink links, maintaining full aggregate bandwidth.

What PCB materials are required for NVSwitch 4.0 boards?
NVSwitch 4.0 operates with NVLink 5.0 at 200 Gb/s per lane. The NVLink signal routing layers on NVSwitch 4.0 boards require ultra-low-loss laminates with dissipation factor (Df) < 0.002 at 10 GHz—typically Panasonic Megtron 7 or Isola Tachyon 100G. Standard Megtron 6 (Df ~0.004) is not suitable for NVLink 5.0 signal layers at the trace lengths involved in rack-scale NVSwitch board designs.


Need to Manufacture AI Server PCBs?

NVSwitch boards and GPU baseboards represent some of the most technically demanding PCB work in the industry. NextPCB supports high-layer-count fabrication, ultra-low-loss laminate processing, controlled-depth backdrilling, via-in-pad, large-format boards, and complete PCBA services for AI server hardware programs.

 


Related Articles:

Author Name

About the Author

Stacy Lu

With extensive experience in the PCB and PCBA industry, Stacy has established herself as a professional and dedicated Key Account Manager with an outstanding reputation. She excels at deeply understanding client needs, delivering effective and high-quality communication. Renowned for her meticulousness and reliability, Stacy is skilled at resolving client issues and fully supporting their business objectives.