Home
PCB Quote
Standard PCB Advanced PCB
Rev 0 PCBA
PCB Assembly
Rev 0 PCBA PCB Assembly Quote PCB Assembly Service PCB Assembly Capability PCB Stencil Service BOM Service Free Functional Testing
Components Sourcing
HQ Online Components BOM Tool
Gerber Viewer | DFM
Online Gerber Viewer HQDFM Design Analysis Software HQDFM User Manual
Capabilities & Services

NextPCB Capabilities

Standard PCB Capabilities Advanced PCB Capabilities PCB Assembly Capabilities

Capabilities by PCB Types

PCB Product Showsase Rigid PCBs Rogers PCB High-TG PCBs Heavy Copper PCBs HDI PCBs High-Speed PCBs High-Frequency PCBs Aluminum PCBs Copper-Core PCBs Ceramic PCBs Flex PCBs Rigid-Flex PCBs

Printed Circuit Boards

PCB Prototype Applicable Industries PCB Manufacturing Process Advanced PCB Materials

PCB Assembly

PCB Assembly Service PCB Stencil Service File Requirements PCB Assembly Guide IC Programming PCBA DFA BGA Assembly Capabilities Laser Labeling/Coding

Layer Buildup

Layer Stack-up Prepregs, cores, foils

SMD-Stencils

Laser Stencil

PCB Design-Aid & Layout

Layer Orientation BGA PCB Price Composition Printed Circuit Board Materials PCB Design & Layout Panel Creation Gold Fingers

Mechanics

V-Scoring Back drilling PCB milling

Surface

Via Covering Surface Finish Silkscreen Solder mask

Quality

E-Test X-RAY Design Rule Check A.O.I

Drills & Throughplating

Via-in-pad Blind & Buried Vias Annular Rings Side plating Plated Half-holes/Castellated Holes Plated-through Slots

Factory & Certificate

PCB Factory VR Visiting PCB Assembly Factory Show Certificate

New users: $30 off
24 hours Fast Turnaround
100% E-test & AOI

Free for 10pcs 50%OFF for 100pcs TURNKEY PCB ASSEMBLY
Tools & Resources
PCB Impedance Calculator PCB Stackups & Impedance PCB Trace Width Calculator AI Electrical Rule Check KiCad Resource Hub KiCad Version Converter Blog News
About Us
About Us

Sponsorships

Next Creator Fund NextPCB Accelerator Program

Contact Us Why Us Feedback Help Center Payment Methods Shipping Methods

0
Support Team

support@nextpcb.com

0086-755-8364 3663

+86 13622941920
Feedback:
support@nextpcb.com

Blog / GPU Rack Architecture: How AI Clusters Are Built from PCB to Rack

GPU Rack Architecture: How AI Clusters Are Built from PCB to Rack

Posted: June, 2026 Last Updated: June, 2026 Writer: Lolly Zheng Share:

Introduction

A GPU rack for AI training is not simply a collection of servers. It is a tightly engineered system where every layer—from the BGA solder joint on a GPU baseboard through the power distribution unit at the top of the rack to the InfiniBand cables leaving the rack—must be designed as part of a coherent whole. The PCB engineer responsible for the GPU baseboard, the mechanical engineer designing the chassis, the power engineer sizing the PDUs, and the network engineer planning the fabric topology are all solving parts of the same problem: how to deliver maximum AI compute throughput per rack unit while managing power, cooling, and connectivity within the physical constraints of a data center.

This article traces that system from the bottom up—from the PCB layer through node, rack, and cluster level—explaining the engineering decisions at each layer and how they constrain and interact with the decisions at adjacent layers. The goal is to give hardware engineers, system architects, and infrastructure procurement teams a complete mental model of GPU rack architecture that connects the physics of PCB design to the economics of AI infrastructure deployment.

Table of Contents

Introduction
The AI Infrastructure Hierarchy: Die to Data Center
Board Level: GPU Baseboards and OAM UBBs
Node Level: The AI Server Node
Rack Level: From Server Nodes to Full GPU Rack
NVLink Fabric: The Intra-Rack Interconnect
Rack Power Architecture
Rack Cooling Architecture
Network Fabric: Connecting Racks in a Cluster
Anatomy of a DGX H100 Rack
Anatomy of a GB200 NVL72 Rack
OAM-Based Rack Architecture (AMD MI300X)
PCB Design Implications at Each Rack Layer
Scaling from Rack to Cluster
FAQ

The AI Infrastructure Hierarchy: Die to Data Center

AI infrastructure can be understood as a nested hierarchy of five levels, each defined by the interconnect technology that binds the components at that level together:

Level	Components	Primary Interconnect	Typical Bandwidth
Die level	GPU die, HBM stacks	HBM microbump / CoWoS interposer	5–8 TB/s (H100–B200)
Package level	GPU package to baseboard	SXM LGA socket / OAM edge connector	900–1,800 GB/s (NVLink 4.0/5.0)
Board level	GPUs, NVSwitch, VRMs on baseboard	NVLink traces on PCB; PCIe Gen5/6 to CPU	900 GB/s per GPU (NVLink 4.0)
Node level	Baseboard + CPU motherboard + PSUs + cooling	NVLink (intra-node); PCIe Gen5 (CPU↔GPU)	~7.2 TB/s aggregate NVLink (DGX H100)
Rack level	Multiple nodes + NVSwitch boards + networking	NVLink 5.0 switch fabric (GB200 NVL72); InfiniBand	130 PF/s (GB200 NVL72 FP4)
Cluster level	Hundreds to thousands of racks	InfiniBand NDR/XDR; RoCE Ethernet	400G–800G per rack link

Each level in this hierarchy is constrained by the levels above and below it. The PCB designer cannot increase the NVLink trace count without increasing layer count; the system architect cannot add more GPUs per node without exceeding the baseboard's power and thermal budget; the data center operator cannot add more racks without exceeding the facility's power and cooling capacity. Understanding the full hierarchy prevents design decisions that solve a problem at one level by creating a worse problem at an adjacent level.

Board Level: GPU Baseboards and OAM UBBs

The GPU baseboard (or Universal Base Board for OAM-based systems) is the foundational PCB of the AI server node. It is the board that physically carries the GPU packages, routes the NVLink fabric between GPUs and NVSwitch chips, delivers power to every component, and provides the PCIe host interface to the CPU motherboard.

For NVIDIA SXM-based systems (H100, H200, B200), the GPU baseboard is an NVIDIA-designed or NVIDIA-licensed proprietary PCB. For AMD MI300X OAM-based systems, the Universal Base Board (UBB) is an open-standard PCB designed by ODMs or cloud providers to the OCP OAM specification. The technical requirements of the GPU baseboard are the most demanding PCB design problem in the system—typically 20–32 layers, hybrid low-loss stackup, NVLink routing at 100–200 Gb/s per lane, PDN designed for 700–1,000 W per GPU, and HDI via technology throughout. The engineering basis for these requirements is covered in the AI Accelerator PCB Design Guide and the HDI layer count analysis.

The baseboard does not operate in isolation—it is mechanically and electrically integrated into a server chassis that provides structural support, cooling airflow or liquid pathways, power input connections, and management connectivity. The baseboard's physical dimensions, mounting hole pattern, connector locations, and component height limits are all defined by the chassis design, which in turn is constrained by standard rack dimensions (19-inch rack width, 1U–8U chassis height).

Node Level: The AI Server Node

A single AI server node combines the GPU baseboard with a host CPU subsystem, power supplies, cooling hardware, and network interfaces into a complete server. For the DGX H100, this means:

GPU baseboard: 8 × H100 SXM5, 4 × NVSwitch 3.0; ~10.2 kW total accelerator power
CPU subsystem: 2 × AMD EPYC 9004 series CPUs on a separate motherboard; 1 TB DDR5 system memory
Host interface: PCIe Gen5 from each CPU to a subset of the GPU baseboard connectors; CPU memory is addressable by GPUs via PCIe but at lower bandwidth than NVLink
Storage: 8 × NVMe SSD for local dataset and checkpoint storage
Networking: 8 × 400G InfiniBand NICs (ConnectX-7) for inter-node communication
Power supplies: Multiple redundant PSUs delivering 12 V bus to the GPU baseboard and CPU subsystem
Cooling: High-flow air cooling or direct liquid cooling (DLC) loops to GPU cold plates

The total system power consumption of a DGX H100 is approximately 10.2 kW—enough to require a dedicated 30–40 A circuit at 400 V three-phase in most data centers. This power density, multiplied by 4–8 nodes per rack, defines the fundamental infrastructure requirement that constrains AI rack deployment.

Rack Level: From Server Nodes to Full GPU Rack

A GPU rack aggregates multiple server nodes, their shared networking infrastructure, and the power and cooling distribution hardware into a single physical enclosure. The standard 19-inch rack format is 42U (rack units) tall; one rack unit is 44.45 mm in height. A DGX H100 node is 10U; a standard GPU rack fits 4 DGX H100 nodes (40U), leaving 2U for top-of-rack (ToR) network switches and management hardware.

The rack architecture for different AI server generations varies significantly:

Parameter	DGX H100 Rack (4 nodes)	GB200 NVL72 Rack	MI300X OAM Rack (8 nodes)
GPUs per rack	32 H100 SXM5	72 B200 SXM6	64 MI300X OAM
Total GPU memory	2,560 GB (32 × 80 GB)	13,824 GB (72 × 192 GB)	12,288 GB (64 × 192 GB)
Intra-rack GPU interconnect	NVLink 4.0 within each node; InfiniBand between nodes	NVLink 5.0 via dedicated NVSwitch boards (fully connected within rack)	Infinity Fabric within each node; InfiniBand between nodes
Rack power (approx.)	~40 kW	~120 kW	~48 kW
Cooling type	Air or DLC	Mandatory DLC	Air or DLC
NVSwitch boards in rack	None (NVSwitch on baseboard within each node)	9 dedicated NVSwitch 4.0 boards	None (Infinity Fabric direct, no switch chip)

The GB200 NVL72 represents a fundamental architectural departure from the DGX H100: rather than organizing the rack as independent nodes connected by InfiniBand, the NVL72 treats the entire rack as a single logical compute unit, with all 72 GPUs fully connected by NVLink 5.0 through the NVSwitch 4.0 boards. This creates a 13.5 TB unified memory space addressable by any GPU in the rack, enabling model parallelism strategies that are not practical when GPUs are connected only by InfiniBand between nodes.

NVLink Fabric: The Intra-Rack Interconnect

NVLink is the high-bandwidth, cache-coherent interconnect that enables efficient all-to-all GPU communication within a node or rack. Its role and implementation differ by GPU generation, as covered in detail in the NVLink PCB routing guide and the NVSwitch architecture guide.

In the DGX H100, the NVLink fabric is confined within each node: the baseboard carries 4 NVSwitch 3.0 chips that create a fully non-blocking fabric connecting all 8 H100 GPUs at 900 GB/s per GPU. Communication between nodes in the same rack uses InfiniBand NICs and a top-of-rack InfiniBand switch—a much slower path (400 GB/s per NIC × 8 NICs vs 900 GB/s NVLink) that limits cross-node all-reduce performance relative to intra-node all-reduce.

In the GB200 NVL72, the NVLink fabric expands to cover the entire rack. Nine NVSwitch 4.0 boards, each carrying multiple NVSwitch 4.0 chips, create a fully non-blocking fabric connecting all 72 B200 GPUs in the rack at 1,800 GB/s per GPU. The dedicated NVSwitch boards occupy dedicated rack slots and are connected to each GPU compute tray via high-speed copper cables or backplane connections within the rack enclosure. The PCB design requirements for these NVSwitch boards are among the most demanding in the AI hardware supply chain—32–40+ layers, Megtron 7 laminate, HVLP copper foil, any-layer HDI via technology, routing NVLink 5.0 at 200 Gb/s per lane across thousands of differential pairs, as described in the Blackwell architecture guide.

Rack Power Architecture

AI GPU racks operate at power densities that challenge even purpose-built data center power infrastructure. A single GB200 NVL72 rack at 120 kW requires a dedicated power circuit delivering approximately 300 A at 400 V three-phase—the equivalent of a small manufacturing facility's entire electrical service. Managing this power efficiently within the rack requires a carefully designed power distribution architecture.

AC-to-DC conversion: High-efficiency AC-to-DC power supply units (PSUs) in each server node convert facility AC power (typically 208 V or 400 V) to a DC bus. The efficiency of this conversion (typically 94–96% for modern titanium-rated PSUs) directly affects the facility's power usage effectiveness (PUE); every percentage point of PSU efficiency loss becomes heat that the cooling system must remove.

Bus voltage: Older GPU server designs used 12 V bus distribution from PSU to GPU baseboard. The challenge with 12 V at 120 kW is the current: 120 kW / 12 V = 10,000 A, which is unmanageable with standard copper bus bars at rack scale. Modern AI rack designs—including the GB200 NVL72 and OAM Gen2 systems—use 48 V bus distribution, reducing the bus current by a factor of 4 (120 kW / 48 V = 2,500 A) and enabling significantly smaller bus bar cross-sections and lower resistive losses in the distribution cabling.

GPU-level voltage conversion: The GPU core voltage (0.85–0.9 V at 500–800 A per GPU) is generated by on-board VRM (Voltage Regulator Module) circuits on the GPU baseboard or OAM module. In 48 V bus systems, the VRM converts 48 V to GPU core voltage in a single high-ratio conversion step; in 12 V systems, an intermediate step may be needed. The PCB design requirements for VRM circuits—heavy copper power planes, low-inductance switching layouts, high-density decoupling capacitor arrays—are among the most demanding aspects of GPU baseboard design.

Redundancy: Data center servers require N+1 or 2N power redundancy. Each server node typically has two or more PSUs in a redundant configuration; if one PSU fails, the others carry the full load. This redundancy requirement doubles the installed PSU capacity and must be accounted for in rack power budgets: a rack nominally consuming 120 kW may have 180–240 kW of installed PSU capacity to support redundancy.

Rack Cooling Architecture

Cooling is the most physically constrained aspect of dense GPU rack design. Heat must be moved from the GPU junction (at 85–95°C maximum operating temperature) to the data center environment (at 20–30°C) through a series of thermal interfaces, each with a temperature drop that consumes part of the available thermal budget.

Air cooling (conventional): In air-cooled GPU racks, high-efficiency fans within each server node force air across the GPU cold plates and heat exchangers, exhausting heated air to the hot aisle at the rear of the rack. Air cooling is practical for GPU TDPs up to approximately 600–700 W per GPU in standard data center environments (ASHRAE A1/A2 thermal classes). H100 SXM5 at 700 W is at the upper limit of what air cooling can reliably sustain in dense rack configurations; most DGX H100 installations use direct liquid cooling.

Direct liquid cooling (DLC): DLC routes chilled water or coolant directly to cold plates that contact the GPU package surfaces through thermal interface material (TIM). A typical DLC cold plate maintains GPU junction temperatures 20–30°C lower than equivalent air cooling at the same TDP, enabling reliable operation at 700–1,000 W per GPU. DLC is mandatory for B200 (1,000 W TDP) and strongly recommended for H100 in high-density configurations.

The PCB design implications of DLC are significant: cold plate mounting structures must be accommodated in the baseboard layout (mounting hole patterns, component height keepout zones around the GPU package area); thermal vias must be sized and placed beneath GPU packages to minimize TIM contact resistance; and board material T_g must be ≥ 170°C to withstand the sustained thermal environment near 1,000 W GPU packages. The thermal management PCB design requirements are detailed at Thermal Management on AI Server PCBs.

Immersion cooling: Some high-density AI installations use single-phase or two-phase immersion cooling, submerging entire servers in dielectric fluid. Immersion cooling can handle power densities of 200+ kW per rack that are beyond air or conventional DLC capability. PCBs in immersion-cooled systems must be compatible with the specific dielectric fluid used; most standard PCB materials and surface finishes are compatible, but conformal coating and underfill are not always appropriate and must be evaluated per the fluid vendor's specifications.

Network Fabric: Connecting Racks in a Cluster

Within a single rack (or in the GB200 NVL72, within a single rack unit), NVLink provides high-bandwidth, low-latency GPU-to-GPU communication. Between racks, InfiniBand or high-speed Ethernet is used. The choice between these two technologies and the topology of the inter-rack fabric are critical system architecture decisions that determine the scaling efficiency of large AI training clusters.

InfiniBand NDR/XDR: NVIDIA InfiniBand NDR (400 Gb/s per port) and the emerging XDR generation (800 Gb/s per port) provide the highest-performance GPU cluster networking. InfiniBand's RDMA (Remote Direct Memory Access) capability allows GPU-to-GPU data transfer across racks without CPU involvement, critical for maintaining high collective communication efficiency during training. InfiniBand fat-tree topologies (connecting racks to leaf switches, leaf switches to spine switches) provide near-bisection bandwidth for all-to-all collective operations at cluster scale.

RoCE (RDMA over Converged Ethernet): High-speed Ethernet with RoCE extensions provides RDMA capabilities on standard Ethernet infrastructure. 400G and 800G Ethernet (using 112G or 224G PAM4 lanes) is increasingly used in hyperscale AI clusters as an alternative to InfiniBand, leveraging the scale economies of the Ethernet ecosystem. The trade-off is somewhat higher latency for collective operations compared to InfiniBand, which can reduce collective communication efficiency at very large cluster scales.

Top-of-rack (ToR) switches: Each rack connects to the network fabric through ToR switches mounted in the top 1–2U of the rack. A GPU rack with 8 nodes and 8 NICs per node requires a ToR switch with 64 × 400G downlinks (to the NICs) and sufficient uplink bandwidth to the spine layer (typically 32 × 400G or 8 × 800G). The ToR switch PCB design is itself a demanding engineering problem: the switch ASIC operates at 25.6–51.2 Tb/s throughput, using 112G or 224G PAM4 serdes lanes on a board with layer counts and material requirements approaching those of the GPU baseboard.

Anatomy of a DGX H100 Rack

The DGX H100 is NVIDIA's reference AI training system, and a rack of four DGX H100 nodes is the most widely deployed high-performance AI training unit in data centers worldwide in 2025–2026. Understanding its physical architecture provides a concrete reference point for the abstract hierarchy described above.

Each DGX H100 node (10U):

One GPU baseboard: 8 × H100 SXM5 (700 W each), 4 × NVSwitch 3.0; ~700 mm × 700 mm board area; 20–24 PCB layers; hybrid Megtron 6E / Megtron 6 stackup
Two AMD EPYC CPUs on a separate motherboard
1 TB DDR5 ECC memory (16 × 64 GB DIMMs)
8 × ConnectX-7 400G InfiniBand NICs
8 × NVMe SSDs (30.72 TB total)
DLC cooling loop: chilled water to cold plates on each GPU package
Multiple PSUs: redundant 12 V bus at approximately 10.2 kW total system power

Full DGX H100 rack (42U):

4 × DGX H100 nodes (40U)
1–2U: top-of-rack InfiniBand NDR switch (64 × 400G ports)
Total GPUs: 32 H100 SXM5
Total GPU memory: 2,560 GB
Total rack power: ~40 kW
NVLink topology: each node is a 900 GB/s all-to-all fabric; nodes connect via InfiniBand (no NVLink between nodes)

The NVLink bandwidth discontinuity between intra-node (900 GB/s per GPU) and inter-node (400 GB/s per NIC × 8 NICs = 3.2 TB/s aggregate for the node, but shared among all 8 GPUs) is the primary architectural limitation of DGX H100 multi-node clusters at scale. Collective all-reduce operations that fit within a single node execute at NVLink speed; those that must cross nodes execute at a fraction of that speed, bounded by the InfiniBand fabric.

Anatomy of a GB200 NVL72 Rack

The GB200 NVL72 is NVIDIA's most radical AI rack architecture to date, designed to eliminate the intra-rack NVLink bandwidth gap that limits DGX H100 multi-node scaling. The key architectural innovation is treating the entire rack as a single logical GPU cluster connected by a unified NVLink 5.0 fabric.

Physical structure:

Compute trays: 18 compute trays, each carrying 2 GB200 Superchips (1 Grace CPU + 1 B200 GPU per Superchip × 2 = 2 Grace CPUs + 2 B200 GPUs per tray); total 36 Grace CPUs + 72 B200 GPUs per rack
NVSwitch boards: 9 dedicated NVSwitch 4.0 boards, each carrying multiple NVSwitch 4.0 chips; together these create the fully non-blocking NVLink 5.0 fabric connecting all 72 GPUs
Power: ~120 kW total rack power; 48 V DC bus distribution throughout; mandatory DLC
Cooling: Dedicated liquid cooling loops to each compute tray and NVSwitch board; no air cooling path for GPU thermal management
Networking: 8 × 400G InfiniBand (or Ethernet) links per rack for inter-rack cluster fabric

NVSwitch fabric topology: Each of the 9 NVSwitch boards connects to every one of the 36 compute trays via NVLink 5.0. Each B200 GPU has 18 NVLink 5.0 links, distributed across the 9 switch boards (2 links per switch board per GPU). This creates a non-blocking topology: any GPU can communicate with any other GPU at full 1,800 GB/s NVLink 5.0 bandwidth without contention.

Unified memory fabric: With all 72 B200 GPUs fully interconnected, the combined 13,824 GB (13.5 TB) of HBM3e memory is addressable as a unified memory space. A model too large to fit on a single GPU (or even a single node) fits in the rack-level memory pool, enabling in-rack tensor parallelism at a scale that previously required inter-node InfiniBand communication.

The NVSwitch board PCBs within the NVL72 are the most technically demanding boards in the system. They must route NVLink 5.0 at 200 Gb/s per lane across thousands of differential pairs connecting 72 GPUs through a fully non-blocking switch fabric. The resulting PCB design requirements—32–40+ layers, Megtron 7 with HVLP copper, any-layer HDI, precision backdrilling—represent the current frontier of commercial PCB capability, as analyzed in the H100 vs A100 PCB comparison context and the NVSwitch guide.

OAM-Based Rack Architecture (AMD MI300X)

OAM-based AI server racks using AMD MI300X accelerators have a different architecture from NVIDIA SXM-based racks in several important ways. The OAM standard, covered in the OAM Module introduction, separates the accelerator module from the baseboard, enabling ODMs and cloud providers to design their own UBBs and chassis.

A typical OAM-based rack with AMD MI300X:

8 server nodes per rack, each with 8 MI300X OAM modules on a single UBB
64 MI300X modules per rack; 192 GB HBM3 per module = 12,288 GB total HBM per rack
No dedicated NVSwitch equivalent; Infinity Fabric connects MI300X modules directly within each node via UBB routing
Inter-node communication via InfiniBand (400G per NIC, 8 NICs per node = 3.2 TB/s aggregate per node)
Approximately 48 kW total rack power (750 W per GPU × 64 = 48,000 W)

The OAM architecture's key advantage at rack level is the ability to mix different accelerator vendors' modules in the same UBB if both are OAM-compliant. The key limitation is the absence of a rack-scale NVLink equivalent: MI300X modules within a node share Infinity Fabric at 448 GB/s per GPU, but across nodes the inter-GPU bandwidth drops to InfiniBand speeds. This makes MI300X racks architecturally similar to DGX H100 in terms of the intra-node vs inter-node bandwidth gap, though the 192 GB per module means that more models can fit within a single node compared to H100.

PCB Design Implications at Each Rack Layer

Each layer of the GPU rack architecture generates specific PCB design requirements that the boards at that layer must satisfy. The following summarizes the key PCB implications of the rack architecture for the boards most often produced by PCB manufacturers serving AI server programs.

GPU baseboard / UBB (board level): The most complex board in the system. Requirements are dominated by NVLink routing (requiring dedicated high-speed signal layers with Megtron 7 or equivalent, HVLP copper foil, ± 5% impedance control, and backdrilling or laser vias to eliminate stubs), PDN design for sustained GPU power (2–3 oz copper power planes, tight PDN impedance), HDI via technology for BGA escape, and thermal via arrays for cold plate integration. Layer counts of 20–32+ are required.

NVSwitch boards (rack level, GB200 NVL72): Even more demanding than GPU baseboards in terms of NVLink routing density, because the switch board carries only NVLink signals and switch chip power delivery without the competing constraints of GPU package placement and CPU PCIe routing. Layer counts of 32–40+ with any-layer HDI and Megtron 7 / HVLP copper throughout the NVLink signal layers.

CPU motherboard (node level): Carries the host CPUs, DDR5 memory, PCIe Gen5 connections to the GPU baseboard, and management subsystems. Typically 12–18 layers with Megtron 6E on PCIe Gen5 signal layers and standard materials elsewhere. The CPU motherboard's PCIe Gen5 routing requirements are covered in the PCIe Gen5 design guide.

Power distribution boards: The 48 V bus distribution board within a high-power AI rack carries extremely high currents at relatively low frequency. The primary PCB requirements are heavy copper (3–4 oz on primary power planes), wide trace and plane areas for current-carrying capacity, and thermal management of VRM components. Signal integrity is not a concern on these boards; the challenging requirements are mechanical (large format, heavy copper, tight planarity for bus bar connections) rather than electrical.

Network interface cards (NICs): Each 400G or 800G InfiniBand NIC carries a ConnectX ASIC with 112G PAM4 or higher-speed serdes lanes to the optical transceiver cage and PCIe Gen5 to the host CPU. NIC PCBs require low-loss laminate on the serdes signal layers (Megtron 6E / Megtron 7 for 112G PAM4 lanes) and careful impedance control on PCIe Gen5 traces. The 112G PAM4 PCB design guide is directly applicable to NIC board design.

Top-of-rack switch boards: The ToR switch ASIC board operates at 25.6–51.2 Tb/s and uses 112G or 224G PAM4 serdes. Layer counts of 24–32 layers, ultra-low-loss laminates on all serdes signal layers, and stringent crosstalk management are required. The 224G PCB design requirements discussed at 224G PCB Design apply directly to next-generation ToR switch boards.

Scaling from Rack to Cluster

A single GPU rack, however powerful, is insufficient for frontier AI training. LLM pre-training runs for models in the 100B–1T parameter range require thousands of GPUs operating in parallel for weeks or months. Scaling from rack to cluster introduces engineering challenges at every layer of the hierarchy.

At 1,000 GPUs (approximately 14 NVL72 racks or 32 DGX H100 racks), the inter-rack InfiniBand fabric becomes the primary performance bottleneck for collective communication operations. A fat-tree InfiniBand topology requires one spine-layer switch for every 2–4 racks; a 1,000-GPU cluster needs 7–14 spine switches, each with 64 × 400G ports, interconnecting through a second tier if the cluster grows beyond a few thousand GPUs.

At 10,000 GPUs (roughly 140 NVL72 racks), the cluster fabric requires a three-tier InfiniBand fat-tree with hundreds of switches and thousands of cables. The total fabric bandwidth requirement at this scale exceeds 100 Tb/s of bisection bandwidth. Power consumption approaches 17 MW (120 kW/rack × 140 racks) before network and cooling overhead. Water infrastructure must supply hundreds of liters per minute of chilled water. The PCB implications at this scale are primarily in the number and volume of boards being produced, not in the design of any individual board.

What makes AI cluster scaling tractable is the modularity of the rack unit: each rack is independently functional, and the cluster is built by replicating racks and adding fabric. The PCB designs within each rack do not fundamentally change as the cluster grows; the engineering challenge at cluster scale is logistical (procurement, installation, validation, operations) rather than PCB design. This modularity is itself enabled by the standardized mechanical and electrical interfaces at each level of the hierarchy—SXM sockets, OAM connectors, InfiniBand port specifications—that allow components from different vendors and generations to interoperate within a common infrastructure framework.

FAQ

How many GPUs fit in a standard 42U rack?
It depends on the GPU generation and server architecture. A DGX H100 rack (4 × 10U nodes) fits 32 H100 GPUs in 42U. The GB200 NVL72 fits 72 B200 GPUs in a purpose-built rack enclosure that occupies approximately 42U. An OAM-based MI300X rack with 8 × 2U servers fits 64 MI300X modules in 42U. The trend across generations is toward higher GPU density per rack unit, driven by the growing compute demands of frontier AI training.

Why does the GB200 NVL72 require dedicated NVSwitch boards while DGX H100 puts NVSwitch on the baseboard?
DGX H100 places 4 NVSwitch 3.0 chips on the GPU baseboard because the node-level fabric (8 GPUs) is compact enough to be implemented on a single large PCB without impractical routing density. The GB200 NVL72 connects 72 GPUs across 18 compute trays in a single fabric, which requires a far larger number of NVLink 5.0 connections than any single PCB can route. Dedicated NVSwitch boards distribute the switching function across 9 boards, each optimized for routing NVLink 5.0 signals without the competing constraints of GPU package placement and CPU connectivity that would be present on a combined GPU baseboard.

What is the total NVLink bandwidth in a GB200 NVL72 rack?
Each of the 72 B200 GPUs has 1,800 GB/s of bidirectional NVLink 5.0 bandwidth. The total aggregate NVLink bandwidth in the rack is 72 × 1,800 GB/s = 129,600 GB/s (~130 TB/s) bidirectional. This is the bandwidth available between any source-destination GPU pair in the rack; because the NVSwitch fabric is non-blocking, all 72 GPUs can simultaneously send and receive at full 1,800 GB/s each without contention.

How does rack power at 120 kW affect data center infrastructure requirements?
A 120 kW rack requires dedicated electrical infrastructure far beyond standard data center provisioning. Standard enterprise data centers are typically provisioned at 5–15 kW per rack; AI-optimized data centers provision 30–150 kW per rack. A 120 kW GB200 NVL72 rack requires: a dedicated 400 V three-phase circuit providing approximately 300 A; a separate UPS circuit with adequate battery backup capacity; liquid cooling supply lines carrying approximately 30–50 liters per minute of chilled water; and structural floor loading rated for the rack weight (including cooling infrastructure), which may exceed 2,000 kg.

Is InfiniBand or Ethernet better for GPU cluster networking?
Both technologies are in active use for AI cluster networking in 2025–2026. InfiniBand NDR (400 Gb/s) provides lower latency and higher bandwidth efficiency for collective communication operations (all-reduce), which is critical for large-scale model training. RoCE Ethernet at 400G or 800G provides comparable throughput at lower cost per port, leveraging commodity switch and NIC volume from the broader Ethernet ecosystem. NVIDIA-based clusters commonly use InfiniBand (benefiting from the tight integration between NVIDIA NICs and CUDA collective communication libraries); AMD MI300X and other OAM-based systems more commonly use Ethernet fabric. Many hyperscale operators run both technologies depending on workload characteristics.

What is the smallest useful GPU cluster for AI training?
A single DGX H100 node (8 GPUs) with 640 GB aggregate HBM is sufficient for training models up to approximately 100B parameters with model parallelism, though training time for frontier models at this scale would be impractically long. In practice, AI training clusters are sized based on target training completion time: a 70B parameter LLaMA-class model trained from scratch requires approximately 6,000 GPU-hours; a single 8-GPU node would complete this in approximately 750 hours (31 days); a 64-GPU cluster (8 DGX H100 nodes) in approximately 94 hours (4 days). Clusters of 256 to 2,048 GPUs (8 to 64 DGX H100 nodes) represent the practical minimum for modern frontier model training programs.

Need to Manufacture PCBs for AI Server and GPU Rack Infrastructure?

From GPU baseboards and NVSwitch boards to CPU motherboards, NIC PCBs, and ToR switch line cards, NextPCB provides the advanced PCB fabrication and assembly capabilities that AI rack hardware programs demand: high-layer-count HDI fabrication, ultra-low-loss laminate processing, heavy copper power planes, controlled-depth backdrilling, BGA assembly, and IPC Class 3 quality standards across the full AI server PCB stack.

Get a quote from NextPCB →

Upload Your Design & Get Your Instant Quote Now Engineer Consultation

About the Author

Lolly Zheng- Sales Account Manager at NextPCB.com

Four years of proven sales experience across electronic components and PCBA industries, with strong expertise in key account acquisition, customer relationship management, and contract negotiations. Focused on driving revenue growth through strategic client development and solution-based selling. Experienced in expanding high-value accounts, securing long-term partnerships, and consistently exceeding sales targets in competitive markets.

1574 0 0 1 Facebook Twitter Linked In