Contact Us
Blog / NVIDIA Blackwell Architecture Explained: B200, GB200 & PCB Design Impact

NVIDIA Blackwell Architecture Explained: B200, GB200 & PCB Design Impact

Posted: June, 2026 Writer: NextPCB - S Share: NEXTPCB Official youtube NEXTPCB Official Facefook NEXTPCB Official Twitter NEXTPCB Official Instagram NEXTPCB Official Linkedin NEXTPCB Official Tiktok NEXTPCB Official Bksy

NVIDIA's Blackwell architecture represents the most significant generational leap in GPU design since the introduction of the transformer engine in Hopper. For software teams, the headline is raw performance: up to 9,000 FP8 TFLOPS per GPU, nearly 10× the training throughput of H100. For hardware engineers and PCB designers, the headline is something different: a 1,000 W thermal envelope, a dual-die CoWoS package, NVLink 5.0 at 1,800 GB/s, and PCIe Gen6—a combination that forces a fundamental rethink of board stackup, material selection, and thermal architecture.

This article unpacks the Blackwell architecture from the silicon outward, with a focus on what each design decision means at the PCB level.

  1. Table of Contents
  2. Introduction
  3. What Is the NVIDIA Blackwell Architecture?
  4. Key Architectural Advances Over Hopper
  5. NVIDIA B200: Specifications and Die Architecture
  6. GB200: The Grace Blackwell Superchip
  7. GB200 NVL72: Rack-Scale AI Infrastructure
  8. What Blackwell Means for PCB Design
  9. Hopper vs Blackwell: PCB Design Comparison
  10. Manufacturing Considerations
  11. FAQ

What Is the NVIDIA Blackwell Architecture?

Blackwell is NVIDIA's fifth-generation data center GPU architecture, succeeding Hopper (H100/H200). It was announced in March 2024 and entered volume production in late 2024. The architecture is named after David Harold Blackwell, an American statistician and game theorist.

The Blackwell family includes several distinct products:

  • B100: Entry-level Blackwell for dense deployments, lower TDP
  • B200: Flagship Blackwell GPU for AI training and inference
  • GB200: Grace Blackwell Superchip—B200 GPU paired with an NVIDIA Grace (ARM-based) CPU on a single module
  • GB200 NVL72: Rack-scale system with 36 GB200 Superchips (72 B200 GPUs) connected via NVLink 5.0
  • B200A: Cloud-optimized variant with adjusted power envelope

Key Architectural Advances Over Hopper

Feature Hopper (H100) Blackwell (B200) Improvement
Process node TSMC 4N TSMC 4NP Refined 4 nm class
Die configuration 1× GH100 (80B transistors) 2× GB100 (208B transistors total) 2.6× transistor count
Packaging Single-die flip-chip Dual-die CoWoS-L Advanced 2.5D integration
FP8 Training TFLOPS ~2,000 (with sparsity) 9,000 ~4.5×
Memory type HBM2e (80 GB) HBM3e (192 GB) 2.4× capacity
Memory bandwidth 3.35 TB/s 8.0 TB/s 2.4×
NVLink generation NVLink 4.0 NVLink 5.0 2× bandwidth (1,800 GB/s)
PCIe generation PCIe Gen5 PCIe Gen6 2× per-lane throughput (PAM4)
TDP 700 W 1,000 W +43%
Form factor SXM5 SXM6 Larger footprint, higher pin count

The introduction of second-generation Transformer Engine with FP8 mixed precision, a new Fifth-Generation NVTensor Core, and RAS (Reliability, Availability, Serviceability) Engine for in-field error correction are among the software-visible advances. For PCB engineers, the critical numbers are TDP, NVLink bandwidth, PCIe generation, and package type.


NVIDIA B200: Specifications and Die Architecture

Dual-Die Design and CoWoS-L Packaging

The single most consequential architectural decision in Blackwell—from a PCB standpoint—is the use of two GB100 dies connected by TSMC's CoWoS-L (Chip-on-Wafer-on-Substrate with Local Silicon Interconnect) packaging technology.

At the transistor counts required for B200 performance targets, a single monolithic die would measure approximately 1,000 mm2—exceeding the reticle limit of current EUV lithography equipment (~858 mm2) and yielding poorly. TSMC's CoWoS-L solves this by placing two separate GB100 dies (each approximately 460 mm2) side-by-side on a silicon interposer, connected by a dense array of microbumps providing die-to-die bandwidth of ~900 GB/s.

From the perspective of the PCB carrying the B200, the package presents as a single very large BGA component with an expanded footprint relative to SXM5. The substrate under the CoWoS assembly is itself a complex interconnect structure, and the PCB must support its mounting area, power delivery, and signal escape routing with extremely fine features.

Compute: FP8, FP16, BF16, and INT8

The B200's Fifth-Generation NVTensor Cores natively support FP8, FP16, BF16, FP32, INT4, and INT8 precisions. The headline throughput numbers:

Precision B200 TFLOPS (dense) B200 TFLOPS (sparse) H100 TFLOPS (dense)
FP8 4,500 9,000 ~2,000 (sparse)
FP16 / BF16 2,250 4,500 989
FP32 75 67
INT8 4,500 9,000 ~2,000 (sparse)

HBM3e Memory: 192 GB at 8.0 TB/s

The B200 integrates 192 GB of HBM3e across eight stacks, delivering 8.0 TB/s of memory bandwidth. HBM stacks are placed directly on the CoWoS interposer adjacent to the GB100 dies, connected via through-silicon vias (TSVs) within the HBM packages and microbumps to the interposer.

This on-package memory architecture means that the PCB does not carry HBM signals—all HBM routing occurs within the CoWoS package itself. However, the PCB must still provide the power rails that feed HBM through the package substrate, with tight ripple and transient requirements.

NVLink 5.0: 1,800 GB/s Bidirectional

NVLink 5.0 doubles per-lane bandwidth over NVLink 4.0, achieving 1,800 GB/s total bidirectional bandwidth per GPU (900 GB/s in each direction). Each NVLink 5.0 link runs at 200 Gb/s per lane, and the B200 supports 18 links (for a total of 18 × 2 × 100 Gb/s = 3,600 Gb/s = ~450 GB/s per direction, aggregated across all links to 900 GB/s per direction).

At 200 Gb/s per lane, NVLink 5.0 signals are among the fastest routed on any commercial PCB today. The channel loss budget from GPU package pad to NVSwitch package pad—including PCB trace, vias, and connectors—is extremely tight. Meeting this budget requires:

  • Ultra-low-loss PCB laminates with dissipation factor (Df) < 0.002
  • Minimized via stub length (backdrilling to within < 5 mils of the signal layer)
  • Smooth copper foil (very-low-profile or ultra-low-profile grades) to reduce skin-effect losses at high frequencies
  • Tight differential pair impedance control (100 Ω ± 5%)

PCIe Gen6 Host Interface

Blackwell introduces PCIe Gen6, doubling throughput over Gen5 by switching encoding from NRZ (Non-Return-to-Zero) to PAM4 (Pulse Amplitude Modulation, 4 levels) at the same 32 GT/s signaling rate. This yields 64 GT/s effective throughput per lane, or approximately 256 GB/s across a ×16 link.

PAM4 encoding significantly reduces noise margins compared to NRZ signaling at equivalent data rates. For the PCB connecting the B200 to the host CPU:

  • Channel insertion loss must be < 28 dB at 16 GHz (the Nyquist frequency for 32 GT/s PAM4)
  • Return loss and crosstalk specifications are tighter than Gen5
  • Forward Error Correction (FEC) is mandatory in the Gen6 specification, but FEC adds latency and PCB signal quality still directly affects bit error rate before correction

GB200: The Grace Blackwell Superchip

The GB200 Superchip integrates a B200 GPU and an NVIDIA Grace CPU (based on ARM Neoverse V2 cores) on a single module, connected by a 900 GB/s NVLink-C2C (Chip-to-Chip) interconnect. This replaces the traditional PCIe host interface between CPU and GPU with a cache-coherent, low-latency memory interconnect.

Key GB200 Superchip specifications:

  • Grace CPU: 72 ARM Neoverse V2 cores, 480 GB LPDDR5X memory, 128-bit memory bus
  • B200 GPU: Full B200 specifications (192 GB HBM3e, 9,000 FP8 TFLOPS sparse)
  • NVLink-C2C bandwidth: 900 GB/s bidirectional (CPU↔GPU)
  • Combined memory: 672 GB (480 GB LPDDR5X + 192 GB HBM3e), addressable as unified memory space
  • Module TDP: ~1,200 W (combined CPU + GPU)

The GB200 module is not an individual PCB in the traditional sense—it is a multi-chip module (MCM) that mounts to a baseboard. That baseboard must handle the combined power delivery, NVLink 5.0 routing to neighboring modules, and PCIe/network connectivity, all within the constraints of a rack-optimized form factor.


GB200 NVL72: Rack-Scale AI Infrastructure

The GB200 NVL72 is a complete rack-scale system containing 36 GB200 Superchips (36 Grace CPUs + 72 B200 GPUs) connected in a fully-connected NVLink 5.0 fabric via NVSwitch chips. The system operates as a single logical GPU with 13.5 TB of unified HBM3e memory and 130 petaFLOPS of FP4 compute.

Infrastructure specifications:

  • GPUs per rack: 72 B200
  • Total HBM3e: 72 × 192 GB = 13,824 GB (~13.5 TB)
  • NVLink switches per rack: 9 NVSwitch boards
  • Rack power consumption: ~120 kW
  • Cooling: Direct liquid cooling (mandatory)
  • Network: 8× 400G InfiniBand per rack

The NVSwitch boards within the NVL72 rack are among the most complex PCBs manufactured for commercial deployment, routing NVLink 5.0 signals between all 72 GPUs simultaneously across a fully non-blocking fabric.


What Blackwell Means for PCB Design

Layer Count Requirements

The transition from Hopper to Blackwell drives a significant increase in required PCB layer count:

Board H100/H200 Baseboard B200 Baseboard GB200 NVL72 NVSwitch Board
Typical layer count 16–20 24–32 32–40+
Primary drivers NVLink 4.0 routing, power planes NVLink 5.0, PCIe Gen6, higher current power planes Fully-connected NVLink 5.0 fabric, maximum signal density

Additional layers are needed for: dedicated NVLink 5.0 signal routing layers (which cannot share layers with power or other signal types due to crosstalk requirements); additional power planes for the expanded rail count in the B200 PDN; and HDI build-up layers for fine-pitch BGA escape routing under the SXM6 socket.

Material Selection

NVLink 5.0 at 200 Gb/s per lane makes material selection one of the most critical decisions in Blackwell PCB design. The insertion loss budget from GPU to NVSwitch is fixed by the NVLink 5.0 specification; the PCB laminate's dielectric loss consumes a portion of that budget that cannot be recovered.

Laminate Dk (at 10 GHz) Df (at 10 GHz) Suitable for B200?
Standard FR4 ~4.5 ~0.020 No — unacceptable loss at NVLink 5.0 frequencies
Panasonic Megtron 6 ~3.6 ~0.004 Marginal for NVLink 5.0; suitable for non-NVLink layers
Panasonic Megtron 7 ~3.4 ~0.002 Yes — recommended for NVLink 5.0 and PCIe Gen6 layers
Isola Tachyon 100G ~3.6 ~0.0021 Yes — suitable for NVLink 5.0 routing layers
Rogers 4350B ~3.48 ~0.0037 Conditional — check channel budget for specific trace lengths
Rogers RO4450F ~3.52 ~0.0037 Conditional — prepreg use; verify bonding compatibility

Many Blackwell board designs use a hybrid stackup: Megtron 7 or Tachyon 100G on the high-speed signal layers, with lower-cost materials on power, ground, and low-speed signal layers to manage overall board cost.

Signal Integrity at NVLink 5.0 and PCIe Gen6 Speeds

At 200 Gb/s per lane (NVLink 5.0) and 64 GT/s per lane (PCIe Gen6 PAM4), the following SI design rules apply to Blackwell baseboards:

  • Differential impedance: 100 Ω ± 5% for NVLink 5.0 pairs; 85 Ω ± 5% for PCIe Gen6
  • Intra-pair skew: < 5 ps (NVLink 5.0); < 3 ps (PCIe Gen6 PAM4)
  • Via stub length: < 10 mils after backdrilling, ideally < 5 mils on NVLink 5.0 traces
  • Copper foil: Very-low-profile (VLP) or ultra-low-profile (HVLP) copper required to reduce skin-effect loss at > 10 GHz; standard electrodeposited (ED) copper is not acceptable on NVLink 5.0 signal layers
  • Trace width: Typically 75–100 μm (3–4 mil) on inner signal layers, with spacing ≥ 2× trace width between adjacent differential pairs
  • Crosstalk: Near-end crosstalk (NEXT) < −30 dB; far-end crosstalk (FEXT) < −40 dB at 10 GHz for NVLink 5.0 channels

Power Delivery Network

The B200's 1,000 W TDP and dual-die architecture require a substantially more complex PDN than Hopper:

  • Primary GPU core voltage (VCORE): ~0.85–0.9 V at up to 800+ A; copper plane resistance must be < 0.2 mΩ end-to-end
  • HBM power rails (VDDQ_HBM): Separate regulated supplies per HBM stack group, with tight noise requirements (< 5 mV ripple)
  • I/O and auxiliary rails: NVLink I/O, PCIe, and management functions each require isolated regulated supplies
  • Total rail count: B200 baseboards typically manage 15–25 distinct power rails
  • Target PDN impedance: < 0.1 mΩ from DC to 100 MHz at the GPU package; some designs target < 50 μΩ for transient response

Thermal Management

At 1,000 W per GPU, air cooling alone cannot maintain junction temperatures within operating limits for sustained compute workloads. Blackwell server designs universally incorporate direct liquid cooling (DLC), and the PCB must accommodate this:

  • Cold plate mounting area: The PCB layout must reserve mounting hole patterns and keep-out zones for cold plate hardware above the SXM6 GPU module
  • Thermal via arrays: Dense arrays (0.4–0.5 mm pitch) of thermal vias under VRM components and near the GPU mounting zone transfer heat to internal copper planes and subsequently to the cold plate structure
  • Copper coin inserts: Where direct GPU-to-cold-plate contact is not achievable, copper coins embedded in the PCB provide a high-conductivity thermal path; cavity milling tolerance is typically ± 0.05 mm for reliable coin seating
  • Board material Tg: Glass transition temperature ≥ 180°C required; continuous operation near 1,000 W power zones elevates local PCB temperature well above ambient, and thermal cycling accelerates delamination in lower-Tg materials

HDI and Via Technology

The SXM6 socket and companion chips (NVSwitch, PCIe retimer, power management ICs) all use fine-pitch BGA packages. Routing signal and power escape from these packages requires HDI via structures:

  • Laser-drilled microvias: 75–100 μm diameter, connecting adjacent layers for BGA escape; stacked and staggered configurations used depending on available layers
  • Via-in-pad: Vias placed directly in BGA pads (filled with conductive or non-conductive epoxy and plated over) to maximize routing density under fine-pitch packages
  • Sequential lamination: Build-up HDI structures (1+N+1, 2+N+2, or any-layer) require multiple press and drill cycles; B200 designs with 28–32 layers typically use 2+N+2 or 3+N+3 HDI
  • ELIC (Every Layer Interconnect): Required for the most complex NVSwitch boards where via density exceeds what sequential build-up can achieve; all layers are interconnectable via stacked filled microvias

Hopper vs Blackwell: PCB Design Comparison

Design Parameter Hopper (H100/H200) Blackwell (B200)
Baseboard layer count 16–20 24–32+
Primary laminate Megtron 6, Tachyon 100G Megtron 7, Tachyon 100G
NVLink trace speed 100 Gb/s per lane (NVLink 4.0) 200 Gb/s per lane (NVLink 5.0)
Host PCIe generation PCIe Gen5 (NRZ, 32 GT/s) PCIe Gen6 (PAM4, 64 GT/s)
GPU TDP 700 W 1,000 W
Cooling requirement Air or DLC DLC mandatory
Via stub removal Backdrilling (< 10 mil stub) Backdrilling (< 5 mil stub) or laser via
Copper foil grade Low-profile (LP) Very-low-profile (VLP) or HVLP
PDN complexity High (10–15 rails) Very high (15–25 rails)
HDI type 1+N+1 or 2+N+2 2+N+2 or 3+N+3; ELIC for NVSwitch boards
Copper coin requirement Optional Common / recommended

Manufacturing Considerations

Producing PCBs for Blackwell-based AI servers is among the most demanding work in the PCB fabrication industry. The key manufacturing requirements are:

  • Sequential lamination: 2+N+2 or 3+N+3 HDI requires 3–4 lamination cycles; each cycle introduces thermal stress and requires tight control of dielectric thickness and layer registration
  • Layer registration: ± 50 μm or better across the full board area; misregistration at this layer count degrades via alignment and controlled impedance
  • Backdrilling accuracy: Controlled-depth drilling to within ± 50 μm; requires CNC machines with depth feedback and board-specific drill files generated from the exact as-built stackup
  • Laser drilling: 75–100 μm microvia diameter; CO2 or UV laser systems; stacked via alignment within 25 μm
  • Copper coin integration: Cavity milling to ± 0.05 mm; coin press and bonding; post-lamination planarity < 0.1 mm across the coin area
  • Surface finish: ENIG (Electroless Nickel Immersion Gold) or ENEPIG for fine-pitch BGA pads; OSP acceptable on press-fit connector areas
  • Electrical testing: Flying probe or bed-of-nails testing at 100% for inner-layer continuity; TDR (Time Domain Reflectometry) for controlled impedance verification on representative test coupons

FAQ

What does “Blackwell” refer to in NVIDIA's naming scheme?
NVIDIA names its GPU architectures after scientists and mathematicians. Blackwell refers to David Harold Blackwell (1919–2010), an American statistician who made foundational contributions to game theory, probability theory, and mathematical statistics. He was the first African American inducted into the National Academy of Sciences.

Is the B200 a single chip or multiple chips?
The B200 uses two GB100 dies connected via TSMC's CoWoS-L interposer. From a software perspective, the two dies present as a single GPU. The die-to-die interconnect inside the CoWoS package operates at ~900 GB/s and is transparent to application code.

Why does Blackwell require direct liquid cooling when Hopper supported air cooling?
The B200's 1,000 W TDP exceeds the practical limit of air cooling for sustained operation in a rack-dense AI server environment. Air cooling at 1,000 W per GPU would require airflow volumes and temperatures that are incompatible with standard data center air management. DLC removes heat more efficiently, enabling higher power density per rack.

Can existing H100 server infrastructure be upgraded to B200?
No. B200 uses the SXM6 form factor (incompatible with SXM5 sockets), requires DLC infrastructure, and demands substantially different baseboard PCB designs. A transition from H100 to B200 infrastructure is a full system replacement, not a GPU card swap.

What PCIe version does B200 use for the host CPU connection?
B200 uses PCIe Gen6 (x16), which uses PAM4 signaling to achieve 64 GT/s per lane (approximately 256 GB/s total for a ×16 link). This is double the throughput of PCIe Gen5. In GB200 Superchip configurations, the Grace CPU connects to the B200 via NVLink-C2C instead of PCIe, providing 900 GB/s coherent bandwidth.

What is the difference between B200 and GB200?
The B200 is the GPU accelerator alone (in SXM6 form factor). The GB200 is a combined module pairing a B200 GPU with an NVIDIA Grace ARM-based CPU, connected by the 900 GB/s NVLink-C2C die-to-die interconnect. The GB200 NVL72 is a complete rack-scale system using 36 GB200 modules.


Need to Manufacture AI Server PCBs?

Designing for Blackwell? NextPCB supports the full PCB manufacturing stack for B200 and GB200 infrastructure—high-layer-count fabrication, Megtron 7 and low-loss laminate processing, any-layer HDI, backdrilling, copper coin integration, and complete PCBA services.


Related Articles: