Contact Us
Blog / AI Training vs AI Inference: Why They Need Different PCB Designs

AI Training vs AI Inference: Why They Need Different PCB Designs

Posted: June, 2026 Last Updated: June, 2026 Writer: Arya Li Share: NEXTPCB Official youtube NEXTPCB Official Facefook NEXTPCB Official Twitter NEXTPCB Official Instagram NEXTPCB Official Linkedin NEXTPCB Official Tiktok NEXTPCB Official Bksy

Introduction

The phrase “AI accelerator PCB” covers an enormous range of hardware—from a compact PCIe card running quantized LLM inference at the edge to a 32-layer baseboard distributing 10 kW across eight H100 GPUs during large-scale pre-training. These are not variations on the same design. They are fundamentally different engineering problems, driven by the profoundly different demands that AI training and AI inference place on compute, memory, interconnect, power, and thermal management.

Understanding those differences is essential for any hardware engineer designing or manufacturing AI accelerator boards. Optimizing a PCB for training performance when it will primarily run inference—or vice versa—wastes cost, adds unnecessary complexity, and can result in a system that underperforms in both roles. This article breaks down the two workloads from first principles and explains how each one shapes PCB architecture from layer count through cooling integration.

  1. Table of Contents
  2. Introduction
  3. Defining Training and Inference
  4. What Is AI Training?
  5. What Is AI Inference?
  6. Workload Characteristics Compared
  7. Compute Requirements
  8. Training: Throughput-Optimized, FP8/BF16 Heavy
  9. Inference: Latency-Sensitive, INT8/FP8 Quantized
  10. Memory Requirements
  11. Training: Large Activation and Gradient Storage
  12. Inference: Model Weight Fit and KV Cache
  13. Interconnect Requirements
  14. Training: Maximum All-to-All Bandwidth
  15. Inference: Lower Collective Bandwidth, Higher PCIe Throughput
  16. Power and Thermal Requirements
  17. PCB Design for AI Training Systems
  18. Layer Count and Stackup
  19. Material Selection
  20. Power Delivery Network
  21. Thermal Management
  22. Interconnect Routing
  23. PCB Design for AI Inference Systems
  24. Layer Count and Stackup
  25. Material Selection
  26. Power Delivery Network
  27. Thermal Management
  28. Form Factors: Edge, PCIe, and OAM
  29. PCB Design Comparison Table: Training vs Inference
  30. Accelerator and PCB Platform Selection by Workload
  31. FAQ

Defining Training and Inference

What Is AI Training?

AI training is the process of adjusting the parameters (weights) of a neural network model to minimize a loss function on a dataset. It involves:

  • Forward pass: Input data is fed through the model layer by layer, producing a prediction
  • Loss computation: The prediction is compared to the ground truth; a scalar loss value is computed
  • Backward pass (backpropagation): Gradients of the loss with respect to every weight in the model are computed using the chain rule, flowing backward through the network
  • Weight update: An optimizer (Adam, SGD, etc.) uses the gradients to adjust every weight in the model

This cycle repeats billions of times across a training run. For a large language model with hundreds of billions of parameters, a single training run may require months of continuous operation across thousands of GPUs. The computational demand is sustained, predictable, and dominated by matrix multiplication operations (GEMMs) on large batches of data.

What Is AI Inference?

AI inference is the process of using a trained model to produce outputs from new inputs. The model weights are fixed; no gradients are computed; no weight updates occur. Inference involves only the forward pass.

However, inference has its own demanding characteristics:

  • Latency sensitivity: Users or downstream systems expect responses within milliseconds to seconds; training is indifferent to per-step latency as long as aggregate throughput is high
  • Variable batch sizes: Inference workloads range from batch size 1 (real-time, single-user) to batch size 512+ (offline batch processing); the hardware must perform well across this range
  • Memory-bound at small batches: At low batch sizes, GPU compute cores are underutilized because memory bandwidth (loading weights from HBM) is the bottleneck, not arithmetic throughput
  • KV cache: Autoregressive LLM inference stores key-value attention tensors for each generated token, consuming GPU memory proportional to sequence length and batch size; large KV caches require more GPU memory than the model weights alone
  • Quantization: Inference commonly uses INT8, INT4, or FP8 quantized weights to reduce memory footprint and increase throughput; training typically uses BF16 or FP32 for numerical stability during gradient accumulation

Workload Characteristics Compared

Dimension AI Training AI Inference
Primary operation Forward pass + backward pass + weight update Forward pass only
Compute pattern Large, regular GEMM operations; highly parallelizable Mixed GEMM (prefill) and memory-bound decode; variable batch
Dominant bottleneck Compute throughput (TFLOPS) Memory bandwidth (TB/s) at small batch; compute at large batch
Precision BF16, FP16, FP8 (mixed precision); FP32 for optimizer states INT8, INT4, FP8 (quantized); BF16 for quality-sensitive paths
Memory access pattern Large sequential reads of activation tensors and weights; gradient writes Weight reads (repeated); KV cache reads and writes; low data reuse
GPU memory required Model weights + activations + gradients + optimizer states; 4–8× model size in bytes Model weights + KV cache; ~2× model size in bytes (FP16) or less (quantized)
GPU-to-GPU communication Frequent, high-bandwidth all-reduce for gradient synchronization Moderate (tensor parallelism) to minimal (single-GPU inference)
Latency sensitivity Low (throughput matters, not per-step latency) High (time-to-first-token, tokens-per-second directly user-visible)
Power consumption Sustained maximum TDP; GPUs run at full throttle continuously Variable; often below peak TDP, especially at low batch sizes
Workload duration Days to months per training run Milliseconds per request; continuous 24/7 serving

Compute Requirements

Training: Throughput-Optimized, FP8/BF16 Heavy

Training throughput is measured in TFLOPS (tera floating-point operations per second) at the relevant precision—most commonly BF16 or FP8 for transformer model training in 2025–2026. The GPU must sustain peak arithmetic throughput across billions of matrix multiplications per training step.

Key compute characteristics for training:

  • Large, regular matrix shapes: Training GEMMs are typically square or near-square, which maps efficiently onto GPU tensor core arrays and keeps hardware utilization (MFU, Model FLOP Utilization) high—often 40–60% of peak TFLOPS on well-optimized training runs
  • Mixed precision: Forward and backward passes in BF16 or FP8; master weights and optimizer states in FP32; the PCB's power delivery must support the sustained full-throttle compute associated with maximum TDP operation
  • FP8 advantage: NVIDIA H100 and B200's native FP8 Transformer Engine doubles effective throughput on transformer training vs BF16 by halving the data width. Accelerators without native FP8 (AMD MI300X base generation) cannot fully exploit this

For PCB design, sustained training at maximum TDP means the power delivery network and thermal management systems must be designed for continuous worst-case load—not for peak instantaneous demand. There is no power averaging benefit to exploit.

Inference: Latency-Sensitive, INT8/FP8 Quantized

Inference compute patterns differ from training in two important ways. First, inference only executes the forward pass—roughly half the compute of a training step for the same model and batch size. Second, inference commonly uses quantized precision (INT8, INT4, FP8) to reduce both memory footprint and arithmetic cost, trading a small amount of output quality for significantly higher throughput.

Key compute characteristics for inference:

  • Prefill phase (prompt processing): The initial processing of the input prompt is compute-bound (large GEMM, high batch parallelism); throughput-optimized like training
  • Decode phase (token generation): Autoregressive generation of each output token is memory-bandwidth-bound at typical serving batch sizes; arithmetic units are underutilized while the GPU waits for weights to load from HBM
  • Quantization impact: INT8 weights are half the size of BF16; INT4 weights are quarter-size; loading from HBM is 2–4× faster per layer, directly improving decode throughput. The PCB's memory bandwidth (TB/s per GPU) is therefore a primary determinant of inference serving capacity

Memory Requirements

Training: Large Activation and Gradient Storage

Training memory consumption is substantially larger than inference for the same model size. For a model with P parameters, the approximate GPU memory footprint during training is:

  • Model weights (BF16): 2P bytes
  • Gradients (BF16): 2P bytes
  • Optimizer states (Adam, FP32): 8P bytes (first moment + second moment, each FP32)
  • Activations (BF16, for backward pass): Proportional to batch size × sequence length × hidden dimension; can be reduced via activation checkpointing at the cost of recomputation
  • Total (without activation checkpointing): approximately 12P bytes minimum for mixed-precision training with Adam optimizer

For a 7B-parameter model: approximately 84 GB minimum (without activations). For a 70B-parameter model: approximately 840 GB, requiring model parallelism across many GPUs. This is why training large models requires high aggregate GPU memory across a cluster, and why fast GPU-to-GPU interconnect for gradient synchronization is critical.

Inference: Model Weight Fit and KV Cache

Inference memory is simpler but still critical. For a model with P parameters in BF16 precision, the weight footprint is 2P bytes. In INT8 quantization, it is P bytes; in INT4, 0.5P bytes. Additionally, the KV cache for autoregressive inference consumes:

KV cache size = 2 × num_layers × num_heads × head_dim × sequence_length × batch_size × precision_bytes

For a 70B model (LLaMA 2 70B architecture) at BF16, generating 4,096 tokens at batch size 32, the KV cache alone consumes approximately 32–48 GB. Combined with model weights (~140 GB in BF16), this exceeds H100's 80 GB comfortably—which is precisely why the AMD MI300X (192 GB) and NVIDIA H200 (141 GB) target inference of large models. For more on memory architecture trade-offs between H100 and MI300X, see H100 vs MI300X: NVIDIA vs AMD AI Accelerator Comparison.

Model Size BF16 Weight Footprint INT8 Weight Footprint Fits in H100 (80 GB)? Fits in MI300X (192 GB)?
7B parameters ~14 GB ~7 GB Yes (BF16 and INT8) Yes
13B parameters ~26 GB ~13 GB Yes (BF16 and INT8) Yes
34B parameters ~68 GB ~34 GB Marginal (BF16); Yes (INT8) Yes
70B parameters ~140 GB ~70 GB No (BF16); Yes (INT8) Yes (BF16 and INT8)
180B parameters ~360 GB ~180 GB No No (requires multi-GPU)

Interconnect Requirements

Training: Maximum All-to-All Bandwidth

Training large models across multiple GPUs requires collective communication operations—primarily all-reduce for gradient synchronization. In tensor parallelism (splitting individual model layers across GPUs) and pipeline parallelism (splitting model layers between GPUs in stages), inter-GPU bandwidth directly determines how much time GPUs spend communicating vs computing.

At scale, inter-GPU communication can account for 30–50% of wall-clock training time if the interconnect is undersized. This is why training systems use the highest-bandwidth available interconnects:

  • NVLink 4.0 at 900 GB/s per GPU (H100, via NVSwitch) for within-node communication
  • NVLink 5.0 at 1,800 GB/s per GPU (B200) for within-rack communication in GB200 NVL72
  • 400G InfiniBand for between-node communication

For PCB design, this means training baseboards must route a very large number of NVLink differential pairs at 100–200 Gb/s per lane, requiring dedicated high-speed signal layers with ultra-low-loss laminates and VLP copper foil. See What Is NVLink? and What Is NVSwitch? for detailed interconnect PCB requirements.

Inference: Lower Collective Bandwidth, Higher PCIe Throughput

Inference interconnect requirements are lower than training for most deployments:

  • Single-GPU inference (model fits in one GPU): No GPU-to-GPU interconnect needed; PCIe Gen5 host interface is sufficient for data input/output
  • Tensor-parallel inference (model split across 2–8 GPUs): Requires all-reduce operations during the forward pass, but at lower frequency and volume than training backward passes; NVLink is beneficial but not as critical
  • Pipeline-parallel inference: Each GPU handles a subset of layers; communication is activation tensors passing between pipeline stages—lower bandwidth than all-reduce; PCIe Gen5 may be sufficient for smaller pipeline depths

The PCIe host interface is more important for inference than training because inference servers receive continuous streams of requests from the network, and the latency of transferring input data from the host CPU to GPU memory adds to time-to-first-token. PCIe Gen5 at 128 GB/s provides adequate bandwidth for most LLM serving configurations.


Power and Thermal Requirements

Parameter AI Training System AI Inference System
GPU utilization 80–100% sustained; full TDP continuously 20–80% typical; variable with request rate
Power consumption Sustained at or near maximum TDP Variable; often 40–70% of peak TDP at moderate load
Thermal profile Constant high heat; worst-case for cooling design Variable; cooling must handle peak but average load is lower
Power delivery design target Worst-case continuous: full VCORE current at max TDP Can apply some power averaging; transient response still critical for burst loads
Cooling type Direct liquid cooling (DLC) strongly preferred at 700–1,000 W per GPU Air cooling viable for lower-power inference accelerators; DLC preferred for high-density
PSU sizing 100% margin over peak TDP; no derated operation 80–100% margin; can use lighter PSUs if average load is consistently below peak
PDN transient response Critical; GPU compute transitions cause large, fast current steps Critical for prefill bursts; less demanding during steady-state decode

PCB Design for AI Training Systems

Layer Count and Stackup

Training baseboards (H100 HGX, B200 NVL72 compute trays) are among the highest layer-count PCBs in production. Layer count is driven by:

  • NVLink/NVSwitch routing requiring 6–12 dedicated high-speed signal layers
  • Multiple power plane pairs (VCORE, VDDQ, NVSwitch power, auxiliary rails)
  • HDI build-up layers for fine-pitch GPU and NVSwitch BGA escape routing

Typical layer counts: 20–24 layers for H100 HGX baseboard; 24–32 layers for B200 baseboard; 32–40+ layers for dedicated NVSwitch boards in GB200 NVL72. For architecture-specific layer count analysis, see NVIDIA Blackwell Architecture Explained and A100 vs H100: PCB Stack Differences.

Material Selection

Training boards demand ultra-low-loss laminates on all NVLink and PCIe Gen5/Gen6 signal layers:

Signal Type Speed Required Laminate (Df)
NVLink 4.0 (H100) 100 Gb/s per lane Megtron 6E / Tachyon 100G (Df ≤ 0.003)
NVLink 5.0 (B200) 200 Gb/s per lane Megtron 7 (Df ≤ 0.002)
PCIe Gen5 32 GT/s per lane Megtron 6E or equivalent
PCIe Gen6 (B200) 64 GT/s per lane (PAM4) Megtron 7 or Rogers ultra-low-loss
Power and ground planes DC / low frequency Megtron 6 or FR4-class acceptable

Copper foil on NVLink signal layers must be very-low-profile (VLP) or high-VLP (HVLP) to minimize skin-effect losses at > 10 GHz. Standard electrodeposited (ED) copper is not acceptable on NVLink 4.0 or NVLink 5.0 routing layers.

Power Delivery Network

Training PDN must deliver full TDP continuously, without interruption, for weeks or months:

  • Target impedance: < 0.15 mΩ from DC to 100 MHz at GPU package; some designs target < 100 μΩ
  • VCORE current: 400–800+ A per GPU (depending on generation); 2–3 oz copper on VCORE planes minimum
  • Decoupling capacitors: Tiered bulk (100–470 μF), mid-frequency (10–47 μF), and high-frequency (100 nF) distributed across the board in proximity to GPU packages
  • VRM placement: Within 20–40 mm of GPU package to minimize power loop inductance; inductor DCR and VRM output impedance dominate PDN at low frequencies
  • Sustained operation: VRMs must be rated for 100% duty cycle at maximum output current; thermal management of VRM components (inductor, FETs) must be validated for continuous operation

Thermal Management

Sustained maximum TDP operation—700 W (H100), 1,000 W (B200) per GPU, plus NVSwitch power—demands aggressive thermal management at the board level:

  • Direct liquid cooling (DLC) is the standard for H100 and mandatory for B200 training systems
  • Cold plate mounting structures must be precisely aligned to GPU module packages; mechanical tolerance stack-up across the GPU module, PCB, and cold plate must maintain < 0.1 mm TIM gap across the full contact area
  • Thermal via arrays (0.4–0.6 mm pitch) under VRM component areas transfer heat to internal copper spreader planes and chassis ground planes
  • PCB material Tg ≥ 170–180°C mandatory; continuous operation at maximum TDP for months creates thermal cycling stress that degrades lower-Tg materials over time

Interconnect Routing

Training baseboards carrying NVSwitch must route thousands of NVLink differential pairs with:

  • Differential impedance: 100 Ω ± 5%
  • Intra-pair skew: < 5 ps
  • Via stubs: < 10 mils after backdrilling (NVLink 4.0); < 5 mils (NVLink 5.0)
  • Reference plane continuity: no splits or voids beneath any NVLink trace
  • Crosstalk: NEXT < −30 dB at 25 GHz; FEXT < −40 dB at 25 GHz

PCB Design for AI Inference Systems

Layer Count and Stackup

Inference PCBs span a much wider range of complexity than training boards, because inference systems range from edge devices to data center servers:

Inference Deployment Typical Accelerator Typical PCB Layer Count
Edge / on-device inference NVIDIA Jetson, custom ASIC, NPU 4–8 layers
Workstation / single-GPU inference NVIDIA RTX 4090, L40S (PCIe card) 8–12 layers (add-in card PCB)
Data center inference (PCIe) NVIDIA L40S, H100 PCIe, A10 12–16 layers (add-in card)
Data center inference (SXM/OAM) H100 SXM5, H200, MI300X OAM 16–24 layers (baseboard/UBB)
Rack-scale inference B200 NVL72 24–32+ layers (same as training)

For data center inference using standard PCIe add-in cards (NVIDIA L40S, A10G), a 12–16 layer board with PCIe Gen4/Gen5 routing is the typical design target. The absence of NVLink routing on PCIe add-in card form factor inference GPUs eliminates the primary driver of high layer count in training boards.

Material Selection

Inference systems have more relaxed signal integrity requirements than training boards on the inter-GPU interconnect layers—because many inference deployments do not use high-bandwidth GPU-to-GPU interconnects at all:

  • PCIe Gen5 signal layers: Megtron 6E or equivalent (Df ≤ 0.003); same requirement as training boards for PCIe Gen5
  • PCIe Gen4 signal layers (older inference cards): Megtron 6 (Df ~0.004) is typically adequate; lower per-lane frequency reduces material loss impact
  • Inter-GPU interconnect (if present): Same requirements as equivalent training boards (NVLink 4.0 → Megtron 6E; Infinity Fabric → Megtron 6)
  • Power and ground planes: Standard Megtron 6 or FR4-class acceptable; inference cards at moderate TDP (300–400 W) do not require the heavy copper power plane work of 700–1,000 W training boards

Power Delivery Network

Inference PDN design is less demanding than training in one key way: GPU utilization is variable rather than continuously at maximum. However, transient response remains critical because inference workloads experience burst loads when large batches arrive simultaneously:

  • Lower sustained current: Average GPU power at 40–70% utilization means PDN can be sized for peak rather than sustained worst-case; copper plane thickness and VRM sizing can be reduced vs training boards at equivalent TDP
  • Transient response: The transition from idle to full prefill compute can be nearly instantaneous (a large batch arriving at a GPU handling low background load); PDN must respond without excessive voltage droop; target impedance at the GPU package < 0.2 mΩ DC to 100 MHz
  • Lower-TDP inference cards: NVIDIA L40S (350 W), A10G (300 W) use PCIe power delivery (PCIe slot 75 W + 2× 8-pin connectors for 450–600 W total); dedicated power planes with 1–2 oz copper adequate at these power levels

Thermal Management

Inference thermal management varies widely by deployment tier:

  • PCIe add-in card inference GPUs (L40S, A10G, 300–400 W): Air cooling with GPU blower or open-air cooler; no liquid cooling required; thermal design follows standard server GPU card practices
  • SXM5/OAM data center inference (H100, H200, MI300X at 700–750 W): Same thermal management requirements as equivalent training hardware; the difference is that average power dissipation may be lower, but the thermal system must still be designed for peak TDP
  • Edge inference (Jetson, NPU modules, < 50 W): Passive heatsink or small fan; PCB thermal via arrays sufficient; board material Tg requirements relaxed to ≥ 130–150°C

Form Factors: Edge, PCIe, and OAM

Inference deployments use a wider range of PCB form factors than training, which is almost exclusively SXM or OAM:

  • Standard PCIe add-in card: NVIDIA L40S, A10G, RTX 6000 Ada; compatible with any standard server PCIe slot; simplest integration; limited to PCIe bandwidth for host interface; no NVLink
  • OAM module (MI300X, Intel Gaudi 3): High-performance inference with 192 GB+ memory; requires OAM-compliant UBB; same PCB complexity as training configuration
  • SXM5 (H100/H200 for inference): Maximum single-node GPU performance; same baseboard as training; chosen when training hardware is repurposed for inference or when very high throughput is needed
  • NVIDIA MGX: A modular server reference design for inference that standardizes the mechanical and electrical interface between GPU cards and server chassis, similar in intent to OAM but within NVIDIA's ecosystem
  • Edge modules (Jetson Orin, custom SoM): System-on-Module designs with integrated CPU + GPU/NPU; mounted via high-density board-to-board connectors on carrier PCBs; typically 4–8 layer carrier boards

PCB Design Comparison Table: Training vs Inference

PCB Design Parameter AI Training (High-End) AI Inference (Data Center) AI Inference (Edge)
Typical layer count 20–32+ 12–24 (PCIe to OAM) 4–8
NVLink routing layers Yes (6–12 dedicated layers) No (PCIe add-in) or Yes (SXM/OAM) No
NVSwitch on board Yes (H100 HGX, B200) No (PCIe); No (OAM/UBB) No
Signal layer laminate Megtron 6E / Megtron 7 / Tachyon 100G Megtron 6 to Megtron 6E FR4 or standard low-loss
Copper foil grade VLP / HVLP on NVLink layers LP to VLP on PCIe Gen5 layers Standard ED copper
GPU TDP per accelerator 700–1,000 W 300–750 W 10–60 W
PDN target impedance < 0.15 mΩ DC–100 MHz < 0.2 mΩ DC–100 MHz < 1 mΩ (relaxed)
Copper weight (power planes) 2–3 oz (70–105 μm) 1–2 oz (35–70 μm) 1 oz (35 μm)
Cooling type Direct liquid cooling (mandatory at B200) Air or DLC depending on TDP Passive or small fan
Board material Tg ≥ 170–180°C ≥ 150–170°C ≥ 130–150°C
HDI / backdrilling Yes (both); mandatory Backdrilling for PCIe Gen5; HDI for fine-pitch BGAs Typically not required
Backdrilling required Yes (NVLink + PCIe Gen5/6) Yes (PCIe Gen5); No (PCIe Gen4) No
Board size Large (400–700+ mm per side) PCIe card (~312 × 111 mm) or UBB (400–600 mm) Small (50–150 mm per side)
Manufacturing complexity Very high Medium to high Low to medium

Accelerator and PCB Platform Selection by Workload

Workload Recommended Accelerator Form Factor PCB Complexity
LLM pre-training (> 7B parameters) H100 SXM5 / B200 SXM6 SXM baseboard (NVIDIA HGX) Very high (20–32+ layers)
LLM fine-tuning (7B–70B) H100 SXM5 / MI300X OAM SXM baseboard or OAM UBB High (16–24 layers)
LLM inference (70B–180B, low latency) MI300X OAM / H200 SXM5 OAM UBB or SXM baseboard High (16–22 layers)
LLM inference (7B–34B, high throughput) H100 PCIe / L40S / A100 PCIe PCIe add-in card Medium (12–16 layers)
Vision model inference (production) A10G / L40S / T4 PCIe add-in card Medium (12–16 layers)
Edge inference (on-device) NVIDIA Jetson Orin / custom NPU SoM on carrier board Low (4–8 layers)
HPC + AI combined MI300X OAM OAM UBB High (16–22 layers)

FAQ

Can the same GPU hardware be used for both training and inference?
Yes, and this is common in practice. H100 SXM5 systems deployed for training are often repurposed for inference when not running training jobs, or operated in a mixed training/inference workload configuration. The hardware is physically identical; only the software workload changes. However, purpose-built inference hardware (L40S, MI300X) may be more cost-effective than high-end training hardware for pure inference workloads due to lower cost-per-GPU and, in the case of MI300X, much larger memory capacity.

Why does training require more GPU memory than inference for the same model?
Training stores model weights, gradients, and optimizer states simultaneously. With Adam optimizer and mixed-precision training (BF16 weights + FP32 master weights + FP32 optimizer moments), memory consumption is approximately 16–20 bytes per parameter. Inference stores only the (possibly quantized) weights plus the KV cache, which is approximately 2 bytes per parameter in BF16 or 1 byte in INT8. A 70B model requires ~1.12–1.4 TB for training and ~70–140 GB for inference.

Is PCIe Gen5 required for inference servers?
For most data center inference deployments in 2025–2026, yes. Current-generation inference GPUs (H100 PCIe, L40S, MI300X OAM) use PCIe Gen5 ×16 as their host interface. PCIe Gen4 is still adequate for lower-throughput inference configurations and older accelerators (A10G, T4). The PCB signal integrity requirements for PCIe Gen5 (backdrilling, Megtron 6E laminate, ± 5% impedance tolerance) apply regardless of whether the board is used for training or inference.

What is the key PCB difference between a training board and an inference board?
The most significant difference is the presence or absence of NVLink/NVSwitch routing. Training boards for NVIDIA H100 or B200 must route thousands of NVLink differential pairs at 100–200 Gb/s per lane between GPU and NVSwitch packages, driving very high layer counts (20–32+) and demanding ultra-low-loss laminates throughout the high-speed signal layers. Standard PCIe inference add-in cards (L40S, H100 PCIe) do not carry NVLink at all, and their PCB complexity is correspondingly lower (12–16 layers, moderate material requirements).

Do edge inference PCBs have any special requirements?
Edge inference PCBs (Jetson Orin, custom NPU carrier boards) have relatively modest PCB requirements compared to data center boards—4–8 layers, standard FR4 or low-loss laminate, moderate PDN. The primary challenges are miniaturization (high component density in a small form factor), thermal management in passively cooled enclosures, and EMI compliance for deployment in consumer or automotive environments. The accelerator SoM itself (the Jetson module, for example) contains the complex high-layer-count PCB; the carrier board is relatively straightforward.

Why is memory bandwidth more important for inference than training?
During the decode phase of autoregressive LLM inference, the GPU generates one token at a time. For each token, it must load the weights of every layer in the model from HBM memory. At small batch sizes, the GPU performs very little arithmetic per byte loaded (low arithmetic intensity), meaning the throughput is limited by how fast it can read weights—not by how many FLOPS it can compute. Higher HBM bandwidth (TB/s) directly translates to more tokens generated per second in this memory-bound regime.


Need to Manufacture AI Server PCBs?

Whether you are building a high-layer-count training baseboard with NVLink routing and direct liquid cooling integration, or a PCIe inference card with moderate complexity, NextPCB's advanced PCB manufacturing capabilities cover the full range of AI hardware requirements—from 4-layer edge inference boards to 32-layer training baseboards with HDI, backdrilling, heavy copper power planes, and complete PCBA services.

Get a quote from NextPCB →


Related Articles:

Author Name

About the Author

Arya Li, Project Manager at NextPCB.com

With extensive experience in manufacturing and international client management, Arya has guided factory visits for over 200 overseas clients, providing bilingual (English & Chinese) presentations on production processes, quality control systems, and advanced manufacturing capabilities. Her deep understanding of both the factory side and client requirements allows her to deliver professional, reliable PCB solutions efficiently. Detail-oriented and service-driven, Arya is committed to being a trusted partner for clients and showcasing the strength and expertise of the factory in the global PCB and PCBA market.