Home
PCB Quote
Standard PCB Advanced PCB
Rev 0 PCBA
PCB Assembly
Rev 0 PCBA PCB Assembly Quote PCB Assembly Service PCB Assembly Capability PCB Stencil Service BOM Service Free Functional Testing
Components Sourcing
HQ Online Components BOM Tool
Gerber Viewer | DFM
Online Gerber Viewer HQDFM Design Analysis Software HQDFM User Manual
Capabilities & Services

NextPCB Capabilities

Standard PCB Capabilities Advanced PCB Capabilities PCB Assembly Capabilities

Capabilities by PCB Types

PCB Product Showsase Rigid PCBs Rogers PCB High-TG PCBs Heavy Copper PCBs HDI PCBs High-Speed PCBs High-Frequency PCBs Aluminum PCBs Copper-Core PCBs Ceramic PCBs Flex PCBs Rigid-Flex PCBs

Printed Circuit Boards

PCB Prototype Applicable Industries PCB Manufacturing Process Advanced PCB Materials

PCB Assembly

PCB Assembly Service PCB Stencil Service File Requirements PCB Assembly Guide IC Programming PCBA DFA BGA Assembly Capabilities Laser Labeling/Coding

Layer Buildup

Layer Stack-up Prepregs, cores, foils

SMD-Stencils

Laser Stencil

PCB Design-Aid & Layout

Layer Orientation BGA PCB Price Composition Printed Circuit Board Materials PCB Design & Layout Panel Creation Gold Fingers

Mechanics

V-Scoring Back drilling PCB milling

Surface

Via Covering Surface Finish Silkscreen Solder mask

Quality

E-Test X-RAY Design Rule Check A.O.I

Drills & Throughplating

Via-in-pad Blind & Buried Vias Annular Rings Side plating Plated Half-holes/Castellated Holes Plated-through Slots

Factory & Certificate

PCB Factory VR Visiting PCB Assembly Factory Show Certificate

New users: $30 off
24 hours Fast Turnaround
100% E-test & AOI

Free for 10pcs 50%OFF for 100pcs TURNKEY PCB ASSEMBLY
Tools & Resources
PCB Impedance Calculator PCB Stackups & Impedance PCB Trace Width Calculator AI Electrical Rule Check KiCad Resource Hub KiCad Version Converter NextPCB Accelerator Program Blog News
About Us
About Us Contact Us Why Us Feedback Help Center Payment Methods Shipping Methods

0
Support Team

support@nextpcb.com

0086-755-8364 3663

+86 13622941920
Feedback:
support@nextpcb.com

Blog / AI Training vs AI Inference: Why They Need Different PCB Designs

AI Training vs AI Inference: Why They Need Different PCB Designs

Posted: June, 2026 Last Updated: June, 2026 Writer: Arya Li Share:

Introduction

The phrase “AI accelerator PCB” covers an enormous range of hardware—from a compact PCIe card running quantized LLM inference at the edge to a 32-layer baseboard distributing 10 kW across eight H100 GPUs during large-scale pre-training. These are not variations on the same design. They are fundamentally different engineering problems, driven by the profoundly different demands that AI training and AI inference place on compute, memory, interconnect, power, and thermal management.

Understanding those differences is essential for any hardware engineer designing or manufacturing AI accelerator boards. Optimizing a PCB for training performance when it will primarily run inference—or vice versa—wastes cost, adds unnecessary complexity, and can result in a system that underperforms in both roles. This article breaks down the two workloads from first principles and explains how each one shapes PCB architecture from layer count through cooling integration.

Table of Contents
Introduction
Defining Training and Inference
What Is AI Training?
What Is AI Inference?
Workload Characteristics Compared
Compute Requirements
Training: Throughput-Optimized, FP8/BF16 Heavy
Inference: Latency-Sensitive, INT8/FP8 Quantized
Memory Requirements
Training: Large Activation and Gradient Storage
Inference: Model Weight Fit and KV Cache
Interconnect Requirements
Training: Maximum All-to-All Bandwidth
Inference: Lower Collective Bandwidth, Higher PCIe Throughput
Power and Thermal Requirements
PCB Design for AI Training Systems
Layer Count and Stackup
Material Selection
Power Delivery Network
Thermal Management
Interconnect Routing
PCB Design for AI Inference Systems
Layer Count and Stackup
Material Selection
Power Delivery Network
Thermal Management
Form Factors: Edge, PCIe, and OAM
PCB Design Comparison Table: Training vs Inference
Accelerator and PCB Platform Selection by Workload
FAQ

Defining Training and Inference

What Is AI Training?

AI training is the process of adjusting the parameters (weights) of a neural network model to minimize a loss function on a dataset. It involves:

Forward pass: Input data is fed through the model layer by layer, producing a prediction
Loss computation: The prediction is compared to the ground truth; a scalar loss value is computed
Backward pass (backpropagation): Gradients of the loss with respect to every weight in the model are computed using the chain rule, flowing backward through the network
Weight update: An optimizer (Adam, SGD, etc.) uses the gradients to adjust every weight in the model

This cycle repeats billions of times across a training run. For a large language model with hundreds of billions of parameters, a single training run may require months of continuous operation across thousands of GPUs. The computational demand is sustained, predictable, and dominated by matrix multiplication operations (GEMMs) on large batches of data.

What Is AI Inference?

AI inference is the process of using a trained model to produce outputs from new inputs. The model weights are fixed; no gradients are computed; no weight updates occur. Inference involves only the forward pass.

However, inference has its own demanding characteristics:

Latency sensitivity: Users or downstream systems expect responses within milliseconds to seconds; training is indifferent to per-step latency as long as aggregate throughput is high
Variable batch sizes: Inference workloads range from batch size 1 (real-time, single-user) to batch size 512+ (offline batch processing); the hardware must perform well across this range
Memory-bound at small batches: At low batch sizes, GPU compute cores are underutilized because memory bandwidth (loading weights from HBM) is the bottleneck, not arithmetic throughput
KV cache: Autoregressive LLM inference stores key-value attention tensors for each generated token, consuming GPU memory proportional to sequence length and batch size; large KV caches require more GPU memory than the model weights alone
Quantization: Inference commonly uses INT8, INT4, or FP8 quantized weights to reduce memory footprint and increase throughput; training typically uses BF16 or FP32 for numerical stability during gradient accumulation

Workload Characteristics Compared

Dimension	AI Training	AI Inference
Primary operation	Forward pass + backward pass + weight update	Forward pass only
Compute pattern	Large, regular GEMM operations; highly parallelizable	Mixed GEMM (prefill) and memory-bound decode; variable batch
Dominant bottleneck	Compute throughput (TFLOPS)	Memory bandwidth (TB/s) at small batch; compute at large batch
Precision	BF16, FP16, FP8 (mixed precision); FP32 for optimizer states	INT8, INT4, FP8 (quantized); BF16 for quality-sensitive paths
Memory access pattern	Large sequential reads of activation tensors and weights; gradient writes	Weight reads (repeated); KV cache reads and writes; low data reuse
GPU memory required	Model weights + activations + gradients + optimizer states; 4–8× model size in bytes	Model weights + KV cache; ~2× model size in bytes (FP16) or less (quantized)
GPU-to-GPU communication	Frequent, high-bandwidth all-reduce for gradient synchronization	Moderate (tensor parallelism) to minimal (single-GPU inference)
Latency sensitivity	Low (throughput matters, not per-step latency)	High (time-to-first-token, tokens-per-second directly user-visible)
Power consumption	Sustained maximum TDP; GPUs run at full throttle continuously	Variable; often below peak TDP, especially at low batch sizes
Workload duration	Days to months per training run	Milliseconds per request; continuous 24/7 serving

Compute Requirements

Training: Throughput-Optimized, FP8/BF16 Heavy

Training throughput is measured in TFLOPS (tera floating-point operations per second) at the relevant precision—most commonly BF16 or FP8 for transformer model training in 2025–2026. The GPU must sustain peak arithmetic throughput across billions of matrix multiplications per training step.

Key compute characteristics for training:

Large, regular matrix shapes: Training GEMMs are typically square or near-square, which maps efficiently onto GPU tensor core arrays and keeps hardware utilization (MFU, Model FLOP Utilization) high—often 40–60% of peak TFLOPS on well-optimized training runs
Mixed precision: Forward and backward passes in BF16 or FP8; master weights and optimizer states in FP32; the PCB's power delivery must support the sustained full-throttle compute associated with maximum TDP operation
FP8 advantage: NVIDIA H100 and B200's native FP8 Transformer Engine doubles effective throughput on transformer training vs BF16 by halving the data width. Accelerators without native FP8 (AMD MI300X base generation) cannot fully exploit this

For PCB design, sustained training at maximum TDP means the power delivery network and thermal management systems must be designed for continuous worst-case load—not for peak instantaneous demand. There is no power averaging benefit to exploit.

Inference: Latency-Sensitive, INT8/FP8 Quantized

Inference compute patterns differ from training in two important ways. First, inference only executes the forward pass—roughly half the compute of a training step for the same model and batch size. Second, inference commonly uses quantized precision (INT8, INT4, FP8) to reduce both memory footprint and arithmetic cost, trading a small amount of output quality for significantly higher throughput.

Key compute characteristics for inference:

Prefill phase (prompt processing): The initial processing of the input prompt is compute-bound (large GEMM, high batch parallelism); throughput-optimized like training
Decode phase (token generation): Autoregressive generation of each output token is memory-bandwidth-bound at typical serving batch sizes; arithmetic units are underutilized while the GPU waits for weights to load from HBM
Quantization impact: INT8 weights are half the size of BF16; INT4 weights are quarter-size; loading from HBM is 2–4× faster per layer, directly improving decode throughput. The PCB's memory bandwidth (TB/s per GPU) is therefore a primary determinant of inference serving capacity

Memory Requirements

Training: Large Activation and Gradient Storage

Training memory consumption is substantially larger than inference for the same model size. For a model with P parameters, the approximate GPU memory footprint during training is:

Model weights (BF16): 2P bytes
Gradients (BF16): 2P bytes
Optimizer states (Adam, FP32): 8P bytes (first moment + second moment, each FP32)
Activations (BF16, for backward pass): Proportional to batch size × sequence length × hidden dimension; can be reduced via activation checkpointing at the cost of recomputation
Total (without activation checkpointing): approximately 12P bytes minimum for mixed-precision training with Adam optimizer

For a 7B-parameter model: approximately 84 GB minimum (without activations). For a 70B-parameter model: approximately 840 GB, requiring model parallelism across many GPUs. This is why training large models requires high aggregate GPU memory across a cluster, and why fast GPU-to-GPU interconnect for gradient synchronization is critical.

Inference: Model Weight Fit and KV Cache

Inference memory is simpler but still critical. For a model with P parameters in BF16 precision, the weight footprint is 2P bytes. In INT8 quantization, it is P bytes; in INT4, 0.5P bytes. Additionally, the KV cache for autoregressive inference consumes:

KV cache size = 2 × num_layers × num_heads × head_dim × sequence_length × batch_size × precision_bytes

For a 70B model (LLaMA 2 70B architecture) at BF16, generating 4,096 tokens at batch size 32, the KV cache alone consumes approximately 32–48 GB. Combined with model weights (~140 GB in BF16), this exceeds H100's 80 GB comfortably—which is precisely why the AMD MI300X (192 GB) and NVIDIA H200 (141 GB) target inference of large models. For more on memory architecture trade-offs between H100 and MI300X, see H100 vs MI300X: NVIDIA vs AMD AI Accelerator Comparison.

Model Size	BF16 Weight Footprint	INT8 Weight Footprint	Fits in H100 (80 GB)?	Fits in MI300X (192 GB)?
7B parameters	~14 GB	~7 GB	Yes (BF16 and INT8)	Yes
13B parameters	~26 GB	~13 GB	Yes (BF16 and INT8)	Yes
34B parameters	~68 GB	~34 GB	Marginal (BF16); Yes (INT8)	Yes
70B parameters	~140 GB	~70 GB	No (BF16); Yes (INT8)	Yes (BF16 and INT8)
180B parameters	~360 GB	~180 GB	No	No (requires multi-GPU)

Interconnect Requirements

Training: Maximum All-to-All Bandwidth

Training large models across multiple GPUs requires collective communication operations—primarily all-reduce for gradient synchronization. In tensor parallelism (splitting individual model layers across GPUs) and pipeline parallelism (splitting model layers between GPUs in stages), inter-GPU bandwidth directly determines how much time GPUs spend communicating vs computing.

At scale, inter-GPU communication can account for 30–50% of wall-clock training time if the interconnect is undersized. This is why training systems use the highest-bandwidth available interconnects:

NVLink 4.0 at 900 GB/s per GPU (H100, via NVSwitch) for within-node communication
NVLink 5.0 at 1,800 GB/s per GPU (B200) for within-rack communication in GB200 NVL72
400G InfiniBand for between-node communication

For PCB design, this means training baseboards must route a very large number of NVLink differential pairs at 100–200 Gb/s per lane, requiring dedicated high-speed signal layers with ultra-low-loss laminates and VLP copper foil. See What Is NVLink? and What Is NVSwitch? for detailed interconnect PCB requirements.

Inference: Lower Collective Bandwidth, Higher PCIe Throughput

Inference interconnect requirements are lower than training for most deployments:

Single-GPU inference (model fits in one GPU): No GPU-to-GPU interconnect needed; PCIe Gen5 host interface is sufficient for data input/output
Tensor-parallel inference (model split across 2–8 GPUs): Requires all-reduce operations during the forward pass, but at lower frequency and volume than training backward passes; NVLink is beneficial but not as critical
Pipeline-parallel inference: Each GPU handles a subset of layers; communication is activation tensors passing between pipeline stages—lower bandwidth than all-reduce; PCIe Gen5 may be sufficient for smaller pipeline depths

The PCIe host interface is more important for inference than training because inference servers receive continuous streams of requests from the network, and the latency of transferring input data from the host CPU to GPU memory adds to time-to-first-token. PCIe Gen5 at 128 GB/s provides adequate bandwidth for most LLM serving configurations.

Power and Thermal Requirements

Parameter	AI Training System	AI Inference System
GPU utilization	80–100% sustained; full TDP continuously	20–80% typical; variable with request rate
Power consumption	Sustained at or near maximum TDP	Variable; often 40–70% of peak TDP at moderate load
Thermal profile	Constant high heat; worst-case for cooling design	Variable; cooling must handle peak but average load is lower
Power delivery design target	Worst-case continuous: full VCORE current at max TDP	Can apply some power averaging; transient response still critical for burst loads
Cooling type	Direct liquid cooling (DLC) strongly preferred at 700–1,000 W per GPU	Air cooling viable for lower-power inference accelerators; DLC preferred for high-density
PSU sizing	100% margin over peak TDP; no derated operation	80–100% margin; can use lighter PSUs if average load is consistently below peak
PDN transient response	Critical; GPU compute transitions cause large, fast current steps	Critical for prefill bursts; less demanding during steady-state decode

PCB Design for AI Training Systems

Layer Count and Stackup

Training baseboards (H100 HGX, B200 NVL72 compute trays) are among the highest layer-count PCBs in production. Layer count is driven by:

NVLink/NVSwitch routing requiring 6–12 dedicated high-speed signal layers
Multiple power plane pairs (VCORE, VDDQ, NVSwitch power, auxiliary rails)
HDI build-up layers for fine-pitch GPU and NVSwitch BGA escape routing

Typical layer counts: 20–24 layers for H100 HGX baseboard; 24–32 layers for B200 baseboard; 32–40+ layers for dedicated NVSwitch boards in GB200 NVL72. For architecture-specific layer count analysis, see NVIDIA Blackwell Architecture Explained and A100 vs H100: PCB Stack Differences.

Material Selection

Training boards demand ultra-low-loss laminates on all NVLink and PCIe Gen5/Gen6 signal layers:

Signal Type	Speed	Required Laminate (Df)
NVLink 4.0 (H100)	100 Gb/s per lane	Megtron 6E / Tachyon 100G (Df ≤ 0.003)
NVLink 5.0 (B200)	200 Gb/s per lane	Megtron 7 (Df ≤ 0.002)
PCIe Gen5	32 GT/s per lane	Megtron 6E or equivalent
PCIe Gen6 (B200)	64 GT/s per lane (PAM4)	Megtron 7 or Rogers ultra-low-loss
Power and ground planes	DC / low frequency	Megtron 6 or FR4-class acceptable

Copper foil on NVLink signal layers must be very-low-profile (VLP) or high-VLP (HVLP) to minimize skin-effect losses at > 10 GHz. Standard electrodeposited (ED) copper is not acceptable on NVLink 4.0 or NVLink 5.0 routing layers.

Power Delivery Network

Training PDN must deliver full TDP continuously, without interruption, for weeks or months:

Target impedance: < 0.15 mΩ from DC to 100 MHz at GPU package; some designs target < 100 μΩ
VCORE current: 400–800+ A per GPU (depending on generation); 2–3 oz copper on VCORE planes minimum
Decoupling capacitors: Tiered bulk (100–470 μF), mid-frequency (10–47 μF), and high-frequency (100 nF) distributed across the board in proximity to GPU packages
VRM placement: Within 20–40 mm of GPU package to minimize power loop inductance; inductor DCR and VRM output impedance dominate PDN at low frequencies
Sustained operation: VRMs must be rated for 100% duty cycle at maximum output current; thermal management of VRM components (inductor, FETs) must be validated for continuous operation

Thermal Management

Sustained maximum TDP operation—700 W (H100), 1,000 W (B200) per GPU, plus NVSwitch power—demands aggressive thermal management at the board level:

Direct liquid cooling (DLC) is the standard for H100 and mandatory for B200 training systems
Cold plate mounting structures must be precisely aligned to GPU module packages; mechanical tolerance stack-up across the GPU module, PCB, and cold plate must maintain < 0.1 mm TIM gap across the full contact area
Thermal via arrays (0.4–0.6 mm pitch) under VRM component areas transfer heat to internal copper spreader planes and chassis ground planes
PCB material T_g ≥ 170–180°C mandatory; continuous operation at maximum TDP for months creates thermal cycling stress that degrades lower-T_g materials over time

Interconnect Routing

Training baseboards carrying NVSwitch must route thousands of NVLink differential pairs with:

Differential impedance: 100 Ω ± 5%
Intra-pair skew: < 5 ps
Via stubs: < 10 mils after backdrilling (NVLink 4.0); < 5 mils (NVLink 5.0)
Reference plane continuity: no splits or voids beneath any NVLink trace
Crosstalk: NEXT < −30 dB at 25 GHz; FEXT < −40 dB at 25 GHz

PCB Design for AI Inference Systems

Layer Count and Stackup

Inference PCBs span a much wider range of complexity than training boards, because inference systems range from edge devices to data center servers:

Inference Deployment	Typical Accelerator	Typical PCB Layer Count
Edge / on-device inference	NVIDIA Jetson, custom ASIC, NPU	4–8 layers
Workstation / single-GPU inference	NVIDIA RTX 4090, L40S (PCIe card)	8–12 layers (add-in card PCB)
Data center inference (PCIe)	NVIDIA L40S, H100 PCIe, A10	12–16 layers (add-in card)
Data center inference (SXM/OAM)	H100 SXM5, H200, MI300X OAM	16–24 layers (baseboard/UBB)
Rack-scale inference	B200 NVL72	24–32+ layers (same as training)

For data center inference using standard PCIe add-in cards (NVIDIA L40S, A10G), a 12–16 layer board with PCIe Gen4/Gen5 routing is the typical design target. The absence of NVLink routing on PCIe add-in card form factor inference GPUs eliminates the primary driver of high layer count in training boards.

Material Selection

Inference systems have more relaxed signal integrity requirements than training boards on the inter-GPU interconnect layers—because many inference deployments do not use high-bandwidth GPU-to-GPU interconnects at all:

PCIe Gen5 signal layers: Megtron 6E or equivalent (Df ≤ 0.003); same requirement as training boards for PCIe Gen5
PCIe Gen4 signal layers (older inference cards): Megtron 6 (Df ~0.004) is typically adequate; lower per-lane frequency reduces material loss impact
Inter-GPU interconnect (if present): Same requirements as equivalent training boards (NVLink 4.0 → Megtron 6E; Infinity Fabric → Megtron 6)
Power and ground planes: Standard Megtron 6 or FR4-class acceptable; inference cards at moderate TDP (300–400 W) do not require the heavy copper power plane work of 700–1,000 W training boards

Power Delivery Network

Inference PDN design is less demanding than training in one key way: GPU utilization is variable rather than continuously at maximum. However, transient response remains critical because inference workloads experience burst loads when large batches arrive simultaneously:

Lower sustained current: Average GPU power at 40–70% utilization means PDN can be sized for peak rather than sustained worst-case; copper plane thickness and VRM sizing can be reduced vs training boards at equivalent TDP
Transient response: The transition from idle to full prefill compute can be nearly instantaneous (a large batch arriving at a GPU handling low background load); PDN must respond without excessive voltage droop; target impedance at the GPU package < 0.2 mΩ DC to 100 MHz
Lower-TDP inference cards: NVIDIA L40S (350 W), A10G (300 W) use PCIe power delivery (PCIe slot 75 W + 2× 8-pin connectors for 450–600 W total); dedicated power planes with 1–2 oz copper adequate at these power levels

Thermal Management

Inference thermal management varies widely by deployment tier:

PCIe add-in card inference GPUs (L40S, A10G, 300–400 W): Air cooling with GPU blower or open-air cooler; no liquid cooling required; thermal design follows standard server GPU card practices
SXM5/OAM data center inference (H100, H200, MI300X at 700–750 W): Same thermal management requirements as equivalent training hardware; the difference is that average power dissipation may be lower, but the thermal system must still be designed for peak TDP
Edge inference (Jetson, NPU modules, < 50 W): Passive heatsink or small fan; PCB thermal via arrays sufficient; board material T_g requirements relaxed to ≥ 130–150°C

Form Factors: Edge, PCIe, and OAM

Inference deployments use a wider range of PCB form factors than training, which is almost exclusively SXM or OAM:

Standard PCIe add-in card: NVIDIA L40S, A10G, RTX 6000 Ada; compatible with any standard server PCIe slot; simplest integration; limited to PCIe bandwidth for host interface; no NVLink
OAM module (MI300X, Intel Gaudi 3): High-performance inference with 192 GB+ memory; requires OAM-compliant UBB; same PCB complexity as training configuration
SXM5 (H100/H200 for inference): Maximum single-node GPU performance; same baseboard as training; chosen when training hardware is repurposed for inference or when very high throughput is needed
NVIDIA MGX: A modular server reference design for inference that standardizes the mechanical and electrical interface between GPU cards and server chassis, similar in intent to OAM but within NVIDIA's ecosystem
Edge modules (Jetson Orin, custom SoM): System-on-Module designs with integrated CPU + GPU/NPU; mounted via high-density board-to-board connectors on carrier PCBs; typically 4–8 layer carrier boards

PCB Design Comparison Table: Training vs Inference

PCB Design Parameter	AI Training (High-End)	AI Inference (Data Center)	AI Inference (Edge)
Typical layer count	20–32+	12–24 (PCIe to OAM)	4–8
NVLink routing layers	Yes (6–12 dedicated layers)	No (PCIe add-in) or Yes (SXM/OAM)	No
NVSwitch on board	Yes (H100 HGX, B200)	No (PCIe); No (OAM/UBB)	No
Signal layer laminate	Megtron 6E / Megtron 7 / Tachyon 100G	Megtron 6 to Megtron 6E	FR4 or standard low-loss
Copper foil grade	VLP / HVLP on NVLink layers	LP to VLP on PCIe Gen5 layers	Standard ED copper
GPU TDP per accelerator	700–1,000 W	300–750 W	10–60 W
PDN target impedance	< 0.15 mΩ DC–100 MHz	< 0.2 mΩ DC–100 MHz	< 1 mΩ (relaxed)
Copper weight (power planes)	2–3 oz (70–105 μm)	1–2 oz (35–70 μm)	1 oz (35 μm)
Cooling type	Direct liquid cooling (mandatory at B200)	Air or DLC depending on TDP	Passive or small fan
Board material T_g	≥ 170–180°C	≥ 150–170°C	≥ 130–150°C
HDI / backdrilling	Yes (both); mandatory	Backdrilling for PCIe Gen5; HDI for fine-pitch BGAs	Typically not required
Backdrilling required	Yes (NVLink + PCIe Gen5/6)	Yes (PCIe Gen5); No (PCIe Gen4)	No
Board size	Large (400–700+ mm per side)	PCIe card (~312 × 111 mm) or UBB (400–600 mm)	Small (50–150 mm per side)
Manufacturing complexity	Very high	Medium to high	Low to medium

Accelerator and PCB Platform Selection by Workload

Workload	Recommended Accelerator	Form Factor	PCB Complexity
LLM pre-training (> 7B parameters)	H100 SXM5 / B200 SXM6	SXM baseboard (NVIDIA HGX)	Very high (20–32+ layers)
LLM fine-tuning (7B–70B)	H100 SXM5 / MI300X OAM	SXM baseboard or OAM UBB	High (16–24 layers)
LLM inference (70B–180B, low latency)	MI300X OAM / H200 SXM5	OAM UBB or SXM baseboard	High (16–22 layers)
LLM inference (7B–34B, high throughput)	H100 PCIe / L40S / A100 PCIe	PCIe add-in card	Medium (12–16 layers)
Vision model inference (production)	A10G / L40S / T4	PCIe add-in card	Medium (12–16 layers)
Edge inference (on-device)	NVIDIA Jetson Orin / custom NPU	SoM on carrier board	Low (4–8 layers)
HPC + AI combined	MI300X OAM	OAM UBB	High (16–22 layers)

FAQ

Can the same GPU hardware be used for both training and inference?
Yes, and this is common in practice. H100 SXM5 systems deployed for training are often repurposed for inference when not running training jobs, or operated in a mixed training/inference workload configuration. The hardware is physically identical; only the software workload changes. However, purpose-built inference hardware (L40S, MI300X) may be more cost-effective than high-end training hardware for pure inference workloads due to lower cost-per-GPU and, in the case of MI300X, much larger memory capacity.

Why does training require more GPU memory than inference for the same model?
Training stores model weights, gradients, and optimizer states simultaneously. With Adam optimizer and mixed-precision training (BF16 weights + FP32 master weights + FP32 optimizer moments), memory consumption is approximately 16–20 bytes per parameter. Inference stores only the (possibly quantized) weights plus the KV cache, which is approximately 2 bytes per parameter in BF16 or 1 byte in INT8. A 70B model requires ~1.12–1.4 TB for training and ~70–140 GB for inference.

Is PCIe Gen5 required for inference servers?
For most data center inference deployments in 2025–2026, yes. Current-generation inference GPUs (H100 PCIe, L40S, MI300X OAM) use PCIe Gen5 ×16 as their host interface. PCIe Gen4 is still adequate for lower-throughput inference configurations and older accelerators (A10G, T4). The PCB signal integrity requirements for PCIe Gen5 (backdrilling, Megtron 6E laminate, ± 5% impedance tolerance) apply regardless of whether the board is used for training or inference.

What is the key PCB difference between a training board and an inference board?
The most significant difference is the presence or absence of NVLink/NVSwitch routing. Training boards for NVIDIA H100 or B200 must route thousands of NVLink differential pairs at 100–200 Gb/s per lane between GPU and NVSwitch packages, driving very high layer counts (20–32+) and demanding ultra-low-loss laminates throughout the high-speed signal layers. Standard PCIe inference add-in cards (L40S, H100 PCIe) do not carry NVLink at all, and their PCB complexity is correspondingly lower (12–16 layers, moderate material requirements).

Do edge inference PCBs have any special requirements?
Edge inference PCBs (Jetson Orin, custom NPU carrier boards) have relatively modest PCB requirements compared to data center boards—4–8 layers, standard FR4 or low-loss laminate, moderate PDN. The primary challenges are miniaturization (high component density in a small form factor), thermal management in passively cooled enclosures, and EMI compliance for deployment in consumer or automotive environments. The accelerator SoM itself (the Jetson module, for example) contains the complex high-layer-count PCB; the carrier board is relatively straightforward.

Why is memory bandwidth more important for inference than training?
During the decode phase of autoregressive LLM inference, the GPU generates one token at a time. For each token, it must load the weights of every layer in the model from HBM memory. At small batch sizes, the GPU performs very little arithmetic per byte loaded (low arithmetic intensity), meaning the throughput is limited by how fast it can read weights—not by how many FLOPS it can compute. Higher HBM bandwidth (TB/s) directly translates to more tokens generated per second in this memory-bound regime.

Need to Manufacture AI Server PCBs?

Whether you are building a high-layer-count training baseboard with NVLink routing and direct liquid cooling integration, or a PCIe inference card with moderate complexity, NextPCB's advanced PCB manufacturing capabilities cover the full range of AI hardware requirements—from 4-layer edge inference boards to 32-layer training baseboards with HDI, backdrilling, heavy copper power planes, and complete PCBA services.

Get a quote from NextPCB →

Related Articles:

About the Author

Arya Li, Project Manager at NextPCB.com

With extensive experience in manufacturing and international client management, Arya has guided factory visits for over 200 overseas clients, providing bilingual (English & Chinese) presentations on production processes, quality control systems, and advanced manufacturing capabilities. Her deep understanding of both the factory side and client requirements allows her to deliver professional, reliable PCB solutions efficiently. Detail-oriented and service-driven, Arya is committed to being a trusted partner for clients and showcasing the strength and expertise of the factory in the global PCB and PCBA market.

603 0 0 1 Facebook Twitter Linked In

AI Training vs AI Inference: Why They Need Different PCB Designs

Defining Training and Inference

What Is AI Training?

What Is AI Inference?

Workload Characteristics Compared

Compute Requirements

Training: Throughput-Optimized, FP8/BF16 Heavy

Inference: Latency-Sensitive, INT8/FP8 Quantized

Memory Requirements

Training: Large Activation and Gradient Storage

Inference: Model Weight Fit and KV Cache

Interconnect Requirements

Training: Maximum All-to-All Bandwidth

Inference: Lower Collective Bandwidth, Higher PCIe Throughput

Power and Thermal Requirements

PCB Design for AI Training Systems

Layer Count and Stackup

Material Selection

Power Delivery Network

Thermal Management

Interconnect Routing

PCB Design for AI Inference Systems

Layer Count and Stackup

Material Selection

Power Delivery Network

Thermal Management

Form Factors: Edge, PCIe, and OAM

PCB Design Comparison Table: Training vs Inference

Accelerator and PCB Platform Selection by Workload

FAQ

Need to Manufacture AI Server PCBs?

About the Author

Recommended Article: