Arya Li, Project Manager at NextPCB.com
Support Team
Feedback:
support@nextpcb.comIntroduction
The phrase “AI accelerator PCB” covers an enormous range of hardware—from a compact PCIe card running quantized LLM inference at the edge to a 32-layer baseboard distributing 10 kW across eight H100 GPUs during large-scale pre-training. These are not variations on the same design. They are fundamentally different engineering problems, driven by the profoundly different demands that AI training and AI inference place on compute, memory, interconnect, power, and thermal management.
Understanding those differences is essential for any hardware engineer designing or manufacturing AI accelerator boards. Optimizing a PCB for training performance when it will primarily run inference—or vice versa—wastes cost, adds unnecessary complexity, and can result in a system that underperforms in both roles. This article breaks down the two workloads from first principles and explains how each one shapes PCB architecture from layer count through cooling integration.
AI training is the process of adjusting the parameters (weights) of a neural network model to minimize a loss function on a dataset. It involves:
This cycle repeats billions of times across a training run. For a large language model with hundreds of billions of parameters, a single training run may require months of continuous operation across thousands of GPUs. The computational demand is sustained, predictable, and dominated by matrix multiplication operations (GEMMs) on large batches of data.
AI inference is the process of using a trained model to produce outputs from new inputs. The model weights are fixed; no gradients are computed; no weight updates occur. Inference involves only the forward pass.
However, inference has its own demanding characteristics:
| Dimension | AI Training | AI Inference |
|---|---|---|
| Primary operation | Forward pass + backward pass + weight update | Forward pass only |
| Compute pattern | Large, regular GEMM operations; highly parallelizable | Mixed GEMM (prefill) and memory-bound decode; variable batch |
| Dominant bottleneck | Compute throughput (TFLOPS) | Memory bandwidth (TB/s) at small batch; compute at large batch |
| Precision | BF16, FP16, FP8 (mixed precision); FP32 for optimizer states | INT8, INT4, FP8 (quantized); BF16 for quality-sensitive paths |
| Memory access pattern | Large sequential reads of activation tensors and weights; gradient writes | Weight reads (repeated); KV cache reads and writes; low data reuse |
| GPU memory required | Model weights + activations + gradients + optimizer states; 4–8× model size in bytes | Model weights + KV cache; ~2× model size in bytes (FP16) or less (quantized) |
| GPU-to-GPU communication | Frequent, high-bandwidth all-reduce for gradient synchronization | Moderate (tensor parallelism) to minimal (single-GPU inference) |
| Latency sensitivity | Low (throughput matters, not per-step latency) | High (time-to-first-token, tokens-per-second directly user-visible) |
| Power consumption | Sustained maximum TDP; GPUs run at full throttle continuously | Variable; often below peak TDP, especially at low batch sizes |
| Workload duration | Days to months per training run | Milliseconds per request; continuous 24/7 serving |
Training throughput is measured in TFLOPS (tera floating-point operations per second) at the relevant precision—most commonly BF16 or FP8 for transformer model training in 2025–2026. The GPU must sustain peak arithmetic throughput across billions of matrix multiplications per training step.
Key compute characteristics for training:
For PCB design, sustained training at maximum TDP means the power delivery network and thermal management systems must be designed for continuous worst-case load—not for peak instantaneous demand. There is no power averaging benefit to exploit.
Inference compute patterns differ from training in two important ways. First, inference only executes the forward pass—roughly half the compute of a training step for the same model and batch size. Second, inference commonly uses quantized precision (INT8, INT4, FP8) to reduce both memory footprint and arithmetic cost, trading a small amount of output quality for significantly higher throughput.
Key compute characteristics for inference:
Training memory consumption is substantially larger than inference for the same model size. For a model with P parameters, the approximate GPU memory footprint during training is:
For a 7B-parameter model: approximately 84 GB minimum (without activations). For a 70B-parameter model: approximately 840 GB, requiring model parallelism across many GPUs. This is why training large models requires high aggregate GPU memory across a cluster, and why fast GPU-to-GPU interconnect for gradient synchronization is critical.
Inference memory is simpler but still critical. For a model with P parameters in BF16 precision, the weight footprint is 2P bytes. In INT8 quantization, it is P bytes; in INT4, 0.5P bytes. Additionally, the KV cache for autoregressive inference consumes:
KV cache size = 2 × num_layers × num_heads × head_dim × sequence_length × batch_size × precision_bytes
For a 70B model (LLaMA 2 70B architecture) at BF16, generating 4,096 tokens at batch size 32, the KV cache alone consumes approximately 32–48 GB. Combined with model weights (~140 GB in BF16), this exceeds H100's 80 GB comfortably—which is precisely why the AMD MI300X (192 GB) and NVIDIA H200 (141 GB) target inference of large models. For more on memory architecture trade-offs between H100 and MI300X, see H100 vs MI300X: NVIDIA vs AMD AI Accelerator Comparison.
| Model Size | BF16 Weight Footprint | INT8 Weight Footprint | Fits in H100 (80 GB)? | Fits in MI300X (192 GB)? |
|---|---|---|---|---|
| 7B parameters | ~14 GB | ~7 GB | Yes (BF16 and INT8) | Yes |
| 13B parameters | ~26 GB | ~13 GB | Yes (BF16 and INT8) | Yes |
| 34B parameters | ~68 GB | ~34 GB | Marginal (BF16); Yes (INT8) | Yes |
| 70B parameters | ~140 GB | ~70 GB | No (BF16); Yes (INT8) | Yes (BF16 and INT8) |
| 180B parameters | ~360 GB | ~180 GB | No | No (requires multi-GPU) |
Training large models across multiple GPUs requires collective communication operations—primarily all-reduce for gradient synchronization. In tensor parallelism (splitting individual model layers across GPUs) and pipeline parallelism (splitting model layers between GPUs in stages), inter-GPU bandwidth directly determines how much time GPUs spend communicating vs computing.
At scale, inter-GPU communication can account for 30–50% of wall-clock training time if the interconnect is undersized. This is why training systems use the highest-bandwidth available interconnects:
For PCB design, this means training baseboards must route a very large number of NVLink differential pairs at 100–200 Gb/s per lane, requiring dedicated high-speed signal layers with ultra-low-loss laminates and VLP copper foil. See What Is NVLink? and What Is NVSwitch? for detailed interconnect PCB requirements.
Inference interconnect requirements are lower than training for most deployments:
The PCIe host interface is more important for inference than training because inference servers receive continuous streams of requests from the network, and the latency of transferring input data from the host CPU to GPU memory adds to time-to-first-token. PCIe Gen5 at 128 GB/s provides adequate bandwidth for most LLM serving configurations.
| Parameter | AI Training System | AI Inference System |
|---|---|---|
| GPU utilization | 80–100% sustained; full TDP continuously | 20–80% typical; variable with request rate |
| Power consumption | Sustained at or near maximum TDP | Variable; often 40–70% of peak TDP at moderate load |
| Thermal profile | Constant high heat; worst-case for cooling design | Variable; cooling must handle peak but average load is lower |
| Power delivery design target | Worst-case continuous: full VCORE current at max TDP | Can apply some power averaging; transient response still critical for burst loads |
| Cooling type | Direct liquid cooling (DLC) strongly preferred at 700–1,000 W per GPU | Air cooling viable for lower-power inference accelerators; DLC preferred for high-density |
| PSU sizing | 100% margin over peak TDP; no derated operation | 80–100% margin; can use lighter PSUs if average load is consistently below peak |
| PDN transient response | Critical; GPU compute transitions cause large, fast current steps | Critical for prefill bursts; less demanding during steady-state decode |
Training baseboards (H100 HGX, B200 NVL72 compute trays) are among the highest layer-count PCBs in production. Layer count is driven by:
Typical layer counts: 20–24 layers for H100 HGX baseboard; 24–32 layers for B200 baseboard; 32–40+ layers for dedicated NVSwitch boards in GB200 NVL72. For architecture-specific layer count analysis, see NVIDIA Blackwell Architecture Explained and A100 vs H100: PCB Stack Differences.
Training boards demand ultra-low-loss laminates on all NVLink and PCIe Gen5/Gen6 signal layers:
| Signal Type | Speed | Required Laminate (Df) |
|---|---|---|
| NVLink 4.0 (H100) | 100 Gb/s per lane | Megtron 6E / Tachyon 100G (Df ≤ 0.003) |
| NVLink 5.0 (B200) | 200 Gb/s per lane | Megtron 7 (Df ≤ 0.002) |
| PCIe Gen5 | 32 GT/s per lane | Megtron 6E or equivalent |
| PCIe Gen6 (B200) | 64 GT/s per lane (PAM4) | Megtron 7 or Rogers ultra-low-loss |
| Power and ground planes | DC / low frequency | Megtron 6 or FR4-class acceptable |
Copper foil on NVLink signal layers must be very-low-profile (VLP) or high-VLP (HVLP) to minimize skin-effect losses at > 10 GHz. Standard electrodeposited (ED) copper is not acceptable on NVLink 4.0 or NVLink 5.0 routing layers.
Training PDN must deliver full TDP continuously, without interruption, for weeks or months:
Sustained maximum TDP operation—700 W (H100), 1,000 W (B200) per GPU, plus NVSwitch power—demands aggressive thermal management at the board level:
Training baseboards carrying NVSwitch must route thousands of NVLink differential pairs with:
Inference PCBs span a much wider range of complexity than training boards, because inference systems range from edge devices to data center servers:
| Inference Deployment | Typical Accelerator | Typical PCB Layer Count |
|---|---|---|
| Edge / on-device inference | NVIDIA Jetson, custom ASIC, NPU | 4–8 layers |
| Workstation / single-GPU inference | NVIDIA RTX 4090, L40S (PCIe card) | 8–12 layers (add-in card PCB) |
| Data center inference (PCIe) | NVIDIA L40S, H100 PCIe, A10 | 12–16 layers (add-in card) |
| Data center inference (SXM/OAM) | H100 SXM5, H200, MI300X OAM | 16–24 layers (baseboard/UBB) |
| Rack-scale inference | B200 NVL72 | 24–32+ layers (same as training) |
For data center inference using standard PCIe add-in cards (NVIDIA L40S, A10G), a 12–16 layer board with PCIe Gen4/Gen5 routing is the typical design target. The absence of NVLink routing on PCIe add-in card form factor inference GPUs eliminates the primary driver of high layer count in training boards.
Inference systems have more relaxed signal integrity requirements than training boards on the inter-GPU interconnect layers—because many inference deployments do not use high-bandwidth GPU-to-GPU interconnects at all:
Inference PDN design is less demanding than training in one key way: GPU utilization is variable rather than continuously at maximum. However, transient response remains critical because inference workloads experience burst loads when large batches arrive simultaneously:
Inference thermal management varies widely by deployment tier:
Inference deployments use a wider range of PCB form factors than training, which is almost exclusively SXM or OAM:
| PCB Design Parameter | AI Training (High-End) | AI Inference (Data Center) | AI Inference (Edge) |
|---|---|---|---|
| Typical layer count | 20–32+ | 12–24 (PCIe to OAM) | 4–8 |
| NVLink routing layers | Yes (6–12 dedicated layers) | No (PCIe add-in) or Yes (SXM/OAM) | No |
| NVSwitch on board | Yes (H100 HGX, B200) | No (PCIe); No (OAM/UBB) | No |
| Signal layer laminate | Megtron 6E / Megtron 7 / Tachyon 100G | Megtron 6 to Megtron 6E | FR4 or standard low-loss |
| Copper foil grade | VLP / HVLP on NVLink layers | LP to VLP on PCIe Gen5 layers | Standard ED copper |
| GPU TDP per accelerator | 700–1,000 W | 300–750 W | 10–60 W |
| PDN target impedance | < 0.15 mΩ DC–100 MHz | < 0.2 mΩ DC–100 MHz | < 1 mΩ (relaxed) |
| Copper weight (power planes) | 2–3 oz (70–105 μm) | 1–2 oz (35–70 μm) | 1 oz (35 μm) |
| Cooling type | Direct liquid cooling (mandatory at B200) | Air or DLC depending on TDP | Passive or small fan |
| Board material Tg | ≥ 170–180°C | ≥ 150–170°C | ≥ 130–150°C |
| HDI / backdrilling | Yes (both); mandatory | Backdrilling for PCIe Gen5; HDI for fine-pitch BGAs | Typically not required |
| Backdrilling required | Yes (NVLink + PCIe Gen5/6) | Yes (PCIe Gen5); No (PCIe Gen4) | No |
| Board size | Large (400–700+ mm per side) | PCIe card (~312 × 111 mm) or UBB (400–600 mm) | Small (50–150 mm per side) |
| Manufacturing complexity | Very high | Medium to high | Low to medium |
| Workload | Recommended Accelerator | Form Factor | PCB Complexity |
|---|---|---|---|
| LLM pre-training (> 7B parameters) | H100 SXM5 / B200 SXM6 | SXM baseboard (NVIDIA HGX) | Very high (20–32+ layers) |
| LLM fine-tuning (7B–70B) | H100 SXM5 / MI300X OAM | SXM baseboard or OAM UBB | High (16–24 layers) |
| LLM inference (70B–180B, low latency) | MI300X OAM / H200 SXM5 | OAM UBB or SXM baseboard | High (16–22 layers) |
| LLM inference (7B–34B, high throughput) | H100 PCIe / L40S / A100 PCIe | PCIe add-in card | Medium (12–16 layers) |
| Vision model inference (production) | A10G / L40S / T4 | PCIe add-in card | Medium (12–16 layers) |
| Edge inference (on-device) | NVIDIA Jetson Orin / custom NPU | SoM on carrier board | Low (4–8 layers) |
| HPC + AI combined | MI300X OAM | OAM UBB | High (16–22 layers) |
Can the same GPU hardware be used for both training and inference?
Yes, and this is common in practice. H100 SXM5 systems deployed for training are often repurposed for inference when not running training jobs, or operated in a mixed training/inference workload configuration. The hardware is physically identical; only the software workload changes. However, purpose-built inference hardware (L40S, MI300X) may be more cost-effective than high-end training hardware for pure inference workloads due to lower cost-per-GPU and, in the case of MI300X, much larger memory capacity.
Why does training require more GPU memory than inference for the same model?
Training stores model weights, gradients, and optimizer states simultaneously. With Adam optimizer and mixed-precision training (BF16 weights + FP32 master weights + FP32 optimizer moments), memory consumption is approximately 16–20 bytes per parameter. Inference stores only the (possibly quantized) weights plus the KV cache, which is approximately 2 bytes per parameter in BF16 or 1 byte in INT8. A 70B model requires ~1.12–1.4 TB for training and ~70–140 GB for inference.
Is PCIe Gen5 required for inference servers?
For most data center inference deployments in 2025–2026, yes. Current-generation inference GPUs (H100 PCIe, L40S, MI300X OAM) use PCIe Gen5 ×16 as their host interface. PCIe Gen4 is still adequate for lower-throughput inference configurations and older accelerators (A10G, T4). The PCB signal integrity requirements for PCIe Gen5 (backdrilling, Megtron 6E laminate, ± 5% impedance tolerance) apply regardless of whether the board is used for training or inference.
What is the key PCB difference between a training board and an inference board?
The most significant difference is the presence or absence of NVLink/NVSwitch routing. Training boards for NVIDIA H100 or B200 must route thousands of NVLink differential pairs at 100–200 Gb/s per lane between GPU and NVSwitch packages, driving very high layer counts (20–32+) and demanding ultra-low-loss laminates throughout the high-speed signal layers. Standard PCIe inference add-in cards (L40S, H100 PCIe) do not carry NVLink at all, and their PCB complexity is correspondingly lower (12–16 layers, moderate material requirements).
Do edge inference PCBs have any special requirements?
Edge inference PCBs (Jetson Orin, custom NPU carrier boards) have relatively modest PCB requirements compared to data center boards—4–8 layers, standard FR4 or low-loss laminate, moderate PDN. The primary challenges are miniaturization (high component density in a small form factor), thermal management in passively cooled enclosures, and EMI compliance for deployment in consumer or automotive environments. The accelerator SoM itself (the Jetson module, for example) contains the complex high-layer-count PCB; the carrier board is relatively straightforward.
Why is memory bandwidth more important for inference than training?
During the decode phase of autoregressive LLM inference, the GPU generates one token at a time. For each token, it must load the weights of every layer in the model from HBM memory. At small batch sizes, the GPU performs very little arithmetic per byte loaded (low arithmetic intensity), meaning the throughput is limited by how fast it can read weights—not by how many FLOPS it can compute. Higher HBM bandwidth (TB/s) directly translates to more tokens generated per second in this memory-bound regime.
Whether you are building a high-layer-count training baseboard with NVLink routing and direct liquid cooling integration, or a PCIe inference card with moderate complexity, NextPCB's advanced PCB manufacturing capabilities cover the full range of AI hardware requirements—from 4-layer edge inference boards to 32-layer training baseboards with HDI, backdrilling, heavy copper power planes, and complete PCBA services.
Related Articles:
Still, need help? Contact Us: support@nextpcb.com
Need a PCB or PCBA quote? Quote now