Arya Li, Project Manager at NextPCB.com
Support Team
Feedback:
support@nextpcb.comAmazon Web Services has been building custom silicon for over a decade—from Graviton ARM-based CPUs to Nitro networking ASICs. AWS Trainium is the latest and most ambitious entry in that program: a custom AI training accelerator chip designed entirely in-house by Amazon, intended to offer lower cost-per-training-FLOP than competing GPU solutions for workloads running on the AWS platform.
For hardware engineers evaluating AI accelerator options, understanding Trainium matters for two reasons. First, it represents a credible alternative to NVIDIA H100 for a growing set of training workloads, with competitive throughput on transformer models and a cost advantage that grows as AWS scales production. Second, its PCB and server architecture makes different demands than GPU-based systems—different interconnect topology, different memory subsystem, different power delivery requirements—that are worth understanding independently of the GPU-centric design knowledge covered elsewhere in this series.
This article covers Trainium 2's architecture, its NeuronLink interconnect, the PCB design requirements of Trn2-based server boards, and the workloads and deployment scenarios where Trainium offers the most compelling economics relative to NVIDIA's platform.
Trainium is Amazon's purpose-built AI training accelerator, designed to run deep learning training workloads on the AWS cloud. Unlike NVIDIA GPUs, which are designed as general-purpose parallel processors that happen to excel at AI, Trainium is an ASIC (Application-Specific Integrated Circuit) optimized specifically for the matrix multiplication and tensor operations that dominate neural network training.
AWS introduced the first generation Trainium chip (Trn1) in 2021, deploying it in Trn1 EC2 instances in late 2022. Trainium 2 (Trn2) followed with a major architectural update targeting larger models and higher throughput, entering general availability in 2024. Trainium chips are not available for purchase as discrete components—they exist exclusively within AWS EC2 instances (Trn1, Trn2, and UltraServer configurations), which is a fundamental difference from NVIDIA GPUs that can be purchased for on-premises deployment.
The Trainium product line sits within Amazon's broader custom silicon strategy alongside:
Trainium 2 represents a substantial generational improvement over the original Trainium chip. Amazon has not published the complete semiconductor specifications (process node, transistor count, die size), but the following system-level specifications are publicly available through AWS documentation:
| Specification | Trainium 1 (Trn1) | Trainium 2 (Trn2) |
|---|---|---|
| Chips per Trn2 server | 16 (Trn1.32xl) | 16 (Trn2.48xl) |
| Peak FP8 compute (per chip) | ~190 TFLOPS (FP16) | ~3,800 TFLOPS (FP8, est.) |
| HBM memory per chip | 32 GB HBM2e | 96 GB HBM3 |
| HBM bandwidth per chip | ~820 GB/s | ~2,900 GB/s (est.) |
| NeuronLink per-chip bandwidth | ~800 GB/s | ~1,600 GB/s (est.) |
| Host interface | PCIe Gen4 ×8 | PCIe Gen5 ×16 |
| TDP per chip | ~300 W | ~700 W (est.) |
| Supported precisions | FP32, TF32, BF16, FP16 | FP32, TF32, BF16, FP16, FP8 |
The NeuronCore architecture inside each Trainium chip contains multiple dedicated tensor engine units for matrix multiplication, vector processing engines for activation functions and normalization, and an on-chip SRAM scratchpad memory that feeds the compute engines at high bandwidth without requiring DRAM access for intermediate results. This scratchpad-centric design differs from GPU architectures that rely on L2 caches and HBM for inter-layer data movement; it enables higher sustained compute utilization on well-optimized models but requires the Neuron compiler to explicitly manage data placement in the scratchpad.
Trainium 2 adds native FP8 support—matching NVIDIA's Transformer Engine capability introduced in H100—which is critical for transformer model training at scale. FP8 approximately doubles throughput over BF16 for matrix-multiply-dominated operations, as analyzed in the AI Training vs Inference guide.
NeuronLink is Amazon's proprietary high-bandwidth interconnect between Trainium chips within a server. In the Trn2.48xl UltraServer configuration, 16 Trainium 2 chips are connected in a high-bandwidth mesh topology via NeuronLink v2, providing approximately 1,600 GB/s of bidirectional bandwidth per chip for all-to-all collective communication.
Unlike NVIDIA's NVLink, which routes through dedicated NVSwitch chips to create a fully non-blocking fabric (as described in the NVSwitch architecture guide), NeuronLink uses direct peer-to-peer connections between chips. This topology is simpler from a PCB design standpoint—no dedicated switch chip and no NVSwitch-equivalent BGA packages on the board—but it means that the all-to-all bandwidth is achieved through multiple routing hops rather than a single switch fabric hop.
NeuronLink v2 traces on the Trn2 server baseboard carry high-speed differential signals at speeds requiring low-loss PCB laminate on the interconnect signal layers. While Amazon has not published NeuronLink's specific per-lane signaling rate, the aggregate bandwidth figure (~1,600 GB/s per chip) implies per-lane speeds in the NVLink 4.0 class (100 Gb/s range), with associated signal integrity requirements: differential impedance control to ± 5%, low-loss laminate (Megtron 6E class or better), and backdrilling of any through-hole vias on the signal paths. These requirements are consistent with the PCIe Gen5 and high-speed interconnect design principles detailed in the PCIe Gen5 PCB design guide.
Across multiple Trn2 UltraServer nodes, chips communicate via a dedicated high-bandwidth interconnect fabric (NeuronLink UltraFabric) at up to 1,600 GB/s per chip across servers—a capability that enables scaling to thousands of Trainium chips for very large model training runs without the bandwidth degradation that occurs when using InfiniBand for inter-node collectives on GPU clusters.
| Dimension | Trainium 2 (per chip) | H100 SXM5 |
|---|---|---|
| FP8 compute | ~3,800 TFLOPS (est.) | ~2,000 TFLOPS (with sparsity) |
| BF16 compute | ~1,900 TFLOPS (est.) | 989 TFLOPS |
| HBM memory | 96 GB HBM3 | 80 GB HBM3 |
| HBM bandwidth | ~2,900 GB/s | 3.35 TB/s |
| Chip-to-chip bandwidth | ~1,600 GB/s (NeuronLink v2) | 900 GB/s (NVLink 4.0) |
| Chip-to-chip topology | Direct peer-to-peer mesh (no switch chip) | NVSwitch crossbar (fully non-blocking) |
| Host interface | PCIe Gen5 ×16 | PCIe Gen5 ×16 |
| Availability | AWS cloud only | Cloud + on-premises |
| Software ecosystem | Neuron SDK; PyTorch via plugin | CUDA; native PyTorch support |
| On-premises deployment | Not available | Available (DGX, HGX, OEM servers) |
The headline comparison favors Trainium 2 on raw compute (higher estimated TFLOPS) and chip-to-chip bandwidth, while H100 leads on HBM bandwidth, software ecosystem maturity, and deployment flexibility. In practice, the most important performance metric is end-to-end training throughput on the specific model architecture and batch configuration used in production—a number that varies significantly based on compiler optimization quality and how well the model maps to each chip's hardware primitives.
AWS publishes benchmark results showing Trainium 2 achieving competitive or superior cost-per-training-FLOP on specific transformer workloads (LLaMA, GPT-NeoX) compared to equivalent H100 configurations. Independent validation of these benchmarks varies, and engineers should run representative workloads on both platforms before making procurement decisions. For a broader comparison of GPU vs ASIC trade-offs, see GPU vs TPU vs ASIC vs FPGA: Which AI Chip Architecture Will Dominate in 2027?
Trainium 2 server boards (the Trn2.48xl UltraServer baseboard and the inter-server UltraFabric switch board) are complex, high-layer-count PCBs, though they differ in important ways from NVIDIA GPU baseboards.
No NVSwitch-equivalent chip: Because NeuronLink uses direct chip-to-chip connections rather than a switch-chip topology, the Trn2 baseboard does not carry the equivalent of NVSwitch packages. This reduces the routing complexity somewhat compared to an H100 baseboard (which carries 4 NVSwitch 3.0 chips with their own power delivery and BGA routing requirements), but the direct NeuronLink routing between 16 Trainium chips still generates a substantial number of high-speed differential pairs.
PCIe Gen5 host interface: Each Trainium 2 chip connects to the host CPU via PCIe Gen5 ×16. With 16 chips per server, the baseboard routes 16 × PCIe Gen5 ×16 channels—256 PCIe Gen5 differential pairs total. These require the same signal integrity treatment as any PCIe Gen5 design: controlled impedance to ± 5%, low-loss laminate (Megtron 6E class) on PCIe signal layers, backdrilling to remove through-hole via stubs, and VLP or HVLP copper foil, as detailed in the materials guidance at High-Speed PCB Materials for AI Servers.
HBM power delivery: Trainium 2's 96 GB HBM3 per chip requires clean, tightly regulated HBM VDDQ power rails routed from the baseboard through the chip's BGA power delivery pins. The HBM power requirements are similar in specification to those of H100/H200 HBM power planes, requiring low-noise VRM output and tight ripple budgets.
Estimated layer count: A 16-chip Trainium 2 baseboard routing NeuronLink direct connections between chips, PCIe Gen5 to host CPUs, and power delivery for ~700 W per chip × 16 chips = ~11,200 W total chip power would require approximately 18–24 PCB layers. This is comparable to an H100 HGX baseboard at the lower end but does not reach the 28–32 layers required for B200 due to the absence of NVSwitch chip routing and NVLink 5.0 signal density.
UltraFabric switch boards: In UltraServer configurations connecting multiple Trn2 nodes, dedicated switch boards route the high-bandwidth inter-node NeuronLink UltraFabric signals. These boards face routing challenges similar to lower-density NVSwitch boards in NVIDIA architectures and require low-loss laminates and precise signal integrity design. For the manufacturing process context relevant to these boards, the GPU PCB Manufacturing guide provides the applicable fabrication process framework.
At an estimated ~700 W TDP per Trainium 2 chip and 16 chips per server, a Trn2.48xl node consumes approximately 11,200 W from GPU-class components alone (before CPUs, VRMs, and networking). This power envelope is comparable to a DGX H100 node (~10,200 W total) and requires similar infrastructure: high-efficiency PSUs, 48 V bus distribution, dedicated VRMs per chip on the baseboard, and liquid cooling for sustained full-load operation.
Amazon deploys Trn2 instances with mandatory direct liquid cooling. The baseboard cold plate mounting requirements, thermal via arrays beneath Trainium 2 packages, and board material Tg requirements (≥ 170°C) are governed by the same physics as GPU boards at equivalent power density, described in the context of AI server thermal design at Thermal Management on AI Server PCBs.
The Trainium 2 package itself uses HBM3 integrated within the chip package (likely using a CoWoS or interposer-based packaging approach, though Amazon has not publicly disclosed the packaging technology). This on-package memory integration means HBM signals are not routed on the baseboard PCB—a simplification relative to designs where HBM is placed as separate packages on the board. From a PCB design standpoint, the Trainium 2 package presents as a large BGA component with power, PCIe, and NeuronLink interface pins at the substrate boundary, similar to an H100 SXM5 from the PCB engineer's perspective.
The AWS Neuron SDK is the software framework for programming Trainium and Inferentia chips. It integrates with PyTorch and TensorFlow via framework plugins that intercept model compilation and route computations to the Neuron hardware. The Neuron compiler performs graph optimization, operator fusion, and memory layout transformations to maximize utilization of Trainium's NeuronCore hardware primitives.
Key characteristics of the Neuron software environment:
torch_neuronx without source code changes; custom CUDA kernels require porting to NeuronX equivalentsThe software ecosystem gap between CUDA and Neuron SDK is the primary friction point for teams considering Trainium. Organizations with production CUDA training pipelines, extensive custom CUDA kernels, or dependencies on CUDA-specific libraries (cuDNN, TensorRT) face non-trivial migration costs. AWS estimates that standard PyTorch transformer training pipelines can achieve 90%+ code reuse with Neuron, but “standard” is the operative word—production training stacks often diverge significantly from reference implementations.
The practical decision to use Trainium 2 rather than NVIDIA H100/H200 depends on several factors specific to each organization's situation.
Strong case for Trainium:
Strong case for H100/GPU:
For organizations comparing Trainium against AMD MI300X for inference-heavy workloads, the memory capacity dimension is relevant: Trainium 2's 96 GB HBM3 per chip is less than MI300X's 192 GB, making MI300X more suitable for single-chip inference of very large models. The full trade-off analysis between NVIDIA and AMD platforms is covered at H100 vs MI300X: NVIDIA vs AMD in the AI Accelerator War.
Can I buy Trainium chips for an on-premises server?
No. AWS Trainium chips are not available for purchase as discrete components or as OEM server hardware for on-premises deployment. They are exclusively available through AWS EC2 instances (Trn1, Trn2, and UltraServer configurations). Organizations requiring on-premises AI training hardware must use NVIDIA GPUs, AMD MI-series GPUs in OAM form factor, or Intel Gaudi accelerators, all of which are available through standard server OEM channels.
Does Trainium 2 support FP8 training?
Yes. Trainium 2 added native FP8 support, matching the precision capability introduced in NVIDIA H100's Transformer Engine. FP8 training approximately doubles throughput over BF16 for transformer model training by halving the data width for matrix multiply operations. The Neuron SDK's NxD training library supports FP8 mixed-precision training for compatible model architectures.
How does NeuronLink compare to NVLink?
NeuronLink v2 provides approximately 1,600 GB/s of bidirectional bandwidth per chip using direct peer-to-peer connections, versus NVLink 4.0's 900 GB/s per chip through NVIDIA's NVSwitch crossbar fabric. NeuronLink's higher aggregate bandwidth per chip comes at the architectural cost of a direct-connect mesh topology rather than a fully non-blocking switch fabric—in large all-to-all collectives, some data must traverse multiple NeuronLink hops. For practical LLM training workloads at 16-chip scale, both architectures provide adequate collective bandwidth with comparable efficiency.
What AWS instance types use Trainium?
AWS Trainium is available in: Trn1.2xlarge (1 Trainium chip), Trn1.32xlarge (16 Trainium 1 chips, 512 GB HBM2e), Trn2.48xlarge (16 Trainium 2 chips, 1,536 GB HBM3), and the Trn2 UltraServer (64 Trainium 2 chips across 4 UltraServer nodes connected by NeuronLink UltraFabric, 6,144 GB HBM3). The UltraServer configuration is designed specifically for training frontier-scale models requiring more memory than a single node can provide.
Is Trainium 2 suitable for AI inference as well as training?
Trainium 2 can technically run inference workloads, but AWS's dedicated inference ASIC (Inferentia 2, used in Inf2 instances) is optimized specifically for inference and typically provides better performance per dollar for inference-only deployments. The Neuron SDK supports both training (on Trainium) and inference (on Inferentia and Trainium), and the same compiled model can be deployed to either chip family. For production inference of LLMs, Inf2 instances using Inferentia 2 are the preferred AWS-native option.
Whether you are building ASIC-based AI accelerator boards like Trainium, GPU baseboard designs for H100 or B200, or OAM Universal Base Boards for AMD MI300X deployments, NextPCB provides the high-layer-count fabrication, low-loss laminate processing, BGA assembly, and IPC Class 3 quality standards required across the AI server PCB supply chain.
Still, need help? Contact Us: support@nextpcb.com
Need a PCB or PCBA quote? Quote now