Contact Us
Blog / What Is AWS Trainium? Amazon's AI Chip & PCB Design Requirements

What Is AWS Trainium? Amazon's AI Chip & PCB Design Requirements

Posted: June, 2026 Last Updated: June, 2026 Writer: Arya Li Share: NEXTPCB Official youtube NEXTPCB Official Facefook NEXTPCB Official Twitter NEXTPCB Official Instagram NEXTPCB Official Linkedin NEXTPCB Official Tiktok NEXTPCB Official Bksy

Introduction

Amazon Web Services has been building custom silicon for over a decade—from Graviton ARM-based CPUs to Nitro networking ASICs. AWS Trainium is the latest and most ambitious entry in that program: a custom AI training accelerator chip designed entirely in-house by Amazon, intended to offer lower cost-per-training-FLOP than competing GPU solutions for workloads running on the AWS platform.

For hardware engineers evaluating AI accelerator options, understanding Trainium matters for two reasons. First, it represents a credible alternative to NVIDIA H100 for a growing set of training workloads, with competitive throughput on transformer models and a cost advantage that grows as AWS scales production. Second, its PCB and server architecture makes different demands than GPU-based systems—different interconnect topology, different memory subsystem, different power delivery requirements—that are worth understanding independently of the GPU-centric design knowledge covered elsewhere in this series.

This article covers Trainium 2's architecture, its NeuronLink interconnect, the PCB design requirements of Trn2-based server boards, and the workloads and deployment scenarios where Trainium offers the most compelling economics relative to NVIDIA's platform.

  1. Table of Contents

What Is AWS Trainium?

Trainium is Amazon's purpose-built AI training accelerator, designed to run deep learning training workloads on the AWS cloud. Unlike NVIDIA GPUs, which are designed as general-purpose parallel processors that happen to excel at AI, Trainium is an ASIC (Application-Specific Integrated Circuit) optimized specifically for the matrix multiplication and tensor operations that dominate neural network training.

AWS introduced the first generation Trainium chip (Trn1) in 2021, deploying it in Trn1 EC2 instances in late 2022. Trainium 2 (Trn2) followed with a major architectural update targeting larger models and higher throughput, entering general availability in 2024. Trainium chips are not available for purchase as discrete components—they exist exclusively within AWS EC2 instances (Trn1, Trn2, and UltraServer configurations), which is a fundamental difference from NVIDIA GPUs that can be purchased for on-premises deployment.

The Trainium product line sits within Amazon's broader custom silicon strategy alongside:

  • Inferentia: Amazon's inference-focused ASIC (Inf1, Inf2) optimized for low-latency, high-throughput inference rather than training
  • Graviton: ARM-based CPUs for general-purpose cloud compute
  • Nitro: Custom networking and security ASICs embedded in every AWS EC2 hypervisor

Trainium 2: Architecture and Specifications

Trainium 2 represents a substantial generational improvement over the original Trainium chip. Amazon has not published the complete semiconductor specifications (process node, transistor count, die size), but the following system-level specifications are publicly available through AWS documentation:

Specification Trainium 1 (Trn1) Trainium 2 (Trn2)
Chips per Trn2 server 16 (Trn1.32xl) 16 (Trn2.48xl)
Peak FP8 compute (per chip) ~190 TFLOPS (FP16) ~3,800 TFLOPS (FP8, est.)
HBM memory per chip 32 GB HBM2e 96 GB HBM3
HBM bandwidth per chip ~820 GB/s ~2,900 GB/s (est.)
NeuronLink per-chip bandwidth ~800 GB/s ~1,600 GB/s (est.)
Host interface PCIe Gen4 ×8 PCIe Gen5 ×16
TDP per chip ~300 W ~700 W (est.)
Supported precisions FP32, TF32, BF16, FP16 FP32, TF32, BF16, FP16, FP8

The NeuronCore architecture inside each Trainium chip contains multiple dedicated tensor engine units for matrix multiplication, vector processing engines for activation functions and normalization, and an on-chip SRAM scratchpad memory that feeds the compute engines at high bandwidth without requiring DRAM access for intermediate results. This scratchpad-centric design differs from GPU architectures that rely on L2 caches and HBM for inter-layer data movement; it enables higher sustained compute utilization on well-optimized models but requires the Neuron compiler to explicitly manage data placement in the scratchpad.

Trainium 2 adds native FP8 support—matching NVIDIA's Transformer Engine capability introduced in H100—which is critical for transformer model training at scale. FP8 approximately doubles throughput over BF16 for matrix-multiply-dominated operations, as analyzed in the AI Training vs Inference guide.


NeuronLink is Amazon's proprietary high-bandwidth interconnect between Trainium chips within a server. In the Trn2.48xl UltraServer configuration, 16 Trainium 2 chips are connected in a high-bandwidth mesh topology via NeuronLink v2, providing approximately 1,600 GB/s of bidirectional bandwidth per chip for all-to-all collective communication.

Unlike NVIDIA's NVLink, which routes through dedicated NVSwitch chips to create a fully non-blocking fabric (as described in the NVSwitch architecture guide), NeuronLink uses direct peer-to-peer connections between chips. This topology is simpler from a PCB design standpoint—no dedicated switch chip and no NVSwitch-equivalent BGA packages on the board—but it means that the all-to-all bandwidth is achieved through multiple routing hops rather than a single switch fabric hop.

NeuronLink v2 traces on the Trn2 server baseboard carry high-speed differential signals at speeds requiring low-loss PCB laminate on the interconnect signal layers. While Amazon has not published NeuronLink's specific per-lane signaling rate, the aggregate bandwidth figure (~1,600 GB/s per chip) implies per-lane speeds in the NVLink 4.0 class (100 Gb/s range), with associated signal integrity requirements: differential impedance control to ± 5%, low-loss laminate (Megtron 6E class or better), and backdrilling of any through-hole vias on the signal paths. These requirements are consistent with the PCIe Gen5 and high-speed interconnect design principles detailed in the PCIe Gen5 PCB design guide.

Across multiple Trn2 UltraServer nodes, chips communicate via a dedicated high-bandwidth interconnect fabric (NeuronLink UltraFabric) at up to 1,600 GB/s per chip across servers—a capability that enables scaling to thousands of Trainium chips for very large model training runs without the bandwidth degradation that occurs when using InfiniBand for inter-node collectives on GPU clusters.


Trainium 2 vs NVIDIA H100: Key Differences

Dimension Trainium 2 (per chip) H100 SXM5
FP8 compute ~3,800 TFLOPS (est.) ~2,000 TFLOPS (with sparsity)
BF16 compute ~1,900 TFLOPS (est.) 989 TFLOPS
HBM memory 96 GB HBM3 80 GB HBM3
HBM bandwidth ~2,900 GB/s 3.35 TB/s
Chip-to-chip bandwidth ~1,600 GB/s (NeuronLink v2) 900 GB/s (NVLink 4.0)
Chip-to-chip topology Direct peer-to-peer mesh (no switch chip) NVSwitch crossbar (fully non-blocking)
Host interface PCIe Gen5 ×16 PCIe Gen5 ×16
Availability AWS cloud only Cloud + on-premises
Software ecosystem Neuron SDK; PyTorch via plugin CUDA; native PyTorch support
On-premises deployment Not available Available (DGX, HGX, OEM servers)

The headline comparison favors Trainium 2 on raw compute (higher estimated TFLOPS) and chip-to-chip bandwidth, while H100 leads on HBM bandwidth, software ecosystem maturity, and deployment flexibility. In practice, the most important performance metric is end-to-end training throughput on the specific model architecture and batch configuration used in production—a number that varies significantly based on compiler optimization quality and how well the model maps to each chip's hardware primitives.

AWS publishes benchmark results showing Trainium 2 achieving competitive or superior cost-per-training-FLOP on specific transformer workloads (LLaMA, GPT-NeoX) compared to equivalent H100 configurations. Independent validation of these benchmarks varies, and engineers should run representative workloads on both platforms before making procurement decisions. For a broader comparison of GPU vs ASIC trade-offs, see GPU vs TPU vs ASIC vs FPGA: Which AI Chip Architecture Will Dominate in 2027?


PCB Design Requirements for Trainium 2 Boards

Trainium 2 server boards (the Trn2.48xl UltraServer baseboard and the inter-server UltraFabric switch board) are complex, high-layer-count PCBs, though they differ in important ways from NVIDIA GPU baseboards.

No NVSwitch-equivalent chip: Because NeuronLink uses direct chip-to-chip connections rather than a switch-chip topology, the Trn2 baseboard does not carry the equivalent of NVSwitch packages. This reduces the routing complexity somewhat compared to an H100 baseboard (which carries 4 NVSwitch 3.0 chips with their own power delivery and BGA routing requirements), but the direct NeuronLink routing between 16 Trainium chips still generates a substantial number of high-speed differential pairs.

PCIe Gen5 host interface: Each Trainium 2 chip connects to the host CPU via PCIe Gen5 ×16. With 16 chips per server, the baseboard routes 16 × PCIe Gen5 ×16 channels—256 PCIe Gen5 differential pairs total. These require the same signal integrity treatment as any PCIe Gen5 design: controlled impedance to ± 5%, low-loss laminate (Megtron 6E class) on PCIe signal layers, backdrilling to remove through-hole via stubs, and VLP or HVLP copper foil, as detailed in the materials guidance at High-Speed PCB Materials for AI Servers.

HBM power delivery: Trainium 2's 96 GB HBM3 per chip requires clean, tightly regulated HBM VDDQ power rails routed from the baseboard through the chip's BGA power delivery pins. The HBM power requirements are similar in specification to those of H100/H200 HBM power planes, requiring low-noise VRM output and tight ripple budgets.

Estimated layer count: A 16-chip Trainium 2 baseboard routing NeuronLink direct connections between chips, PCIe Gen5 to host CPUs, and power delivery for ~700 W per chip × 16 chips = ~11,200 W total chip power would require approximately 18–24 PCB layers. This is comparable to an H100 HGX baseboard at the lower end but does not reach the 28–32 layers required for B200 due to the absence of NVSwitch chip routing and NVLink 5.0 signal density.

UltraFabric switch boards: In UltraServer configurations connecting multiple Trn2 nodes, dedicated switch boards route the high-bandwidth inter-node NeuronLink UltraFabric signals. These boards face routing challenges similar to lower-density NVSwitch boards in NVIDIA architectures and require low-loss laminates and precise signal integrity design. For the manufacturing process context relevant to these boards, the GPU PCB Manufacturing guide provides the applicable fabrication process framework.


Power Delivery and Thermal Management

At an estimated ~700 W TDP per Trainium 2 chip and 16 chips per server, a Trn2.48xl node consumes approximately 11,200 W from GPU-class components alone (before CPUs, VRMs, and networking). This power envelope is comparable to a DGX H100 node (~10,200 W total) and requires similar infrastructure: high-efficiency PSUs, 48 V bus distribution, dedicated VRMs per chip on the baseboard, and liquid cooling for sustained full-load operation.

Amazon deploys Trn2 instances with mandatory direct liquid cooling. The baseboard cold plate mounting requirements, thermal via arrays beneath Trainium 2 packages, and board material Tg requirements (≥ 170°C) are governed by the same physics as GPU boards at equivalent power density, described in the context of AI server thermal design at Thermal Management on AI Server PCBs.

The Trainium 2 package itself uses HBM3 integrated within the chip package (likely using a CoWoS or interposer-based packaging approach, though Amazon has not publicly disclosed the packaging technology). This on-package memory integration means HBM signals are not routed on the baseboard PCB—a simplification relative to designs where HBM is placed as separate packages on the board. From a PCB design standpoint, the Trainium 2 package presents as a large BGA component with power, PCIe, and NeuronLink interface pins at the substrate boundary, similar to an H100 SXM5 from the PCB engineer's perspective.


Neuron SDK and Software Ecosystem

The AWS Neuron SDK is the software framework for programming Trainium and Inferentia chips. It integrates with PyTorch and TensorFlow via framework plugins that intercept model compilation and route computations to the Neuron hardware. The Neuron compiler performs graph optimization, operator fusion, and memory layout transformations to maximize utilization of Trainium's NeuronCore hardware primitives.

Key characteristics of the Neuron software environment:

  • PyTorch compatibility: Most standard PyTorch models can be compiled for Trainium via torch_neuronx without source code changes; custom CUDA kernels require porting to NeuronX equivalents
  • NxD (NeuronX Distributed): A distributed training library providing tensor parallelism, pipeline parallelism, and data parallelism primitives optimized for NeuronLink all-to-all communication patterns
  • Model support: Trainium 2 officially supports training of major LLM families including LLaMA 2/3, Mistral, GPT-NeoX, BERT variants, and diffusion models; support for less common architectures may require additional compiler work
  • Compilation time: The Neuron compiler performs ahead-of-time (AOT) graph compilation, which means model compilation takes minutes to hours for large models; subsequent runs reuse the compiled artifact. This differs from GPU JIT compilation where most graph optimization occurs at first execution

The software ecosystem gap between CUDA and Neuron SDK is the primary friction point for teams considering Trainium. Organizations with production CUDA training pipelines, extensive custom CUDA kernels, or dependencies on CUDA-specific libraries (cuDNN, TensorRT) face non-trivial migration costs. AWS estimates that standard PyTorch transformer training pipelines can achieve 90%+ code reuse with Neuron, but “standard” is the operative word—production training stacks often diverge significantly from reference implementations.


When to Choose Trainium Over GPU

The practical decision to use Trainium 2 rather than NVIDIA H100/H200 depends on several factors specific to each organization's situation.

Strong case for Trainium:

  • Training workload runs exclusively on AWS and uses standard transformer architectures (LLaMA, Mistral, GPT-NeoX) without extensive custom CUDA kernels
  • Cost optimization is the primary objective and the team is willing to invest in Neuron SDK adoption
  • Very large-scale training (thousands of chips) where the NeuronLink UltraFabric's cross-node bandwidth advantage over InfiniBand-connected GPU clusters reduces collective communication overhead
  • AWS contract negotiation context where Trainium instance pricing can be discounted relative to H100 instances

Strong case for H100/GPU:

  • Existing production training pipeline uses custom CUDA kernels, CUDA graph optimization, or CUDA-specific libraries with no GPU-agnostic equivalents
  • On-premises deployment is required (Trainium is cloud-only)
  • Workload requires multi-cloud flexibility or specific GPU instance types from non-AWS cloud providers
  • Training of non-standard architectures (graph neural networks, reinforcement learning environments, custom operators) where Neuron compiler support is limited
  • Inference workload requiring NVIDIA-specific inference optimizations (TensorRT, CUDA graphs, FlashAttention variants)

For organizations comparing Trainium against AMD MI300X for inference-heavy workloads, the memory capacity dimension is relevant: Trainium 2's 96 GB HBM3 per chip is less than MI300X's 192 GB, making MI300X more suitable for single-chip inference of very large models. The full trade-off analysis between NVIDIA and AMD platforms is covered at H100 vs MI300X: NVIDIA vs AMD in the AI Accelerator War.


FAQ

Can I buy Trainium chips for an on-premises server?
No. AWS Trainium chips are not available for purchase as discrete components or as OEM server hardware for on-premises deployment. They are exclusively available through AWS EC2 instances (Trn1, Trn2, and UltraServer configurations). Organizations requiring on-premises AI training hardware must use NVIDIA GPUs, AMD MI-series GPUs in OAM form factor, or Intel Gaudi accelerators, all of which are available through standard server OEM channels.

Does Trainium 2 support FP8 training?
Yes. Trainium 2 added native FP8 support, matching the precision capability introduced in NVIDIA H100's Transformer Engine. FP8 training approximately doubles throughput over BF16 for transformer model training by halving the data width for matrix multiply operations. The Neuron SDK's NxD training library supports FP8 mixed-precision training for compatible model architectures.

How does NeuronLink compare to NVLink?
NeuronLink v2 provides approximately 1,600 GB/s of bidirectional bandwidth per chip using direct peer-to-peer connections, versus NVLink 4.0's 900 GB/s per chip through NVIDIA's NVSwitch crossbar fabric. NeuronLink's higher aggregate bandwidth per chip comes at the architectural cost of a direct-connect mesh topology rather than a fully non-blocking switch fabric—in large all-to-all collectives, some data must traverse multiple NeuronLink hops. For practical LLM training workloads at 16-chip scale, both architectures provide adequate collective bandwidth with comparable efficiency.

What AWS instance types use Trainium?
AWS Trainium is available in: Trn1.2xlarge (1 Trainium chip), Trn1.32xlarge (16 Trainium 1 chips, 512 GB HBM2e), Trn2.48xlarge (16 Trainium 2 chips, 1,536 GB HBM3), and the Trn2 UltraServer (64 Trainium 2 chips across 4 UltraServer nodes connected by NeuronLink UltraFabric, 6,144 GB HBM3). The UltraServer configuration is designed specifically for training frontier-scale models requiring more memory than a single node can provide.

Is Trainium 2 suitable for AI inference as well as training?
Trainium 2 can technically run inference workloads, but AWS's dedicated inference ASIC (Inferentia 2, used in Inf2 instances) is optimized specifically for inference and typically provides better performance per dollar for inference-only deployments. The Neuron SDK supports both training (on Trainium) and inference (on Inferentia and Trainium), and the same compiled model can be deployed to either chip family. For production inference of LLMs, Inf2 instances using Inferentia 2 are the preferred AWS-native option.


Need to Manufacture AI Accelerator PCBs?

Whether you are building ASIC-based AI accelerator boards like Trainium, GPU baseboard designs for H100 or B200, or OAM Universal Base Boards for AMD MI300X deployments, NextPCB provides the high-layer-count fabrication, low-loss laminate processing, BGA assembly, and IPC Class 3 quality standards required across the AI server PCB supply chain.

Author Name

About the Author

Arya Li, Project Manager at NextPCB.com

With extensive experience in manufacturing and international client management, Arya has guided factory visits for over 200 overseas clients, providing bilingual (English & Chinese) presentations on production processes, quality control systems, and advanced manufacturing capabilities. Her deep understanding of both the factory side and client requirements allows her to deliver professional, reliable PCB solutions efficiently. Detail-oriented and service-driven, Arya is committed to being a trusted partner for clients and showcasing the strength and expertise of the factory in the global PCB and PCBA market.