Contact Us
Blog / Google TPU v5 Explained: Architecture, PCB Layout & Deployment

Google TPU v5 Explained: Architecture, PCB Layout & Deployment

Posted: June, 2026 Last Updated: June, 2026 Writer: Arya Li Share: NEXTPCB Official youtube NEXTPCB Official Facefook NEXTPCB Official Twitter NEXTPCB Official Instagram NEXTPCB Official Linkedin NEXTPCB Official Tiktok NEXTPCB Official Bksy

Introduction

Google invented the Tensor Processing Unit (TPU) in 2015, deployed it internally in 2016, and has iterated through five major generations since. By 2026, TPU v5 powers a substantial fraction of Google's internal AI training workloads—including the Gemini model family—and is available to external customers through Google Cloud. For hardware engineers evaluating AI accelerator options, the TPU is the longest-standing non-GPU AI training platform in production and offers a concrete alternative to NVIDIA's architecture for specific workloads and deployment contexts.

Understanding TPU v5 matters beyond its performance benchmarks. The TPU's systolic array architecture makes fundamentally different demands on the PCB and server infrastructure than a GPU does: it uses less HBM per chip (relying instead on a large on-chip SRAM scratchpad), deploys a proprietary inter-chip interconnect (ICI) rather than NVLink, and scales through a unique pod topology that treats thousands of chips as a single logical accelerator connected in a torus mesh. Each of these architectural decisions cascades into specific PCB design requirements that differ from the GPU board design principles covered in the rest of this series.

This article explains TPU v5's architecture, its ICI interconnect, the PCB and server design implications, how it compares to NVIDIA H100, and the workloads and organizational contexts where TPU v5 offers genuine advantages.

  1. Table of Contents

What Is a Google TPU?

A TPU (Tensor Processing Unit) is an ASIC (Application-Specific Integrated Circuit) developed by Google specifically for accelerating tensor computations in deep learning. Unlike a GPU, which is a general-purpose parallel processor that was adapted for AI, the TPU was designed from the ground up to execute matrix multiplication with maximum efficiency. Its hardware reflects a single insight: deep learning training and inference are dominated by large matrix multiplications (GEMMs), and a chip that executes GEMMs with minimal overhead will dramatically outperform a general-purpose processor of equivalent silicon area and power budget on those operations.

TPUs are available through Google Cloud in TPU VM instances and TPU pods. They are not sold as discrete hardware components or embedded in OEM server products—like AWS Trainium (covered in the preceding article What Is AWS Trainium?), TPUs are cloud-only accelerators. This fundamental availability constraint means that TPU v5 is primarily relevant to organizations whose AI compute runs on Google Cloud or who have negotiated private TPU capacity agreements with Google.


TPU v5 Variants: v5e and v5p

Google introduced two distinct TPU v5 products targeting different points on the performance/cost curve:

TPU v5e (efficiency): Optimized for cost-effective training and inference at scale. v5e chips have lower per-chip compute than v5p but are priced significantly lower per chip-hour on Google Cloud. They are the preferred option for fine-tuning, serving medium-sized models, and training workloads that do not require the absolute maximum throughput per chip. A full TPU v5e pod contains 256 chips in a 16 × 16 2D torus topology.

TPU v5p (performance): Optimized for maximum throughput per chip, targeting frontier model pre-training. v5p has higher compute density, more HBM, and higher ICI bandwidth than v5e. A full TPU v5p pod scales to 8,960 chips in a 3D torus topology, making it one of the largest single-fabric AI compute clusters in commercial availability. Google uses v5p for training its largest Gemini models internally.

Both variants share the same core TPU v5 silicon architecture but differ in configuration, packaging, and interconnect density. The PCB-level distinctions between v5e and v5p boards reflect their different ICI bandwidth targets: v5p boards route more ICI lanes per chip, requiring more signal routing layers and tighter signal integrity management.


TPU v5 Architecture: Systolic Array and MXU

The defining feature of every TPU generation is the Matrix Multiply Unit (MXU), implemented as a two-dimensional systolic array of multiply-accumulate (MAC) units. In a systolic array, data “flows” through the array in a wave pattern: input activations enter from one edge of the array, weights enter from an adjacent edge, and partial sum results propagate through the array and accumulate as each element passes through successive MAC units. This arrangement achieves very high arithmetic throughput with minimal memory bandwidth for the computation itself—once data enters the array, it is reused by every MAC unit it passes through without requiring additional DRAM reads.

The systolic array architecture makes the TPU extremely efficient for large, regular matrix multiplications (the dominant operation in transformer attention and feed-forward layers) and less efficient for irregular or sparse operations, dynamic control flow, or operations that do not map naturally to the fixed-size matrix dimensions of the MXU. This is the fundamental reason why TPUs require compiler-level graph optimization (XLA) to achieve high utilization: the XLA compiler must tile, pad, and reshape tensor operations to match the MXU's preferred matrix dimensions.

Each TPU v5 chip also contains:

  • Vector Processing Units (VPUs): For element-wise operations, activation functions (GeLU, SiLU, ReLU), normalization, and softmax that do not map to the MXU
  • On-chip SRAM (scratchpad memory): Large (estimated 100+ MB per chip on v5p) high-bandwidth SRAM that feeds the MXU and VPUs; the compiler explicitly manages data placement in the scratchpad rather than relying on a hardware cache
  • HBM interface: Off-chip HBM for storing model weights and activations that do not fit in the on-chip SRAM; smaller HBM capacity than GPUs (see specs below) because the scratchpad reduces HBM bandwidth pressure for compute-bound operations
  • ICI interface: The inter-chip interconnect for all-to-all communication between chips in a pod

ICI: Inter-Chip Interconnect

The ICI (Inter-Chip Interconnect) is Google's proprietary high-bandwidth, low-latency interconnect between TPU chips. Unlike NVIDIA's NVLink (which routes through NVSwitch chips as explained in the NVSwitch guide), ICI uses direct chip-to-chip connections in a torus mesh topology without a dedicated switch chip.

ICI key characteristics:

  • Topology: 2D torus (v5e, up to 256 chips) or 3D torus (v5p, up to 8,960 chips)
  • Bandwidth: TPU v5p ICI provides approximately 4,800 Gb/s (~600 GB/s) per chip bidirectional; v5e provides lower ICI bandwidth consistent with its efficiency focus
  • Physical medium: On PCB copper traces (shorter distances within a host board) and optical cables or high-speed copper cables (longer distances between host boards in a pod)
  • Latency: Very low within a torus neighborhood; latency increases with hop count for far-away chips in large pods

The torus topology used by ICI differs fundamentally from NVLink's non-blocking switch fabric. In an NVLink-connected GPU cluster, any GPU can communicate with any other GPU at full bandwidth through the NVSwitch crossbar without bandwidth reduction regardless of which two GPUs are communicating. In an ICI torus, bandwidth between two chips depends on how many torus links separate them; chips at opposite ends of a large pod must communicate through multiple hops, each consuming a portion of intermediate links' bandwidth. For collective operations (all-reduce, all-gather) that are the dominant communication pattern in LLM training, the torus topology is well-matched: the XLA collective communication primitives are optimized to exploit the torus structure, distributing data across the mesh with minimal hop count for the aggregate collective.

The PCB design implication of ICI is that the TPU host board must route ICI high-speed differential pairs between TPU chip packages and to the board edge connectors that connect to neighboring boards in the pod. ICI signaling rates (estimated in the 100–200 Gb/s per lane range at v5p) require similar PCB material and signal integrity treatment to NVLink 4.0—low-loss laminate on ICI signal layers, controlled impedance to ± 5%, and careful via design to avoid stub resonances in the signal band.


TPU v5 Specifications

Specification TPU v5e (per chip) TPU v5p (per chip)
BF16 peak compute ~197 TFLOPS ~459 TFLOPS
INT8 peak compute ~394 TOPS ~918 TOPS
HBM memory 16 GB HBM2e 95 GB HBM2e
HBM bandwidth ~819 GB/s ~2,765 GB/s
ICI bandwidth per chip ~1,600 Gb/s ~4,800 Gb/s
ICI topology 2D torus (up to 256 chips) 3D torus (up to 8,960 chips)
Host interface PCIe Gen4 ×8 PCIe Gen5 ×16 (est.)
TDP per chip ~197 W ~450 W
Google Cloud availability Generally available (v5e VM) Generally available (v5p VM and pod)

The HBM capacity numbers—particularly 16 GB for v5e—appear low compared to GPU HBM capacities (80–192 GB per GPU). This is intentional: the TPU architecture relies on its large on-chip SRAM scratchpad to feed the MXU without going to HBM for each matrix multiplication. For models that fit their active computation within the scratchpad, HBM bandwidth is less of a bottleneck than it is for GPUs. For models with very large weight tensors that must remain in HBM, the limited HBM capacity requires more aggressive model parallelism across more chips—which is why v5p's 95 GB HBM per chip better serves large frontier model training.


TPU v5 vs NVIDIA H100: Key Differences

Dimension TPU v5p NVIDIA H100 SXM5
Architecture Systolic array (fixed-function GEMM) SIMT (general-purpose parallel)
BF16 peak compute ~459 TFLOPS 989 TFLOPS
FP8 support Limited (v5 added FP8 via compiler) Native (Transformer Engine)
HBM capacity 95 GB HBM2e 80 GB HBM3
HBM bandwidth ~2,765 GB/s 3,350 GB/s
On-chip SRAM Very large (100+ MB, compiler-managed) Moderate (L2 cache, hardware-managed)
Inter-chip interconnect ICI 4,800 Gb/s per chip (3D torus) NVLink 4.0, 900 GB/s (via NVSwitch)
Max scale (single fabric) 8,960 chips (v5p pod, 3D torus) 72 GPUs per NVL72 rack; InfiniBand beyond
TDP ~450 W 700 W
Software JAX + XLA; PyTorch/XLA CUDA; native PyTorch
Availability Google Cloud only Cloud + on-premises
Programmability Limited to XLA-compatible operations Full CUDA programmability

The raw BF16 TFLOPS comparison (459 vs 989) does not capture the full performance picture. TPU v5p achieves significantly higher hardware utilization rates than H100 on well-optimized workloads compiled by XLA, partly because the systolic array eliminates scheduling overhead present in GPU execution. Google's published benchmarks show TPU v5p achieving competitive or superior time-to-train results on transformer model pre-training compared to H100 configurations of equivalent cost. However, these benchmarks use XLA-optimized JAX training stacks—the native environment where TPU performance shines. PyTorch-based training pipelines typically see lower utilization improvement from XLA than JAX-native code, narrowing or eliminating the performance advantage over H100.


PCB Layout and Board Design for TPU Systems

Unlike NVIDIA GPU boards where extensive public documentation and reference designs exist, Google's TPU host board designs are proprietary and not publicly released. However, the published architectural specifications allow inference of the key PCB design requirements.

Host board layer count: A TPU v5p host board carrying 4 TPU chips (a typical configuration based on Google's pod architecture descriptions) with 4 × PCIe Gen5 host interfaces and ICI routing between chips would likely require 16–22 layers. This is lower than an H100 HGX baseboard (20–24 layers) primarily because the absence of NVSwitch chips reduces BGA package count and reduces the routing layer demand for switch-to-GPU NVLink traces. The PCB layer count requirements for high-speed AI accelerator boards are analyzed in detail in the HDI layer count guide.

ICI signal integrity: ICI traces between TPU chip packages on the host board carry high-speed differential signals at rates estimated in the 100–200 Gb/s per lane range. These require the same signal integrity design discipline as NVLink 4.0 traces on GPU boards: low-loss laminate (Megtron 6E class or better) on ICI signal layers, differential impedance control to ± 5%, backdrilling or laser microvia approaches to eliminate through-hole via stubs, and VLP or HVLP copper foil. The material selection framework for these requirements is covered at High-Speed PCB Materials for AI Servers.

ICI edge connectors: Between host boards in a TPU pod, ICI signals travel over high-speed copper cables or optical cables. The PCB edge connector or cable connector at the board edge must be rated for the ICI signaling frequency. Board-edge connectors at 100+ Gb/s per lane require gold-plated contacts, controlled impedance at the connector transition, and ground stitching via arrays adjacent to the connector footprint to maintain reference plane continuity through the connector launch.

PCIe host interface: TPU v5p uses PCIe Gen5 for host CPU connectivity. The PCB routing requirements for PCIe Gen5 on a TPU host board are identical to those on any other server board: controlled impedance to 85 Ω ± 10%, low-loss laminate on PCIe signal layers, and backdrilling to remove via stubs. The signal integrity fundamentals are the same whether the board is a GPU baseboard or a TPU host board, as detailed in the PCIe Gen5 PCB design guide.

Power delivery: At ~450 W TDP per v5p chip with 4 chips per host board, total chip power per board is approximately 1,800 W. This is substantially lower than an H100 HGX baseboard at ~5,600 W GPU power (8 × 700 W), which means the TPU host board's power planes and VRM designs are correspondingly less demanding—2 oz copper on primary power planes is likely adequate rather than the 3 oz required on GPU boards at higher current density. The lower TDP also means TPU host boards can be designed for air cooling or moderate liquid cooling rather than the mandatory high-flow DLC required by B200. Thermal management considerations for AI server PCBs are covered at Thermal Management on AI Server PCBs.


Pod Topology and Data Center Deployment

The TPU pod is the fundamental unit of large-scale TPU deployment. A pod connects multiple TPU host boards into a single unified fabric via ICI, allowing all chips in the pod to participate in collective communication operations as if they were on the same board. This is analogous to—but architecturally distinct from—the NVLink fabric in NVIDIA's GB200 NVL72 rack, which is described in detail at NVIDIA GB200 NVL72: PCB & System Architecture Explained.

TPU v5 pod configurations:

Configuration Chip Count ICI Topology Total BF16 Compute Total HBM
TPU v5e single host 4 2D torus (2×2) ~788 TFLOPS 64 GB
TPU v5e pod slice 256 2D torus (16×16) ~50 PFLOPS 4,096 GB
TPU v5p single host 4 3D torus (cube) ~1,836 TFLOPS 380 GB
TPU v5p pod slice (medium) 1,024 3D torus ~470 PFLOPS ~97 TB
TPU v5p full pod 8,960 3D torus ~4.1 ExaFLOPS ~851 TB

The full TPU v5p pod at 8,960 chips and ~4.1 ExaFLOPS represents one of the largest single-fabric AI compute resources available on any cloud platform. Its 3D torus topology provides high aggregate bandwidth for collective operations that are distributed across the torus, and Google's XLA compiler is specifically optimized to generate communication patterns that exploit the torus structure for minimum average hop count in LLM tensor parallelism and data parallelism collectives.

Data center deployment of TPU pods requires dedicated infrastructure: custom rack designs to accommodate the high-density TPU host boards and their inter-board ICI cable connections; power infrastructure sized for the pod's aggregate TDP (at 450 W per chip × 8,960 chips = approximately 4,032 kW for a full v5p pod, distributed across approximately 560 host boards at ~7.2 kW each); and data center networking for CPU-to-TPU PCIe traffic and management. Google's TPU data centers are purpose-built for this topology rather than adapted from general-purpose server infrastructure.


XLA Compiler and Software Ecosystem

The XLA (Accelerated Linear Algebra) compiler is the software foundation of the TPU programming model. XLA is an optimizing compiler for linear algebra computations that takes a computational graph (from JAX, PyTorch, or TensorFlow) and generates optimized machine code for the target hardware. For TPUs, XLA performs the following critical optimizations:

  • Operator fusion: Combining multiple operations (e.g., matrix multiply followed by bias add and activation) into a single hardware kernel, eliminating intermediate DRAM writes between operations
  • Memory layout optimization: Reshaping tensor dimensions to match the MXU's preferred matrix dimensions (powers of 2, typically 128 or 256) and placing data in the on-chip scratchpad rather than HBM wherever possible
  • Collective communication optimization: Generating all-reduce, all-gather, and reduce-scatter patterns that minimize hop count on the ICI torus for the specific pod size and topology
  • Padding and tiling: Adding zero-padding to tensors whose dimensions do not align with MXU block sizes; poor alignment can significantly reduce MXU utilization

JAX is Google's preferred framework for TPU programming and achieves the highest hardware utilization. PyTorch with the PyTorch/XLA backend provides broad model compatibility but typically achieves lower utilization than JAX-native code because PyTorch's dynamic execution model conflicts with XLA's ahead-of-time compilation approach. For CUDA-native training pipelines, migration to TPU requires a shift to either JAX or PyTorch/XLA, with associated code changes and compilation overhead adjustments.


When to Use TPU v5

TPU v5 is most compelling in the following scenarios:

  • Google Cloud-native training: Organizations whose AI infrastructure runs on Google Cloud and whose training workloads use JAX or PyTorch/XLA with standard transformer architectures (BERT, GPT, T5, LLaMA variants) will see the lowest friction adoption path and the best price/performance from TPU v5
  • Very large-scale training requiring a single fabric: The TPU v5p pod at 8,960 chips in a 3D torus provides a single coherent ICI fabric at a scale that no NVIDIA NVLink configuration matches; for training runs requiring thousands of chips in tight collective communication, the torus topology can outperform InfiniBand-connected GPU clusters on collective-heavy workloads
  • Power-efficiency-sensitive deployment: At ~450 W per v5p chip versus 700–1,000 W for H100/B200, TPU v5p offers meaningfully better compute per watt for training workloads that are well-matched to the systolic array. For equivalent BF16 throughput, a TPU v5p configuration requires approximately 40% less power than an H100 configuration
  • Cost arbitrage through Google Cloud committed use discounts: Google Cloud committed use and sustained use discounts on TPU instances can provide significant cost advantages over on-demand GPU instance pricing for organizations with predictable, sustained training workloads

TPU v5 is less suited for organizations with existing CUDA-based training pipelines, requirements for on-premises deployment, or training of model architectures that do not map efficiently to the XLA compilation model. For a broader comparison of TPU against GPU, custom ASIC, and FPGA for the full range of AI workloads, see GPU vs TPU vs ASIC vs FPGA: Which AI Chip Architecture Will Dominate in 2027?


FAQ

Can I use TPU v5 for AI inference as well as training?
Yes, but it is not the primary use case. Google Cloud offers TPU v5e instances for inference workloads, and the TPU's high BF16 throughput and fast ICI collective bandwidth make it viable for large-model inference requiring tensor parallelism across multiple chips. However, Google's preferred inference accelerator for external customers is its dedicated inference-optimized hardware accessed through Vertex AI, rather than TPU v5 directly. For most inference use cases, the cost-per-token of TPU inference on Google Cloud is competitive with H100 inference but requires XLA-compiled models.

How does TPU v5 ICI compare to NVIDIA NVLink in bandwidth?
TPU v5p ICI provides approximately 4,800 Gb/s (~600 GB/s) per chip bidirectional, versus NVLink 4.0's 900 GB/s per GPU through NVSwitch. ICI's higher aggregate bandwidth reflects the torus topology's higher link count per chip (6 ICI neighbors in a 3D torus vs NVLink's 18 links concentrated through NVSwitch). The meaningful difference is in topology: NVLink's NVSwitch provides a non-blocking crossbar where any two GPUs communicate at full bandwidth in a single hop; ICI's torus requires multiple hops for communication between far-apart chips, so effective bandwidth between distant chips is lower than the per-chip aggregate figure suggests.

What is the difference between TPU v5e and TPU v5p?
TPU v5e is the efficiency-optimized variant with lower per-chip compute (~197 BF16 TFLOPS), less HBM (16 GB), lower ICI bandwidth, and lower TDP (~197 W). It is priced lower per chip-hour and is suited for fine-tuning, inference, and cost-sensitive training. TPU v5p is the performance variant with higher compute (~459 BF16 TFLOPS), more HBM (95 GB), higher ICI bandwidth (~4,800 Gb/s), and higher TDP (~450 W). v5p scales to 8,960-chip pods in a 3D torus topology and is designed for frontier model pre-training.

Does TPU v5 support PyTorch?
Yes, through the PyTorch/XLA backend. Standard PyTorch models can run on TPU v5 without source code changes in most cases; the PyTorch/XLA layer intercepts the computational graph and compiles it with XLA for the TPU hardware. Custom CUDA kernels, CUDA graph optimizations, and CUDA-specific libraries (cuDNN operations without XLA equivalents) require porting or replacement. JAX is the preferred and best-optimized framework for TPU v5; for organizations that can adopt JAX, it provides significantly higher TPU utilization than PyTorch/XLA.

Is a full TPU v5p pod available to external Google Cloud customers?
TPU v5p is available to Google Cloud customers in pod slice configurations (subsets of the full 8,960-chip pod). The full 8,960-chip pod is primarily used internally by Google for Gemini and other flagship model training. External customers can request large TPU v5p allocations through Google Cloud capacity agreements; the practical maximum for external customers in 2025–2026 is typically in the hundreds to low thousands of chips per allocation, with full pod scale available primarily to Google's largest strategic cloud customers.


Need to Manufacture PCBs for AI Accelerator Systems?

Whether you are working on TPU-class server host boards, NVIDIA GPU baseboards, or custom AI ASIC carrier boards, the PCB requirements are demanding: high-layer-count fabrication, low-loss laminates, precise impedance control, HDI via technology, and IPC Class 3 quality standards. NextPCB supports the complete AI server PCB manufacturing stack, from bare board fabrication through SMT assembly and functional test.

Author Name

About the Author

Arya Li, Project Manager at NextPCB.com

With extensive experience in manufacturing and international client management, Arya has guided factory visits for over 200 overseas clients, providing bilingual (English & Chinese) presentations on production processes, quality control systems, and advanced manufacturing capabilities. Her deep understanding of both the factory side and client requirements allows her to deliver professional, reliable PCB solutions efficiently. Detail-oriented and service-driven, Arya is committed to being a trusted partner for clients and showcasing the strength and expertise of the factory in the global PCB and PCBA market.