Hopper (microarchitecture)
2024 Deep Dive: Nvidia Hopper Microarchitecture for Datacenter GPUs
Executive Summary
Nvidia's Hopper microarchitecture is a significant advancement in GPU technology, designed specifically for datacenter applications. Named after computer scientist and United States Navy rear admiral Grace Hopper, this architecture improves upon its predecessors, the Turing and Ampere microarchitectures. The Hopper architecture features a new streaming multiprocessor, a faster memory subsystem, and a transformer acceleration engine, making it an attractive option for AI, HPC, and data analytics workloads.
The Hopper microarchitecture was officially revealed in March 2022, after being leaked in November 2019. It is used alongside the Lovelace microarchitecture and is designed to provide better performance when used in an SXM5 configuration than in a typical PCIe socket. With its improved single-precision floating-point format (FP32) throughput and support for new instructions, including the Smith–Waterman algorithm, the Hopper architecture is poised to dominate the datacenter GPU market.
Architecture & Design
The Nvidia Hopper H100 GPU is implemented using the TSMC N4 process with 80 billion transistors. It consists of up to 144 streaming multiprocessors, which improve upon the Turing and Ampere microarchitectures. The maximum number of concurrent warps per streaming multiprocessor (SM) remains the same between the Ampere and Hopper architectures, at 64. The Hopper architecture provides a Tensor Memory Accelerator (TMA), which supports bidirectional asynchronous memory transfer between shared memory and global memory.
Under TMA, applications may transfer up to 5D tensors. When writing from shared memory to global memory, elementwise reduction and bitwise operators may be used, avoiding registers and SM instructions while enabling users to write warp specialized codes. TMA is exposed through cuda::memcpy_async. The Hopper architecture also features improved single-precision floating-point format (FP32) throughput, with twice as many FP32 operations per cycle per SM than its predecessor.
The Hopper architecture adds support for new instructions, including the Smith–Waterman algorithm. Like Ampere, TensorFloat-32 (TF-32) arithmetic is supported. The mapping pattern for both architectures is identical. The Nvidia Hopper H100 supports HBM3 and HBM2e memory up to 80 GB, with the HBM3 memory system supporting 3 TB/s, an increase of 50% over the Nvidia Ampere A100's 2 TB/s.
| Component | Specification |
|---|---|
| Process Node | TSMC N4 |
| Transistors | 80 billion |
| Streaming Multiprocessors | Up to 144 |
| Memory | HBM3 and HBM2e up to 80 GB |
Performance & Thermal
The Nvidia Hopper H100 GPU delivers exceptional performance, scalability, and security for every workload. With Nvidia's NVLink Switch System, up to 256 H100 GPUs can be connected to accelerate exascale workloads, while the dedicated Transformer Engine supports trillion-parameter language models. The Hopper architecture provides up to 9X faster training and an incredible 30X inference speedup on large language models.
The H100 features fourth-generation Tensor Cores and a Transformer Engine with FP8 precision, further extending Nvidia's market-leading AI leadership. The exact TDP figures were not publicly disclosed, but the H100 is designed to provide better performance when used in an SXM5 configuration than in a typical PCIe socket. The thermal solution for the H100 is designed to handle the increased power consumption, but exact details were not publicly disclosed.
Benchmarks have shown that the H100 provides significant performance improvements over its predecessor, the A100. In one benchmark, the H100 showed a 3x performance improvement over the A100 in Tensor Core, FP32, and FP64 data types. The H100 also provides improved performance in AI and HPC workloads, making it an attractive option for datacenter applications.
Market Positioning
The Nvidia Hopper H100 GPU is positioned as a high-end datacenter GPU, designed to provide exceptional performance, scalability, and security for AI, HPC, and data analytics workloads. The H100 is designed to compete with other high-end datacenter GPUs, such as the AMD Instinct MI200. The target buyer for the H100 is likely large datacenter operators, such as cloud service providers, who require high-performance GPUs to accelerate their workloads.
The H100 is also designed to provide a competitive advantage in the AI market, with its support for trillion-parameter language models and up to 9X faster training and 30X inference speedup on large language models. The H100 is also designed to provide improved performance in HPC workloads, making it an attractive option for researchers and scientists who require high-performance computing resources.
Specifications
Technical Specifications
| Specification | Detail |
|---|---|
| Process Node | TSMC N4 |
| Transistors | 80 billion |
| Streaming Multiprocessors | Up to 144 |
| Memory | HBM3 and HBM2e up to 80 GB |
| Tensor Cores | Fourth-generation |
| Transformer Engine | With FP8 precision |
Frequently Asked Questions
Frequently Asked Questions
What is the Nvidia Hopper microarchitecture?
The Nvidia Hopper microarchitecture is a GPU microarchitecture developed by Nvidia, designed specifically for datacenter applications. It features a new streaming multiprocessor, a faster memory subsystem, and a transformer acceleration engine.
What are the key features of the Nvidia Hopper H100 GPU?
The Nvidia Hopper H100 GPU features up to 144 streaming multiprocessors, HBM3 and HBM2e memory up to 80 GB, fourth-generation Tensor Cores, and a Transformer Engine with FP8 precision.
What are the performance improvements of the Nvidia Hopper H100 GPU?
The Nvidia Hopper H100 GPU provides up to 9X faster training and an incredible 30X inference speedup on large language models, as well as improved performance in AI and HPC workloads.