What the H100 is
The NVIDIA H100 is a Hopper-generation data-center GPU built for large-scale AI training, inference, HPC, and data analytics. The reason it matters in this course is not simply that it is faster than older GPUs. Hopper changes how high-performance kernels are structured by making asynchronous movement, specialized data paths, and overlap-first execution part of normal programming practice.
That makes the right mental model less like “one more CUDA device” and more like “a machine with multiple memory tiers, explicit staging surfaces, and execution hardware that rewards keeping the next tile in flight while the current one is still doing work.”
Working mental model: H100 performance comes from matching the right math path (Tensor Cores vs. regular cores), the right data path (TMA, cache, shared memory, registers), and the right execution strategy (overlap memory movement with compute).
Transformer-heavy workloads
Hopper adds a dedicated Transformer Engine and stronger low-precision math paths, especially FP8, because modern training and inference workloads are dominated by dense tensor operations.
Overlap over blocking
The architecture rewards fetching the next tile while the current tile is still being processed, which changes how kernels are designed, scheduled, and synchronized.
Specialization everywhere
TMA, Tensor Cores, LSUs, SFUs, and schedulers each own distinct parts of the throughput story. The best kernels route work through those paths intentionally.
Scale beyond one GPU
NVLink and NVSwitch matter because frontier workloads are distributed. The course treats that systems context as part of the architecture, not an optional appendix.
Headline specs
The exact numbers vary by SKU and form factor, but the lesson material uses these figures to anchor the engineering discussion.
| Category | Details | Why it matters |
|---|---|---|
| Architecture | Hopper | Introduces stronger asynchronous movement, Transformer Engine support, and new Tensor Core paths. |
| Memory | 80 GB HBM3, around 3.35 TB/s in the lesson deck | High bandwidth is essential for feeding matrix-heavy AI workloads without starving compute. |
| SM count | 132 streaming multiprocessors in the lesson material | SMs are the execution surfaces that run thread blocks and expose the memory and math hierarchy. |
| Tensor Cores | 528 total, 4 per SM in the lesson framing | These are the fast path for dense matrix multiply-accumulate operations. |
| L2 cache | 50 MB | L2 mediates traffic between SMs and HBM, smoothing data flow and reducing wasted bandwidth. |
| Interconnect | 4th-gen NVLink support with NVSwitch platform context | Critical for scaling training systems beyond one accelerator. |
Asynchronous execution is the first real shift
The course emphasizes the biggest conceptual change immediately: Hopper is designed to keep compute busy by decoupling request from completion. Each SM includes a Tensor Memory Accelerator, and that matters because it offloads tensor-shaped movement from the main execution resources.
Synchronous vs. asynchronous copies
A synchronous copy blocks the issuing thread or warp until the transfer finishes. That is easy to reason about, but it wastes issue bandwidth whenever memory latency dominates. An asynchronous copy starts the transfer and lets other useful work continue while the data moves in the background.
This is what makes pipelined kernels possible. Tile i can be in compute while tile
i+1 is already on the way, which is exactly the pattern later lessons build around with
TMA, barriers, and WGMMA.
Why TMA matters
- It reduces the per-thread overhead of global-memory-to-shared-memory movement.
- It handles shape, stride, and bounds behavior in hardware instead of forcing scalarized movement logic.
- It frees more SM execution bandwidth for actual math.
- It makes producer-consumer pipelines cleaner and more scalable for GEMM-style kernels.
Trade-off: asynchronous systems increase bookkeeping complexity. You need to reason about staging, dependency points, and when data becomes safe to consume.
The memory hierarchy is the real pipeline
A practical way to read H100 memory is as a staged path: HBM3 -> L2 cache -> L1/shared memory -> registers. Performance depends on moving data down this hierarchy efficiently and minimizing how often kernels have to pay off-chip costs.
HBM3
Large off-chip device memory with huge bandwidth, but still much higher latency and energy cost than on-chip storage. Coalescing and access regularity still matter.
L2 cache
A 50 MB shared traffic manager between SMs and HBM. It can absorb messy accesses and reorganize them into more efficient memory transactions.
Shared memory + L1
The on-chip staging surface where tiles live during compute. It is banked, capacity-limited, and one of the most valuable resources in a high-performance kernel.
Registers
The fastest storage per thread and a common bottleneck. Too many registers reduce occupancy and can spill values into slower memory.
What each level changes
- HBM bandwidth matters, but coalescing still decides how much of it you actually get.
- L2 can hide some irregularity, but it is not a substitute for deliberate layout choices.
- Shared memory lowers latency and enables reuse, but bank conflicts and capacity trade-offs remain central.
- Register pressure can silently reduce active warp count enough to hurt performance.
SMs, Tensor Cores, warps, and dispatch
The SM is the core execution unit of the GPU. It runs thread blocks, holds local storage, and contains the math and memory units that actually execute instructions.
Tensor Cores and WGMMA
Hopper includes fourth-generation Tensor Cores, and one of the most important programming-model changes is that matrix math is no longer always a single-warp story. With WGMMA, four warps coordinate on one matrix multiply-accumulate operation, turning tensor-core issue into a 128-thread coordination problem.
- Matrix A can come from shared memory or registers, depending on the form.
- Matrix B is commonly treated as shared-memory sourced in the course framing.
- Matrix C accumulators are distributed across four warps in registers.
- FP16 and FP8 values are packed into 32-bit registers for tensor-core issue.
SM anatomy
Each SM contains FP32, INT32, FP64, Tensor Core, LSU, SFU, register, cache, scheduler, and dispatch resources. It is also partitioned into four SMSPs, each with its own local issue path.
Warps and schedulers
A warp is the 32-thread unit the scheduler reasons about each cycle. The scheduler selects ready warps using scoreboarding and dependency tracking, while the dispatch unit launches work into the execution pipelines. This is why “ready” is not just about having instructions to run. Data and pipeline availability have to line up too.
Practical implication: block sizes that map cleanly into multiple warps, especially multiples of four warps such as 128 threads, often balance more naturally across the four SMSPs.
Optimization takeaways
- Design around overlap. Serialized load-then-compute thinking leaves throughput on the table.
- Use Tensor Cores intentionally. The fastest path depends on datatype, layout, and staging choices.
- Watch register pressure. A tiny increase can drop occupancy enough to move the whole kernel backwards.
- Treat shared memory as a budget. It is powerful, but over-allocating it changes occupancy and cache behavior.
- Size blocks to the machine. Thread-block geometry should reflect warp organization and SM sub-partitioning.
- Expect specialization. Hopper rewards routing work through the right hardware path instead of a generic fallback.
Glossary
| TMA | Tensor Memory Accelerator, hardware support for asynchronous tensor and tile movement. |
|---|---|
| SM | Streaming Multiprocessor, the main execution block that runs thread blocks. |
| SMSP | An SM sub-partition with its own scheduler and dispatch resources. |
| WGMMA | Warp Group Matrix Multiply Accumulate, tensor-core issue involving four warps together. |
| HBM3 | High-bandwidth off-chip device memory used as the GPU's main large-capacity memory tier. |
| NVLink | High-bandwidth GPU interconnect used in multi-GPU systems. |
Continue the course
Lesson 1 gives the machine shape. Lesson 2 moves into the mechanisms Hopper code actually uses: clusters, DSMEM, inline PTX, and the pointer conversions that make low-level memory operations correct.
Clusters and PTX
Thread block clusters, distributed shared memory, inline PTX, state spaces, and address conversion.
Course MaterialAll Slide Decks
Browse the complete lecture sequence when you want the whole staircase, not just one entry page.
Code RealityCode Anchors
Jump back into the homepage section that maps these ideas onto CUTLASS and custom Hopper kernels.