H100 Architecture and Async Execution | Lesson 1 | CUDA Programming for NVIDIA H100s

What the H100 is

The NVIDIA H100 is a Hopper-generation data-center GPU built for large-scale AI training, inference, HPC, and data analytics. The reason it matters in this course is not simply that it is faster than older GPUs. Hopper changes how high-performance kernels are structured by making asynchronous movement, specialized data paths, and overlap-first execution part of normal programming practice.

That makes the right mental model less like “one more CUDA device” and more like “a machine with multiple memory tiers, explicit staging surfaces, and execution hardware that rewards keeping the next tile in flight while the current one is still doing work.”

Working mental model: H100 performance comes from matching the right math path (Tensor Cores vs. regular cores), the right data path (TMA, cache, shared memory, registers), and the right execution strategy (overlap memory movement with compute).

Transformer-heavy workloads

Hopper adds a dedicated Transformer Engine and stronger low-precision math paths, especially FP8, because modern training and inference workloads are dominated by dense tensor operations.

Overlap over blocking

The architecture rewards fetching the next tile while the current tile is still being processed, which changes how kernels are designed, scheduled, and synchronized.

Specialization everywhere

TMA, Tensor Cores, LSUs, SFUs, and schedulers each own distinct parts of the throughput story. The best kernels route work through those paths intentionally.

Scale beyond one GPU

NVLink and NVSwitch matter because frontier workloads are distributed. The course treats that systems context as part of the architecture, not an optional appendix.

Headline specs

The exact numbers vary by SKU and form factor, but the lesson material uses these figures to anchor the engineering discussion.

Category	Details	Why it matters
Architecture	Hopper	Introduces stronger asynchronous movement, Transformer Engine support, and new Tensor Core paths.
Memory	80 GB HBM3, around 3.35 TB/s in the lesson deck	High bandwidth is essential for feeding matrix-heavy AI workloads without starving compute.
SM count	132 streaming multiprocessors in the lesson material	SMs are the execution surfaces that run thread blocks and expose the memory and math hierarchy.
Tensor Cores	528 total, 4 per SM in the lesson framing	These are the fast path for dense matrix multiply-accumulate operations.
L2 cache	50 MB	L2 mediates traffic between SMs and HBM, smoothing data flow and reducing wasted bandwidth.
Interconnect	4th-gen NVLink support with NVSwitch platform context	Critical for scaling training systems beyond one accelerator.

Asynchronous execution is the first real shift

The course emphasizes the biggest conceptual change immediately: Hopper is designed to keep compute busy by decoupling request from completion. Each SM includes a Tensor Memory Accelerator, and that matters because it offloads tensor-shaped movement from the main execution resources.

Synchronous vs. asynchronous copies

A synchronous copy blocks the issuing thread or warp until the transfer finishes. That is easy to reason about, but it wastes issue bandwidth whenever memory latency dominates. An asynchronous copy starts the transfer and lets other useful work continue while the data moves in the background.

This is what makes pipelined kernels possible. Tile i can be in compute while tile i+1 is already on the way, which is exactly the pattern later lessons build around with TMA, barriers, and WGMMA.

Why TMA matters

It reduces the per-thread overhead of global-memory-to-shared-memory movement.
It handles shape, stride, and bounds behavior in hardware instead of forcing scalarized movement logic.
It frees more SM execution bandwidth for actual math.
It makes producer-consumer pipelines cleaner and more scalable for GEMM-style kernels.

Trade-off: asynchronous systems increase bookkeeping complexity. You need to reason about staging, dependency points, and when data becomes safe to consume.

The memory hierarchy is the real pipeline

A practical way to read H100 memory is as a staged path: HBM3 -> L2 cache -> L1/shared memory -> registers. Performance depends on moving data down this hierarchy efficiently and minimizing how often kernels have to pay off-chip costs.

HBM3

Large off-chip device memory with huge bandwidth, but still much higher latency and energy cost than on-chip storage. Coalescing and access regularity still matter.

L2 cache

A 50 MB shared traffic manager between SMs and HBM. It can absorb messy accesses and reorganize them into more efficient memory transactions.

Shared memory + L1

The on-chip staging surface where tiles live during compute. It is banked, capacity-limited, and one of the most valuable resources in a high-performance kernel.

Registers

The fastest storage per thread and a common bottleneck. Too many registers reduce occupancy and can spill values into slower memory.

What each level changes

HBM bandwidth matters, but coalescing still decides how much of it you actually get.
L2 can hide some irregularity, but it is not a substitute for deliberate layout choices.
Shared memory lowers latency and enables reuse, but bank conflicts and capacity trade-offs remain central.
Register pressure can silently reduce active warp count enough to hurt performance.

SMs, Tensor Cores, warps, and dispatch

The SM is the core execution unit of the GPU. It runs thread blocks, holds local storage, and contains the math and memory units that actually execute instructions.

Tensor Cores and WGMMA

Hopper includes fourth-generation Tensor Cores, and one of the most important programming-model changes is that matrix math is no longer always a single-warp story. With WGMMA, four warps coordinate on one matrix multiply-accumulate operation, turning tensor-core issue into a 128-thread coordination problem.

Matrix A can come from shared memory or registers, depending on the form.
Matrix B is commonly treated as shared-memory sourced in the course framing.
Matrix C accumulators are distributed across four warps in registers.
FP16 and FP8 values are packed into 32-bit registers for tensor-core issue.

SM anatomy

Each SM contains FP32, INT32, FP64, Tensor Core, LSU, SFU, register, cache, scheduler, and dispatch resources. It is also partitioned into four SMSPs, each with its own local issue path.

Warps and schedulers

A warp is the 32-thread unit the scheduler reasons about each cycle. The scheduler selects ready warps using scoreboarding and dependency tracking, while the dispatch unit launches work into the execution pipelines. This is why “ready” is not just about having instructions to run. Data and pipeline availability have to line up too.

Practical implication: block sizes that map cleanly into multiple warps, especially multiples of four warps such as 128 threads, often balance more naturally across the four SMSPs.

Optimization takeaways

Design around overlap. Serialized load-then-compute thinking leaves throughput on the table.
Use Tensor Cores intentionally. The fastest path depends on datatype, layout, and staging choices.
Watch register pressure. A tiny increase can drop occupancy enough to move the whole kernel backwards.
Treat shared memory as a budget. It is powerful, but over-allocating it changes occupancy and cache behavior.
Size blocks to the machine. Thread-block geometry should reflect warp organization and SM sub-partitioning.
Expect specialization. Hopper rewards routing work through the right hardware path instead of a generic fallback.

Glossary

TMA	Tensor Memory Accelerator, hardware support for asynchronous tensor and tile movement.
SM	Streaming Multiprocessor, the main execution block that runs thread blocks.
SMSP	An SM sub-partition with its own scheduler and dispatch resources.
WGMMA	Warp Group Matrix Multiply Accumulate, tensor-core issue involving four warps together.
HBM3	High-bandwidth off-chip device memory used as the GPU's main large-capacity memory tier.
NVLink	High-bandwidth GPU interconnect used in multi-GPU systems.

Continue the course

Lesson 1 gives the machine shape. Lesson 2 moves into the mechanisms Hopper code actually uses: clusters, DSMEM, inline PTX, and the pointer conversions that make low-level memory operations correct.

Next Lesson

Introduction to H100s