Course Material
All Slide Decks
The core lesson sequence plus the kernel supplements in one place, from the Hopper introduction through barriers, TMA, WGMMA, Stream-K, kernel launch control, and multi-GPU orchestration.
Course Decks
The lesson decks plus the kernel addenda.
Use this page when you want the full slide collection rather than a single integrated lesson entry.
-
01
Introduction to H100s
Architecture, memory hierarchy, execution hierarchy, and the shift toward asynchronous thinking.
-
02
Clusters, Data Types, Inline PTX, Pointers
Thread block clusters, distributed shared memory, state spaces, and address-space conversion.
-
03
Asynchronicity and Barriers
Latency hiding,
mbarrier, producer-consumer hazards, wait patterns, and correctness under overlap. -
04
cuTensorMap
TMA descriptors, tensor shapes and strides, swizzling, interleaving, and descriptor-driven transfers.
-
05
cp.async.bulkBulk async copies, multicast, prefetching, reductions, and barrier-based completion semantics.
-
06
WGMMA Part 1
Warpgroups,
wgmma.mma_async, register and shared-memory sourcing, and tensor-core dataflow. -
07
WGMMA Part 2
Commit and wait groups, FP8 behavior,
stmatrix, packing, and sparse WGMMA constraints. -
08
Kernel Design
Warp specialization, pipelining, circular buffering, persistent scheduling, Stream-K, and launch strategy.
-
08.1
Stream-K
Work splits, fixup, grouped locality, and the tail-balancing scheduler that turns K into a distributed work tape.
-
08.2
Kernel Launch
Launch bounds, grid constants, dependent grids, programmatic stream serialization, and overlap tuning for multi-kernel handoff.
-
09
Multi GPU Part 1
NVLink, NVSwitch, topology, bottlenecks, and why scaling changes how you reason about the machine.
-
10
Multi GPU Part 2
Slurm, PMIx, NCCL communicators, collectives, and the orchestration behind large training systems.