Course Material

All Slide Decks

The core lesson sequence plus the kernel supplements in one place, from the Hopper introduction through barriers, TMA, WGMMA, Stream-K, kernel launch control, and multi-GPU orchestration.

Course Decks

The lesson decks plus the kernel addenda.

Use this page when you want the full slide collection rather than a single integrated lesson entry.

01
Introduction to H100s

Architecture, memory hierarchy, execution hierarchy, and the shift toward asynchronous thinking.
02
Clusters, Data Types, Inline PTX, Pointers

Thread block clusters, distributed shared memory, state spaces, and address-space conversion.
03
Asynchronicity and Barriers

Latency hiding, mbarrier, producer-consumer hazards, wait patterns, and correctness under overlap.
04
cuTensorMap

TMA descriptors, tensor shapes and strides, swizzling, interleaving, and descriptor-driven transfers.
05
cp.async.bulk

Bulk async copies, multicast, prefetching, reductions, and barrier-based completion semantics.
06
WGMMA Part 1

Warpgroups, wgmma.mma_async, register and shared-memory sourcing, and tensor-core dataflow.
07
WGMMA Part 2

Commit and wait groups, FP8 behavior, stmatrix, packing, and sparse WGMMA constraints.
08
Kernel Design

Warp specialization, pipelining, circular buffering, persistent scheduling, Stream-K, and launch strategy.
08.1
Stream-K

Work splits, fixup, grouped locality, and the tail-balancing scheduler that turns K into a distributed work tape.
08.2
Kernel Launch

Launch bounds, grid constants, dependent grids, programmatic stream serialization, and overlap tuning for multi-kernel handoff.
09
Multi GPU Part 1

NVLink, NVSwitch, topology, bottlenecks, and why scaling changes how you reason about the machine.
10
Multi GPU Part 2

Slurm, PMIx, NCCL communicators, collectives, and the orchestration behind large training systems.

All Slide Decks

The lesson decks plus the kernel addenda.

Introduction to H100s

Clusters, Data Types, Inline PTX, Pointers

Asynchronicity and Barriers

cuTensorMap

cp.async.bulk

WGMMA Part 1

WGMMA Part 2

Kernel Design

Stream-K

Kernel Launch

Multi GPU Part 1

Multi GPU Part 2

`cp.async.bulk`