Course Material

All Slide Decks

The core lesson sequence plus the kernel supplements in one place, from the Hopper introduction through barriers, TMA, WGMMA, Stream-K, kernel launch control, and multi-GPU orchestration.

Course Decks

The lesson decks plus the kernel addenda.

Use this page when you want the full slide collection rather than a single integrated lesson entry.

  1. 01

    Introduction to H100s

    Architecture, memory hierarchy, execution hierarchy, and the shift toward asynchronous thinking.

  2. 02

    Clusters, Data Types, Inline PTX, Pointers

    Thread block clusters, distributed shared memory, state spaces, and address-space conversion.

  3. 03

    Asynchronicity and Barriers

    Latency hiding, mbarrier, producer-consumer hazards, wait patterns, and correctness under overlap.

  4. 04

    cuTensorMap

    TMA descriptors, tensor shapes and strides, swizzling, interleaving, and descriptor-driven transfers.

  5. 05

    cp.async.bulk

    Bulk async copies, multicast, prefetching, reductions, and barrier-based completion semantics.

  6. 06

    WGMMA Part 1

    Warpgroups, wgmma.mma_async, register and shared-memory sourcing, and tensor-core dataflow.

  7. 07

    WGMMA Part 2

    Commit and wait groups, FP8 behavior, stmatrix, packing, and sparse WGMMA constraints.

  8. 08

    Kernel Design

    Warp specialization, pipelining, circular buffering, persistent scheduling, Stream-K, and launch strategy.

  9. 08.1

    Stream-K

    Work splits, fixup, grouped locality, and the tail-balancing scheduler that turns K into a distributed work tape.

  10. 08.2

    Kernel Launch

    Launch bounds, grid constants, dependent grids, programmatic stream serialization, and overlap tuning for multi-kernel handoff.

  11. 09

    Multi GPU Part 1

    NVLink, NVSwitch, topology, bottlenecks, and why scaling changes how you reason about the machine.

  12. 10

    Multi GPU Part 2

    Slurm, PMIx, NCCL communicators, collectives, and the orchestration behind large training systems.