A course by Prateek Shukla

CUDA Programming for NVIDIA H100s

Hopper taught as the asynchronous machine: mbarrier, cp.async.bulk, TMA, WGMMA, warp-specialized kernels, and multi-GPU orchestration.

10 lessons CUTLASS SM90 kernels PTX-level primitives Multi-GPU orchestration

Why This Course Exists

Hopper changed the programming model, not just the throughput number.

The site focuses on how Hopper actually works: asynchronous execution, descriptor-driven movement, synchronization semantics, tensor-core dataflow, and the distributed systems context required at H100 scale.

01

Asynchronous execution is the center of the story.

Older CUDA instincts assume issue and completion are nearly one mental unit. Hopper makes overlap, wait logic, and producer-consumer coordination part of everyday kernel design.

02

Data movement is a first-class optimization surface.

TMA descriptors, swizzles, layout choices, shared memory staging, and barrier-linked transfers matter as much as the math itself when you want tensor cores to stay fed.

03

Scale is part of the architecture, not an optional appendix.

NVLink, Stream-K scheduling, NCCL setup, PMIx, and Slurm show up because real training systems do not stop at one GPU, and neither should the mental model.

Deep Dives

Four deliberate entries into the course.

The syllabus above shows the full climb. This section stays tighter: four high-signal entries that cover the machine model, synchronization, kernel design, and distributed orchestration without turning the homepage into another full lesson index.

Lesson 01

Introduction to H100s

Start with Hopper as a machine: architecture, memory hierarchy, SMSPs, Tensor Cores, schedulers, and the reason asynchronous execution changes how the rest of the course needs to be read.

  • Build the memory-path mental model from HBM through L2, shared memory, and registers.
  • See where TMA and WGMMA fit before the later lessons turn them into concrete kernel machinery.
  • Anchor the whole course in utilization, overlap, and the real execution hierarchy of the GPU.
Hopper Memory Hierarchy TMA WGMMA

Lesson 03

Asynchronicity and Barriers

Move from architecture into synchronization: proxy separation, RAW and WAR hazards, mbarrier, cluster barriers, async groups, and the rules that make overlap correct instead of accidental.

  • See why Hopper needs explicit visibility and ordering across generic and async proxies.
  • Understand what mbarrier tracks, how phases flip, and why acquire semantics matter.
  • Connect TMA, cp.async.bulk, WGMMA consumption, and shared-memory ownership handoff.
mbarrier fence.proxy.async Barrier Cluster wait_group

Lesson 08

Kernel Design

Step from primitives to full kernels: arithmetic intensity, warp specialization, circular buffers, cooperative versus ping-pong pipelines, persistent scheduling, Stream-K, cluster multicast, and epilogue design.

  • Classify when a kernel is compute-bound versus memory-bound and which knobs actually matter in each regime.
  • Compare cooperative and ping-pong pipelines as utilization strategies, not interchangeable template names.
  • Connect persistent schedulers, Stream-K fixup, and epilogue handoff to the real SM90 kernel files in the repo.
Warp Specialization Stream-K Persistent Epilogue

Lesson 10

Multi GPU Part 2

Finish at the orchestration layer: Slurm job control, PMIx bootstrap, GPU binding under CUDA_VISIBLE_DEVICES, NCCL communicator setup, collective primitives, and the parallelism patterns that actually use the fabric from lesson 9.

  • Understand how resource allocation, peer discovery, and communicator creation compose into one launch path.
  • See why NCCL thinks in collectives and topology-aware routes instead of manual pairwise copies.
  • Tie data, tensor, pipeline, and expert parallelism back to the concrete communication primitives they need.
Slurm PMIx NCCL Collectives

Code Reality

The course keeps returning to pipelines because the files do.

A big part of this material is not slides in isolation. It is reading how descriptors, barriers, warpgroup math, scheduling, and measured kernels become concrete structures in the code. Pick a thread below and watch the pipeline take shape in the actual files.

Code Anchors

The course stays close to the kernel source.

These references are already in the repo. They are where the course connects architecture diagrams to scheduler logic, shared-memory layout, descriptor construction, and measured kernels.

sm90_tile_scheduler_stream_k.hpp

Persistent Stream-K decomposition, work-tile bookkeeping, reduction units, and the scheduler rules that determine how work moves across the machine.

fast.cu/README.md

From-scratch kernels with measured bf16 matmul results, including runs that outperform cuBLAS on selected H100 matrix sizes.

Interactive Visualizations

Small live glances into the teaching tools behind the course.

These companion visualizations are already live on GitHub Pages. The homepage keeps them lighter now: each preview loads on demand inline, with a full-view link whenever you want the standalone interactive.

01

Rasterization + Swizzle Visualizer

Tile traversal, padded grids, swizzled cluster IDs, and persistent block mapping from rasterization.cu.

Open Raster Visualizer
Snapshot Tile Scheduling Studio
Rasterization and swizzle visualizer preview

02

H100 GPU Architecture

A dedicated architecture visualization for reading Hopper as a machine: major structures, hierarchy, and the chip-level surfaces the course keeps referring back to.

Open H100 Visualizer
Snapshot Architecture View
H100 GPU architecture visualizer preview

03

DGX H100 Node Architecture

A system-level view of the node: GPUs, fabric, and the kind of topology context that matters once the course moves beyond a single device.

Open DGX Visualizer
Snapshot Node Topology
DGX H100 node architecture visualizer preview

04

WGMMA Lifecycle

A scrollytelling walkthrough of WGMMA operations, swizzling, descriptors, and async load behavior on Hopper.

Open WGMMA Visualizer
Snapshot Lifecycle Walkthrough
WGMMA lifecycle visualizer preview

05

TMA / WGMMA Async Pipeline

A compact buffer-and-timeline view of overlapped load, compute, store, and barrier-driven progression through the asynchronous pipeline.

Open Barrier Visualizer
Snapshot Async Pipeline
TMA and WGMMA async pipeline visualizer preview

06

Thread Block Cluster Visualizer

An interactive view of thread block clusters: residency, distributed shared memory topology, and how clusters coordinate across SMs on Hopper.

Open Cluster Visualizer
Snapshot Cluster Topology
Thread block cluster visualizer preview

07

SM90 Tile Scheduler Visualizer

A live walkthrough of CUTLASS SM90 tile scheduling: swizzle groups, cluster-major and cluster-minor offsets, raster order, and the get_work_idx_m_and_n mapping from linear CTA work to output tiles.

Open Tile Scheduler Visualizer
Snapshot Scheduler Mapping
SM90 tile scheduler visualizer preview

FAQ

What people usually want to know before they start.

Short answers about prerequisites, scope, and what this course is actually trying to teach.

It starts at lesson 1, but it assumes you are already comfortable with C or C++ and reading normal CUDA kernels. The focus here is Hopper’s execution model, not CUDA basics.

Start Here

Move from Hopper architecture into the mechanisms that make modern kernels work.

The lessons connect the execution model, descriptors, waits, tile scheduling, tensor-core dataflow, and multi-GPU orchestration in one place.