Is this a beginner CUDA course?

No. The course assumes you already know C or C++, can read CUDA kernels, and want the Hopper-specific execution model rather than beginner CUDA onboarding.

What does the course actually teach?

It teaches Hopper and H100 mechanisms directly: asynchronous execution, mbarrier, thread block clusters, distributed shared memory, cuTensorMap, cp.async.bulk, WGMMA, warp-specialized kernel design, and multi-GPU orchestration.

Does the course cover WGMMA and Tensor Memory Accelerator?

Yes. WGMMA is covered across lessons 6 and 7, while TMA and cuTensorMap are covered in lessons 4 and 5 as part of Hopper's descriptor-driven asynchronous data movement model.

Does the course include multi-GPU topics?

Yes. Lessons 9 and 10 cover NVLink, NVSwitch, topology, Slurm, PMIx, NCCL communicators, collectives, and distributed training patterns.

Why is H100 different from older CUDA mental models?

Because Hopper shifts the programming model toward overlap-first execution: asynchronous copies, barriers, wait logic, descriptor-backed movement, and warpgroup tensor-core issue become central instead of secondary details.

A course by Prateek Shukla

CUDA Programming for NVIDIA H100s

Hopper taught as the asynchronous machine: mbarrier, cp.async.bulk, TMA, WGMMA, warp-specialized kernels, and multi-GPU orchestration.

Explore the 10 lessons Browse all slides

10 lessons CUTLASS SM90 kernels PTX-level primitives Multi-GPU orchestration

Why This Course Exists

Hopper changed the programming model, not just the throughput number.

The site focuses on how Hopper actually works: asynchronous execution, descriptor-driven movement, synchronization semantics, tensor-core dataflow, and the distributed systems context required at H100 scale.

Asynchronous execution is the center of the story.

Older CUDA instincts assume issue and completion are nearly one mental unit. Hopper makes overlap, wait logic, and producer-consumer coordination part of everyday kernel design.

Data movement is a first-class optimization surface.

TMA descriptors, swizzles, layout choices, shared memory staging, and barrier-linked transfers matter as much as the math itself when you want tensor cores to stay fed.

Scale is part of the architecture, not an optional appendix.

NVLink, Stream-K scheduling, NCCL setup, PMIx, and Slurm show up because real training systems do not stop at one GPU, and neither should the mental model.

Course Structure

Ten lessons, arranged as one climb into the asynchronous machine.

The sequence starts with architecture and state spaces, then moves through barriers, TMA, WGMMA, compute-bound kernel design, and finally the distributed orchestration stack.

Deep Dives

Four deliberate entries into the course.

The syllabus above shows the full climb. This section stays tighter: four high-signal entries that cover the machine model, synchronization, kernel design, and distributed orchestration without turning the homepage into another full lesson index.

Lesson 01

Introduction to H100s

Start with Hopper as a machine: architecture, memory hierarchy, SMSPs, Tensor Cores, schedulers, and the reason asynchronous execution changes how the rest of the course needs to be read.

Build the memory-path mental model from HBM through L2, shared memory, and registers.
See where TMA and WGMMA fit before the later lessons turn them into concrete kernel machinery.
Anchor the whole course in utilization, overlap, and the real execution hierarchy of the GPU.

Hopper Memory Hierarchy TMA WGMMA

Open Lesson Page Open Slide Deck

Lesson 03

Asynchronicity and Barriers

Move from architecture into synchronization: proxy separation, RAW and WAR hazards, mbarrier, cluster barriers, async groups, and the rules that make overlap correct instead of accidental.

See why Hopper needs explicit visibility and ordering across generic and async proxies.
Understand what mbarrier tracks, how phases flip, and why acquire semantics matter.
Connect TMA, cp.async.bulk, WGMMA consumption, and shared-memory ownership handoff.

mbarrier fence.proxy.async Barrier Cluster wait_group

Open Lesson Page Open Slide Deck

Lesson 08

Kernel Design

Step from primitives to full kernels: arithmetic intensity, warp specialization, circular buffers, cooperative versus ping-pong pipelines, persistent scheduling, Stream-K, cluster multicast, and epilogue design.

Classify when a kernel is compute-bound versus memory-bound and which knobs actually matter in each regime.
Compare cooperative and ping-pong pipelines as utilization strategies, not interchangeable template names.
Connect persistent schedulers, Stream-K fixup, and epilogue handoff to the real SM90 kernel files in the repo.

Warp Specialization Stream-K Persistent Epilogue

Open Lesson Page Open Slide Deck

Lesson 10

Multi GPU Part 2

Finish at the orchestration layer: Slurm job control, PMIx bootstrap, GPU binding under CUDA_VISIBLE_DEVICES, NCCL communicator setup, collective primitives, and the parallelism patterns that actually use the fabric from lesson 9.

Understand how resource allocation, peer discovery, and communicator creation compose into one launch path.
See why NCCL thinks in collectives and topology-aware routes instead of manual pairwise copies.
Tie data, tensor, pipeline, and expert parallelism back to the concrete communication primitives they need.

Slurm PMIx NCCL Collectives

Open Lesson Page Open Slide Deck

Code Reality

The course keeps returning to pipelines because the files do.

A big part of this material is not slides in isolation. It is reading how descriptors, barriers, warpgroup math, scheduling, and measured kernels become concrete structures in the code. Pick a thread below and watch the pipeline take shape in the actual files.

Open Full File

Code Anchors

The course stays close to the kernel source.

These references are already in the repo. They are where the course connects architecture diagrams to scheduler logic, shared-memory layout, descriptor construction, and measured kernels.

sm90_gemm_tma_warpspecialized_pingpong.hpp

A CUTLASS SM90 kernel with ping-pong staging, warp-specialized producer and math groups, ordered sequence barriers, and scheduler-linked shared storage.

sm90_tile_scheduler_stream_k.hpp

Persistent Stream-K decomposition, work-tile bookkeeping, reduction units, and the scheduler rules that determine how work moves across the machine.

fast.cu/examples/matmul/matmul_12.cuh

Shared-memory matrix descriptors, inline PTX, explicit warpgroup fences, and direct wgmma.mma_async usage for bf16 GEMM kernels.

fast.cu/README.md

From-scratch kernels with measured bf16 matmul results, including runs that outperform cuBLAS on selected H100 matrix sizes.

Interactive Visualizations

Small live glances into the teaching tools behind the course.

These companion visualizations are already live on GitHub Pages. The homepage keeps them lighter now: each preview loads on demand inline, with a full-view link whenever you want the standalone interactive.

Rasterization + Swizzle Visualizer

Tile traversal, padded grids, swizzled cluster IDs, and persistent block mapping from rasterization.cu.

Open Raster Visualizer

Snapshot Tile Scheduling Studio

Rasterization and swizzle visualizer preview

H100 GPU Architecture

A dedicated architecture visualization for reading Hopper as a machine: major structures, hierarchy, and the chip-level surfaces the course keeps referring back to.

Open H100 Visualizer

Snapshot Architecture View

H100 GPU architecture visualizer preview

DGX H100 Node Architecture

A system-level view of the node: GPUs, fabric, and the kind of topology context that matters once the course moves beyond a single device.

Open DGX Visualizer

Snapshot Node Topology

DGX H100 node architecture visualizer preview

WGMMA Lifecycle

A scrollytelling walkthrough of WGMMA operations, swizzling, descriptors, and async load behavior on Hopper.

Open WGMMA Visualizer

Snapshot Lifecycle Walkthrough

TMA / WGMMA Async Pipeline

A compact buffer-and-timeline view of overlapped load, compute, store, and barrier-driven progression through the asynchronous pipeline.

Open Barrier Visualizer

Snapshot Async Pipeline

TMA and WGMMA async pipeline visualizer preview

Thread Block Cluster Visualizer

An interactive view of thread block clusters: residency, distributed shared memory topology, and how clusters coordinate across SMs on Hopper.

Open Cluster Visualizer

Snapshot Cluster Topology

SM90 Tile Scheduler Visualizer

A live walkthrough of CUTLASS SM90 tile scheduling: swizzle groups, cluster-major and cluster-minor offsets, raster order, and the get_work_idx_m_and_n mapping from linear CTA work to output tiles.

Open Tile Scheduler Visualizer

Snapshot Scheduler Mapping

FAQ

What people usually want to know before they start.

Short answers about prerequisites, scope, and what this course is actually trying to teach.

It starts at lesson 1, but it assumes you are already comfortable with C or C++ and reading normal CUDA kernels. The focus here is Hopper’s execution model, not CUDA basics.

Start Here

Move from Hopper architecture into the mechanisms that make modern kernels work.

The lessons connect the execution model, descriptors, waits, tile scheduling, tensor-core dataflow, and multi-GPU orchestration in one place.

Start with lesson 1 Browse all slides

CUDA Programming for NVIDIA H100s

Hopper changed the programming model, not just the throughput number.

Asynchronous execution is the center of the story.

Data movement is a first-class optimization surface.

Scale is part of the architecture, not an optional appendix.