Classify the bottleneck before you design the kernel
The notes start from first principles: CUDA kernels are either limited primarily by math throughput or
by data movement. Arithmetic intensity is the guide:
AI = FLOPs / bytes moved. If that value is below the hardware ridge point, the kernel is
memory-bound. If it is above the ridge point, the kernel is compute-bound.
| Regime | Primary symptom | What you optimize first |
|---|---|---|
| Memory-bound | Execution is gated by bytes moved rather than math issue. | Traffic reduction, locality, cache behavior, and movement overlap. |
| Compute-bound | Tensor cores or arithmetic issue are the actual limiter. | Register pressure, WGMMA issue density, pipeline depth, and epilogue overlap. |
The slides give concrete ridge-point intuition: roughly 295 FLOP / byte for FP16 tensor
operations and around 20 FLOP / byte for FP32 CUDA-core work. That is why the lesson is
mostly about Hopper GEMMs. They are high-arithmetic-intensity kernels whose bottleneck shifts toward
register file pressure, WGMMA launch rate, and scheduling rather than raw byte traffic alone.
Compute-bound kernels live or die by warp specialization and register policy
Hopper compute-bound kernels are defined by strong reuse and a producer-consumer imbalance. The producer warps frequently sit at barriers waiting for consumers because the real pressure is on the math path: accumulators, operand fragments, issue cadence, and the number of registers required to keep a large output tile resident.
Producer warps
Usually one warpgroup is enough. Producers mainly issue TMA copies, manage barriers, and stay lean on registers.
Consumer warps
Consumers carry the accumulator tile and drive WGMMA. They are where register pressure explodes, which is why the notes treat one or two consumer warpgroups as the main design choice.
setmaxnreg makes the asymmetry explicit
The lesson highlights setmaxnreg as the instruction that lets different warp roles ask for
different register budgets during execution. Producers can stay near the minimum, while consumers
request the larger allocation needed by the MMA-heavy section.
| Design choice | Why it exists | Typical consequence |
|---|---|---|
| One producer warpgroup | A single issuing thread can launch large TMA-backed movement. | Keeps load-side overhead small and frees resources for math. |
| One or two consumer warpgroups | Accumulator tiles determine how much math state fits per warpgroup. | Drives the choice between basic / ping-pong and cooperative pipelines. |
| Moderate occupancy, high register density | Compute-bound kernels do not need maximal occupancy if they keep tensor cores fed. | Register budgeting becomes more important than raw block count. |
The mental model: warp specialization is not just branch cleanup. It is how the kernel fits the live state of a Hopper GEMM into the register file without spilling away the whole win.
Circular buffers are the physical structure that makes overlap real
The lesson reduces pipelining to one idea: while tile N is being computed, tile
N + 1 must already be moving. Shared memory becomes a fixed ring of stages that the
producer fills and the consumer drains round-robin.
Producer waits EMPTY -> declares expected bytes on FULL
Producer issues TMA -> hardware signals FULL on completion
Consumer waits FULL -> issues WGMMA from that stage
Consumer waits oldest -> arrives EMPTY to release the stage
FULL barrier
The consumer waits here before reading a stage. In TMA pipelines this is tied to real transfer completion, not just a software flag.
EMPTY barrier
The producer waits here before reusing a stage. Consumers release it only after the oldest WGMMA that touches that stage is known to be done.
Phase bits are what keep the ring safe after wrap-around. Stage indices repeat. The phase bit flips on wrap so “stage 0 is full” can be disambiguated as the old pass or the new pass through the circular buffer.
Prologue, steady state, drain
- Prologue: fill the pipeline with initial WGMMA work before releasing anything.
- Steady state: wait for data, issue WGMMA, wait for the oldest in-flight group, release the oldest stage, advance.
- Drain: finish with
warpgroup_wait<0>(), then release the remaining stages so producers can exit cleanly.
Cooperative and ping-pong pipelines solve different utilization problems
Both architectures use the same core ingredients: one producer warpgroup, consumer warpgroups, circular shared-memory stages, and barrier-backed overlap. The difference is where the second consumer warpgroup is spent.
Cooperative pipeline
Two consumer warpgroups work on the same output tile. That enables larger effective CTA tiles because each warpgroup owns a disjoint M region, but tensor cores go idle during the epilogue.
Ping-pong pipeline
Consumer 0 and Consumer 1 alternate tiles. While one runs the epilogue, the other runs the next tile's WGMMA mainloop. This is the design that overlaps epilogue with tensor-core work.
| Architecture | Best when | Main tradeoff |
|---|---|---|
| Cooperative | You need a larger output tile than one warpgroup can hold in registers. | Epilogue time is not hidden by concurrent WGMMA. |
| Ping-pong | Epilogue overlap is critical and K is large enough to keep alternating consumers busy. | The coordination logic is more intricate, especially around ordered barriers and epilogue handoff. |
Clusters extend the data-sharing radius
The notes treat thread block clusters mainly as a load-side optimization for WGMMA kernels. Adjacent
CTAs that need the same A or B tile can use TMA multicast so one global-memory read fans out into the
local shared memory of the blocks in the cluster. Typical shapes like (1, 2) or
(2, 1) are favored because neighboring blocks inside a TPC communicate especially well.
Persistent scheduling decides who gets the next tile and how fixup happens
Scheduling is not just launch geometry. It is the policy that maps tile coordinates onto workers over time. Hopper kernels frequently use persistent scheduling so a smaller grid can stay resident and loop through many work tiles rather than paying the full non-persistent tail every time.
Persistent tiles
A CTA pulls a sequence of tiles, keeps its role assignment and pipeline state alive, and avoids the worst tail effects of one-shot block scheduling.
Stream-K
Work can be decomposed along K so multiple CTAs contribute partial sums to one output tile, then fix up the result through a reduction and epilogue ownership rule.
The Stream-K scheduler in this repo makes that explicit with a work unit that tracks
M_idx, N_idx, L_idx, K_idx, and
k_tile_count. Those fields tell the CTA where its split starts and how much of the K
dimension it owns.
| Role | Predicate | Responsibility |
|---|---|---|
| First split | K_idx == 0 |
Initialize the partial result in workspace. |
| Middle split | 0 < K_idx and not final |
Reduce into existing workspace state. |
| Final split | K_idx + k_tile_count == k_tiles_per_output_tile |
Own the final fixup and epilogue for that output tile. |
The notes also highlight grouping as an L2-locality optimization on top of Stream-K. Rather than one completely global pool of stream-K units, groups partition work so cooperating CTAs stay closer to the same region of output space and share more useful cache state.
The epilogue is its own pipeline, not an afterthought
The output tile is usually too large to dump from registers to global memory in one step. Hopper epilogues break the tile into subtiles, optionally load source tensor C, apply the elementwise or scaling logic in registers, copy the converted result into shared memory, and then issue TMA stores for the completed subtile.
Register to shared
Consumers write each subtile into shared memory after type conversion. When C and D share element width, the same shared-memory buffers can be reused for load and store phases.
Shared to global
A cross-proxy fence plus bar.sync make the full subtile visible before the TMA store
launches, and the store pipeline can overlap one subtile's write with work on the next subtile.
In the ping-pong design, the handoff is even more structured. One consumer's epilogue must coexist with the other consumer's MMA. The lesson describes a 2x2 ordered-barrier grid: one row for MMA handoff and one row for epilogue handoff. That is what lets the two consumers alternate without corrupting shared memory or starving tensor cores.
Important restraint: not every post-op belongs inside the epilogue. The notes are explicit that cross-tile reductions like layer norm or softmax often require a real kernel boundary, and the goal becomes making that handoff cheap rather than pretending it can always be fused away.
Continue the course
Lesson 8 turns the Hopper primitives into full-kernel machinery. The next two supplements stay inside the kernel layer: first Stream-K for tail balancing, then kernel launch control for the handoff between dependent grids before the course moves out to multi-GPU systems.
Stream-K
Continue into split-K work units, fixup, locks, and the scheduler logic that targets the tail wave directly.
ThenKernel Launch
Follow the handoff into launch bounds, dependent grids, programmatic stream serialization, and launch-window tuning.
Code AnchorPing-Pong Mainloop
See ordered math warpgroups, persistent tile iteration, and epilogue overlap in the repo's SM90 ping-pong kernel.
After ThatMulti GPU Part 1
Move from kernel utilization into NVLink, NVSwitch, rails, and the system-level fabric behind large H100 training jobs.