Explore Here / Lesson 08

Kernel Design

Lesson 8 zooms out from one Hopper primitive to the whole kernel. The problem becomes utilization: deciding whether the bottleneck is math or movement, shaping warp roles, sizing circular buffers, picking cooperative versus ping-pong pipelines, scheduling tiles persistently, and designing an epilogue that does not undo the mainloop.

Arithmetic Intensity Warp Specialization Ping-Pong Stream-K Epilogue

Classify the bottleneck before you design the kernel

The notes start from first principles: CUDA kernels are either limited primarily by math throughput or by data movement. Arithmetic intensity is the guide: AI = FLOPs / bytes moved. If that value is below the hardware ridge point, the kernel is memory-bound. If it is above the ridge point, the kernel is compute-bound.

Regime Primary symptom What you optimize first
Memory-bound Execution is gated by bytes moved rather than math issue. Traffic reduction, locality, cache behavior, and movement overlap.
Compute-bound Tensor cores or arithmetic issue are the actual limiter. Register pressure, WGMMA issue density, pipeline depth, and epilogue overlap.

The slides give concrete ridge-point intuition: roughly 295 FLOP / byte for FP16 tensor operations and around 20 FLOP / byte for FP32 CUDA-core work. That is why the lesson is mostly about Hopper GEMMs. They are high-arithmetic-intensity kernels whose bottleneck shifts toward register file pressure, WGMMA launch rate, and scheduling rather than raw byte traffic alone.

Compute-bound kernels live or die by warp specialization and register policy

Hopper compute-bound kernels are defined by strong reuse and a producer-consumer imbalance. The producer warps frequently sit at barriers waiting for consumers because the real pressure is on the math path: accumulators, operand fragments, issue cadence, and the number of registers required to keep a large output tile resident.

Producer warps

Usually one warpgroup is enough. Producers mainly issue TMA copies, manage barriers, and stay lean on registers.

Consumer warps

Consumers carry the accumulator tile and drive WGMMA. They are where register pressure explodes, which is why the notes treat one or two consumer warpgroups as the main design choice.

setmaxnreg makes the asymmetry explicit

The lesson highlights setmaxnreg as the instruction that lets different warp roles ask for different register budgets during execution. Producers can stay near the minimum, while consumers request the larger allocation needed by the MMA-heavy section.

Design choice Why it exists Typical consequence
One producer warpgroup A single issuing thread can launch large TMA-backed movement. Keeps load-side overhead small and frees resources for math.
One or two consumer warpgroups Accumulator tiles determine how much math state fits per warpgroup. Drives the choice between basic / ping-pong and cooperative pipelines.
Moderate occupancy, high register density Compute-bound kernels do not need maximal occupancy if they keep tensor cores fed. Register budgeting becomes more important than raw block count.

The mental model: warp specialization is not just branch cleanup. It is how the kernel fits the live state of a Hopper GEMM into the register file without spilling away the whole win.

Circular buffers are the physical structure that makes overlap real

The lesson reduces pipelining to one idea: while tile N is being computed, tile N + 1 must already be moving. Shared memory becomes a fixed ring of stages that the producer fills and the consumer drains round-robin.

Producer waits EMPTY  -> declares expected bytes on FULL
Producer issues TMA   -> hardware signals FULL on completion
Consumer waits FULL   -> issues WGMMA from that stage
Consumer waits oldest -> arrives EMPTY to release the stage

FULL barrier

The consumer waits here before reading a stage. In TMA pipelines this is tied to real transfer completion, not just a software flag.

EMPTY barrier

The producer waits here before reusing a stage. Consumers release it only after the oldest WGMMA that touches that stage is known to be done.

Phase bits are what keep the ring safe after wrap-around. Stage indices repeat. The phase bit flips on wrap so “stage 0 is full” can be disambiguated as the old pass or the new pass through the circular buffer.

Prologue, steady state, drain

  1. Prologue: fill the pipeline with initial WGMMA work before releasing anything.
  2. Steady state: wait for data, issue WGMMA, wait for the oldest in-flight group, release the oldest stage, advance.
  3. Drain: finish with warpgroup_wait<0>(), then release the remaining stages so producers can exit cleanly.

Cooperative and ping-pong pipelines solve different utilization problems

Both architectures use the same core ingredients: one producer warpgroup, consumer warpgroups, circular shared-memory stages, and barrier-backed overlap. The difference is where the second consumer warpgroup is spent.

Cooperative pipeline

Two consumer warpgroups work on the same output tile. That enables larger effective CTA tiles because each warpgroup owns a disjoint M region, but tensor cores go idle during the epilogue.

Ping-pong pipeline

Consumer 0 and Consumer 1 alternate tiles. While one runs the epilogue, the other runs the next tile's WGMMA mainloop. This is the design that overlaps epilogue with tensor-core work.

Architecture Best when Main tradeoff
Cooperative You need a larger output tile than one warpgroup can hold in registers. Epilogue time is not hidden by concurrent WGMMA.
Ping-pong Epilogue overlap is critical and K is large enough to keep alternating consumers busy. The coordination logic is more intricate, especially around ordered barriers and epilogue handoff.

Clusters extend the data-sharing radius

The notes treat thread block clusters mainly as a load-side optimization for WGMMA kernels. Adjacent CTAs that need the same A or B tile can use TMA multicast so one global-memory read fans out into the local shared memory of the blocks in the cluster. Typical shapes like (1, 2) or (2, 1) are favored because neighboring blocks inside a TPC communicate especially well.

Persistent scheduling decides who gets the next tile and how fixup happens

Scheduling is not just launch geometry. It is the policy that maps tile coordinates onto workers over time. Hopper kernels frequently use persistent scheduling so a smaller grid can stay resident and loop through many work tiles rather than paying the full non-persistent tail every time.

Persistent tiles

A CTA pulls a sequence of tiles, keeps its role assignment and pipeline state alive, and avoids the worst tail effects of one-shot block scheduling.

Stream-K

Work can be decomposed along K so multiple CTAs contribute partial sums to one output tile, then fix up the result through a reduction and epilogue ownership rule.

The Stream-K scheduler in this repo makes that explicit with a work unit that tracks M_idx, N_idx, L_idx, K_idx, and k_tile_count. Those fields tell the CTA where its split starts and how much of the K dimension it owns.

Role Predicate Responsibility
First split K_idx == 0 Initialize the partial result in workspace.
Middle split 0 < K_idx and not final Reduce into existing workspace state.
Final split K_idx + k_tile_count == k_tiles_per_output_tile Own the final fixup and epilogue for that output tile.

The notes also highlight grouping as an L2-locality optimization on top of Stream-K. Rather than one completely global pool of stream-K units, groups partition work so cooperating CTAs stay closer to the same region of output space and share more useful cache state.

The epilogue is its own pipeline, not an afterthought

The output tile is usually too large to dump from registers to global memory in one step. Hopper epilogues break the tile into subtiles, optionally load source tensor C, apply the elementwise or scaling logic in registers, copy the converted result into shared memory, and then issue TMA stores for the completed subtile.

Register to shared

Consumers write each subtile into shared memory after type conversion. When C and D share element width, the same shared-memory buffers can be reused for load and store phases.

Shared to global

A cross-proxy fence plus bar.sync make the full subtile visible before the TMA store launches, and the store pipeline can overlap one subtile's write with work on the next subtile.

In the ping-pong design, the handoff is even more structured. One consumer's epilogue must coexist with the other consumer's MMA. The lesson describes a 2x2 ordered-barrier grid: one row for MMA handoff and one row for epilogue handoff. That is what lets the two consumers alternate without corrupting shared memory or starving tensor cores.

Important restraint: not every post-op belongs inside the epilogue. The notes are explicit that cross-tile reductions like layer norm or softmax often require a real kernel boundary, and the goal becomes making that handoff cheap rather than pretending it can always be fused away.

Continue the course

Lesson 8 turns the Hopper primitives into full-kernel machinery. The next two supplements stay inside the kernel layer: first Stream-K for tail balancing, then kernel launch control for the handoff between dependent grids before the course moves out to multi-GPU systems.