Explore Here / Lesson 08.1

Stream-K

Lesson 8.1 is the scheduler addendum that makes persistent kernels feel complete. Instead of assigning one whole output tile to one CTA and accepting a weak tail wave, Stream-K slices the remaining K work into a balanced tape, tracks split ownership explicitly, and pays the reduction cost only where the tail actually needs it.

Stream-K Fixup Persistent Scheduling Groups L2 Locality

Why Stream-K exists

The problem is not correctness. Plain data-parallel persistent scheduling is correct. The problem is utilization. If the output tile count is not a clean multiple of available SMs, the last wave leaves a large fraction of the machine idle while only a handful of CTAs finish the tail.

Tile-parallel view

One CTA owns one output tile. That is clean and cheap until the last wave is sparse and many SMs are waiting for a few final tiles to retire.

Stream-K view

The remaining work is treated as a tape of math iterations along K. The tail tile can be split so every SM finishes closer to the same time.

The notes are careful that this is usually a hybrid strategy. Early waves often stay purely data-parallel because that has the lowest overhead. Stream-K is most valuable when it targets the tail wave where the decomposition mismatch is doing visible damage.

Useful heuristic from the notes: if the tail is still mostly full, it can be better to keep the normal scheduler and avoid unnecessary split-K reduction overhead. Stream-K is most attractive when the tail is meaningfully under-filled.

Work units and splits make the scheduler explicit

For one output tile, GEMM is the sum of many K tiles. If a CTA computes all of them, no reduction is needed. If multiple CTAs each compute a fraction of those K tiles, each CTA owns a split of that output tile and the result needs fixup.

Field Meaning Why it matters
M_idx, N_idx, L_idx The output tile coordinates. Tell the CTA which C tile it contributes to.
K_idx Where this split begins inside the output tile's K dimension. Distinguishes the first split from middle or final splits.
k_tile_count How many K tiles this split computes. Tells you how much math this CTA actually owns for that output tile.
k_tile_remaining How much of the work unit is still left to process. Matters because one CTA can span multiple splits as it walks the work tape.

A split is simply one CTA's contribution to one output tile. One CTA can end a tile and begin the next one inside the same assigned range of math work, which is why the scheduler tracks both tile coordinates and the exact K start within the tile.

// Conceptual split state
tile: (M_idx, N_idx, L_idx)
K_idx: where this CTA starts in the tile's K dimension
k_tile_count: how much K this CTA computes

is_final_split = (K_idx + k_tile_count) == k_tiles_per_output_tile

The first, middle, and final split roles fall directly out of that state

Once a CTA knows its K_idx and k_tile_count, its role is no longer mysterious. The slides describe three cases, and the Stream-K scheduler header in this repo mirrors them with helpers like is_final_split(...) and compute_epilogue(...).

First split

K_idx == 0. Nobody has written workspace for this tile yet, so the first split initializes the partial result.

Middle split

Owns a K range strictly inside the tile. It reduces its partials into existing workspace and does not own the final epilogue.

Final split

Covers the end of the K dimension. It waits for prior splits, loads their partials, adds its own, and then owns the final epilogue.

This is the real conceptual shift. A CTA can still be persistent and still follow a deterministic work loop, but it is no longer guaranteed to own a complete output tile from start to finish.

Backward iteration helps the final split: the notes emphasize that workers often iterate through shared tiles in reverse K order so the ending split reaches the fixup point later, reducing how long it has to wait for the earlier split to finish.

The workspace and lock protocol are the cost of balancing the tail

Split ownership only works because partial accumulators can be staged in global workspace and progress can be tracked with a lock. The notes describe that lock as an integer per output tile that monotonically encodes how much K work has already been completed and published.

Mechanism Purpose Effect on execution
Reduction workspace Stores partial accumulators for tiles split across CTAs. Creates the place where later splits can load and reduce prior work.
Lock / progress counter Records how much K work has already been completed for the tile. Lets later splits know whether they can proceed with deterministic or opportunistic reduction.
Separate reduction units Allow reduction and epilogue work to be modeled explicitly for some scheduler modes. Decouples math ownership from final fixup ownership when the scheduler decides that is beneficial.

The Stream-K header in this repo exposes both deterministic and non-deterministic reduction modes. The deterministic path waits for the exact cumulative K progress it expects. The non-deterministic path is looser and cares mainly that the workspace has been initialized before middle splits race to reduce into it.

// Conceptual fixup flow
first split   -> store partials, publish progress
middle split  -> wait until workspace is valid, reduce partials, publish progress
final split   -> wait for prior progress, load-add all required partials, run epilogue

Groups recover locality after the 1D work tape scatters CTAs

A naive Stream-K decomposition balances work but destroys the nice spatial wave pattern that helps L2. Groups are the locality repair mechanism. They partition Stream-K units into sub-groups so cooperating workers stay closer to the same region of output space and reuse more useful cache state.

Base persistent scheduler

Keeps the standard tile-parallel persistent loop, swizzle, and raster order. It is the cheap baseline when split-K fixup is not needed.

Grouped persistent scheduler

Extends persistence across multiple grouped GEMM problems, treating them as one long linear tile space while preserving per-group swizzle and locality metadata.

Stream-K scheduler

Adds split tracking, reduction ownership, and locality groups so the tail can be balanced without giving up the reuse story completely.

File Role in the story
sm90_tile_scheduler.hpp The base static persistent scheduler and swizzle machinery.
sm90_tile_scheduler_group.hpp The grouped persistent extension for multiple GEMM problems in one launch.
sm90_tile_scheduler_stream_k.hpp The split-K, fixup, reduction, and grouped-locality scheduler used for the Stream-K path.

The notes also mention HyTiS as an alternative way to fight wave quantization. Its point is useful even if you do not adopt it: Stream-K is not the only answer. It is one answer that trades extra reduction machinery for better tail utilization when K is the natural dimension to slice.

Practical guidance

  1. Use Stream-K for the mismatch, not for its own sake. The goal is to fix the tail wave, not to split every tile when the normal persistent scheduler is already efficient.
  2. Track split ownership explicitly. K_idx, k_tile_count, and final-split status are the mechanism that decides whether a CTA stores, reduces, or runs the epilogue.
  3. Remember that fixup is a real cost. Workspace traffic, locks, and cross-CTA reduction are why hybrid policies often outperform applying Stream-K everywhere.
  4. Protect locality after balancing. Groups matter because a pure 1D work tape can solve utilization while quietly destroying L2 reuse.
  5. Read the scheduler files as a family. The base persistent scheduler, grouped scheduler, and Stream-K scheduler are three connected answers to the same utilization problem.

Glossary

Split One CTA's contribution to one output tile when the tile is divided along K.
Fixup The reduction of partial accumulators from multiple CTAs into one final output tile.
Final split The split whose K range reaches the end of the tile and therefore owns the final epilogue.
Separate reduction A scheduler mode where reduction and epilogue work can be modeled as distinct work units.
Group A locality-preserving subset of Stream-K units that cooperates on its own portion of the work tape.

Continue the course

The scheduler addendum finishes the tail-balancing story, but one more kernel supplement remains. The next page looks at the launch boundary itself: launch bounds, dependent grids, programmatic stream serialization, and how Hopper overlaps the boot of one grid with the tail of another.