Why Stream-K exists
The problem is not correctness. Plain data-parallel persistent scheduling is correct. The problem is utilization. If the output tile count is not a clean multiple of available SMs, the last wave leaves a large fraction of the machine idle while only a handful of CTAs finish the tail.
Tile-parallel view
One CTA owns one output tile. That is clean and cheap until the last wave is sparse and many SMs are waiting for a few final tiles to retire.
Stream-K view
The remaining work is treated as a tape of math iterations along K. The tail tile can be split so every SM finishes closer to the same time.
The notes are careful that this is usually a hybrid strategy. Early waves often stay purely data-parallel because that has the lowest overhead. Stream-K is most valuable when it targets the tail wave where the decomposition mismatch is doing visible damage.
Useful heuristic from the notes: if the tail is still mostly full, it can be better to keep the normal scheduler and avoid unnecessary split-K reduction overhead. Stream-K is most attractive when the tail is meaningfully under-filled.
Work units and splits make the scheduler explicit
For one output tile, GEMM is the sum of many K tiles. If a CTA computes all of them, no reduction is needed. If multiple CTAs each compute a fraction of those K tiles, each CTA owns a split of that output tile and the result needs fixup.
| Field | Meaning | Why it matters |
|---|---|---|
M_idx, N_idx, L_idx |
The output tile coordinates. | Tell the CTA which C tile it contributes to. |
K_idx |
Where this split begins inside the output tile's K dimension. | Distinguishes the first split from middle or final splits. |
k_tile_count |
How many K tiles this split computes. | Tells you how much math this CTA actually owns for that output tile. |
k_tile_remaining |
How much of the work unit is still left to process. | Matters because one CTA can span multiple splits as it walks the work tape. |
A split is simply one CTA's contribution to one output tile. One CTA can end a tile and begin the next one inside the same assigned range of math work, which is why the scheduler tracks both tile coordinates and the exact K start within the tile.
// Conceptual split state
tile: (M_idx, N_idx, L_idx)
K_idx: where this CTA starts in the tile's K dimension
k_tile_count: how much K this CTA computes
is_final_split = (K_idx + k_tile_count) == k_tiles_per_output_tile
The first, middle, and final split roles fall directly out of that state
Once a CTA knows its K_idx and k_tile_count, its role is no longer
mysterious. The slides describe three cases, and the Stream-K scheduler header in this repo mirrors
them with helpers like is_final_split(...) and compute_epilogue(...).
First split
K_idx == 0. Nobody has written workspace for this tile yet, so the first split
initializes the partial result.
Middle split
Owns a K range strictly inside the tile. It reduces its partials into existing workspace and does not own the final epilogue.
Final split
Covers the end of the K dimension. It waits for prior splits, loads their partials, adds its own, and then owns the final epilogue.
This is the real conceptual shift. A CTA can still be persistent and still follow a deterministic work loop, but it is no longer guaranteed to own a complete output tile from start to finish.
Backward iteration helps the final split: the notes emphasize that workers often iterate through shared tiles in reverse K order so the ending split reaches the fixup point later, reducing how long it has to wait for the earlier split to finish.
The workspace and lock protocol are the cost of balancing the tail
Split ownership only works because partial accumulators can be staged in global workspace and progress can be tracked with a lock. The notes describe that lock as an integer per output tile that monotonically encodes how much K work has already been completed and published.
| Mechanism | Purpose | Effect on execution |
|---|---|---|
| Reduction workspace | Stores partial accumulators for tiles split across CTAs. | Creates the place where later splits can load and reduce prior work. |
| Lock / progress counter | Records how much K work has already been completed for the tile. | Lets later splits know whether they can proceed with deterministic or opportunistic reduction. |
| Separate reduction units | Allow reduction and epilogue work to be modeled explicitly for some scheduler modes. | Decouples math ownership from final fixup ownership when the scheduler decides that is beneficial. |
The Stream-K header in this repo exposes both deterministic and non-deterministic reduction modes. The deterministic path waits for the exact cumulative K progress it expects. The non-deterministic path is looser and cares mainly that the workspace has been initialized before middle splits race to reduce into it.
// Conceptual fixup flow
first split -> store partials, publish progress
middle split -> wait until workspace is valid, reduce partials, publish progress
final split -> wait for prior progress, load-add all required partials, run epilogue
Groups recover locality after the 1D work tape scatters CTAs
A naive Stream-K decomposition balances work but destroys the nice spatial wave pattern that helps L2. Groups are the locality repair mechanism. They partition Stream-K units into sub-groups so cooperating workers stay closer to the same region of output space and reuse more useful cache state.
Base persistent scheduler
Keeps the standard tile-parallel persistent loop, swizzle, and raster order. It is the cheap baseline when split-K fixup is not needed.
Grouped persistent scheduler
Extends persistence across multiple grouped GEMM problems, treating them as one long linear tile space while preserving per-group swizzle and locality metadata.
Stream-K scheduler
Adds split tracking, reduction ownership, and locality groups so the tail can be balanced without giving up the reuse story completely.
| File | Role in the story |
|---|---|
sm90_tile_scheduler.hpp |
The base static persistent scheduler and swizzle machinery. |
sm90_tile_scheduler_group.hpp |
The grouped persistent extension for multiple GEMM problems in one launch. |
sm90_tile_scheduler_stream_k.hpp |
The split-K, fixup, reduction, and grouped-locality scheduler used for the Stream-K path. |
The notes also mention HyTiS as an alternative way to fight wave quantization. Its point is useful even if you do not adopt it: Stream-K is not the only answer. It is one answer that trades extra reduction machinery for better tail utilization when K is the natural dimension to slice.
Practical guidance
- Use Stream-K for the mismatch, not for its own sake. The goal is to fix the tail wave, not to split every tile when the normal persistent scheduler is already efficient.
- Track split ownership explicitly.
K_idx,k_tile_count, and final-split status are the mechanism that decides whether a CTA stores, reduces, or runs the epilogue. - Remember that fixup is a real cost. Workspace traffic, locks, and cross-CTA reduction are why hybrid policies often outperform applying Stream-K everywhere.
- Protect locality after balancing. Groups matter because a pure 1D work tape can solve utilization while quietly destroying L2 reuse.
- Read the scheduler files as a family. The base persistent scheduler, grouped scheduler, and Stream-K scheduler are three connected answers to the same utilization problem.
Glossary
| Split | One CTA's contribution to one output tile when the tile is divided along K. |
|---|---|
| Fixup | The reduction of partial accumulators from multiple CTAs into one final output tile. |
| Final split | The split whose K range reaches the end of the tile and therefore owns the final epilogue. |
| Separate reduction | A scheduler mode where reduction and epilogue work can be modeled as distinct work units. |
| Group | A locality-preserving subset of Stream-K units that cooperates on its own portion of the work tape. |
Continue the course
The scheduler addendum finishes the tail-balancing story, but one more kernel supplement remains. The next page looks at the launch boundary itself: launch bounds, dependent grids, programmatic stream serialization, and how Hopper overlaps the boot of one grid with the tail of another.
Kernel Launch
Continue into dependent launches, host-side serialization control, and the overlap window between producer and consumer grids.
After ThatMulti GPU Part 1
Continue into NVLink, NVSwitch, ConnectX, rails, topology, and why one fast kernel is still not enough at system scale.
Code AnchorStream-K Scheduler
See the split metadata, reduction workspace logic, and epilogue ownership rules encoded directly in CUTLASS SM90 scheduler code.
Code AnchorGroup Scheduler
Open the grouped persistent scheduler to compare locality-oriented persistence against the Stream-K extension.
ExploreBack to Lesson Dossiers
Return to the homepage and keep moving through the integrated lesson pages and supplemental dossiers already living in the site.