Why kernel launch deserves its own lesson
The notes are explicit that multi-kernel design is not just cleanup work after the "real" optimization. Some post-processing stages genuinely need a boundary. Epilogues hit a practical ceiling, fused work raises register pressure, and cross-tile operations like LayerNorm or Softmax need a point where one phase is complete before the next phase reasons over its output.
What forces the boundary
Cross-tile reductions, heavy epilogues, and state that no longer fits comfortably beside the accumulator mainloop make a clean handoff the right architecture.
What Hopper changes
The handoff no longer has to mean "producer fully retires, then consumer cold-starts." Hopper can overlap the boot path if the launch packet and instruction stream cooperate.
The lesson's core rule
Start early, read late. Let the dependent grid reserve context and do safe setup work before the producer is globally finished, but fence dependency-sensitive reads until it is actually safe.
Optimization target: do not ask "how do I avoid multiple kernels?" Ask "how do I make the boundary cheap enough that the right decomposition still runs close to full speed?"
Launch controls span both compile-time shape and runtime authority
Some launch decisions are embedded into the kernel signature and codegen assumptions. Others live in the host launch packet. This supplement keeps those layers separate because Hopper performance depends on knowing which knob changes occupancy, which knob changes parameter lifetime, and which knob changes scheduler policy.
| Control | What it expresses | Why it matters here |
|---|---|---|
__cluster_dims__(x, y, z) |
Compile-time thread block cluster shape. | Important on SM90 because cluster residency and multicast behavior are part of real kernel design. |
__launch_bounds__(...) |
Bounds on thread count and occupancy assumptions. | Launch shape is part of the resource budget that later overlap decisions have to live inside. |
__maxnreg__(N) |
Register cap per thread. | Matters when fused work and dependent-launch overlap are both pressuring residency. |
__grid_constant__ |
Read-only grid-lifetime kernel parameter. | Exactly the kind of metadata path Hopper uses for descriptors and launch-stable control data. |
cudaLaunchKernelEx + attributes |
Host-side runtime authority over launch behavior. | This is where programmatic stream serialization permission is actually granted. |
The local fast.cu examples are useful here even though they are not dependent-launch demos. Files like
matmul_12.cuh and pingpong_experimental.cuh show the launch-shape side of the
story directly: cluster dimensions, launch bounds, and grid-constant descriptors are encoded into the
kernels before any runtime launch packet ever appears.
__global__ __launch_bounds__(NUM_THREADS)
void __cluster_dims__(CLUSTER_M * CLUSTER_N, 1, 1)
matmulKernel12(
int M,
int N,
int K,
const __grid_constant__ CUtensorMap tensorMapC,
const __grid_constant__ CUtensorMap tensorMapA,
const __grid_constant__ CUtensorMap tensorMapB,
int* dspace);
Default stream serialization leaves a tail-latency hole
Under the default rules, kernels launched on the same stream behave like an ordered queue. Kernel B is not eligible until Kernel A fully retires. That means the handoff is tied to whole-grid retirement rather than to the earlier moment when enough of the producer's work is already complete for the next stage to begin preparing.
Why the tail hurts
Near the end of the producer grid, only a shrinking subset of SMs still has work. The rest of the machine can sit idle even when the dependent grid is already known and ready to go.
What is wasted
The cost is not just empty cycles. The dependent grid also pays a cold-start penalty later because no context was admitted and no safe prefetch work was allowed to begin early.
Default behavior:
producer grid fully retires
-> dependent grid becomes eligible
-> dependent grid starts booting
That is the exact serialization wall this supplement is trying to soften. The goal is not to violate correctness. The goal is to untie "eligible to boot" from "safe to read dependency-sensitive data."
Dependent launch works because Hopper splits scheduler permission from read permission
Grid dependency control introduces two control points. One is producer-side and talks to the scheduler. The other is consumer-side and fences execution until the dependency has actually resolved. That split is the mechanism that makes overlap possible without turning the handoff into a race.
| Mechanism | What it does | What it does not do |
|---|---|---|
griddepcontrol.launch_dependents |
Signals that a dependent grid may begin booting before the producer grid has globally retired. | It does not make dependency-sensitive loads safe on its own. |
griddepcontrol.wait |
Stops dependent warps at the exact fence where unresolved reads would become unsafe. | It does not itself grant early eligibility; it only fences execution inside the dependent grid. |
cudaLaunchAttributeProgrammaticStreamSerialization |
Host-side permission bit that tells the launch machinery to honor the dependent-launch protocol. | It does not replace the device-side instructions; the packet and the instruction stream are both required. |
The notes make the host-authority point strongly: device instructions do not override stream policy by themselves. If the host launch descriptor does not opt in, strict same-stream serialization remains in force even if the kernel contains the relevant instructions.
// producer kernel on the stream
producer_kernel<<<gridA, blockA, 0, stream>>>(...);
// dependent kernel with programmatic stream serialization permission
cudaLaunchAttribute attr[1];
attr[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
attr[0].val.programmaticStreamSerializationAllowed = 1;
cudaLaunchConfig_t cfg{};
cfg.gridDim = gridB;
cfg.blockDim = blockB;
cfg.attrs = attr;
cfg.numAttrs = 1;
cudaLaunchKernelEx(&cfg, consumer_kernel, ...);
The conceptual split: producer code says "the dependent may boot now." Dependent code says "I will stop right here until the unsafe reads are truly safe." Both sides are required.
The overlap window is where the optimization either pays off or collapses
Once the dependent grid can arrive early, the interesting question becomes timing. Launch too late and you hide almost none of the startup cost. Launch too early and the producer and consumer fight over the same execution slots, registers, shared memory, and cache capacity.
The dependent's wall
Warps in the dependent grid can run setup code and then park at griddepcontrol.wait.
That reserves execution context early, which is useful but not free.
The producer's green light
launch_dependents should be placed at the latest safe point that still gives real
overlap. Too conservative wastes the feature. Too aggressive creates contention and can lower total
throughput.
The prefetch split
The notes highlight an asymmetric dependent kernel where dependency-bound warps stop at the wait fence while separate warps prefetch static weights into L2. That keeps the memory system busy while still respecting correctness for activation reads.
| Too little overlap | Too much overlap | Healthy target |
|---|---|---|
| The dependent grid still pays most of its cold-start cost. | Producer and consumer compete for issue slots, residency, L2, and memory queues. | Launch at the latest safe moment that still hides meaningful startup latency. |
| Little net gain from the feature. | Prefetched lines can churn out of L2 before compute needs them. | Prefetch only stable, reusable tensors whose safety is independent of the producer's completion. |
| The dependent grid remains mostly idle until the producer is done anyway. | Total runtime can rise even though concurrency looks more impressive on paper. | Watch L2 hit rate, queue pressure, and residency headroom rather than judging overlap by appearance alone. |
Dependent-launch protocol:
producer issues launch_dependents
-> dependent grid boots
-> dependency-bound warps stop at wait
-> dependency resolves
-> stalled warps continue and consume the produced data
Practical guidance
- Do not treat multi-kernel boundaries as failure. If the math needs a real boundary, the right question is how to hide its cost, not how to deny that it exists.
- Keep launch policy layered correctly. Compile-time launch shape, grid-lifetime parameters, and host-side launch permission solve different problems.
- Remember the host owns the exception. Programmatic stream serialization is opt-in behavior, not the default interpretation of same-stream work.
- Separate boot overlap from read safety. Early scheduler admission and correct memory visibility are related, but they are not the same event.
- Tune for net throughput, not for dramatic overlap screenshots. Concurrency that damages cache residency or queue pressure can easily lose to a slightly later, cleaner launch point.
Glossary
| Programmatic stream serialization | A host-authorized relaxation of default same-stream ordering that allows dependent grids to participate in the launch/wait protocol. |
|---|---|
| Launch dependents | The producer-side signal that tells the scheduler a dependent grid may begin booting. |
| Dependent wait | The consumer-side fence that stalls warps before dependency-sensitive reads become legal. |
| Overlap window | The timing gap between early dependent admission and the moment the unresolved dependency actually becomes safe to consume. |
| Grid constant | A read-only kernel parameter guaranteed to stay stable for the lifetime of one launched grid. |
Continue the course
This supplement finishes the single-node kernel handoff story. The next lesson leaves kernel boundaries behind and moves into the fabric itself: NVLink, NVSwitch, rails, and the system topology that governs how many H100s become one training machine.
Multi GPU Part 1
Continue into DGX H100 topology, NVLink, NVSwitch, ConnectX, rails, and why the communication fabric becomes the new bottleneck.
Code Anchormatmul_12.cuh
See launch bounds, cluster dimensions, and grid-constant tensor maps inside a concrete Hopper GEMM kernel.
Code AnchorPing-Pong Experimental
Compare another launch-shape example where cluster-aware kernel setup and descriptor-driven inputs shape the runtime budget.
ExploreBack to Lesson Dossiers
Return to the homepage and move through the integrated lesson pages and kernel supplements already wired into the site.