Why lifecycle control matters
wgmma.mma_async returns immediately. That is the whole reason Hopper can overlap tensor
math with TMA loads, pointer work, and staging for the next tile. But once math is asynchronous, the
program also needs a way to bundle launches, cap how much is in flight, and know when accumulators are
actually safe to consume.
Track batches, not individual MMAs
The notes are explicit that groups exist so hardware does not have to scoreboard every single MMA independently. A tile's worth of issue becomes one trackable unit.
Drain only when the next phase needs it
You usually do not want a “wait for everything” barrier after every issue window. You keep some groups in flight, stage the next work, and drain fully only before register consumption or store.
The pattern to remember: issue a batch, commit it, start preparing the next batch,
wait only until the in-flight depth falls back to the desired window, and use
wait_group 0 when you truly need fully materialized results.
Commit, wait, and fence define the WGMMA lifecycle
Grouping gives WGMMA pipeline-friendly granularity. The current batch of async MMAs stays “open” until
you close it with wgmma.commit_group.sync.aligned. Later, wgmma.wait_group
lets the warpgroup stall only when too many committed batches are still in flight.
// One output-tile worth of async math
wgmma.mma_async ...
wgmma.mma_async ...
wgmma.commit_group.sync.aligned;
// Prepare the next tile while some math still runs
// TMA, pointer math, shared-memory staging, etc.
wgmma.wait_group.sync.aligned 1;
// Before reading accumulators or exporting them
wgmma.wait_group.sync.aligned 0;
| Instruction | What it does | What it does not do |
|---|---|---|
wgmma.commit_group |
Closes the current batch so hardware can track it as one committed group. | It does not wait for completion. |
wgmma.wait_group N |
Stalls until only N committed groups remain in flight. |
It is not a blanket “synchronize all tensor work forever” primitive unless N = 0. |
wgmma.fence.sync.aligned |
Marks the warpgroup issue boundary and orders accumulator / register-sourced operand hazards for WGMMA. | It is not the async proxy fence for TMA or shared-memory visibility. |
The lesson is careful about that last distinction. wgmma.fence is about the WGMMA
pipeline itself, especially accumulators and register-resident A fragments. If shared-memory contents
were produced through the async proxy path, you still need the right cross-proxy publication rule
before consuming them.
stmatrix is how opaque accumulator fragments become a usable tile
When WGMMA finishes, the result is not sitting in registers as a clean row-major matrix. It is
fragmented across 128 threads in the exact layout tensor cores use internally. stmatrix is
the cooperative store instruction that writes those fragments back to shared memory in a normal matrix
layout.
Like ldmatrix, but in reverse
Data ownership and address ownership are decoupled. Threads provide the registers they hold and the addresses required to land those fragments into the correct row-major destination.
Still tile-structured
The basic atom is an m8n8 matrix tile. Variants like .x4 let each thread
contribute four packed registers so larger logical tiles can be serialized out a slice at a time.
// From the repo's matmul_12.cuh epilogue flow
// convert/pack fragments
// stmatrix.sync.aligned.m8n8.x4.trans.shared::cta.b16 [...]
// bar.sync across the warpgroup
// issue TMA store for the completed tile
FP32 accumulators are the important caveat
stmatrix does not support .f32 data. It supports packed .b16
and .b8 forms. So if WGMMA accumulated into FP32 registers, you cannot feed those
registers directly into stmatrix. You first downcast and pack the values, which is why
FP32 output paths often include an explicit conversion loop before the shared-memory store.
Why the loop exists: one logical output tile is usually larger than a single
stmatrix.x4 call can cover, so the epilogue serializes the accumulator slices, writes
them to shared memory, then hands the tile to a later TMA store path.
FP8 is a transport and layout problem as much as a precision choice
Hopper's FP8 path matters because it changes both math throughput and the operand contract. The lesson frames FP8 as a compressed training format that still accumulates in FP32, but whose range, scaling, packing, and layout decisions can make or break correctness and performance.
| Format | What it favors | Why it shows up |
|---|---|---|
e4m3 |
More mantissa precision, smaller dynamic range. | Common for activations and weights when range is manageable. |
e5m2 |
Larger dynamic range, less mantissa precision. | Useful for gradients and paths where magnitude swings are larger. |
Scaling is the escape hatch that keeps 8-bit floating point useful. Tensor-wise, vector-wise, and
block-wise schemes all exist because the raw FP8 range is too cramped to trust without a quantization
policy. The notes also call out .satfinite conversion forms so out-of-range values clamp
instead of exploding into unusable results.
// Representative PTX conversion style
cvt.rn.satfinite.e4m3x2.f16x2 ...
cvt.rn.satfinite.e5m2x2.f32 ...
x4 packs fill registers
e4m3x4 and e5m2x4 place four FP8 values into one 32-bit register. That
is the natural storage and transport form.
x2 packs match the math path
e4m3x2 and e5m2x2 are 16-bit packed forms that line up with how Hopper
likes to ingest lower-precision math fragments.
Hopper FP8 WGMMA has stricter operand rules
The course notes call out the fixed m64nNk32 FP8 shape and a crucial Hopper limitation:
the obvious ldmatrix ... .b8 path is not what SM90a exposes here. That is why practical
Hopper FP8 kernels tend to use either shared-memory descriptors for both operands or an RS path where A
is loaded into registers with ordinary shared loads already packed in the layout WGMMA expects.
The same notes also stress the strict K-major rule. Unlike the FP16/BF16 forms, FP8 WGMMA does not give you the same transpose escape hatch. If the staged operand tiles are not already K-major, you pay for extra repacking and synchronization before the tensor cores can consume them.
Numerical caution: even when the documented accumulator type is FP32, the lesson points to empirical reports that FP8 accumulation can behave like a reduced-precision FP32 variant. Large K reductions may need K-slicing or staged accumulation to keep error growth under control.
Sparse WGMMA adds a second contract: metadata correctness
Sparse WGMMA is still matrix multiply-accumulate, but the hardware assumes a very specific structured sparsity pattern. The supported model is sparse A times dense B. A carries the compressed data plus the metadata that tells tensor cores where the surviving values belong.
sp_meta
A packed register containing the position metadata for the surviving elements. For 2:4 sparsity, it records which two indices out of each quartet remain.
sp_sel
A selector telling hardware which threads are responsible for presenting metadata for the current instruction shape. If that selector does not match the actual loader strategy, the tensor core reads the wrong metadata.
| Case | Selector rule from the notes | Why it matters |
|---|---|---|
| FP16 / BF16 sparse | sp_sel can choose either contributing thread pair. |
Lets you match the selector to where the metadata was actually loaded. |
| FP8 / INT8 sparse | sp_sel must be 0. |
The notes treat any other value as undefined for these shapes. |
| Replicated metadata loaders | Use the simple selector and broadcast the same metadata to the relevant threads. | Reduces the chance of misaligned metadata ownership. |
The physical sparsity rule is 2:4. You do not just drop two values and hope for the best. You must store the survivors plus enough metadata for hardware to reconstruct their original positions inside each quartet. Wrong metadata does not degrade gracefully. It simply makes the tensor-core read the wrong sparse pattern.
Practical guidance
- Think in issue windows. Batch async MMAs into logical groups, then choose a wait depth that preserves overlap instead of collapsing back to full serialization.
- Drain only when you consume. Use
wait_group 0before reading accumulators, exporting them withstmatrix, or otherwise treating the results as complete. - Separate register ordering from memory visibility.
wgmma.fenceis not a substitute for the async proxy publication rules used by TMA-backed pipelines. - Treat FP8 as a layout contract. Scaling, packing, K-major staging, and conversion policy matter as much as the nominal data type.
- Never hand-wave sparse metadata. Sparse tensor cores only work when the payload and selector logic match exactly.
Glossary
wgmma.commit_group |
Closes the current batch of async MMAs so hardware can track it as one committed group. |
|---|---|
wgmma.wait_group |
Wait primitive that limits how many committed WGMMA groups remain in flight. |
stmatrix |
Cooperative matrix store instruction that writes tensor-core fragments from registers back to shared memory. |
e4m3 / e5m2 |
Hopper's two FP8 floating-point formats, trading mantissa precision against dynamic range. |
sp_meta |
Packed metadata telling sparse tensor cores where the surviving nonzero values from sparse A belong. |
Continue the course
Lesson 7 completes the WGMMA instruction model. The next lesson zooms out from one primitive to the whole kernel: arithmetic intensity, warp specialization, circular buffers, persistent scheduling, and the pipeline shapes that keep Hopper busy for full GEMMs.
Kernel Design
Move from WGMMA mechanics into ping-pong versus cooperative pipelines, scheduling, and epilogue design.
Code Anchormatmul_12.cuh
See wgmma.fence, commit and wait helpers, the stmatrix epilogue loop, and TMA store handoff in one kernel.
RS WGMMA Mainloop
Open the register-sourced SM90 path that makes the FP8 and layout discussion concrete inside a production kernel shape.
ExploreBack to Lesson Dossiers
Return to the homepage and keep moving through the integrated site-native lesson pages.