Explore Here / Lesson 07

WGMMA Part 2

Lesson 7 is the rest of the warpgroup story. Once WGMMA is in flight, you still need a real lifecycle: grouping and draining async math, handling accumulator hazards correctly, turning opaque register fragments into storable tiles with stmatrix, and surviving FP8 plus sparse tensor-core constraints without losing the machine model.

commit_group wait_group stmatrix FP8 Sparse WGMMA

Why lifecycle control matters

wgmma.mma_async returns immediately. That is the whole reason Hopper can overlap tensor math with TMA loads, pointer work, and staging for the next tile. But once math is asynchronous, the program also needs a way to bundle launches, cap how much is in flight, and know when accumulators are actually safe to consume.

Track batches, not individual MMAs

The notes are explicit that groups exist so hardware does not have to scoreboard every single MMA independently. A tile's worth of issue becomes one trackable unit.

Drain only when the next phase needs it

You usually do not want a “wait for everything” barrier after every issue window. You keep some groups in flight, stage the next work, and drain fully only before register consumption or store.

The pattern to remember: issue a batch, commit it, start preparing the next batch, wait only until the in-flight depth falls back to the desired window, and use wait_group 0 when you truly need fully materialized results.

Commit, wait, and fence define the WGMMA lifecycle

Grouping gives WGMMA pipeline-friendly granularity. The current batch of async MMAs stays “open” until you close it with wgmma.commit_group.sync.aligned. Later, wgmma.wait_group lets the warpgroup stall only when too many committed batches are still in flight.

// One output-tile worth of async math
wgmma.mma_async ...
wgmma.mma_async ...
wgmma.commit_group.sync.aligned;

// Prepare the next tile while some math still runs
// TMA, pointer math, shared-memory staging, etc.
wgmma.wait_group.sync.aligned 1;

// Before reading accumulators or exporting them
wgmma.wait_group.sync.aligned 0;
Instruction What it does What it does not do
wgmma.commit_group Closes the current batch so hardware can track it as one committed group. It does not wait for completion.
wgmma.wait_group N Stalls until only N committed groups remain in flight. It is not a blanket “synchronize all tensor work forever” primitive unless N = 0.
wgmma.fence.sync.aligned Marks the warpgroup issue boundary and orders accumulator / register-sourced operand hazards for WGMMA. It is not the async proxy fence for TMA or shared-memory visibility.

The lesson is careful about that last distinction. wgmma.fence is about the WGMMA pipeline itself, especially accumulators and register-resident A fragments. If shared-memory contents were produced through the async proxy path, you still need the right cross-proxy publication rule before consuming them.

stmatrix is how opaque accumulator fragments become a usable tile

When WGMMA finishes, the result is not sitting in registers as a clean row-major matrix. It is fragmented across 128 threads in the exact layout tensor cores use internally. stmatrix is the cooperative store instruction that writes those fragments back to shared memory in a normal matrix layout.

Like ldmatrix, but in reverse

Data ownership and address ownership are decoupled. Threads provide the registers they hold and the addresses required to land those fragments into the correct row-major destination.

Still tile-structured

The basic atom is an m8n8 matrix tile. Variants like .x4 let each thread contribute four packed registers so larger logical tiles can be serialized out a slice at a time.

// From the repo's matmul_12.cuh epilogue flow
// convert/pack fragments
// stmatrix.sync.aligned.m8n8.x4.trans.shared::cta.b16 [...]
// bar.sync across the warpgroup
// issue TMA store for the completed tile

FP32 accumulators are the important caveat

stmatrix does not support .f32 data. It supports packed .b16 and .b8 forms. So if WGMMA accumulated into FP32 registers, you cannot feed those registers directly into stmatrix. You first downcast and pack the values, which is why FP32 output paths often include an explicit conversion loop before the shared-memory store.

Why the loop exists: one logical output tile is usually larger than a single stmatrix.x4 call can cover, so the epilogue serializes the accumulator slices, writes them to shared memory, then hands the tile to a later TMA store path.

FP8 is a transport and layout problem as much as a precision choice

Hopper's FP8 path matters because it changes both math throughput and the operand contract. The lesson frames FP8 as a compressed training format that still accumulates in FP32, but whose range, scaling, packing, and layout decisions can make or break correctness and performance.

Format What it favors Why it shows up
e4m3 More mantissa precision, smaller dynamic range. Common for activations and weights when range is manageable.
e5m2 Larger dynamic range, less mantissa precision. Useful for gradients and paths where magnitude swings are larger.

Scaling is the escape hatch that keeps 8-bit floating point useful. Tensor-wise, vector-wise, and block-wise schemes all exist because the raw FP8 range is too cramped to trust without a quantization policy. The notes also call out .satfinite conversion forms so out-of-range values clamp instead of exploding into unusable results.

// Representative PTX conversion style
cvt.rn.satfinite.e4m3x2.f16x2 ...
cvt.rn.satfinite.e5m2x2.f32 ...

x4 packs fill registers

e4m3x4 and e5m2x4 place four FP8 values into one 32-bit register. That is the natural storage and transport form.

x2 packs match the math path

e4m3x2 and e5m2x2 are 16-bit packed forms that line up with how Hopper likes to ingest lower-precision math fragments.

Hopper FP8 WGMMA has stricter operand rules

The course notes call out the fixed m64nNk32 FP8 shape and a crucial Hopper limitation: the obvious ldmatrix ... .b8 path is not what SM90a exposes here. That is why practical Hopper FP8 kernels tend to use either shared-memory descriptors for both operands or an RS path where A is loaded into registers with ordinary shared loads already packed in the layout WGMMA expects.

The same notes also stress the strict K-major rule. Unlike the FP16/BF16 forms, FP8 WGMMA does not give you the same transpose escape hatch. If the staged operand tiles are not already K-major, you pay for extra repacking and synchronization before the tensor cores can consume them.

Numerical caution: even when the documented accumulator type is FP32, the lesson points to empirical reports that FP8 accumulation can behave like a reduced-precision FP32 variant. Large K reductions may need K-slicing or staged accumulation to keep error growth under control.

Sparse WGMMA adds a second contract: metadata correctness

Sparse WGMMA is still matrix multiply-accumulate, but the hardware assumes a very specific structured sparsity pattern. The supported model is sparse A times dense B. A carries the compressed data plus the metadata that tells tensor cores where the surviving values belong.

sp_meta

A packed register containing the position metadata for the surviving elements. For 2:4 sparsity, it records which two indices out of each quartet remain.

sp_sel

A selector telling hardware which threads are responsible for presenting metadata for the current instruction shape. If that selector does not match the actual loader strategy, the tensor core reads the wrong metadata.

Case Selector rule from the notes Why it matters
FP16 / BF16 sparse sp_sel can choose either contributing thread pair. Lets you match the selector to where the metadata was actually loaded.
FP8 / INT8 sparse sp_sel must be 0. The notes treat any other value as undefined for these shapes.
Replicated metadata loaders Use the simple selector and broadcast the same metadata to the relevant threads. Reduces the chance of misaligned metadata ownership.

The physical sparsity rule is 2:4. You do not just drop two values and hope for the best. You must store the survivors plus enough metadata for hardware to reconstruct their original positions inside each quartet. Wrong metadata does not degrade gracefully. It simply makes the tensor-core read the wrong sparse pattern.

Practical guidance

  1. Think in issue windows. Batch async MMAs into logical groups, then choose a wait depth that preserves overlap instead of collapsing back to full serialization.
  2. Drain only when you consume. Use wait_group 0 before reading accumulators, exporting them with stmatrix, or otherwise treating the results as complete.
  3. Separate register ordering from memory visibility. wgmma.fence is not a substitute for the async proxy publication rules used by TMA-backed pipelines.
  4. Treat FP8 as a layout contract. Scaling, packing, K-major staging, and conversion policy matter as much as the nominal data type.
  5. Never hand-wave sparse metadata. Sparse tensor cores only work when the payload and selector logic match exactly.

Glossary

wgmma.commit_group Closes the current batch of async MMAs so hardware can track it as one committed group.
wgmma.wait_group Wait primitive that limits how many committed WGMMA groups remain in flight.
stmatrix Cooperative matrix store instruction that writes tensor-core fragments from registers back to shared memory.
e4m3 / e5m2 Hopper's two FP8 floating-point formats, trading mantissa precision against dynamic range.
sp_meta Packed metadata telling sparse tensor cores where the surviving nonzero values from sparse A belong.

Continue the course

Lesson 7 completes the WGMMA instruction model. The next lesson zooms out from one primitive to the whole kernel: arithmetic intensity, warp specialization, circular buffers, persistent scheduling, and the pipeline shapes that keep Hopper busy for full GEMMs.