Explore Here / Lesson 06

WGMMA Part 1

Lesson 6 is where Hopper tensor-core programming stops looking like older warp-level MMA. WGMMA turns four warps into one computational unit, launches tensor-core work asynchronously, consumes operand tiles from shared memory or registers, and depends on a very specific set of descriptors, register layouts, and fence semantics.

Warpgroup wgmma.mma_async ldmatrix Shared Descriptors Tensor Cores

Why WGMMA matters

The lesson frames four key Hopper innovations around tensor math: tensor-core work is now fully asynchronous, execution expands from warp to warpgroup, operands can come directly from shared memory or registers without the old register-only flow, and Hopper adds stronger support for FP8 and sparsity.

Asynchronous math launch

wgmma.mma_async launches tensor-core work and lets the issuing threads continue while the math is in flight.

Larger tiles, cleaner scheduling

WGMMA avoids chaining a forest of smaller warp-level MMA calls by letting four warps cooperate on a larger logical operation.

This is the compute-side analogue of the earlier async movement lessons. Hopper wants math and movement to stay in flight together, and WGMMA is the tensor-core issue model that makes that practical.

Warpgroup execution is the paradigm shift

A warpgroup is four warps, or 128 threads, acting as one computational unit for tensor-core issue. All four warps reach the WGMMA instruction together, the scheduler checks convergence, fuses them for the operation, dispatches the command to tensor cores, and then the threads continue onward while the math proceeds asynchronously.

Why Hopper does this: the older single-warp MMA model makes very large tiles and deep pipelining harder to schedule cleanly. The warpgroup model lets Hopper keep tensor cores busy for longer with less launch overhead.

The fixed m64

The notes explain why the M dimension is fixed at 64. The hardware maps operand fragments into a static tensor-core input pattern and avoids the need for a much more expensive dynamic routing structure. N is more flexible, while K is determined by input datatype.

Dimension Rule Why it matters
M Fixed at 64. Matches the static hardware mapping for warpgroup tensor-core issue.
N Flexible, typically multiples of 8 or 16 up to 256. Controls how many columns of B and C are processed.
K Driven by input precision. Determines the inner-product depth the tensor core consumes per operation.

The instruction shape tells you what the hardware expects

The core operation is wgmma.mma_async. Operand A can come from registers or shared memory. Operand B comes from shared memory. Accumulators and outputs live in registers.

wgmma.mma_async.sync.aligned.shape.dtypeD.dtypeA.dtypeB
d, a_operand, b_desc, scale_d, imm_scale_a, imm_scale_b, imm_trans_a, imm_trans_b;

.sync

Requires the participating warpgroup threads to reach the instruction together before issue proceeds.

.aligned

Asserts warpgroup convergence at the instruction's program counter. Divergence here breaks the contract.

wgmma.fence.sync.aligned

The lesson is precise here: this is a register-ordering barrier for later wgmma.mma_async use, not a completion barrier. It orders relevant register accesses before a later async WGMMA reuses those registers as accumulators or A fragments. Shared-memory ordering still needs the appropriate async proxy fence.

Register-sourced A and ldmatrix

Hopper gives A two possible locations. If A is reused heavily, keeping it in registers can reduce shared-memory pressure. If it is not reused enough to justify the extra instruction and register cost, shared-memory sourcing may be better.

A in registers

Good when reuse is high. Loading A from registers avoids doubling shared-memory bank pressure for A and B together.

A in shared memory

Simpler when the kernel does not benefit enough from register reuse to justify the extra address and register overhead.

ldmatrix is the main register-loading path

ldmatrix does not behave like a plain linear shared-memory load. It moves matrix fragments into opaque register patterns that are already aligned with tensor-core input expectations. The threads providing addresses are not always the threads that hold the resulting data.

  • .m8n8 describes the geometric shape of the loaded matrix fragment.
  • .x1, .x2, and .x4 control how many core 8x8 matrices are loaded per instruction.
  • .trans can transpose during load, avoiding manual shuffling.
  • BF16 data is packed into 32-bit registers because NVIDIA registers are 32-bit containers.

Important hardware fact: WGMMA expects the exact fragment layout that ldmatrix produces. The register arrangement is not arbitrary.

Shared-memory descriptors and swizzle complete the operand contract

When A or B is shared-memory sourced, WGMMA uses a compact 64-bit descriptor rather than a normal pointer. The lesson describes five packed fields: base address, leading byte offset, stride byte offset, matrix base offset, and swizzle mode.

Field Meaning
Base address The shared-memory address of the tile, 16-byte aligned and packed into the descriptor.
LBO Leading byte offset used along the operand's leading direction.
SBO Stride byte offset in the other core-matrix direction.
Matrix base offset Offsets the descriptor into the repeating swizzle region when the tile does not start at the pattern boundary.
Swizzle mode bits Tell hardware whether the operand is unswizzled or uses 128B, 64B, or 32B swizzle.

Shared memory is swizzled so that the physical address mapping aligns with bank-friendly access patterns. The descriptor has to encode the correct swizzle mode. If it does not, WGMMA will read scrambled data as if it were laid out linearly.

// Representative pattern from the repo
uint64_t desc_a = make_smem_desc(&sA[0]);
uint64_t desc_b = make_smem_desc(&sB[0]);

asm volatile(
  "wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16 ..."
);

Practical notes

  1. Treat warpgroup convergence as part of correctness. WGMMA is not tolerant of casual divergence.
  2. Choose A's location deliberately. Registers help when reuse is high; otherwise they can become pure pressure.
  3. Respect the descriptor and swizzle contract. Tensor cores only see what the descriptor tells them to see.
  4. Remember what wgmma.fence is and is not. It orders register use, but it is not a completion barrier for shared-memory visibility.
  5. Watch accumulator pressure. Large tiles and async math are powerful, but they can explode register usage quickly.

Glossary

Warpgroup Four warps acting together as one tensor-core issue unit.
wgmma.mma_async Hopper's asynchronous warpgroup matrix multiply-accumulate instruction.
ldmatrix The main mechanism for loading matrix fragments from shared memory into register layouts tensor cores expect.
Descriptor A packed 64-bit shared-memory operand description used by WGMMA for operand layout and stride interpretation.
Scale / transpose immediates Instruction flags that control accumulator behavior, sign scaling, and transpose behavior.

Continue the course

Lesson 6 introduces the basic WGMMA execution model and operand setup. The next lesson goes deeper into commit and wait groups, FP8 behavior, stmatrix, packing constraints, and the rest of the WGMMA lifecycle details.