cuTensorMap and TMA Descriptors | Lesson 4 | CUDA Programming for NVIDIA H100s

What TMA changes

TMA is Hopper's hardware path for large asynchronous tensor copies between global memory and shared memory. Instead of tying movement to per-thread address generation, Hopper lets one launching thread describe the transfer and then continue executing while hardware handles the operation in the background.

Before descriptor-driven movement

Threads participate heavily in address generation, copies fragment into smaller pieces, and the SM stays tethered to the mechanics of movement.

With Hopper TMA

One instruction plus a descriptor can describe a whole tile transfer while the hardware handles the movement path mostly on its own.

That is why the lesson keeps returning to descriptors. Hopper's asynchronous machine works because layout knowledge is encoded ahead of time rather than recomputed by the issuing threads every time.

Core idea: CUtensorMap shifts transfer state out of the instruction stream and into a reusable 128-byte metadata object that the TMA hardware can execute against.

What the descriptor contains

A CUtensorMap is created on the host, encoded with the CUDA Driver API, and then used by the runtime path that issues TMA instructions. Conceptually it answers three questions: where the tensor lives, how it is laid out, and how the hardware should move it.

Addressing metadata

Base pointer, rank, global dimensions, and byte strides define the source tensor in memory.

Traversal metadata

Box dimensions and element strides define the tile size and per-dimension movement behavior.

Layout and policy metadata

Data type, swizzle, interleave, L2 promotion, and OOB fill shape how the transfer behaves.

Encoding workflow

Create a CUtensorMap object in host memory.
Encode it with the CUDA Driver API, typically with cuTensorMapEncodeTiled(...).
Pass the encoded descriptor to the path that will launch TMA-backed movement.

CUtensorMap tma_desc;
// host-side descriptor object, 128 bytes, 128B aligned

// encode with CUDA Driver API
// cuTensorMapEncodeTiled(&tma_desc, ...);

Field-by-field rules that matter

The easiest way to break a tensor map is to blur units and invariants. The lesson material is very clear that some fields are expressed in elements, some in bytes, and some carry alignment constraints that depend on datatype or interleave mode.

Field	Meaning	Unit / constraint
Data type	Defines how the source tensor is interpreted and what alignment / size rules apply.	Type enum such as FP16, BF16, FP32, TF32, integers.
Rank	Number of tensor dimensions, not matrix rank.	Notes frame supported movement across 1D to 5D tensor copies.
Global address	Base device pointer in HBM.	16B aligned baseline, stricter when interleaving is enabled.
`globalDim`	Extent of each tensor dimension.	Measured in elements.
`globalStrides`	Distance to move across higher dimensions in source memory.	Measured in bytes.
`elementStrides`	Logical step size for traversal during the async copy.	Measured in elements, array size equals rank.
`boxDim`	Tile size for one TMA transfer.	Measured in elements, must fit downstream layout assumptions.

Keep this straight: global dimensions and box dimensions are in elements, while global strides are in bytes. Losing that distinction is one of the fastest ways to encode the wrong descriptor.

Datatype and rank

The notes list common types including unsigned integers, signed integers, FP16, BF16, FP32, FP64, and TensorFloat-32 variants. Rank here means tensor dimensionality, not matrix rank from linear algebra.

Alignment and interleave constraints

Interleave modes tighten the address and stride rules. The notes call out 16-byte alignment for 16B interleave, 32-byte alignment for 32B interleave, stride granularity to match, and a dimensionality requirement of at least 3 when interleaving is used.

Swizzle and interleave are how layout becomes hardware-friendly

Swizzle is the shared-memory-side layout transform. Interleave is the source-layout decode in global memory. They are related because Hopper often has to decode a packed source layout and then place the result into a shared-memory arrangement that downstream consumers can read efficiently.

Why swizzling exists

Shared memory has 32 banks. Regular strided layouts can repeatedly hit the same banks and serialize what should have been parallel access. Hopper uses hardware-accelerated address permutation during transfer so the physical placement in shared memory better matches the access pattern of the consuming warps.

Mode	What changes	Typical use
32B	Smallest swizzle span, lower cache-line utilization, narrower tiles.	Very small inner dimensions.
64B	Better bank spread than 32B but still not the strongest layout protection.	Mid-width tiles.
128B	Largest common swizzle span, aligned with the main shared-memory / tensor consumption pattern.	Common FP16 / BF16 tensor-core pipelines.

The notes also distinguish atom size and span. The atom is the indivisible chunk the swizzler moves. The span is the repeat window of the swizzle pattern. That is why the notes highlight the bounding-box constraint: if the inner dimension in bytes exceeds the swizzle span, the pattern repeats and can recreate bank collisions.

Interleave coupling

Some source layouts are physically interleaved, for example NC/8HWC8 or NC/16HWC16 style packing. The interleave mode tells TMA how to decode those chunks. The lesson notes explicitly call out that 32B interleave must pair with 32B swizzle because those hardware paths operate in lockstep.

L2 promotion and out-of-bounds fill finish the movement contract

The descriptor does not stop at layout. It also controls how aggressively the system fetches into L2 and what value the destination should receive when a tile extends beyond the valid tensor boundary.

L2 promotion

Mode	Meaning	Best fit
NONE	Use the default smaller fetch behavior.	Sparse or irregular access where over-fetching would waste bandwidth.
L2_64B	Promote to 64B fetch width.	Moderately dense cases with narrow contiguous width.
L2_128B	Fetch a full cache line in one shot.	Common dense GEMM or convolution-style pipelines.
L2_256B	Fetch two full cache lines logically together.	Highly dense workloads with immediate reuse and enough locality to justify cache space.

The lesson's rule of thumb is simple: dense GEMM and convolution often benefit from 128B or 256B promotion, while sparse lookups usually do better with no promotion to avoid fetching neighbors that will never be touched.

Out-of-bounds fill

Edge tiles still need to move. OOB fill lets the transfer continue without manual per-element guards by synthesizing destination values for out-of-range positions while leaving the source tensor in HBM unchanged.

Zero fill is useful for padding and clean boundary math.
NaN-related fill modes are useful for debugging or specialized behavior.
This simplifies tiled kernels that naturally extend beyond tensor edges.

Practical guidance

Get the descriptor math right first. Rank, dimensions, and byte strides must describe the real tensor.
Respect alignment rules. Interleave mode and datatype change what the hardware expects.
Choose tile geometry around the consumer. TMA should feed the compute pattern you actually use downstream.
Use swizzle deliberately. It only helps when it matches the access pattern of the reader.
Match L2 promotion to contiguous width. Dense predictable traffic and sparse lookups want very different fetch behavior.

Glossary

TMA	Tensor Memory Accelerator, Hopper's asynchronous tensor copy engine.
`CUtensorMap`	The host-encoded descriptor that teaches TMA how to interpret and move a tensor tile.
Swizzle	Shared-memory address remapping used to reduce bank conflicts and align with consumer access patterns.
Interleave	Source-layout decoding for packed global-memory formats.
L2 promotion	A fetch-width hint that trades bandwidth efficiency against possible cache pollution.
OOB fill	Destination fill policy used when a TMA tile extends beyond the valid tensor boundary.

Continue the course

Lesson 4 gives TMA its descriptor and layout vocabulary. The next lesson, cp.async.bulk, moves into the instruction family that uses this machinery directly for structured bulk movement, grouping, multicast, and barrier-linked completion.

Next Lesson

cuTensorMap