cp.async.bulk on Hopper | Lesson 5 | CUDA Programming for NVIDIA H100s

Why bulk copies matter

cp.async.bulk is Hopper's hardware-accelerated asynchronous bulk transfer family. The big idea is not just that the instruction is non-blocking. It is that the TMA hardware can own the address calculation, loop expansion, and movement of a much larger region while the SM goes back to compute.

Ampere `cp.async`

Uses the LSU path. The warp is still responsible for issuing many 16-byte operations and carrying the address-generation burden itself.

Hopper `cp.async.bulk`

Uses TMA. One thread can launch a whole tile transfer while the hardware tracks progress and updates the relevant completion object.

That matters because movement is often the long pole. Reducing issue overhead, register pressure, and instruction traffic gives the rest of the kernel more space to stay compute-focused.

Raw and tensor-aware bulk copies solve different problems

Hopper exposes two broad shapes. The raw form is a bulk memcpy-like instruction. The tensor form is descriptor-aware and uses the tensor map plus coordinates to locate and move a structured tile.

Raw / linear form

cp.async.bulk.dst.src.barrier_type{.cache_hint}
dst_addr, src_addr, size, mbarrier_addr;

Use this when memory is just a contiguous byte range. The hardware does not interpret the source as a tensor with dimensions and bounds.

Tensor-aware form

cp.async.bulk.tensor.ndim.dst.src.barrier_type{.cache_hint}{.multicast}
dst_addr, tensor_map, coordinate_array, mbarrier_addr;

Adding .tensor makes the instruction descriptor-aware. The dimensional suffix tells the hardware how many coordinates to read, and the tensor map carries shape, stride, bounds, swizzle, and related layout information.

State spaces

The source and destination state spaces are explicit in the instruction shape, including .global, .shared::cta, and .shared::cluster.

Load modes

.tile fetches a dense multidimensional tile. .im2col applies a convolution-friendly transform during fetch instead of requiring a separate rearrangement kernel.

Completion mechanisms define how consumers know the transfer is safe

The lesson makes the completion model explicit because the transfer itself is intentionally asynchronous. Hopper offers two main completion styles.

`mbarrier::complete_tx::bytes`

The richer coordination path. Hardware updates the barrier's transaction count as bytes arrive and flips phase when the expected work completes.

`bulk_group`

The lighter-weight batching model. You launch operations, commit the group, and later wait until only a limited number of recent groups remain pending.

mbarrier is the right model when you want explicit byte-tracked coordination between producers and consumers. bulk_group is simpler when coarse-grained grouped waiting is enough.

Important distinction: bulk_group is about batching async operations. mbarrier is about tracking work and phase completion with a reusable synchronization object.

Cache policy and prefetch shape how bulk movement interacts with L2

Bulk copies can flood L2 with one-use traffic if the kernel does not communicate reuse intent. Hopper lets the instruction attach L2 cache hints through a cache policy descriptor.

Policy	Meaning	Typical fit
`evict_first`	Mark lines as early eviction candidates.	Streaming or one-pass data, and final-output stores.
`evict_last`	Mark lines as persistent for longer retention.	Reused weights or tiles likely to be consumed repeatedly.
`evict_normal`	Default behavior.	When reuse is unclear or mixed.

createpolicy.fractional.L2::evict_last.b64 policy_reg, 1.0;
createpolicy.fractional.L2::evict_first.b64 policy_reg, 1.0;

cp.async.bulk.prefetch and its tensor-aware variant can also prefetch data into L2 as a latency hint, but the lesson is clear that prefetch is not a correctness primitive. If the later main transfer arrives before the data is cached, hardware will fetch from HBM anyway.

Multicast turns one HBM fetch into cluster-wide delivery

A recurring Hopper use case is that multiple CTAs in a cluster need the same operand tile. In GEMM-like workloads, many blocks may need the same slice of matrix A while each consumes a different slice of matrix B. Multicast exists to avoid paying for that same HBM fetch repeatedly.

A leader thread issues the multicast TMA instruction.
The data is fetched once from global memory into L2.
The cache/controller fabric broadcasts it to the shared-memory destinations of the CTAs selected by the mask.
The relevant barrier state is updated for each participating CTA.
Consumers in each CTA wait on their local barrier and resume when the tile is valid.

The notes emphasize that the barrier object is replicated at the same relative shared-memory offset in each participating CTA. Only CTAs included in the mask should behave like receivers for that transfer.

Practical consequence: multicast is a topology feature, not just a copy qualifier. The mask, barrier placement, and cluster participation all have to agree or the synchronization contract breaks.

TMA stores and reductions

The lesson also points out that structured stores can use bulk_group completion, and Hopper extends the offload idea further with cp.reduce.async, where the TMA path can own bulk reduction-style accumulation work too.

Practical guidance

Use raw bulk copies when memory is just bytes. Use tensor copies when shape, stride, and bounds are part of the problem.
Choose the completion model intentionally. mbarrier and bulk_group are different coordination tools.
Pick cache policy based on reuse. Streaming traffic and persistent weights want opposite hints.
Treat multicast as a cluster contract. Mask membership, barrier offsets, and receiver logic have to line up.
Remember that prefetch is only a hint. It can reduce latency, but it does not guarantee residency.

Glossary

`cp.async.bulk`	Hopper's asynchronous bulk copy instruction family for large transfers.
`cp.async.bulk.tensor`	The descriptor-aware structured tensor-copy form.
`bulk_group`	A batching and grouped-wait mechanism for bulk async operations.
Multicast	Cluster-wide fanout of one fetched payload to multiple CTA destinations.
Cache policy descriptor	A 64-bit object created with `createpolicy` to carry eviction hints.
`cp.async.bulk.prefetch`	An L2 prefetch hint for future bulk transfers.

Continue the course

Lesson 5 defines the main async movement family. Lesson 6 pivots to the compute side of the same pipeline: WGMMA, warpgroups, descriptor-backed tensor-core issue, and the register or shared-memory operand paths Hopper uses for large asynchronous matrix math.

Next Lesson

`cp.async.bulk`

Why bulk copies matter

Ampere `cp.async`

Hopper `cp.async.bulk`

Raw and tensor-aware bulk copies solve different problems

Raw / linear form

Tensor-aware form

State spaces

Load modes

Completion mechanisms define how consumers know the transfer is safe

`mbarrier::complete_tx::bytes`

`bulk_group`

Cache policy and prefetch shape how bulk movement interacts with L2

Multicast turns one HBM fetch into cluster-wide delivery

TMA stores and reductions

Practical guidance

Glossary

Continue the course

WGMMA Part 1

matmul_12.cuh

Back to Lesson Dossiers

cp.async.bulk

Why bulk copies matter

Ampere cp.async

Hopper cp.async.bulk

Raw and tensor-aware bulk copies solve different problems

Raw / linear form

Tensor-aware form

State spaces

Load modes

Completion mechanisms define how consumers know the transfer is safe

mbarrier::complete_tx::bytes

bulk_group

Cache policy and prefetch shape how bulk movement interacts with L2

Multicast turns one HBM fetch into cluster-wide delivery

TMA stores and reductions

Practical guidance

Glossary

Continue the course

WGMMA Part 1

matmul_12.cuh

Back to Lesson Dossiers

`cp.async.bulk`

Ampere `cp.async`

Hopper `cp.async.bulk`

`mbarrier::complete_tx::bytes`

`bulk_group`