Why bulk copies matter
cp.async.bulk is Hopper's hardware-accelerated asynchronous bulk transfer family. The big
idea is not just that the instruction is non-blocking. It is that the TMA hardware can own the address
calculation, loop expansion, and movement of a much larger region while the SM goes back to compute.
Ampere cp.async
Uses the LSU path. The warp is still responsible for issuing many 16-byte operations and carrying the address-generation burden itself.
Hopper cp.async.bulk
Uses TMA. One thread can launch a whole tile transfer while the hardware tracks progress and updates the relevant completion object.
That matters because movement is often the long pole. Reducing issue overhead, register pressure, and instruction traffic gives the rest of the kernel more space to stay compute-focused.
Raw and tensor-aware bulk copies solve different problems
Hopper exposes two broad shapes. The raw form is a bulk memcpy-like instruction. The tensor form is descriptor-aware and uses the tensor map plus coordinates to locate and move a structured tile.
Raw / linear form
cp.async.bulk.dst.src.barrier_type{.cache_hint}
dst_addr, src_addr, size, mbarrier_addr;
Use this when memory is just a contiguous byte range. The hardware does not interpret the source as a tensor with dimensions and bounds.
Tensor-aware form
cp.async.bulk.tensor.ndim.dst.src.barrier_type{.cache_hint}{.multicast}
dst_addr, tensor_map, coordinate_array, mbarrier_addr;
Adding .tensor makes the instruction descriptor-aware. The dimensional suffix tells the
hardware how many coordinates to read, and the tensor map carries shape, stride, bounds, swizzle, and
related layout information.
State spaces
The source and destination state spaces are explicit in the instruction shape, including
.global, .shared::cta, and .shared::cluster.
Load modes
.tile fetches a dense multidimensional tile. .im2col applies a
convolution-friendly transform during fetch instead of requiring a separate rearrangement kernel.
Completion mechanisms define how consumers know the transfer is safe
The lesson makes the completion model explicit because the transfer itself is intentionally asynchronous. Hopper offers two main completion styles.
mbarrier::complete_tx::bytes
The richer coordination path. Hardware updates the barrier's transaction count as bytes arrive and flips phase when the expected work completes.
bulk_group
The lighter-weight batching model. You launch operations, commit the group, and later wait until only a limited number of recent groups remain pending.
mbarrier is the right model when you want explicit byte-tracked coordination between
producers and consumers. bulk_group is simpler when coarse-grained grouped waiting is
enough.
Important distinction: bulk_group is about batching async operations.
mbarrier is about tracking work and phase completion with a reusable synchronization
object.
Cache policy and prefetch shape how bulk movement interacts with L2
Bulk copies can flood L2 with one-use traffic if the kernel does not communicate reuse intent. Hopper lets the instruction attach L2 cache hints through a cache policy descriptor.
| Policy | Meaning | Typical fit |
|---|---|---|
evict_first |
Mark lines as early eviction candidates. | Streaming or one-pass data, and final-output stores. |
evict_last |
Mark lines as persistent for longer retention. | Reused weights or tiles likely to be consumed repeatedly. |
evict_normal |
Default behavior. | When reuse is unclear or mixed. |
createpolicy.fractional.L2::evict_last.b64 policy_reg, 1.0;
createpolicy.fractional.L2::evict_first.b64 policy_reg, 1.0;
cp.async.bulk.prefetch and its tensor-aware variant can also prefetch data into L2 as a
latency hint, but the lesson is clear that prefetch is not a correctness primitive. If the later main
transfer arrives before the data is cached, hardware will fetch from HBM anyway.
Multicast turns one HBM fetch into cluster-wide delivery
A recurring Hopper use case is that multiple CTAs in a cluster need the same operand tile. In GEMM-like workloads, many blocks may need the same slice of matrix A while each consumes a different slice of matrix B. Multicast exists to avoid paying for that same HBM fetch repeatedly.
- A leader thread issues the multicast TMA instruction.
- The data is fetched once from global memory into L2.
- The cache/controller fabric broadcasts it to the shared-memory destinations of the CTAs selected by the mask.
- The relevant barrier state is updated for each participating CTA.
- Consumers in each CTA wait on their local barrier and resume when the tile is valid.
The notes emphasize that the barrier object is replicated at the same relative shared-memory offset in each participating CTA. Only CTAs included in the mask should behave like receivers for that transfer.
Practical consequence: multicast is a topology feature, not just a copy qualifier. The mask, barrier placement, and cluster participation all have to agree or the synchronization contract breaks.
TMA stores and reductions
The lesson also points out that structured stores can use bulk_group completion, and
Hopper extends the offload idea further with cp.reduce.async, where the TMA path can own
bulk reduction-style accumulation work too.
Practical guidance
- Use raw bulk copies when memory is just bytes. Use tensor copies when shape, stride, and bounds are part of the problem.
- Choose the completion model intentionally.
mbarrierandbulk_groupare different coordination tools. - Pick cache policy based on reuse. Streaming traffic and persistent weights want opposite hints.
- Treat multicast as a cluster contract. Mask membership, barrier offsets, and receiver logic have to line up.
- Remember that prefetch is only a hint. It can reduce latency, but it does not guarantee residency.
Glossary
cp.async.bulk |
Hopper's asynchronous bulk copy instruction family for large transfers. |
|---|---|
cp.async.bulk.tensor |
The descriptor-aware structured tensor-copy form. |
bulk_group |
A batching and grouped-wait mechanism for bulk async operations. |
| Multicast | Cluster-wide fanout of one fetched payload to multiple CTA destinations. |
| Cache policy descriptor | A 64-bit object created with createpolicy to carry eviction hints. |
cp.async.bulk.prefetch |
An L2 prefetch hint for future bulk transfers. |
Continue the course
Lesson 5 defines the main async movement family. Lesson 6 pivots to the compute side of the same pipeline: WGMMA, warpgroups, descriptor-backed tensor-core issue, and the register or shared-memory operand paths Hopper uses for large asynchronous matrix math.
WGMMA Part 1
Move into warpgroup tensor-core issue, ldmatrix, descriptors, and async math launch.
matmul_12.cuh
See structured bulk copies, multicast, bulk-group waits, and WGMMA-adjacent movement in one Hopper kernel path.
ExploreBack to Lesson Dossiers
Return to the homepage and move through the integrated lesson pages already living inside the site.