Asynchronicity, mbarrier, and Fences | Lesson 3 | CUDA Programming for NVIDIA H100s

Why asynchronicity matters

The notes frame the story in the simplest possible way: systems alternate between doing and fetching. In a synchronous world those happen one after another. In an asynchronous world, the request for work or data is decoupled from the moment its result is consumed. That is what creates latency hiding.

Blocking / synchronous

The launching thread pauses until the operation is completely done. Control does not return, so no useful work overlaps the wait.

Non-blocking / asynchronous

The launching thread regains control immediately while some other part of the system tracks the work in the background.

On GPUs this distinction matters because math is often fast and memory movement is comparatively slow. Hopper pushes this hard: TMA and Tensor Cores are designed so compute can continue while movement and completion tracking happen elsewhere.

Mental model: the goal is not to maximize the number of async instructions. The goal is to keep expensive units busy while slower operations complete in parallel.

Proxies, the async lifecycle, and why hazards appear

The lesson's lifecycle is three-stage: initialize async work, let another part of the system track it while unrelated work continues, then synchronize exactly where the result becomes necessary. Hopper adds a crucial twist: the part that issues work and the part that performs it may live in different proxies.

Generic proxy

Ordinary thread-issued loads and stores. Within this path, ordering is mostly what CUDA programmers already expect from sequential instructions inside one thread.

Async proxy

Hardware paths like cp.async.bulk and wgmma. Once launched, they run independently of the issuing thread's current instruction stream.

RAW and WAR hazards

If the generic proxy writes data and the async proxy immediately reads the same location, the async path can observe stale state. That is a read-after-write hazard. If the async proxy or tensor consumer is still reading a tile and the producer overwrites the same shared-memory region too early, that is a write-after-read hazard.

RAW matters when launch is much faster than memory visibility.
WAR matters when double-buffered shared tiles are reused aggressively.
Neither problem is solved by "the instruction finished" alone.

Fences are about visibility contracts, not ceremony

A fence constrains how memory effects become observable. In Hopper workflows the important distinction is between ordinary ordering and cross-proxy ordering. Standard per-thread sequencing does not automatically make generic and async paths agree on the state of memory.

Release

Producer-side ordering. Earlier writes must become visible before the signal or arrival that follows them.

Acquire

Consumer-side ordering. Later reads are not allowed to slip before the synchronization point that tells the consumer the data is ready.

Cross-proxy fence

Required when generic and async paths touch the same location. In the lesson notes, fence.proxy.async is the explicit bridge.

// Conceptual producer / consumer ordering
generic proxy writes shared state
cross-proxy or release fence
producer signals barrier
consumer waits with acquire semantics
consumer reads the tile safely

Important limitation: a cross-proxy fence is still per-thread ordering. It does not replace the CTA- or cluster-level coordination needed to make sure all writers finished setup before an elected thread launches the async operation.

The `mbarrier` pipeline

The lesson defines mbarrier as a shared-memory hardware synchronization primitive that tracks asynchronous work. That is what makes it different from a classic barrier. A classic barrier asks "have the threads arrived?" An mbarrier can ask "has the tracked work completed, and is this phase safe to consume?"

Initialize the barrier in shared memory with an expected arrival count.
Attach expected transaction work for the async operation.
Launch the async copy or producer stage.
Do independent work while consumers test or wait.
Flip phase and reuse the barrier for the next tile.

Initialization and reuse

The notes are blunt here: initialization bugs ruin everything else. The barrier lives in shared memory, the address must be converted correctly, and the object must be 64-bit aligned. Reuse is phase-based, so each pipeline round is tracked as a separate generation rather than one endless counter.

Expected transaction count

For async data movement, the barrier does not just track thread arrival. It also tracks outstanding work, often framed in the notes as transaction bytes. Completion requires both the arrival side and the tracked work side to reach the state the barrier expects.

Parity and wait semantics

The parity-based waits are checking one specific phase of work. mbarrier.try_wait.parity and mbarrier.test_wait.parity tell the consumer whether the phase it cares about has completed. The .acquire form adds the visibility guarantee that makes subsequent reads safe.

// Conceptual consumer pattern
while (!phase_complete) {
  // do unrelated work, test again, or yield
}
// acquire visibility of the produced tile
consume shared-memory data or descriptor-backed work safely

Practical pattern from the notes: keep a local phase bit in registers and toggle it at the end of each loop iteration instead of carrying heavier synchronization tokens around.

Cluster barriers, async groups, and named barriers

Hopper needs more than one kind of synchronization because the ownership question changes. Sometimes you need to coordinate a producer and consumer inside one block. Sometimes you need multiple CTAs in a cluster to be physically present before remote shared-memory access is safe. Sometimes you want several async bulk copies to be committed as one batch.

`barrier.cluster.arrive`

Signals cluster arrival without immediately stopping independent work.

`barrier.cluster.wait`

Blocks until all participants arrived and cluster-visible writes are safe to consume.

`cp.async.bulk.commit_group`

Batches launched bulk async operations into a committed group that later waits can reason about.

`cp.async.bulk.wait_group<N>`

Waits until only N committed groups remain pending, not until N groups have finished.

Named barriers solve a different problem. They let subsets of warps synchronize without forcing an entire block to stop at one whole-block rendezvous. That makes them useful for warp-specialized internal producer-consumer pipelines.

Why cluster barriers exist: standard __syncthreads() cannot coordinate block-to-block handoff inside a cluster, which matters for DSMEM access and TMA multicast patterns.

Design rules

Separate progress from visibility. "Done" and "safe to read" are not interchangeable.
Use the right barrier for the ownership question. Thread rendezvous, async work tracking, and cluster presence are different contracts.
Keep scopes honest. CTA scope, cluster scope, GPU scope, and system scope are not cosmetic suffixes.
Make shared-memory reuse explicit. If ownership is changing, encode it with arrival, wait, and phase management.
Use asynchronicity as a scheduling tool. The payoff is clean overlap that keeps the machine productive.

Glossary

Proxy	A memory-operation path or agent with its own visibility and ordering rules.
RAW hazard	A consumer reads before the producer's newer write is visible on the path it uses.
WAR hazard	A producer overwrites data before the consumer has finished reading the previous tile.
`mbarrier`	A shared-memory hardware barrier that tracks asynchronous work across reusable phases.
Phase / parity	The generation state of a reusable barrier, often tracked with a single bit.
`wait_group`	A bulk async group wait based on how many committed groups are still pending.

PTX ISA Deep Dive: `mbarrier` Internals

Reference: PTX ISA 9.2, mbarrier chapter and async-copy / async-store / async-reduce sections.

What an `mbarrier` actually is

In PTX, an mbarrier is an opaque .b64, 8-byte-aligned object in shared memory. It tracks four things:

The current phase.
The current phase's pending-arrival count.
The next phase's expected-arrival count.
The current phase's tx-count for outstanding asynchronous transactions.

A phase completes only when both pending arrivals and tx-count reach zero. Completion then atomically advances the phase and reloads pending arrivals from expected arrivals.

The two-debt mental model

An mbarrier phase has two debts that must be paid before waiters can observe completion:

Debt	Meaning	Increased by	Decreased by
Arrival debt	How many arrivals are still missing?	`mbarrier.init` (sets initial count)	`mbarrier.arrive*`
Transaction debt	How many async transactions are still outstanding?	`mbarrier.expect_tx` / `mbarrier.arrive.expect_tx`	`mbarrier.complete_tx` (explicit or implicit)

mbarrier.arrive* pays down the first debt. mbarrier.expect_tx adds to the second debt. mbarrier.complete_tx pays down the second debt. mbarrier.arrive.expect_tx does both in one shot.

Key instructions

mbarrier.init — Creates a barrier expecting count arrivals and zero async debt.

mbarrier.arrive — Decrements pending arrivals. Does not touch async work unless tx-count is already zero. If tx-count is nonzero, arrive alone will not complete the phase.

mbarrier.arrive.expect_tx — Fused form: adds txCount to transaction debt and contributes one arrival. Can make a phase look arrival-complete while still incomplete if tx-count is nonzero.

mbarrier.expect_tx — Purely "increase async debt." Useful when async work and thread arrival are logically separate.

mbarrier.complete_tx — Decrements tx-count. Explicit form is only .relaxed, but several async instructions generate implicit complete-tx with .release semantics (e.g., cp.async.bulk...mbarrier::complete_tx::bytes).

Watch out

arrive does not mean completion if tx-count is nonzero. If you called expect_tx anywhere in the phase, the barrier won't complete until complete_tx drives tx-count back to zero.

PTX ISA Deep Dive: Fences and Proxy Fences

Reference: PTX ISA 9.2, fence / membar sections.

The mental model

Ordinary loads/stores/atomics/reductions use the generic proxy. Some mechanisms use a different proxy, especially the async proxy. A proxy is an abstract label on a method of memory access — if two accesses use different proxies, a proxy fence is required to synchronize them.

The memory model requires that two operations use the same proxy to be "morally strong" with respect to each other. Generic on one side and async on the other means ordinary memory-order reasoning is insufficient — you need a cross-proxy fence.

cp.async, cp.async.bulk, cp.reduce.async.bulk, and wgmma.mma_async are not normal in-thread operations in plain program order. They have weaker ordering guarantees, and you must rely on each instruction family's documented completion and synchronization rules.

Ordinary fences: `fence` and old-style `membar`

Main syntax: fence{.sem}.scope. Semantics: .sc, .acq_rel, .acquire, .release. Scopes: .cta, .cluster, .gpu, .sys.

fence.acq_rel (default) — Sufficient for most synchronization patterns, but only synchronizes when combined with the right surrounding memory operations. Not magic global synchronization.

fence.sc — Stronger, slower. All morally strong fence.sc operations participate in Fence-SC order. If two are ordered there, they synchronize.

`fence.proxy.async` specifically

Use when one side is in the generic proxy and the other is in the async proxy.

Not every async-looking instruction uses the async proxy:

Plain cp.async is generic-proxy.
cp.async.bulk and cp.reduce.async.bulk are async-proxy.
wgmma.mma_async is async-proxy.

Easy place to get tripped up. Use fence.proxy.async when generic and async paths touch the same location.

Watch out: wgmma.fence is not a substitute for fence.proxy.async. wgmma.fence orders register accesses. If you wrote matrices to shared memory generically and wgmma.mma_async reads them through the async proxy, you still need fence.proxy.async for the shared-memory handoff.

Important: fence.proxy.async does not mean "wait until the async operation is finished." It only provides cross-proxy ordering. Completion is separate. You need both a completion wait (to know the async op is done) and proxy-fence semantics (so generic code can safely read the results).

PTX ISA Deep Dive: CTA Named Barriers

Reference: PTX ISA 9.2, bar / barrier instruction pages.

CTA barrier resources used by bar / barrier instructions. Each CTA has 16 logical barriers numbered 0..15. Operand a selects one of them.

Instruction forms

// Preferred spellings
barrier{.cta}.sync{.aligned}          a{, b};
barrier{.cta}.arrive{.aligned}        a, b;
barrier{.cta}.red.popc{.aligned}.u32  d, a{, b}, {!}c;
barrier{.cta}.red.op{.aligned}.pred   p, a{, b}, {!}c;

// Older short spellings
bar{.cta}.sync                        a{, b};
bar{.cta}.arrive                      a, b;
bar{.cta}.red.popc.u32                d, a{, b}, {!}c;
bar{.cta}.red.op.pred                 p, a{, b}, {!}c;

Key instructions

bar.sync a — Standard named barrier wait. Thread signals arrival and waits until all participating warps have arrived.

bar.arrive a, b — Signals arrival but does not wait for other participating warps. Enables producer/consumer patterns.

bar.red.popc.u32 d, a, c — Barrier plus population-count reduction. Counts how many threads have predicate c true.

bar.red.and.pred p, a, c — Barrier plus logical AND of predicates across participants.

bar.red.or.pred p, a, c — Barrier plus logical OR of predicates across participants.

Memory-ordering semantics

.sync, .red, and .arrive guarantee prior memory accesses are performed relative to all participating threads when the barrier completes. .sync and .red additionally guarantee the thread does not issue new memory accesses before the barrier completes.

"Performed" means: reads are transmitted from memory and cannot be modified by another participant; writes are visible to participants and old values can no longer be read. This is why named barriers are safe for shared-memory handoff inside a CTA.

Quick reference

Form	Blocks	Memory guarantee
`bar.sync a`	Yes	Prior writes visible; no new accesses until complete
`bar.sync a, b`	Yes	Prior writes visible; no new accesses until complete
`bar.arrive a, b`	No	Prior writes visible
`bar.red.popc.u32 d, a, c`	Yes	Prior writes visible; no new accesses until complete
`bar.red.and.pred p, a, c`	Yes	Prior writes visible; no new accesses until complete

Continue the course

Lesson 3 establishes the correctness layer of the asynchronous machine. Lesson 4 moves back into the movement path itself: how Hopper TMA uses a host-encoded descriptor to describe tiles, strides, swizzles, interleaving, and fetch policy before the transfer even begins.

Next Lesson

Asynchronicity and Barriers

Why asynchronicity matters

Blocking / synchronous

Non-blocking / asynchronous

Proxies, the async lifecycle, and why hazards appear

Generic proxy

Async proxy

RAW and WAR hazards

Fences are about visibility contracts, not ceremony

Release

Acquire

Cross-proxy fence

The mbarrier pipeline

Initialization and reuse

Expected transaction count

Parity and wait semantics

Cluster barriers, async groups, and named barriers

barrier.cluster.arrive

barrier.cluster.wait

cp.async.bulk.commit_group

cp.async.bulk.wait_group<N>

Design rules

Glossary

PTX ISA Deep Dive: mbarrier Internals

What an mbarrier actually is

The two-debt mental model

Key instructions

Watch out

PTX ISA Deep Dive: Fences and Proxy Fences

The mental model

Ordinary fences: fence and old-style membar

fence.proxy.async specifically

PTX ISA Deep Dive: CTA Named Barriers

Instruction forms

Key instructions

Memory-ordering semantics

Quick reference

Continue the course

cuTensorMap

SM90 Ping-Pong Kernel

Back to Lesson Dossiers

The `mbarrier` pipeline

`barrier.cluster.arrive`

`barrier.cluster.wait`

`cp.async.bulk.commit_group`

`cp.async.bulk.wait_group<N>`

PTX ISA Deep Dive: `mbarrier` Internals

What an `mbarrier` actually is

Ordinary fences: `fence` and old-style `membar`

`fence.proxy.async` specifically