Clusters, DSMEM, Inline PTX, and Pointers | Lesson 2 | CUDA Programming for NVIDIA H100s

Overview

Modern CUDA performance work often requires thinking across scheduling, memory topology, compiler behavior, and instruction-level control at the same time. Lesson 2 is where those layers begin to collapse into one picture. Thread block clusters change how blocks cooperate. Distributed shared memory changes what can be shared without bouncing through global memory. PTX and inline PTX expose instruction forms that CUDA C++ does not always surface directly. Pointer conversion instructions tie all of that back to the actual hardware address spaces.

Main idea: high-performance GPU code is not only about doing less work. It is also about placing data in the right memory region, using the right instruction form, and minimizing unnecessary routing, synchronization, and bank conflicts.

Clusters

Guaranteed co-scheduled CTAs on nearby SMs create a new cooperation surface above the block level.

DSMEM

Remote shared-memory access inside a cluster reduces how often blocks need to fall back to global memory.

Inline PTX

Instruction-level control matters when the exact state space, datatype, or synchronization form is the point.

Pointer conversion

cvta and mapa connect high-level pointers to concrete memory and cluster semantics.

Thread block clusters

A thread block cluster is a group of thread blocks guaranteed to be co-scheduled concurrently on adjacent SMs within a GPC. The important point is that this extends the hierarchy from threads -> thread blocks -> clusters -> grid without pretending one block is now spread across many SMs.

Why clusters matter

When one thread block cannot hold enough useful shared state on a single SM, clusters provide a way for nearby CTAs to cooperate more directly.

What does not change

A CTA still executes on exactly one SM. Clusters coordinate whole blocks; they do not split one CTA across the machine.

Key properties

Clusters are explicit software constructs, not an automatic scheduling side effect.
All CTAs in a cluster are guaranteed concurrent residency on nearby SMs.
They are useful when cross-block data sharing is frequent enough to make global memory too expensive.
They trade some coordination overhead for a larger effective working set.

Cluster sizing tradeoffs

Cluster size	Upside	Downside
2	Low coordination overhead and often a practical default.	Limited shared-memory reach across the cluster.
8	More cooperating CTAs with a useful balance of scale and usability.	More synchronization cost and more constraints on grid shape.
16	Maximum neighborhood described in the lesson notes.	Highest synchronization overhead and a larger risk of idle hardware.

Declaring a cluster

__global__ void __cluster_dims__(2, 1, 1)
cluster_kernel(float* input, float* output) {
  // kernel body
}

// launch shape must remain compatible with cluster dimensions

Cluster special registers

Register	Meaning
`%cluster_ctaid`	CTA identifier inside the cluster.
`%cluster_nctaid`	Cluster dimensions.
`%cluster_ctarank`	Linearized rank of the CTA inside the cluster.
`%cluster_nctarank`	Total CTA count inside the cluster.
`%is_explicit_cluster`	Whether the launch used an explicit cluster rather than the implicit `1x1x1` case.

Distributed shared memory

Distributed shared memory is the cluster-era extension of shared memory. It allows direct memory access between SMs inside the same thread block cluster. Ownership still stays per block, but the programmer can address a neighboring CTA's shared-memory allocation without routing everything through global memory.

What it enables

Remote shared-memory loads, stores, and atomics between CTAs in the same cluster.

Why it is fast

It uses an SM-to-SM communication path instead of forcing all communication through L2 and HBM.

What it is not

It is not one merged shared-memory pool. Each CTA still owns its own local allocation.

The practical win is a larger effective working set at a much cheaper memory tier than global memory. That is especially useful for top-k, matrix pipelines, reductions, or tile exchange patterns where neighboring CTAs need each other's intermediate state.

TMA and multicast: DSMEM becomes much more powerful when paired with TMA-backed movement and multicast. One producer can seed multiple consumers without repeating the full data movement path for every destination.

PTX, inline PTX, operands, and state spaces

PTX is NVIDIA's virtual ISA inside the CUDA toolchain. CUDA C++ lowers into PTX, which is then lowered again into machine-specific code. Inline PTX matters when you want instruction-level control without rewriting an entire kernel in PTX.

Why use inline PTX

To access hardware features that CUDA C++ does not expose cleanly.
To hand-optimize small performance-critical fragments.
To request exact operand types, state spaces, or synchronization forms.
To inject constants or instruction parameters through templates.

asm volatile(
  "ptx instruction here;"
  : /* outputs */
  : /* inputs */
  : /* clobbers */
);

Constraint modifiers and common operand classes

Constraint	Meaning
`=`	Write-only output operand.
`+`	Read-write operand.
`r`	General integer register.
`l`	64-bit integer register, commonly used for addresses.
`f`	32-bit floating-point register.
`d`	64-bit floating-point register.

A memory clobber is often required when the assembly changes memory in ways the compiler cannot infer from the operand list alone. Without it, the compiler may legally reorder surrounding code in a way that breaks the intended semantics.

PTX state spaces

`.global`

Large device memory visible across the grid, high capacity and high latency.

`.shared`

Low-latency memory local to a CTA, or reachable through cluster mechanisms in DSMEM workflows.

`.local`

Per-thread storage, often backed by memory when registers spill.

`.const` / `.param`

Read-only constant space and kernel argument space with their own access semantics.

Data types that show up in Hopper code

Standard integer and floating-point scalar families still matter for control flow and addressing.
BF16, FP16, TF32, and FP8 matter because they change throughput, packing density, and tensor-core issue shape.
Packed types matter because multiple low-precision values are often carried inside one 32-bit register.

Addressing, banking, and swizzling

Shared memory is still banked, and the bank structure still matters even on Hopper. Poor access patterns can serialize otherwise parallel traffic, which is why low-level layout choices continue to be part of performance engineering instead of a side detail.

Shared-memory banks

Threads in a warp ideally spread their requests across banks instead of piling onto the same bank.
Bank conflicts increase replay and reduce effective bandwidth.
The more structured the access pattern, the easier it is to reason about conflict behavior.

128-byte transaction granularity

Many high-performance paths reason in 128-byte chunks because that aligns well with the granularity of efficient movement and layout transformations. This is one reason tile shapes and swizzles show up so often in Hopper code.

Why swizzling exists: it is a layout trick for spreading accesses more evenly across banks and transactions so structured movement stays fast even when tensor tiles are large and reused aggressively.

Generic pointers, state-space pointers, `cvta`, and `mapa`

CUDA C++ usually gives you generic pointers, but PTX instructions often need a pointer that is already understood in the correct state space. That is why conversion steps matter. The compiler cannot always assume which physical memory region a generic pointer should target inside low-level assembly.

Generic pointers

Flexible at the CUDA C++ level, but ambiguous for instructions that care about a specific state space.

State-space pointers

Explicitly identify shared, global, local, or constant semantics so the instruction can target the correct path.

`cvta` and `mapa`

// Convert a generic pointer into a specific address space
asm volatile("cvta.to.shared.u64 %0, %1;" : "=l"(smem_addr) : "l"(ptr));

// Map a shared-memory address into another CTA's shared-memory region
asm volatile("mapa.shared::cluster.u64 %0, %1, %2;" : "=l"(remote_addr) : "l"(addr), "r"(rank));

cvta turns a generic address into the appropriate state-space address. mapa takes that one step further for cluster programming by mapping a shared-memory address into the address space of another CTA inside the cluster.

Important distinction: cluster rank is not the same thing as a global block ID. A block ID tells you where a CTA sits in the full grid. Cluster rank tells you where it sits inside its local cluster, and that is the value mapping operations care about.

Integration takeaways

Do not flatten clusters into “shared memory, but bigger.” Residency guarantees and ownership still matter.
Keep state-space terminology precise. .global, .shared, and generic pointers are not interchangeable.
Treat inline PTX as a surgical tool. Use it when exact semantics are the point, not as a stylistic preference.
Respect bank behavior and 128-byte layout assumptions. Swizzles and packed layouts exist for a reason.
Be exact about rank and mapping. Many cluster bugs are really address-space or rank-mapping bugs.

Glossary

Cluster	A guaranteed co-scheduled group of CTAs on nearby SMs inside a GPC.
DSMEM	Distributed shared memory, remote shared-memory access across CTAs in one cluster.
PTX	NVIDIA's virtual ISA used as the lower-level target inside the CUDA toolchain.
`cvta`	Convert to address; turns a generic pointer into the correct state-space address.
`mapa`	Map address; remaps shared-memory addresses across CTAs inside a cluster.

Continue the course

Lesson 2 builds the address-space and instruction-level vocabulary Hopper kernels need. The next big step is synchronization under overlap: barriers, wait patterns, and correctness when loads and compute are intentionally in flight at the same time.

Next Lesson

Clusters, Data Types, Inline PTX, Pointers

Overview

Clusters

DSMEM

Inline PTX

Pointer conversion

Thread block clusters

Why clusters matter

What does not change

Key properties

Cluster sizing tradeoffs

Declaring a cluster

Cluster special registers

Distributed shared memory

What it enables

Why it is fast

What it is not

PTX, inline PTX, operands, and state spaces

Why use inline PTX

Constraint modifiers and common operand classes

PTX state spaces

`.global`

`.shared`

`.local`

`.const` / `.param`

Data types that show up in Hopper code

Addressing, banking, and swizzling

Shared-memory banks

128-byte transaction granularity

Generic pointers, state-space pointers, `cvta`, and `mapa`

Generic pointers

State-space pointers

`cvta` and `mapa`

Integration takeaways

Glossary

Continue the course

Asynchronicity and Barriers

All Slide Decks

Back to Lesson Dossiers

Clusters, Data Types, Inline PTX, Pointers

Overview

Clusters

DSMEM

Inline PTX

Pointer conversion

Thread block clusters

Why clusters matter

What does not change

Key properties

Cluster sizing tradeoffs

Declaring a cluster

Cluster special registers

Distributed shared memory

What it enables

Why it is fast

What it is not

PTX, inline PTX, operands, and state spaces

Why use inline PTX

Constraint modifiers and common operand classes

PTX state spaces

.global

.shared

.local

.const / .param

Data types that show up in Hopper code

Addressing, banking, and swizzling

Shared-memory banks

128-byte transaction granularity

Generic pointers, state-space pointers, cvta, and mapa

Generic pointers

State-space pointers

cvta and mapa

Integration takeaways

Glossary

Continue the course

Asynchronicity and Barriers

All Slide Decks

Back to Lesson Dossiers

`.global`

`.shared`

`.local`

`.const` / `.param`

Generic pointers, state-space pointers, `cvta`, and `mapa`

`cvta` and `mapa`