Explore Here / Lesson 02

Clusters, Data Types, Inline PTX, Pointers

Lesson 2 moves from architecture into mechanism. The focus shifts to cluster residency, distributed shared memory, PTX state spaces, inline PTX constraints, packed datatypes, and the address conversions Hopper code relies on when memory movement stops being generic.

Clusters DSMEM Inline PTX State Spaces cvta / mapa

Overview

Modern CUDA performance work often requires thinking across scheduling, memory topology, compiler behavior, and instruction-level control at the same time. Lesson 2 is where those layers begin to collapse into one picture. Thread block clusters change how blocks cooperate. Distributed shared memory changes what can be shared without bouncing through global memory. PTX and inline PTX expose instruction forms that CUDA C++ does not always surface directly. Pointer conversion instructions tie all of that back to the actual hardware address spaces.

Main idea: high-performance GPU code is not only about doing less work. It is also about placing data in the right memory region, using the right instruction form, and minimizing unnecessary routing, synchronization, and bank conflicts.

Clusters

Guaranteed co-scheduled CTAs on nearby SMs create a new cooperation surface above the block level.

DSMEM

Remote shared-memory access inside a cluster reduces how often blocks need to fall back to global memory.

Inline PTX

Instruction-level control matters when the exact state space, datatype, or synchronization form is the point.

Pointer conversion

cvta and mapa connect high-level pointers to concrete memory and cluster semantics.

Thread block clusters

A thread block cluster is a group of thread blocks guaranteed to be co-scheduled concurrently on adjacent SMs within a GPC. The important point is that this extends the hierarchy from threads -> thread blocks -> clusters -> grid without pretending one block is now spread across many SMs.

Why clusters matter

When one thread block cannot hold enough useful shared state on a single SM, clusters provide a way for nearby CTAs to cooperate more directly.

What does not change

A CTA still executes on exactly one SM. Clusters coordinate whole blocks; they do not split one CTA across the machine.

Key properties

  • Clusters are explicit software constructs, not an automatic scheduling side effect.
  • All CTAs in a cluster are guaranteed concurrent residency on nearby SMs.
  • They are useful when cross-block data sharing is frequent enough to make global memory too expensive.
  • They trade some coordination overhead for a larger effective working set.

Cluster sizing tradeoffs

Cluster size Upside Downside
2 Low coordination overhead and often a practical default. Limited shared-memory reach across the cluster.
8 More cooperating CTAs with a useful balance of scale and usability. More synchronization cost and more constraints on grid shape.
16 Maximum neighborhood described in the lesson notes. Highest synchronization overhead and a larger risk of idle hardware.

Declaring a cluster

__global__ void __cluster_dims__(2, 1, 1)
cluster_kernel(float* input, float* output) {
  // kernel body
}

// launch shape must remain compatible with cluster dimensions

Cluster special registers

Register Meaning
%cluster_ctaid CTA identifier inside the cluster.
%cluster_nctaid Cluster dimensions.
%cluster_ctarank Linearized rank of the CTA inside the cluster.
%cluster_nctarank Total CTA count inside the cluster.
%is_explicit_cluster Whether the launch used an explicit cluster rather than the implicit 1x1x1 case.

Distributed shared memory

Distributed shared memory is the cluster-era extension of shared memory. It allows direct memory access between SMs inside the same thread block cluster. Ownership still stays per block, but the programmer can address a neighboring CTA's shared-memory allocation without routing everything through global memory.

What it enables

Remote shared-memory loads, stores, and atomics between CTAs in the same cluster.

Why it is fast

It uses an SM-to-SM communication path instead of forcing all communication through L2 and HBM.

What it is not

It is not one merged shared-memory pool. Each CTA still owns its own local allocation.

The practical win is a larger effective working set at a much cheaper memory tier than global memory. That is especially useful for top-k, matrix pipelines, reductions, or tile exchange patterns where neighboring CTAs need each other's intermediate state.

TMA and multicast: DSMEM becomes much more powerful when paired with TMA-backed movement and multicast. One producer can seed multiple consumers without repeating the full data movement path for every destination.

PTX, inline PTX, operands, and state spaces

PTX is NVIDIA's virtual ISA inside the CUDA toolchain. CUDA C++ lowers into PTX, which is then lowered again into machine-specific code. Inline PTX matters when you want instruction-level control without rewriting an entire kernel in PTX.

Why use inline PTX

  • To access hardware features that CUDA C++ does not expose cleanly.
  • To hand-optimize small performance-critical fragments.
  • To request exact operand types, state spaces, or synchronization forms.
  • To inject constants or instruction parameters through templates.
asm volatile(
  "ptx instruction here;"
  : /* outputs */
  : /* inputs */
  : /* clobbers */
);

Constraint modifiers and common operand classes

Constraint Meaning
= Write-only output operand.
+ Read-write operand.
r General integer register.
l 64-bit integer register, commonly used for addresses.
f 32-bit floating-point register.
d 64-bit floating-point register.

A memory clobber is often required when the assembly changes memory in ways the compiler cannot infer from the operand list alone. Without it, the compiler may legally reorder surrounding code in a way that breaks the intended semantics.

PTX state spaces

.global

Large device memory visible across the grid, high capacity and high latency.

.shared

Low-latency memory local to a CTA, or reachable through cluster mechanisms in DSMEM workflows.

.local

Per-thread storage, often backed by memory when registers spill.

.const / .param

Read-only constant space and kernel argument space with their own access semantics.

Data types that show up in Hopper code

  • Standard integer and floating-point scalar families still matter for control flow and addressing.
  • BF16, FP16, TF32, and FP8 matter because they change throughput, packing density, and tensor-core issue shape.
  • Packed types matter because multiple low-precision values are often carried inside one 32-bit register.

Addressing, banking, and swizzling

Shared memory is still banked, and the bank structure still matters even on Hopper. Poor access patterns can serialize otherwise parallel traffic, which is why low-level layout choices continue to be part of performance engineering instead of a side detail.

Shared-memory banks

  • Threads in a warp ideally spread their requests across banks instead of piling onto the same bank.
  • Bank conflicts increase replay and reduce effective bandwidth.
  • The more structured the access pattern, the easier it is to reason about conflict behavior.

128-byte transaction granularity

Many high-performance paths reason in 128-byte chunks because that aligns well with the granularity of efficient movement and layout transformations. This is one reason tile shapes and swizzles show up so often in Hopper code.

Why swizzling exists: it is a layout trick for spreading accesses more evenly across banks and transactions so structured movement stays fast even when tensor tiles are large and reused aggressively.

Generic pointers, state-space pointers, cvta, and mapa

CUDA C++ usually gives you generic pointers, but PTX instructions often need a pointer that is already understood in the correct state space. That is why conversion steps matter. The compiler cannot always assume which physical memory region a generic pointer should target inside low-level assembly.

Generic pointers

Flexible at the CUDA C++ level, but ambiguous for instructions that care about a specific state space.

State-space pointers

Explicitly identify shared, global, local, or constant semantics so the instruction can target the correct path.

cvta and mapa

// Convert a generic pointer into a specific address space
asm volatile("cvta.to.shared.u64 %0, %1;" : "=l"(smem_addr) : "l"(ptr));

// Map a shared-memory address into another CTA's shared-memory region
asm volatile("mapa.shared::cluster.u64 %0, %1, %2;" : "=l"(remote_addr) : "l"(addr), "r"(rank));

cvta turns a generic address into the appropriate state-space address. mapa takes that one step further for cluster programming by mapping a shared-memory address into the address space of another CTA inside the cluster.

Important distinction: cluster rank is not the same thing as a global block ID. A block ID tells you where a CTA sits in the full grid. Cluster rank tells you where it sits inside its local cluster, and that is the value mapping operations care about.

Integration takeaways

  1. Do not flatten clusters into “shared memory, but bigger.” Residency guarantees and ownership still matter.
  2. Keep state-space terminology precise. .global, .shared, and generic pointers are not interchangeable.
  3. Treat inline PTX as a surgical tool. Use it when exact semantics are the point, not as a stylistic preference.
  4. Respect bank behavior and 128-byte layout assumptions. Swizzles and packed layouts exist for a reason.
  5. Be exact about rank and mapping. Many cluster bugs are really address-space or rank-mapping bugs.

Glossary

Cluster A guaranteed co-scheduled group of CTAs on nearby SMs inside a GPC.
DSMEM Distributed shared memory, remote shared-memory access across CTAs in one cluster.
PTX NVIDIA's virtual ISA used as the lower-level target inside the CUDA toolchain.
cvta Convert to address; turns a generic pointer into the correct state-space address.
mapa Map address; remaps shared-memory addresses across CTAs inside a cluster.

Continue the course

Lesson 2 builds the address-space and instruction-level vocabulary Hopper kernels need. The next big step is synchronization under overlap: barriers, wait patterns, and correctness when loads and compute are intentionally in flight at the same time.