Overview
Modern CUDA performance work often requires thinking across scheduling, memory topology, compiler behavior, and instruction-level control at the same time. Lesson 2 is where those layers begin to collapse into one picture. Thread block clusters change how blocks cooperate. Distributed shared memory changes what can be shared without bouncing through global memory. PTX and inline PTX expose instruction forms that CUDA C++ does not always surface directly. Pointer conversion instructions tie all of that back to the actual hardware address spaces.
Main idea: high-performance GPU code is not only about doing less work. It is also about placing data in the right memory region, using the right instruction form, and minimizing unnecessary routing, synchronization, and bank conflicts.
Clusters
Guaranteed co-scheduled CTAs on nearby SMs create a new cooperation surface above the block level.
DSMEM
Remote shared-memory access inside a cluster reduces how often blocks need to fall back to global memory.
Inline PTX
Instruction-level control matters when the exact state space, datatype, or synchronization form is the point.
Pointer conversion
cvta and mapa connect high-level pointers to concrete memory and cluster semantics.
Thread block clusters
A thread block cluster is a group of thread blocks guaranteed to be co-scheduled concurrently on adjacent SMs within a GPC. The important point is that this extends the hierarchy from threads -> thread blocks -> clusters -> grid without pretending one block is now spread across many SMs.
Why clusters matter
When one thread block cannot hold enough useful shared state on a single SM, clusters provide a way for nearby CTAs to cooperate more directly.
What does not change
A CTA still executes on exactly one SM. Clusters coordinate whole blocks; they do not split one CTA across the machine.
Key properties
- Clusters are explicit software constructs, not an automatic scheduling side effect.
- All CTAs in a cluster are guaranteed concurrent residency on nearby SMs.
- They are useful when cross-block data sharing is frequent enough to make global memory too expensive.
- They trade some coordination overhead for a larger effective working set.
Cluster sizing tradeoffs
| Cluster size | Upside | Downside |
|---|---|---|
| 2 | Low coordination overhead and often a practical default. | Limited shared-memory reach across the cluster. |
| 8 | More cooperating CTAs with a useful balance of scale and usability. | More synchronization cost and more constraints on grid shape. |
| 16 | Maximum neighborhood described in the lesson notes. | Highest synchronization overhead and a larger risk of idle hardware. |
Declaring a cluster
__global__ void __cluster_dims__(2, 1, 1)
cluster_kernel(float* input, float* output) {
// kernel body
}
// launch shape must remain compatible with cluster dimensions
Cluster special registers
| Register | Meaning |
|---|---|
%cluster_ctaid |
CTA identifier inside the cluster. |
%cluster_nctaid |
Cluster dimensions. |
%cluster_ctarank |
Linearized rank of the CTA inside the cluster. |
%cluster_nctarank |
Total CTA count inside the cluster. |
%is_explicit_cluster |
Whether the launch used an explicit cluster rather than the implicit 1x1x1 case. |
Distributed shared memory
Distributed shared memory is the cluster-era extension of shared memory. It allows direct memory access between SMs inside the same thread block cluster. Ownership still stays per block, but the programmer can address a neighboring CTA's shared-memory allocation without routing everything through global memory.
What it enables
Remote shared-memory loads, stores, and atomics between CTAs in the same cluster.
Why it is fast
It uses an SM-to-SM communication path instead of forcing all communication through L2 and HBM.
What it is not
It is not one merged shared-memory pool. Each CTA still owns its own local allocation.
The practical win is a larger effective working set at a much cheaper memory tier than global memory. That is especially useful for top-k, matrix pipelines, reductions, or tile exchange patterns where neighboring CTAs need each other's intermediate state.
TMA and multicast: DSMEM becomes much more powerful when paired with TMA-backed movement and multicast. One producer can seed multiple consumers without repeating the full data movement path for every destination.
PTX, inline PTX, operands, and state spaces
PTX is NVIDIA's virtual ISA inside the CUDA toolchain. CUDA C++ lowers into PTX, which is then lowered again into machine-specific code. Inline PTX matters when you want instruction-level control without rewriting an entire kernel in PTX.
Why use inline PTX
- To access hardware features that CUDA C++ does not expose cleanly.
- To hand-optimize small performance-critical fragments.
- To request exact operand types, state spaces, or synchronization forms.
- To inject constants or instruction parameters through templates.
asm volatile(
"ptx instruction here;"
: /* outputs */
: /* inputs */
: /* clobbers */
);
Constraint modifiers and common operand classes
| Constraint | Meaning |
|---|---|
= |
Write-only output operand. |
+ |
Read-write operand. |
r |
General integer register. |
l |
64-bit integer register, commonly used for addresses. |
f |
32-bit floating-point register. |
d |
64-bit floating-point register. |
A memory clobber is often required when the assembly changes memory in ways the compiler
cannot infer from the operand list alone. Without it, the compiler may legally reorder surrounding code
in a way that breaks the intended semantics.
PTX state spaces
.global
Large device memory visible across the grid, high capacity and high latency.
.shared
Low-latency memory local to a CTA, or reachable through cluster mechanisms in DSMEM workflows.
.local
Per-thread storage, often backed by memory when registers spill.
.const / .param
Read-only constant space and kernel argument space with their own access semantics.
Data types that show up in Hopper code
- Standard integer and floating-point scalar families still matter for control flow and addressing.
- BF16, FP16, TF32, and FP8 matter because they change throughput, packing density, and tensor-core issue shape.
- Packed types matter because multiple low-precision values are often carried inside one 32-bit register.
Addressing, banking, and swizzling
Shared memory is still banked, and the bank structure still matters even on Hopper. Poor access patterns can serialize otherwise parallel traffic, which is why low-level layout choices continue to be part of performance engineering instead of a side detail.
Shared-memory banks
- Threads in a warp ideally spread their requests across banks instead of piling onto the same bank.
- Bank conflicts increase replay and reduce effective bandwidth.
- The more structured the access pattern, the easier it is to reason about conflict behavior.
128-byte transaction granularity
Many high-performance paths reason in 128-byte chunks because that aligns well with the granularity of efficient movement and layout transformations. This is one reason tile shapes and swizzles show up so often in Hopper code.
Why swizzling exists: it is a layout trick for spreading accesses more evenly across banks and transactions so structured movement stays fast even when tensor tiles are large and reused aggressively.
Generic pointers, state-space pointers, cvta, and mapa
CUDA C++ usually gives you generic pointers, but PTX instructions often need a pointer that is already understood in the correct state space. That is why conversion steps matter. The compiler cannot always assume which physical memory region a generic pointer should target inside low-level assembly.
Generic pointers
Flexible at the CUDA C++ level, but ambiguous for instructions that care about a specific state space.
State-space pointers
Explicitly identify shared, global, local, or constant semantics so the instruction can target the correct path.
cvta and mapa
// Convert a generic pointer into a specific address space
asm volatile("cvta.to.shared.u64 %0, %1;" : "=l"(smem_addr) : "l"(ptr));
// Map a shared-memory address into another CTA's shared-memory region
asm volatile("mapa.shared::cluster.u64 %0, %1, %2;" : "=l"(remote_addr) : "l"(addr), "r"(rank));
cvta turns a generic address into the appropriate state-space address. mapa
takes that one step further for cluster programming by mapping a shared-memory address into the address
space of another CTA inside the cluster.
Important distinction: cluster rank is not the same thing as a global block ID. A block ID tells you where a CTA sits in the full grid. Cluster rank tells you where it sits inside its local cluster, and that is the value mapping operations care about.
Integration takeaways
- Do not flatten clusters into “shared memory, but bigger.” Residency guarantees and ownership still matter.
- Keep state-space terminology precise.
.global,.shared, and generic pointers are not interchangeable. - Treat inline PTX as a surgical tool. Use it when exact semantics are the point, not as a stylistic preference.
- Respect bank behavior and 128-byte layout assumptions. Swizzles and packed layouts exist for a reason.
- Be exact about rank and mapping. Many cluster bugs are really address-space or rank-mapping bugs.
Glossary
| Cluster | A guaranteed co-scheduled group of CTAs on nearby SMs inside a GPC. |
|---|---|
| DSMEM | Distributed shared memory, remote shared-memory access across CTAs in one cluster. |
| PTX | NVIDIA's virtual ISA used as the lower-level target inside the CUDA toolchain. |
cvta |
Convert to address; turns a generic pointer into the correct state-space address. |
mapa |
Map address; remaps shared-memory addresses across CTAs inside a cluster. |
Continue the course
Lesson 2 builds the address-space and instruction-level vocabulary Hopper kernels need. The next big step is synchronization under overlap: barriers, wait patterns, and correctness when loads and compute are intentionally in flight at the same time.
Asynchronicity and Barriers
Move into mbarrier, proxy fencing, wait semantics, and correctness under overlap.
All Slide Decks
Use the slides page when you want the full lecture sequence beyond the current integrated lesson pages.
ExploreBack to Lesson Dossiers
Return to the homepage explore section and move between the current long-form lesson entries.