NCCL, PMIx, and Distributed Orchestration | Lesson 10 | CUDA Programming for NVIDIA H100s

How the distributed stack fits together

The final lesson is about separation of responsibilities. A training job is not “NCCL doing everything.” Slurm decides where the work runs, PMIx gives isolated ranks a way to publish and retrieve bootstrap metadata, and NCCL builds the communicator and issues the collective GPU work once the participants know who they are.

Allocation and launch

Slurm controls nodes, GPUs, CPUs, memory, task placement, and the basic launch environment for each process.

Discovery and communication

PMIx handles metadata exchange. NCCL uses that state plus topology discovery to construct rings, trees, and direct paths for collective communication.

The clean summary from the course: Slurm allocates, PMIx introduces peers, and NCCL moves tensors efficiently across the cluster.

Slurm is the cluster control plane, not the tensor communication layer

Slurm's job is resource allocation, placement, environment setup, and process launch. It knows the job shape, the node list, and which devices are assigned to each task, but it does not by itself solve rank-to-rank discovery or collective GPU communication.

Responsibility	What it gives you	Why it matters
Resource allocation	Nodes, GPUs, CPUs, memory, and wall-clock limits.	Defines the physical footprint of the distributed job.
Task launch	Process startup across the allocated nodes, often via `srun --mpi=pmix`.	Replaces slower or more ad hoc host-by-host bootstrap flows.
Device isolation	`CUDA_VISIBLE_DEVICES` and cgroup-level device filtering.	Ensures each task sees only the GPUs it was allocated.

The notes also emphasize the hierarchy of terms: job, task, rank, local rank, and namespace. Those are not vocabulary trivia. They are the handles the rest of the startup path uses to bind processes to devices and to each other.

#!/bin/bash
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-task=1

srun --mpi=pmix python train.py

PMIx solves the moment after launch and before communication

A newly started rank knows it exists, but it does not automatically know where rank 0 lives, which GPU or NIC the others are bound to, or where the bootstrap data for the communicator has been published. PMIx acts as the temporary data exchange layer for that phase.

Operation	Purpose	Typical use in this flow
`PMIx_Put`	Publish bootstrap metadata.	Rank 0 posts IDs or connection state into the PMIx store.
`PMIx_Commit`	Push local data so it becomes visible at the desired scope.	Makes the published bootstrap state readable by peers.
`PMIx_Get`	Retrieve data that another participant published.	Follower ranks fetch the NCCL unique ID or peer metadata.
`PMIx_Fence`	Synchronization point before dependent startup continues.	Prevents some ranks from racing ahead before bootstrap state is globally visible.

Local rank is the especially important bridge into CUDA. Because Slurm filters and re-indexes visible devices, the process usually maps PMIX_LOCAL_RANK onto the GPUs it can see after filtering, not onto the raw physical numbering on the node.

Debugging detail worth remembering: if every task sees only one filtered device, every task may report it is running on GPU 0. In that context, PCI bus IDs are often more trustworthy than the visible CUDA index.

NCCL turns bootstrap metadata into a communicator and a route plan

Before collectives can run, every participant needs to join the same communicator. The slides describe the common pattern clearly: rank 0 creates an ncclUniqueId, publishes it through PMIx, peers fence and retrieve it, and then every rank calls ncclCommInitRank with the same unique ID, world size, and rank.

ncclUniqueId id;
if (global_rank == 0) {
  ncclGetUniqueId(&id);
  // publish through PMIx
}

// synchronize, retrieve, initialize
ncclCommInitRank(&comm, world_size, id, global_rank);

Initialization is expensive because the ID is only the entry point. NCCL still has to build the actual transport graph: which peers are local, which routes should prefer NVSwitch, which NICs should carry inter-node jumps, and whether the communication structure should look more like a ring or a tree.

Why the ID matters

Every participant must possess the exact same unique ID, otherwise they are not joining the same communicator.

Why the fence matters

PMIx visibility rules are real. Peers cannot safely fetch communicator bootstrap state until rank 0 has published and committed it at the correct scope.

Teardown matters too. ncclCommDestroy invalidates the communicator and frees the scratch space, mappings, and transport state built during initialization.

Collectives are the real language of multi-GPU tensor movement

The lesson frames H100 clusters in terms of collective dataflow patterns, not manual send/receive. Broadcast, Reduce, AllReduce, AllGather, ReduceScatter, and All-to-All each solve a different tensor movement problem, and those patterns map directly onto different training strategies.

Primitive	What it does	Typical use
Broadcast	One root rank sends the same data to every rank.	Weight initialization, checkpoint restore, shared metadata distribution.
Reduce	All ranks contribute values; one root receives the reduced result.	Centralized consumption of a reduced tensor.
AllReduce	All ranks contribute and all receive the reduced result.	Gradient synchronization in data parallel training.
AllGather	Every rank contributes a shard and all ranks receive the concatenated whole.	Reconstructing full activations or parameters from shards.
ReduceScatter	Reduce first, then distribute disjoint slices of the reduced result.	Sharded gradient or activation flows, especially in tensor and sequence parallel paths.
All-to-All	Every rank sends a different chunk to every other rank.	Expert parallel routing and token exchange.

The notes also emphasize that the operation itself is only part of the story. Datatype, reduction operator, communicator, and CUDA stream are part of the call signature because NCCL is meant to compose with the rest of the CUDA execution model rather than live outside it.

The four major parallelism patterns are really communication-pattern choices

The final conceptual move of the course is to tie AI training strategies back to the collectives they require. Data, tensor, pipeline, and expert parallelism are not just framework buzzwords. Each one implies a specific pattern of tensor replication, sharding, and communication.

Data parallelism

Replicate the model on every GPU and split the data. The defining collective is AllReduce, with Broadcast often used for initialization.

Tensor parallelism

Split large matrices across GPUs. This introduces AllGather, AllReduce, and ReduceScatter depending on the sharding direction and whether sequence parallelism is in play.

Pipeline parallelism

Split the model vertically by layers. Communication becomes neighbor-to-neighbor point-to-point traffic as micro-batches move through the pipeline.

Expert parallelism

Route tokens to whichever GPU hosts the chosen expert. This is where the system starts to look like large-scale All-to-All traffic instead of simple replicated gradient exchange.

The sequence-parallel discussion in the slides is especially important because it shows how an AllReduce can often be decomposed into ReduceScatter + AllGather, leaving room to do useful work on the sharded intermediate state rather than replicating everything immediately.

The final mental model: once you know the primitive each parallelism strategy requires, you can start reasoning clearly about whether the bottleneck is bandwidth, latency, topology, or buffering rather than treating “distributed training” as one opaque step.

Practical guidance

Keep the layers of responsibility straight. Job scheduling, bootstrap metadata exchange, and collective execution solve different problems even though they appear in one startup path.
Bind GPUs using the filtered local view. Local rank plus visible device count are usually the correct inputs after Slurm has applied device filtering.
Treat communicator setup as a real phase. The unique ID is only the beginning; topology discovery and route construction are where startup cost comes from.
Choose collectives from the tensor pattern, not from habit. AllReduce is not the answer to every distributed step once the model is sharded in different ways.
See the course as one staircase. The kernel lessons explain how one GPU stays busy. The final two lessons explain how many GPUs become one training machine.

Glossary

Slurm	Cluster workload manager and scheduler that allocates resources and launches tasks.
PMIx	Process bootstrap and metadata exchange interface used for scalable peer discovery after launch.
`ncclUniqueId`	Bootstrap token shared across ranks so they can join the same NCCL communicator.
AllReduce	Collective where all ranks contribute data and all ranks receive the reduced result.
Expert parallelism	Distributed routing strategy where tokens are sent to whichever device hosts the selected experts.

Course wrap-up

This final lesson closes the loop: Hopper primitives, kernel pipelines, schedulers, node fabrics, and distributed orchestration are all part of the same machine model. The course is complete, but the site now keeps the whole path connected so you can revisit any step without dropping out of the main UI.

Course Material

Multi GPU Part 2