Multi-GPU Topology on H100 | Lesson 9 | CUDA Programming for NVIDIA H100s

Why scale out at all

The slides start with a blunt calculation. If training roughly costs six FLOPs per parameter per token, then a 1T parameter model over 10T tokens lands near 6 x 10^25 FLOPs. Even pretending you sustain 1000 TFLOP/s, that is measured in centuries on one device. Multi-GPU systems are not a luxury. They are the only way the training budget enters human time.

More throughput

Additional GPUs cut wall-clock time by spreading the math over more tensor cores and more HBM capacity.

More memory surface

Bigger models and optimizer state stop fitting on one device, so the system needs many pools of HBM plus a fabric fast enough to keep them coherent enough for training.

The point of the lesson: once you leave one GPU, the machine is no longer just SMs, tensor cores, and shared memory. The machine is now a communication hierarchy.

A DGX H100 node is already a small distributed system

The notes describe the DGX H100 node as eight SXM5 H100 GPUs connected by four NVSwitch chips, plus eight ConnectX-7 NICs and four OSFP cages for the external compute fabric. That matters because the node is not one GPU with accessories. It is a coordinated mesh of accelerators and network endpoints.

Component	Role	Why it matters
8 x H100 SXM5 GPUs	Compute and HBM capacity inside the node.	They are the endpoints that need to exchange activations, gradients, and model state.
4 x NVSwitch	Internal full-speed switching fabric.	Lets every GPU reach every other GPU without collapsing to a slow PCIe-style path.
8 x ConnectX-7 NICs	External network interfaces for the compute fabric.	Once traffic leaves the node, these devices own the critical bandwidth transition.
4 x OSFP cages	Physical outbound high-speed links.	Expose the node to the rail-aligned InfiniBand fabric.

The lesson also separates compute fabric from storage fabric. The OSFP path is for inter-node GPU communication. Dataset ingest and checkpoint traffic use a different set of interfaces and a different network design.

Inside the node, NVLink and NVSwitch define the fast path

H100's fourth-generation NVLink gives each GPU about 900 GB/s of bidirectional bandwidth. The notes frame that as the reason GPUs can skip the CPU path and communicate directly out of HBM instead of bouncing through ordinary host-centric links.

NVLink

The direct GPU-to-GPU link layer. The notes call out 18 individual NVLink links per H100 and treat the bandwidth jump over PCIe as one of the main reasons modern multi-GPU systems behave differently.

NVSwitch

The node-wide switching layer that makes the interconnect fully non-blocking. Every GPU can reach every other GPU at full speed without being trapped in a daisy-chain story.

The important conceptual transition is that the node starts to feel like a fabric rather than eight isolated devices. The notes even emphasize that switch-side operations matter, because the switch layer is not just a dumb cable crossbar.

The bandwidth cliff: inside one DGX node, GPUs talk over NVLink. Once traffic leaves the box, it hits the NIC and the external network. The whole scaling story is shaped by how well you manage that drop in bandwidth and added distance.

Outside the node, ConnectX, rails, and the network topology decide whether scaling holds

ConnectX-7 is the exit point from the node. The notes describe eight compute NICs per DGX H100 and a rail-aligned system where GPU 0 across all nodes shares one rail, GPU 1 shares another, and so on. This creates eight parallel traffic planes instead of one giant mixed queue.

Concept	What it means	Why it matters
Rail alignment	Each GPU index across the cluster maps onto its own independent network plane.	Reduces interference and keeps traffic patterns cleaner at the leaf layer.
Leaf / spine / core	A hierarchical InfiniBand fat-tree that expands from one scalable unit to larger pods.	Determines path diversity, oversubscription behavior, and cluster-wide reachability.
Adaptive routing	Switch hardware chooses among multiple viable paths based on congestion.	Prevents flows from piling onto one hot uplink when alternatives are available.
SHIELD and in-network logic	Hardware features that help isolate faults or offload parts of communication behavior.	Scaling fails fast if the fabric cannot handle faults and reductions efficiently.

The notes describe SuperPOD-style deployments as rail-optimized, three-tier fabrics built from Quantum-2 InfiniBand switches. The exact numbers matter less than the lesson's main point: once you scale beyond one node, topology is no longer background detail. It is part of performance engineering.

CUDA P2P and UVA expose the basic direct path, but not the whole communication problem

CUDA gives two key mechanisms at this level. Unified Virtual Addressing means one pointer space can identify whether data lives in host memory or a GPU's HBM. Peer access then allows one GPU to directly access another GPU's memory over the available fabric.

`cudaDeviceEnablePeerAccess`

Enables direct peer access. The notes stress that it is unidirectional, so bidirectional access needs the call on both devices.

`cudaMemcpyPeer`

Issues direct memory copies between devices without staging through host memory, assuming peer access and the right topology are available.

But the lesson also argues that raw P2P is not the whole answer. On an HGX H100 board, the topology is a switch-connected mesh, not a straight line. A manual load/store path from one SM to another GPU's HBM does not automatically saturate the fabric or solve multi-party coordination. That is exactly why the next lesson turns to NCCL, PMIx, and Slurm.

// Conceptual CUDA-side flow
cudaDeviceEnablePeerAccess(peer, 0);
cudaMemcpyPeer(dst_ptr, dst_device, src_ptr, src_device, bytes);

Practical guidance

Measure the bandwidth hierarchy, not just the GPU. Intra-node NVLink behavior and inter-node NIC behavior are very different scaling regimes.
Read the node as a fabric. NVSwitch, ConnectX, OSFP, rails, and topology are part of the machine your kernel actually runs inside.
Use peer access deliberately. Direct GPU memory access is powerful, but it is not an automatic replacement for collective libraries or topology-aware scheduling.
Expect topology to shape performance. Rail alignment, adaptive routing, and switch hierarchy determine whether communication stays balanced as the cluster grows.
Treat lesson 9 as the hardware preface to lesson 10. Once you understand the fabric, the orchestration stack makes much more sense.

Glossary

NVLink	Direct GPU-to-GPU interconnect used for high-bandwidth communication inside the node.
NVSwitch	Switch fabric that lets GPUs inside a node communicate with each other at full speed.
ConnectX-7	The node's high-speed network interface for external GPU-to-GPU communication.
Rail	An isolated network plane that aligns the same GPU index across all nodes onto one communication path.
UVA	Unified Virtual Addressing, which lets the system determine memory location from the pointer value.

Continue the course

Lesson 9 explains the hardware and CUDA-side path. The final lesson moves into orchestration: how jobs launch across nodes, how ranks discover each other, and how collective communication actually gets initialized and scheduled.

Next Lesson

Multi GPU Part 1