Tile Traversal
A/B slice highlights show the active C-tile dependencies for the selected step.
Cluster Swizzle View
Cluster decode/encode flow sorted by effective swizzle rank.
H100 Persistent Scheduler
Mirrors CUTLASS SM90 scheduling: output is divided into swizzled CTA tiles,
get_grid_shape() honors hw_info.max_active_clusters when
provided and otherwise falls back to the 18-SM/GPC occupancy heuristic, then each
resident block advances by one full-grid stride to fetch more work.
Tile Workload Split Across Blocks
Color = persistent block slot. Dashed outlines mark tiles in the same selected wave, which means those resident blocks can execute them concurrently. Padded tiles still get a slot, but they are muted because they fall outside the logical matrix.
132 Resident Parallel Block Slots
Compact view of the H100 persistent slots. Each cell is one concurrently resident
worker sized from CUTLASS hw_info.sm_count; together they form the
parallel wave width.