Skip to content

Add NCCL GIN / symmetric memory tutorial#3932

Draft
d4l3k wants to merge 3 commits into
mainfrom
d4l3k/nccl_gin_tutorial
Draft

Add NCCL GIN / symmetric memory tutorial#3932
d4l3k wants to merge 3 commits into
mainfrom
d4l3k/nccl_gin_tutorial

Conversation

@d4l3k

@d4l3k d4l3k commented Jul 1, 2026

Copy link
Copy Markdown
Member

Adds a new unstable tutorial on GPU-Initiated Networking (GIN) with NCCL and PyTorch distributed, via the NCCL backend of torch.distributed._symmetric_memory.

Contents

  • Introduction to host-initiated vs GPU-initiated communication and the NCCL 2.28 device API (LSA, Multimem, GIN)
  • Enabling the NCCL symmetric memory backend (symm_mem.set_backend("NCCL") / TORCH_SYMMMEM=NCCL), including eager process group init and the warm-up collective requirement
  • Complete torchrun-able example of a device-initiated one_shot_all_reduce on symmetric tensors
  • One-sided communication with torch.ops.symm_mem.nccl_put_with_signal / nccl_wait_for_signal, plus notes on nccl_put/nccl_get and the handle-level put_signal/wait_signal (NCCL 2.29+)
  • Multi-node launch and GIN hardware/software requirements (RDMA NICs, GPUDirect RDMA, DMA-BUF), including the GDAKI and CPU proxy transports
  • Writing custom communication kernels in Python with the CuTe DSL over symmetric memory peer buffers (hdl.get_buffer), adapted from the NVIDIA CUTLASS distributed examples
  • Pointers to the raw ncclGin C++ device API for extension authors

Since GIN itself is a device-side API not directly exposed in Python, the tutorial is framed around symmetric memory as the user-facing API, with GIN presented as the transport that services window operations across nodes.

All Python API calls in the examples were verified against pytorch/pytorch main (op registrations in nccl_extension.cu, pybind signatures, and patterns from test/distributed/test_nccl.py); the CuTe DSL example is adapted from the official NVIDIA CUTLASS examples/python/CuTeDSL distributed examples.

The tutorial is a static .rst in unstable_source/ (multi-GPU torchrun code cannot execute in the docs build), registered with a card and toctree entry in unstable_index.rst. Validated with make html-noplot (clean build, no new warnings) and lintrunner.

🤖 Generated with Claude Code

Adds an unstable tutorial covering GPU-initiated networking with NCCL and PyTorch symmetric memory: enabling the NCCL backend, device-initiated one-shot all-reduce, one-sided put/signal operations, writing custom communication kernels in Python with CuTe DSL, multi-node GIN requirements, and pointers to the NCCL device API for custom C++ kernels.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@pytorch-bot

pytorch-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3932

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1d236c3 with merge base cb473bc (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the cla signed label Jul 1, 2026
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d4l3k d4l3k added the skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed. label Jul 1, 2026
Replaces the C++-only GIN section with Python examples: nccl4py exposes the NCCL device API (including GIN put/wait_signal) to CuTe DSL kernels, so GPU-initiated RDMA is now reachable from Python. Adapted from the nccl4py cute example, using torch.distributed for bootstrap and nccl.torch.empty for NCCL-allocated tensors. Also adds a section on combining symmetric memory with nccl4py: wrapping the process group's communicator via _comm_ptr(), registering symm_mem tensors as NCCL windows, and the reverse register_external_nccl_comm bridge.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@d4l3k d4l3k force-pushed the d4l3k/nccl_gin_tutorial branch from 010903c to 1d236c3 Compare July 2, 2026 03:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant