Add NCCL GIN / symmetric memory tutorial by d4l3k · Pull Request #3932 · pytorch/tutorials

d4l3k · 2026-07-01T23:49:26Z

Adds a new unstable tutorial on GPU-Initiated Networking (GIN) with NCCL and PyTorch distributed, via the NCCL backend of torch.distributed._symmetric_memory.

Introduction to host-initiated vs GPU-initiated communication and the NCCL 2.28 device API (LSA, Multimem, GIN)
Enabling the NCCL symmetric memory backend (symm_mem.set_backend("NCCL") / TORCH_SYMMMEM=NCCL), including eager process group init and the warm-up collective requirement
Complete torchrun-able example of a device-initiated one_shot_all_reduce on symmetric tensors
One-sided communication with torch.ops.symm_mem.nccl_put_with_signal / nccl_wait_for_signal, plus notes on nccl_put/nccl_get and the handle-level put_signal/wait_signal (NCCL 2.29+)
Multi-node launch and GIN hardware/software requirements (RDMA NICs, GPUDirect RDMA, DMA-BUF), including the GDAKI and CPU proxy transports
Writing custom communication kernels in Python with the CuTe DSL over symmetric memory peer buffers (hdl.get_buffer), adapted from the NVIDIA CUTLASS distributed examples
Pointers to the raw ncclGin C++ device API for extension authors

Since GIN itself is a device-side API not directly exposed in Python, the tutorial is framed around symmetric memory as the user-facing API, with GIN presented as the transport that services window operations across nodes.

All Python API calls in the examples were verified against pytorch/pytorch main (op registrations in nccl_extension.cu, pybind signatures, and patterns from test/distributed/test_nccl.py); the CuTe DSL example is adapted from the official NVIDIA CUTLASS examples/python/CuTeDSL distributed examples.

The tutorial is a static .rst in unstable_source/ (multi-GPU torchrun code cannot execute in the docs build), registered with a card and toctree entry in unstable_index.rst. Validated with make html-noplot (clean build, no new warnings) and lintrunner.

🤖 Generated with Claude Code

Adds an unstable tutorial covering GPU-initiated networking with NCCL and PyTorch symmetric memory: enabling the NCCL backend, device-initiated one-shot all-reduce, one-sided put/signal operations, writing custom communication kernels in Python with CuTe DSL, multi-node GIN requirements, and pointers to the NCCL device API for custom C++ kernels. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pytorch-bot · 2026-07-01T23:49:31Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3932

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 1d236c3 with merge base cb473bc ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Replaces the C++-only GIN section with Python examples: nccl4py exposes the NCCL device API (including GIN put/wait_signal) to CuTe DSL kernels, so GPU-initiated RDMA is now reachable from Python. Adapted from the nccl4py cute example, using torch.distributed for bootstrap and nccl.torch.empty for NCCL-allocated tensors. Also adds a section on combining symmetric memory with nccl4py: wrapping the process group's communicator via _comm_ptr(), registering symm_mem tensors as NCCL windows, and the reverse register_external_nccl_comm bridge. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

meta-cla Bot added the cla signed label Jul 1, 2026

Fix CuTe DSL documentation link

a736da5

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

d4l3k added the skip-link-check Will allow you to skip linkcheck on a PR. Should only should be used when a link can't be fixed. label Jul 1, 2026

d4l3k force-pushed the d4l3k/nccl_gin_tutorial branch from 010903c to 1d236c3 Compare July 2, 2026 03:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add NCCL GIN / symmetric memory tutorial#3932

Add NCCL GIN / symmetric memory tutorial#3932
d4l3k wants to merge 3 commits into
mainfrom
d4l3k/nccl_gin_tutorial

d4l3k commented Jul 1, 2026

Uh oh!

pytorch-bot Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

d4l3k commented Jul 1, 2026

Contents

Uh oh!

pytorch-bot Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/tutorials/3932

✅ No Failures

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pytorch-bot Bot commented Jul 1, 2026 •

edited

Loading