We have received the following feedback from users:
-
cuVS 26.6.0 host-resident multi-GPU streaming is a meaningful improvement because it allows the dataset to remain in host RAM and stream to GPUs, instead of requiring the full dataset to fit in GPU memory.
-
The RAFT comms path performed close to the cuML MNMG baseline in a benchmark setup: 16.29s per iteration versus 14.64s per iteration for cuML MNMG, about +11.3%.
-
The SNMG path was much slower in the user's test: around 256s per iteration on 8 GPUs, compared with an estimated around 30s per iteration for RAFT comms on 8 GPUs.
-
The user suspects the SNMG gap is mainly from lack of compute-transfer overlap, OMP thread scheduling overhead, batch-level barriers, and memory management overhead.
-
Their profiling also suggests the shared CUTLASS fused distance kernel is significantly slower than their cuBLAS GEMM baseline for this large-K workload.
Suggestions from user:
We have received the following feedback from users:
cuVS 26.6.0 host-resident multi-GPU streaming is a meaningful improvement because it allows the dataset to remain in host RAM and stream to GPUs, instead of requiring the full dataset to fit in GPU memory.
The RAFT comms path performed close to the cuML MNMG baseline in a benchmark setup: 16.29s per iteration versus 14.64s per iteration for cuML MNMG, about +11.3%.
The SNMG path was much slower in the user's test: around 256s per iteration on 8 GPUs, compared with an estimated around 30s per iteration for RAFT comms on 8 GPUs.
The user suspects the SNMG gap is mainly from lack of compute-transfer overlap, OMP thread scheduling overhead, batch-level barriers, and memory management overhead.
Their profiling also suggests the shared CUTLASS fused distance kernel is significantly slower than their cuBLAS GEMM baseline for this large-K workload.
Suggestions from user: