Skip to content

[FEA] Out-of-core K-means improvements #2292

Description

@cjnolet

We have received the following feedback from users:

  1. cuVS 26.6.0 host-resident multi-GPU streaming is a meaningful improvement because it allows the dataset to remain in host RAM and stream to GPUs, instead of requiring the full dataset to fit in GPU memory.

  2. The RAFT comms path performed close to the cuML MNMG baseline in a benchmark setup: 16.29s per iteration versus 14.64s per iteration for cuML MNMG, about +11.3%.

  3. The SNMG path was much slower in the user's test: around 256s per iteration on 8 GPUs, compared with an estimated around 30s per iteration for RAFT comms on 8 GPUs.

  4. The user suspects the SNMG gap is mainly from lack of compute-transfer overlap, OMP thread scheduling overhead, batch-level barriers, and memory management overhead.

  5. Their profiling also suggests the shared CUTLASS fused distance kernel is significantly slower than their cuBLAS GEMM baseline for this large-K workload.

Suggestions from user:

  • Add pinned/page-locked memory support for streaming, or use internally allocated pinned staging buffers, to avoid pageable-memory bounce copy.
  • Add double buffering or pipelined H2D transfer in SNMG so the next batch can transfer while the current batch is computing.
  • Expose a native Python API path for RAFT comms resources, instead of requiring a duck-typing bridge around pylibraft.common.Handle.
  • Consider a cuBLAS GEMM plus argmin path for large-K distance computation, or otherwise optimize the current CUTLASS fused distance kernel.
  • Improve API usability around return-type consistency, adaptive streaming_batch_size, and fit progress reporting.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    Status
    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions