AllGather

Low-latency AllGather kernel implementations for NVIDIA GPUs.

API Reference

fast_allgather(ctx, buffer)

Performs fast AllGather operation.

Parameters:

ctx – FastAllgatherContext
buffer – Symmetric buffer to AllGather

create_fast_allgather_context(...): Creates the context for fast AllGather.

get_auto_all_gather_method(...): Automatically selects the best AllGather method based on hardware topology.

class AllGatherMethod

Enum for AllGather methods:

PULL: Pull-based AllGather
PUSH_2D: 2D push-based AllGather
PUSH_3D: 3D push-based AllGather
PUSH_2D_LL: Low-latency 2D push
PUSH_2D_LL_MULTIMEM: Low-latency 2D push with multicast memory
PUSH_NUMA_2D: NUMA-aware 2D push
PUSH_NUMA_2D_LL: NUMA-aware low-latency 2D push

Internal Kernels

_forward_pull_kernel(...): Pull-based AllGather kernel.

_forward_push_2d_kernel(...): 2D push-based AllGather kernel.

_forward_push_3d_kernel(...): 3D push-based AllGather kernel.

_forward_push_2d_ll_kernel(...): Low-latency 2D push AllGather kernel.

_forward_push_2d_ll_multimem_kernel(...): Low-latency 2D push AllGather with multicast memory.

_forward_push_numa_2d_kernel(...): NUMA-aware 2D push AllGather kernel.

_forward_push_numa_2d_ll_kernel(...): NUMA-aware low-latency 2D push AllGather kernel.

cp_engine_producer_all_gather_intra_node(...): Copy engine-based intra-node AllGather.

cp_engine_producer_all_gather_inter_node(...): Copy engine-based inter-node AllGather.