AllGather
Low-latency AllGather kernel implementations for NVIDIA GPUs.
API Reference
- fast_allgather(ctx, buffer)
Performs fast AllGather operation.
- Parameters:
ctx – FastAllgatherContext
buffer – Symmetric buffer to AllGather
- create_fast_allgather_context(...)
Creates the context for fast AllGather.
- get_auto_all_gather_method(...)
Automatically selects the best AllGather method based on hardware topology.
- class AllGatherMethod
Enum for AllGather methods:
PULL: Pull-based AllGatherPUSH_2D: 2D push-based AllGatherPUSH_3D: 3D push-based AllGatherPUSH_2D_LL: Low-latency 2D pushPUSH_2D_LL_MULTIMEM: Low-latency 2D push with multicast memoryPUSH_NUMA_2D: NUMA-aware 2D pushPUSH_NUMA_2D_LL: NUMA-aware low-latency 2D push
Internal Kernels
- _forward_pull_kernel(...)
Pull-based AllGather kernel.
- _forward_push_2d_kernel(...)
2D push-based AllGather kernel.
- _forward_push_3d_kernel(...)
3D push-based AllGather kernel.
- _forward_push_2d_ll_kernel(...)
Low-latency 2D push AllGather kernel.
- _forward_push_2d_ll_multimem_kernel(...)
Low-latency 2D push AllGather with multicast memory.
- _forward_push_numa_2d_kernel(...)
NUMA-aware 2D push AllGather kernel.
- _forward_push_numa_2d_ll_kernel(...)
NUMA-aware low-latency 2D push AllGather kernel.
- cp_engine_producer_all_gather_intra_node(...)
Copy engine-based intra-node AllGather.
- cp_engine_producer_all_gather_inter_node(...)
Copy engine-based inter-node AllGather.