AllGather ========= Low-latency AllGather kernel implementations for NVIDIA GPUs. API Reference ------------- .. py:function:: fast_allgather(ctx, buffer) Performs fast AllGather operation. :param ctx: FastAllgatherContext :param buffer: Symmetric buffer to AllGather .. py:function:: create_fast_allgather_context(...) Creates the context for fast AllGather. .. py:function:: get_auto_all_gather_method(...) Automatically selects the best AllGather method based on hardware topology. .. py:class:: AllGatherMethod Enum for AllGather methods: - ``PULL``: Pull-based AllGather - ``PUSH_2D``: 2D push-based AllGather - ``PUSH_3D``: 3D push-based AllGather - ``PUSH_2D_LL``: Low-latency 2D push - ``PUSH_2D_LL_MULTIMEM``: Low-latency 2D push with multicast memory - ``PUSH_NUMA_2D``: NUMA-aware 2D push - ``PUSH_NUMA_2D_LL``: NUMA-aware low-latency 2D push Internal Kernels ---------------- .. py:function:: _forward_pull_kernel(...) Pull-based AllGather kernel. .. py:function:: _forward_push_2d_kernel(...) 2D push-based AllGather kernel. .. py:function:: _forward_push_3d_kernel(...) 3D push-based AllGather kernel. .. py:function:: _forward_push_2d_ll_kernel(...) Low-latency 2D push AllGather kernel. .. py:function:: _forward_push_2d_ll_multimem_kernel(...) Low-latency 2D push AllGather with multicast memory. .. py:function:: _forward_push_numa_2d_kernel(...) NUMA-aware 2D push AllGather kernel. .. py:function:: _forward_push_numa_2d_ll_kernel(...) NUMA-aware low-latency 2D push AllGather kernel. .. py:function:: cp_engine_producer_all_gather_intra_node(...) Copy engine-based intra-node AllGather. .. py:function:: cp_engine_producer_all_gather_inter_node(...) Copy engine-based inter-node AllGather.