AllReduce

Distributed AllReduce kernel implementations for NVIDIA GPUs.

API Reference

Multiple AllReduce methods are supported:

  • one_shot: Single-pass AllReduce

  • two_shot: Two-pass AllReduce for larger data

  • double_tree: Double-tree algorithm for balanced communication

  • one_shot_tma: TMA-based single-pass AllReduce

  • one_shot_multimem: Multicast memory-based single-pass AllReduce

  • two_shot_multimem: Multicast memory-based two-pass AllReduce

Example Usage

from triton_dist.kernels.nvidia.allreduce import allreduce_kernel

# Run AllReduce test
bash scripts/launch.sh python/triton_dist/test/nvidia/test_allreduce.py --method one_shot