AllReduce
Distributed AllReduce kernel implementations for NVIDIA GPUs.
API Reference
Multiple AllReduce methods are supported:
one_shot: Single-pass AllReduce
two_shot: Two-pass AllReduce for larger data
double_tree: Double-tree algorithm for balanced communication
one_shot_tma: TMA-based single-pass AllReduce
one_shot_multimem: Multicast memory-based single-pass AllReduce
two_shot_multimem: Multicast memory-based two-pass AllReduce
Example Usage
from triton_dist.kernels.nvidia.allreduce import allreduce_kernel
# Run AllReduce test
bash scripts/launch.sh python/triton_dist/test/nvidia/test_allreduce.py --method one_shot