Build Triton-distributed
The best practice to use Triton-distributed with the Nvidia backend:
Python >=3.11 (suggest using virtual environment)
CUDA >=12.4
Torch >=2.8
We recommend installation in Nvidia PyTorch container.
if for AMD GPU:
ROCM 7.1
Torch 2.7.1 with ROCM support
Dependencies with other versions may also work well, but this is not guaranteed. If you find any problem in installing, please tell us in Issues.
NVIDIA Build Steps
Prepare docker container:
docker run --name triton-dist --ipc=host --network=host --privileged --cap-add=SYS_ADMIN --shm-size=10g --gpus=all -itd nvcr.io/nvidia/pytorch:25.04-py3 /bin/bash docker exec -it triton-dist /bin/bash
Clone Triton-distributed to your own path (e.g.,
/workspace/Triton-distributed)git clone https://github.com/ByteDance-Seed/Triton-distributed.git
Update submodules
cd /workspace/Triton-distributed git submodule deinit --all -f # deinit previous submodules rm -rf 3rdparty/triton # remove previous triton git submodule update --init --recursive
Install dependencies (optional for PyTorch container)
Note: Not needed for PyTorch container
# If you are not using PyTorch container pip3 install torch==2.8 pip3 install setuptools==69.0.0 wheel pybind11
Build Triton-distributed
Then you can build Triton-distributed.
# Remove triton installed with torch pip uninstall triton pip uninstall triton_dist # remove previous triton-dist # Install dependencies pip3 install cuda.core==0.2.0 cuda-python==12.4 nvidia-nvshmem-cu12==3.3.9 Cython==0.29.24 nvshmem4py-cu12==0.1.2 rm -rf /usr/local/lib/python3.12/dist-packages/triton # Install Triton-distributed cd /workspace/Triton-distributed export USE_TRITON_DISTRIBUTED_AOT=0 echo 'numpy<2' > /tmp/pip_install_constraint.txt MAX_JOBS=126 pip3 install -c /tmp/pip_install_constraint.txt -e python[build,tests,tutorials] --verbose --no-build-isolation --use-pep517
We also provide AOT version of Triton-distributed. If you want to use AOT (Not Recommended), then
cd /workspace/Triton-distributed/ bash ./scripts/gen_aot_code.sh export USE_TRITON_DISTRIBUTED_AOT=1 MAX_JOBS=126 pip3 install -e python --verbose --no-build-isolation --use-pep517
(Note: You have to first build non-AOT version before building AOT version, once you build AOT version, you will always build for AOT in future. To unset this, you have to remove your build directory:
python/build)
Test NVIDIA Installation
Quick Validation Tests
# Basic distributed wait test
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_distributed_wait.py --case correctness_tma
# NVSHMEM API test
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_nvshmem_api.py
AllGather GEMM Tests
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_ag_gemm.py --case check
bash ./scripts/launch.sh --nproc_per_node 2 python/triton_dist/test/nvidia/test_ag_gemm.py --case check
GEMM ReduceScatter Tests
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_gemm_rs.py -M 8192 -N 8192 -K 29568 --check
AllReduce Tests
NVSHMEM_DISABLE_CUDA_VMM=1 bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_allreduce.py --method one_shot --stress --iters 2
Flash Decoding Tests
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_decode_attn.py --case perf_8k
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_sp_decode_attn.py --case correctness
MoE Tests
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_ag_moe.py --M 2048 --iters 10 --warmup_iters 20
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_moe_reduce_rs.py 8192 2048 1536 32 2
E2E Tests
# Dense model
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 8 --seq_len 256 --model <model_path> --check --mode ag_rs
# E2E inference
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 4096 --gen_len 128 --max_length 150 --model <model_path> --backend triton_dist
Run All Unit Tests
The full test suite is available via:
bash .codebase/scripts/nvidia/run_unittest.sh
Run E2E Tests
bash .codebase/scripts/nvidia/run_e2e_test.sh
Run Tutorial Tests
bash .codebase/scripts/nvidia/run_tutorial_test.sh
Run All The Tutorials
See examples in the tutorials directory at the project root.
To use Triton-distributed with the AMD backend:
Starting from the rocm/pytorch:rocm7.1_ubuntu24.04_py3.12_pytorch_release_2.7.1 Docker container
AMD Build Steps
Clone the repo
git clone https://github.com/ByteDance-Seed/Triton-distributed.git
Update submodules
cd Triton-distributed/
git submodule update --init --recursive
If you are updating an old repo, there may be issues if the rocshmem submodule is still present. Erase it if necessary:
rm -rf 3rdparty/rocshmem # only for updated repo
Install dependencies
export TRITON_BUILD_WITH_CLANG_LLD=TRUE
export TRITON_USE_ASSERT_ENABLED_LLVM=TRUE
export TRITON_BUILD_PROTON=0
rm -f /usr/local/bin/cmake
apt-get update -y
apt install -y libopenmpi-dev git cython3 ibverbs-utils openmpi-bin libopenmpi-dev libpci-dev libdw1 locales cmake miopen-hip autoconf libtool flex ninja-build clang lld
python3 -m pip install -i https://test.pypi.org/simple hip-python>=7.1 # (or whatever Rocm version you have)
pip3 install pybind11
bash ./shmem/rocshmem_bind/build.sh
Build Triton-distributed
pip3 install -e python --verbose --no-build-isolation --use-pep517
Test AMD Installation
GEMM ReduceScatter example on single node
bash ./scripts/launch_amd.sh ./python/triton_dist/test/amd/test_ag_gemm_intra_node.py 8192 8192 29568
and see the following (reduced) output
✅ Triton and Torch match