# Build Triton-distributed

## The best practice to use Triton-distributed with the Nvidia backend:
- Python >=3.11 (suggest using virtual environment)
- CUDA >=12.4
- Torch >=2.8

We recommend installation in [Nvidia PyTorch container](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags).

#### if for AMD GPU:
- ROCM 7.1
- Torch 2.7.1 with ROCM support


Dependencies with other versions may also work well, but this is not guaranteed. If you find any problem in installing, please tell us in Issues.

### NVIDIA Build Steps
1. Prepare docker container:
    ```sh
    docker run --name triton-dist --ipc=host --network=host --privileged --cap-add=SYS_ADMIN --shm-size=10g --gpus=all -itd nvcr.io/nvidia/pytorch:25.04-py3 /bin/bash
    docker exec -it triton-dist /bin/bash
    ```

2. Clone Triton-distributed to your own path (e.g., `/workspace/Triton-distributed`)
    ```sh
    git clone https://github.com/ByteDance-Seed/Triton-distributed.git
    ```

3. Update submodules
    ```sh
    cd /workspace/Triton-distributed
    git submodule deinit --all -f # deinit previous submodules
    rm -rf 3rdparty/triton # remove previous triton
    git submodule update --init --recursive
    ```

4. Install dependencies (optional for PyTorch container)
    > Note: Not needed for PyTorch container
    ```sh
    # If you are not using PyTorch container
    pip3 install torch==2.8
    pip3 install setuptools==69.0.0 wheel pybind11
    ```

5. Build Triton-distributed

    Then you can build Triton-distributed.

    ```sh
    # Remove triton installed with torch
    pip uninstall triton
    pip uninstall triton_dist # remove previous triton-dist
    # Install dependencies
    pip3 install cuda.core==0.2.0 cuda-python==12.4 nvidia-nvshmem-cu12==3.3.9 Cython==0.29.24 nvshmem4py-cu12==0.1.2
    rm -rf /usr/local/lib/python3.12/dist-packages/triton
    # Install Triton-distributed
    cd /workspace/Triton-distributed
    export USE_TRITON_DISTRIBUTED_AOT=0
    echo 'numpy<2' > /tmp/pip_install_constraint.txt
    MAX_JOBS=126 pip3 install -c /tmp/pip_install_constraint.txt -e python[build,tests,tutorials] --verbose --no-build-isolation --use-pep517
    ```

    We also provide AOT version of Triton-distributed. If you want to use AOT (**Not Recommended**), then
    ```sh
    cd /workspace/Triton-distributed/
    bash ./scripts/gen_aot_code.sh
    export USE_TRITON_DISTRIBUTED_AOT=1
    MAX_JOBS=126 pip3 install -e python --verbose --no-build-isolation --use-pep517
    ```
    (Note: You have to first build non-AOT version before building AOT version, once you build AOT version, you will always build for AOT in future. To unset this, you have to remove your build directory: `python/build`)


### Test NVIDIA Installation

#### Quick Validation Tests
```sh
# Basic distributed wait test
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_distributed_wait.py --case correctness_tma

# NVSHMEM API test
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_nvshmem_api.py
```

#### AllGather GEMM Tests
```sh
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_ag_gemm.py --case check
bash ./scripts/launch.sh --nproc_per_node 2 python/triton_dist/test/nvidia/test_ag_gemm.py --case check
```

#### GEMM ReduceScatter Tests
```sh
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_gemm_rs.py -M 8192 -N 8192 -K 29568 --check
```

#### AllReduce Tests
```sh
NVSHMEM_DISABLE_CUDA_VMM=1 bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_allreduce.py --method one_shot --stress --iters 2
```

#### Flash Decoding Tests
```sh
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_decode_attn.py --case perf_8k
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_sp_decode_attn.py --case correctness
```

#### MoE Tests
```sh
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_ag_moe.py --M 2048 --iters 10 --warmup_iters 20
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_moe_reduce_rs.py 8192 2048 1536 32 2
```

#### E2E Tests
```sh
# Dense model
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_tp_e2e.py --bsz 8 --seq_len 256 --model <model_path> --check --mode ag_rs

# E2E inference
bash ./scripts/launch.sh python/triton_dist/test/nvidia/test_e2e_inference.py --bsz 4096 --gen_len 128 --max_length 150 --model <model_path> --backend triton_dist
```

### Run All Unit Tests
The full test suite is available via:
```sh
bash .codebase/scripts/nvidia/run_unittest.sh
```

### Run E2E Tests
```sh
bash .codebase/scripts/nvidia/run_e2e_test.sh
```

### Run Tutorial Tests
```sh
bash .codebase/scripts/nvidia/run_tutorial_test.sh
```

### Run All The Tutorials
See examples in the `tutorials` directory at the project root.

## To use Triton-distributed with the AMD backend:
Starting from the rocm/pytorch:rocm7.1_ubuntu24.04_py3.12_pytorch_release_2.7.1 Docker container
#### AMD Build Steps
1. Clone the repo
```sh
git clone https://github.com/ByteDance-Seed/Triton-distributed.git
```
2. Update submodules
```sh
cd Triton-distributed/
git submodule update --init --recursive
```
If you are updating an old repo, there may be issues if the rocshmem submodule is still present. Erase it if necessary:
```sh
rm -rf 3rdparty/rocshmem # only for updated repo
```
3. Install dependencies
```sh
export TRITON_BUILD_WITH_CLANG_LLD=TRUE
export TRITON_USE_ASSERT_ENABLED_LLVM=TRUE
export TRITON_BUILD_PROTON=0
rm -f /usr/local/bin/cmake
apt-get update -y
apt install -y libopenmpi-dev git cython3 ibverbs-utils openmpi-bin libopenmpi-dev libpci-dev libdw1 locales cmake miopen-hip autoconf libtool flex ninja-build clang lld
python3 -m pip install -i https://test.pypi.org/simple hip-python>=7.1 # (or whatever Rocm version you have)
pip3 install pybind11
bash ./shmem/rocshmem_bind/build.sh
```
4. Build Triton-distributed
```sh
pip3 install -e python --verbose --no-build-isolation --use-pep517
```
### Test AMD Installation
#### GEMM ReduceScatter example on single node
```sh
bash ./scripts/launch_amd.sh ./python/triton_dist/test/amd/test_ag_gemm_intra_node.py 8192 8192 29568
```
and see the following (reduced) output
```sh
✅ Triton and Torch match
```