Mega Triton Kernel Demo

Environment Set Up

First, you need to set up the environment. This is exactly the same as the e2e demo. If you have already set up your environment, you can skip this step.

bash ./scripts/build_e2e_env.sh
source ./scripts/setenv.sh

Chat Demo

We provide a chat demo. You can play with the mega triton kernel using the following command:

# server
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh python/triton_dist/mega_triton_kernel/test/models/model_server.py --model Qwen/Qwen3-32B

# client
python3 python/triton_dist/mega_triton_kernel/test/models/chat.py

Benchmark

We provide a script to benchmark decode latency. If you need to change the TP (Tensor Parallelism) size, you can pass the nproc_per_node parameter to the launch.sh script.

NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh python/triton_dist/mega_triton_kernel/test/models/bench_qwen3.py --model Qwen/Qwen3-32B --seq_len 512

Perf

Setting: batch size=1, seq=1, ctx=512, single-step decoding latency in milliseconds (ms)

8xH800 GPU (TP=8)

Model	torch eager	torch + cudagraph	triton_dist_AR + cudagraph	mega_triton_kernel
qwen-8b	26.08	5.49	4.65	3.33
qwen-32b	49.69	10.80	9.18	7.41

8xH20 GPU (TP=8)

Model	torch eager	torch + cudagraph	triton_dist_AR + cudagraph	mega_triton_kernel
qwen-8b	28.75	5.52	4.59	3.16
qwen-32b	52.37	13.87	11.96	8.34

Build Model

We use Qwen3 as an example(python/triton_dist/mega_triton_kernel/models/qwen3.py) to demonstrate how to build a mega triton kernel for a model. You can refer to it when building other models.