# Mega Triton Kernel Demo
## Environment Set Up

First, you need to set up the environment. This is exactly the same as the e2e demo. If you have already set up your environment, you can skip this step.
```bash
bash ./scripts/build_e2e_env.sh
source ./scripts/setenv.sh
```


## Chat Demo
We provide a chat demo. You can play with the mega triton kernel using the following command:
```bash
# server
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh python/triton_dist/mega_triton_kernel/test/models/model_server.py --model Qwen/Qwen3-32B

# client
python3 python/triton_dist/mega_triton_kernel/test/models/chat.py
```

## Benchmark
We provide a script to benchmark decode latency. If you need to change the TP (Tensor Parallelism) size, you can pass the `nproc_per_node` parameter to the `launch.sh` script.
```bash
NVSHMEM_DISABLE_CUDA_VMM=0 bash ./scripts/launch.sh python/triton_dist/mega_triton_kernel/test/models/bench_qwen3.py --model Qwen/Qwen3-32B --seq_len 512
```

### Perf
**Setting**: batch size=1, seq=1, ctx=512, single-step decoding latency in milliseconds (ms)
#### 8xH800 GPU (TP=8)

| Model       | torch eager | torch + cudagraph | triton_dist_AR + cudagraph | mega_triton_kernel |
|---|---|---|---|---|
| qwen-8b     | 26.08       | 5.49              | 4.65                       | 3.33               |
| qwen-32b    | 49.69       | 10.80             | 9.18                       | 7.41               |

#### 8xH20 GPU (TP=8)

| Model       | torch eager | torch + cudagraph | triton_dist_AR + cudagraph | mega_triton_kernel |
|---|---|---|---|---|
| qwen-8b     | 28.75       | 5.52              | 4.59                       | 3.16               |
| qwen-32b    | 52.37       | 13.87             | 11.96                      | 8.34               |


## Build Model
We use Qwen3 as an example(`python/triton_dist/mega_triton_kernel/models/qwen3.py`) to demonstrate how to build a mega triton kernel for a model. You can refer to it when building other models.