Dense Model

Dense transformer model implementation for distributed inference.

Description

The dense model module provides a complete implementation of dense transformer models (e.g., Qwen, LLaMA) with tensor parallelism support.

Features

  • Tensor Parallel attention and MLP

  • Multiple parallelism modes (ag_rs, allreduce, gemm_ar)

  • KV cache support for efficient decoding

Example Usage

# End-to-end inference test
bash scripts/launch.sh python/triton_dist/test/nvidia/test_e2e_inference.py \
    --bsz 4096 --gen_len 128 --max_length 150 --model <model_path> --backend triton_dist