Tensor Parallel MLP =================== Tensor Parallel MLP layer for distributed inference. Description ----------- The TP MLP layer provides distributed MLP computation with computation-communication overlapping for efficient tensor parallelism. Modes ----- - ``ag_rs``: AllGather input + ReduceScatter output - ``allreduce``: AllReduce-based parallelism - ``gemm_ar``: GEMM + AllReduce fusion Example Usage ------------- .. code-block:: bash # Test TP MLP bash scripts/launch.sh python/triton_dist/test/nvidia/test_tp_mlp.py \ --M 4096 --model --mode ag_rs