GEMM AllReduce Layer
High-level layer for GEMM + AllReduce fusion.
Description
This layer fuses GEMM computation with AllReduce communication for efficient tensor parallelism in dense models.
See python/triton_dist/layers/nvidia/gemm_allreduce_layer.py for implementation details.