GEMM AllReduce Layer

High-level layer for GEMM + AllReduce fusion.

Description

This layer fuses GEMM computation with AllReduce communication for efficient tensor parallelism in dense models.

See python/triton_dist/layers/nvidia/gemm_allreduce_layer.py for implementation details.