NVIDIA Kernels
==============

This section documents all available kernels for NVIDIA GPUs (SM80, SM89, SM90a).

Kernel List
-----------

.. list-table:: Available NVIDIA Kernels
   :header-rows: 1
   :widths: 30 70

   * - Kernel
     - Description
   * - :doc:`allgather_gemm`
     - AllGather + GEMM with computation-communication overlapping
   * - :doc:`gemm_reduce_scatter`
     - GEMM + ReduceScatter with computation-communication overlapping
   * - :doc:`allgather`
     - Low-latency AllGather implementations (pull, push 2D/3D, multimem)
   * - :doc:`allreduce`
     - Distributed AllReduce kernels (one-shot, two-shot, double-tree, multimem)
   * - :doc:`flash_decode`
     - Distributed Flash Decoding for attention
   * - :doc:`ep_a2a`
     - Expert Parallelism All-to-All for MoE models (inter-node)
   * - :doc:`ep_a2a_intra_node`
     - Expert Parallelism All-to-All (intra-node optimized)
   * - :doc:`ep_all2all_fused`
     - EP All-to-All fused megakernel (dispatch+groupgemm, groupgemm+combine)
   * - :doc:`low_latency_a2a_v2`
     - Low-Latency EP All-to-All with FP8 quantization
   * - :doc:`gemm_allreduce`
     - GEMM + AllReduce with computation-communication overlapping
   * - :doc:`moe_reduce_rs`
     - MoE ReduceScatter kernel
   * - :doc:`moe_reduce_ar`
     - MoE AllReduce kernel
   * - :doc:`sp_ag_attention`
     - Sequence Parallel AllGather Attention (intra-node and inter-node)
   * - :doc:`ulysses_sp`
     - Ulysses-style Sequence Parallelism kernels

.. toctree::
   :maxdepth: 1
   :hidden:

   allgather_gemm
   gemm_reduce_scatter
   allgather
   allreduce
   flash_decode
   ep_a2a
   ep_a2a_intra_node
   ep_all2all_fused
   low_latency_a2a_v2
   gemm_allreduce
   moe_reduce_rs
   moe_reduce_ar
   sp_ag_attention
   ulysses_sp