NVIDIA Kernels

This section documents all available kernels for NVIDIA GPUs (SM80, SM89, SM90a).

Kernel List

Available NVIDIA Kernels

Kernel

Description

AllGather GEMM

AllGather + GEMM with computation-communication overlapping

GEMM ReduceScatter

GEMM + ReduceScatter with computation-communication overlapping

AllGather

Low-latency AllGather implementations (pull, push 2D/3D, multimem)

AllReduce

Distributed AllReduce kernels (one-shot, two-shot, double-tree, multimem)

Flash Decode

Distributed Flash Decoding for attention

Expert Parallelism All-to-All (EP A2A)

Expert Parallelism All-to-All for MoE models (inter-node)

Expert Parallelism All-to-All (Intra-Node)

Expert Parallelism All-to-All (intra-node optimized)

Expert Parallelism All-to-All Fused Megakernel

EP All-to-All fused megakernel (dispatch+groupgemm, groupgemm+combine)

Low-Latency All-to-All V2 (EP)

Low-Latency EP All-to-All with FP8 quantization

GEMM AllReduce

GEMM + AllReduce with computation-communication overlapping

MoE ReduceScatter

MoE ReduceScatter kernel

MoE AllReduce

MoE AllReduce kernel

Sequence Parallel AllGather Attention

Sequence Parallel AllGather Attention (intra-node and inter-node)

Ulysses Sequence Parallelism

Ulysses-style Sequence Parallelism kernels