NVIDIA Kernels

This section documents all available kernels for NVIDIA GPUs (SM80, SM89, SM90a).

Kernel List

Available NVIDIA Kernels
Kernel	Description
AllGather GEMM	AllGather + GEMM with computation-communication overlapping
GEMM ReduceScatter	GEMM + ReduceScatter with computation-communication overlapping
AllGather	Low-latency AllGather implementations (pull, push 2D/3D, multimem)
AllReduce	Distributed AllReduce kernels (one-shot, two-shot, double-tree, multimem)
Flash Decode	Distributed Flash Decoding for attention
Expert Parallelism All-to-All (EP A2A)	Expert Parallelism All-to-All for MoE models (inter-node)
Expert Parallelism All-to-All (Intra-Node)	Expert Parallelism All-to-All (intra-node optimized)
Expert Parallelism All-to-All Fused Megakernel	EP All-to-All fused megakernel (dispatch+groupgemm, groupgemm+combine)
Low-Latency All-to-All V2 (EP)	Low-Latency EP All-to-All with FP8 quantization
GEMM AllReduce	GEMM + AllReduce with computation-communication overlapping
MoE ReduceScatter	MoE ReduceScatter kernel
MoE AllReduce	MoE AllReduce kernel
Sequence Parallel AllGather Attention	Sequence Parallel AllGather Attention (intra-node and inter-node)
Ulysses Sequence Parallelism	Ulysses-style Sequence Parallelism kernels