NVIDIA Kernels
This section documents all available kernels for NVIDIA GPUs (SM80, SM89, SM90a).
Kernel List
Kernel |
Description |
|---|---|
AllGather + GEMM with computation-communication overlapping |
|
GEMM + ReduceScatter with computation-communication overlapping |
|
Low-latency AllGather implementations (pull, push 2D/3D, multimem) |
|
Distributed AllReduce kernels (one-shot, two-shot, double-tree, multimem) |
|
Distributed Flash Decoding for attention |
|
Expert Parallelism All-to-All for MoE models (inter-node) |
|
Expert Parallelism All-to-All (intra-node optimized) |
|
EP All-to-All fused megakernel (dispatch+groupgemm, groupgemm+combine) |
|
Low-Latency EP All-to-All with FP8 quantization |
|
GEMM + AllReduce with computation-communication overlapping |
|
MoE ReduceScatter kernel |
|
MoE AllReduce kernel |
|
Sequence Parallel AllGather Attention (intra-node and inter-node) |
|
Ulysses-style Sequence Parallelism kernels |