Megakernel Implementations

Here you’ll find a list of tutorials for implementing MegaKernels in Triton-distributed.

Megakernel: We provide a tutorial of MegaKernel for a dense language model (e.g., Qwen3-32B) by integrating our tensor parallelism modules.
EP MoE Megakernel: Expert Parallelism MoE megakernel implementations for single-node 8-GPU configurations:
- EP All-to-All Fused Kernel: Fused megakernel combining dispatch+groupgemm and groupgemm+combine operations with token optimization (token saving/skipping, token sorting, SM scheduling)
- EP All-to-All Fused Layer: High-level layer API for fused EP MoE operations