Megakernel Implementations
Here you’ll find a list of tutorials for implementing MegaKernels in Triton-distributed.
Megakernel: We provide a tutorial of MegaKernel for a dense language model (e.g., Qwen3-32B) by integrating our tensor parallelism modules.
EP MoE Megakernel: Expert Parallelism MoE megakernel implementations for single-node 8-GPU configurations:
EP All-to-All Fused Kernel: Fused megakernel combining dispatch+groupgemm and groupgemm+combine operations with token optimization (token saving/skipping, token sorting, SM scheduling)
EP All-to-All Fused Layer: High-level layer API for fused EP MoE operations