Megakernel Implementations

Here you’ll find a list of tutorials for implementing MegaKernels in Triton-distributed.

  1. Megakernel: We provide a tutorial of MegaKernel for a dense language model (e.g., Qwen3-32B) by integrating our tensor parallelism modules.

  2. EP MoE Megakernel: Expert Parallelism MoE megakernel implementations for single-node 8-GPU configurations:

    • EP All-to-All Fused Kernel: Fused megakernel combining dispatch+groupgemm and groupgemm+combine operations with token optimization (token saving/skipping, token sorting, SM scheduling)

    • EP All-to-All Fused Layer: High-level layer API for fused EP MoE operations