LittleKernel
LittleKernel is a Python DSL for writing CUDA kernels with full PTX-level
control. It generates CUDA source code from Python functions and compiles
them at runtime using nvcc.
Overview
LittleKernel bridges the gap between high-level frameworks (PyTorch, Triton) and hand-written CUDA/PTX. You write kernels as decorated Python functions using explicit types and intrinsics, and LittleKernel:
Parses the Python AST,
Runs compiler passes (type inference, memory allocation, constant folding, inlining),
Generates CUDA C++ source with inline PTX assembly,
Compiles via
nvccand wraps the result in a callable that interoperates with PyTorch tensors.
Architecture
python/little_kernel/
├── language/ # DSL types, decorators, intrinsics
│ └── intrin/ # Per-feature intrinsic modules
│ ├── wgmma.py # SM90 Tensor Core (WGMMA)
│ ├── umma.py # SM100 Tensor Core (UMMA / tcgen05)
│ ├── tma.py # Tensor Memory Accelerator
│ ├── barrier.py # MBarrier & cluster sync
│ └── ...
├── core/ # IR, passes, type system
│ └── passes/ # Compiler pipeline
├── codegen/ # CUDA code generation
├── runtime/ # nvcc compilation, TMA descriptors, kernel launch
├── atom/ # High-level building blocks (MMA, TMA, barriers)
└── benchmark/ # GPU micro-benchmarks and GEMM kernels
Supported Architectures
SM90 (Hopper) – WGMMA, TMA, MBarrier, Cluster, async pipelines
SM100 (Blackwell) – UMMA, TMEM, tcgen05, 2SM CTA groups, TMA Store