LittleKernel

LittleKernel is a Python DSL for writing CUDA kernels with full PTX-level control. It generates CUDA source code from Python functions and compiles them at runtime using nvcc.

Overview

LittleKernel bridges the gap between high-level frameworks (PyTorch, Triton) and hand-written CUDA/PTX. You write kernels as decorated Python functions using explicit types and intrinsics, and LittleKernel:

  1. Parses the Python AST,

  2. Runs compiler passes (type inference, memory allocation, constant folding, inlining),

  3. Generates CUDA C++ source with inline PTX assembly,

  4. Compiles via nvcc and wraps the result in a callable that interoperates with PyTorch tensors.

Architecture

python/little_kernel/
├── language/        # DSL types, decorators, intrinsics
│   └── intrin/      # Per-feature intrinsic modules
│       ├── wgmma.py     # SM90 Tensor Core (WGMMA)
│       ├── umma.py      # SM100 Tensor Core (UMMA / tcgen05)
│       ├── tma.py       # Tensor Memory Accelerator
│       ├── barrier.py   # MBarrier & cluster sync
│       └── ...
├── core/            # IR, passes, type system
│   └── passes/      # Compiler pipeline
├── codegen/         # CUDA code generation
├── runtime/         # nvcc compilation, TMA descriptors, kernel launch
├── atom/            # High-level building blocks (MMA, TMA, barriers)
└── benchmark/       # GPU micro-benchmarks and GEMM kernels

Supported Architectures

  • SM90 (Hopper) – WGMMA, TMA, MBarrier, Cluster, async pipelines

  • SM100 (Blackwell) – UMMA, TMEM, tcgen05, 2SM CTA groups, TMA Store

Sub-pages