LittleKernel ============ LittleKernel is a Python DSL for writing CUDA kernels with full PTX-level control. It generates CUDA source code from Python functions and compiles them at runtime using ``nvcc``. .. contents:: On this page :local: :depth: 2 Overview -------- LittleKernel bridges the gap between high-level frameworks (PyTorch, Triton) and hand-written CUDA/PTX. You write kernels as decorated Python functions using explicit types and intrinsics, and LittleKernel: 1. **Parses** the Python AST, 2. **Runs compiler passes** (type inference, memory allocation, constant folding, inlining), 3. **Generates** CUDA C++ source with inline PTX assembly, 4. **Compiles** via ``nvcc`` and wraps the result in a callable that interoperates with PyTorch tensors. Architecture ------------ :: python/little_kernel/ ├── language/ # DSL types, decorators, intrinsics │ └── intrin/ # Per-feature intrinsic modules │ ├── wgmma.py # SM90 Tensor Core (WGMMA) │ ├── umma.py # SM100 Tensor Core (UMMA / tcgen05) │ ├── tma.py # Tensor Memory Accelerator │ ├── barrier.py # MBarrier & cluster sync │ └── ... ├── core/ # IR, passes, type system │ └── passes/ # Compiler pipeline ├── codegen/ # CUDA code generation ├── runtime/ # nvcc compilation, TMA descriptors, kernel launch ├── atom/ # High-level building blocks (MMA, TMA, barriers) └── benchmark/ # GPU micro-benchmarks and GEMM kernels Supported Architectures ----------------------- - **SM90 (Hopper)** -- WGMMA, TMA, MBarrier, Cluster, async pipelines - **SM100 (Blackwell)** -- UMMA, TMEM, tcgen05, 2SM CTA groups, TMA Store Sub-pages --------- .. toctree:: :maxdepth: 1 getting-started dsl-reference intrinsics