DSL Reference ============= .. contents:: On this page :local: :depth: 2 Kernel Definition ----------------- Kernels are regular Python functions decorated with ``@lk.ll_kernel``: .. code-block:: python import little_kernel as lk import little_kernel.language as ll @lk.ll_kernel(backend="cuda", is_entry=True) def my_kernel( data: ll.Tensor[ll.float32], N: ll.int32, ) -> ll.void: ... Parameters: - ``backend``: Currently only ``"cuda"`` is supported. - ``is_entry``: ``True`` for top-level (launchable) kernels; ``False`` for device functions that can be called from other kernels. Type System ----------- Scalar Types ~~~~~~~~~~~~ =========== ================== ========= LK Type C++ Type Bytes =========== ================== ========= ``int32`` ``int`` 4 ``uint32`` ``unsigned int`` 4 ``int64`` ``long long`` 8 ``uint64`` ``unsigned long long`` 8 ``float16`` ``__half`` 2 ``bfloat16`` ``__nv_bfloat16`` 2 ``float32`` ``float`` 4 ``float64`` ``double`` 8 ``void`` ``void`` 0 =========== ================== ========= Tensor Types ~~~~~~~~~~~~ ``ll.Tensor[dtype]`` maps to a raw pointer (``dtype*``) in generated code. All pointer arithmetic and bounds checking is the programmer's responsibility. Qualifiers ~~~~~~~~~~ - ``ll.const[T]``: C++ ``const T`` - ``ll.grid_constant[T]``: ``__grid_constant__ const T`` (for TMA descriptors) Variable Declarations ~~~~~~~~~~~~~~~~~~~~~ Use type annotations for local variables: .. code-block:: python x: ll.float32 = 0.0 idx: ll.int32 = ll.threadIdx_x() addr: ll.uint64 = ll.get_smem_address(buf) Shared Memory ~~~~~~~~~~~~~ Allocate shared memory with ``ll.empty``: .. code-block:: python # Static shared memory buf = ll.empty([1024], dtype=ll.float32, scope="shared") # Dynamic shared memory ll.align_memory(1024, scope="dynamic_shared") buf = ll.empty([SIZE], dtype=ll.bfloat16, scope="dynamic_shared") Control Flow ~~~~~~~~~~~~ Standard Python ``if/else``, ``for`` (with ``range()``), and ``while`` are supported and map directly to C++ control flow. Loop pragmas: ``ll.unroll()`` emits ``#pragma unroll``. Building & Launching -------------------- .. code-block:: python from little_kernel.core.passes import PASSES from little_kernel.codegen.codegen_cuda import codegen_cuda passes = PASSES["cuda"] kernel = my_kernel.build( passes, codegen_cuda, grid=(grid_x, grid_y, grid_z), block=(block_x, block_y, block_z), shared_mem_bytes=SMEM_SIZE, arch="sm_90", # or "sm_100a" for Blackwell cluster_dim=(cx, cy, cz), # optional, for SM90+ clusters ) # Launch (pass same args as the @ll_kernel function) kernel(data_tensor, N) torch.cuda.synchronize() Inspecting Generated Code -------------------------- To see the generated CUDA source without compiling: .. code-block:: python cuda_src = my_kernel.compile(passes, codegen_cuda) print(cuda_src)