DSL Reference

Kernel Definition

Kernels are regular Python functions decorated with @lk.ll_kernel:

import little_kernel as lk
import little_kernel.language as ll

@lk.ll_kernel(backend="cuda", is_entry=True)
def my_kernel(
    data: ll.Tensor[ll.float32],
    N: ll.int32,
) -> ll.void:
    ...

Parameters:

  • backend: Currently only "cuda" is supported.

  • is_entry: True for top-level (launchable) kernels; False for device functions that can be called from other kernels.

Type System

Scalar Types

Tensor Types

ll.Tensor[dtype] maps to a raw pointer (dtype*) in generated code. All pointer arithmetic and bounds checking is the programmer’s responsibility.

Qualifiers

  • ll.const[T]: C++ const T

  • ll.grid_constant[T]: __grid_constant__ const T (for TMA descriptors)

Variable Declarations

Use type annotations for local variables:

x: ll.float32 = 0.0
idx: ll.int32 = ll.threadIdx_x()
addr: ll.uint64 = ll.get_smem_address(buf)

Shared Memory

Allocate shared memory with ll.empty:

# Static shared memory
buf = ll.empty([1024], dtype=ll.float32, scope="shared")

# Dynamic shared memory
ll.align_memory(1024, scope="dynamic_shared")
buf = ll.empty([SIZE], dtype=ll.bfloat16, scope="dynamic_shared")

Control Flow

Standard Python if/else, for (with range()), and while are supported and map directly to C++ control flow.

Loop pragmas: ll.unroll() emits #pragma unroll.

Building & Launching

from little_kernel.core.passes import PASSES
from little_kernel.codegen.codegen_cuda import codegen_cuda

passes = PASSES["cuda"]
kernel = my_kernel.build(
    passes, codegen_cuda,
    grid=(grid_x, grid_y, grid_z),
    block=(block_x, block_y, block_z),
    shared_mem_bytes=SMEM_SIZE,
    arch="sm_90",           # or "sm_100a" for Blackwell
    cluster_dim=(cx, cy, cz),  # optional, for SM90+ clusters
)

# Launch (pass same args as the @ll_kernel function)
kernel(data_tensor, N)
torch.cuda.synchronize()

Inspecting Generated Code

To see the generated CUDA source without compiling:

cuda_src = my_kernel.compile(passes, codegen_cuda)
print(cuda_src)