DSL Reference
Kernel Definition
Kernels are regular Python functions decorated with @lk.ll_kernel:
import little_kernel as lk
import little_kernel.language as ll
@lk.ll_kernel(backend="cuda", is_entry=True)
def my_kernel(
data: ll.Tensor[ll.float32],
N: ll.int32,
) -> ll.void:
...
Parameters:
backend: Currently only"cuda"is supported.is_entry:Truefor top-level (launchable) kernels;Falsefor device functions that can be called from other kernels.
Type System
Scalar Types
Tensor Types
ll.Tensor[dtype] maps to a raw pointer (dtype*) in generated code.
All pointer arithmetic and bounds checking is the programmer’s responsibility.
Qualifiers
ll.const[T]: C++const Tll.grid_constant[T]:__grid_constant__ const T(for TMA descriptors)
Variable Declarations
Use type annotations for local variables:
x: ll.float32 = 0.0
idx: ll.int32 = ll.threadIdx_x()
addr: ll.uint64 = ll.get_smem_address(buf)
Control Flow
Standard Python if/else, for (with range()), and while are
supported and map directly to C++ control flow.
Loop pragmas: ll.unroll() emits #pragma unroll.
Building & Launching
from little_kernel.core.passes import PASSES
from little_kernel.codegen.codegen_cuda import codegen_cuda
passes = PASSES["cuda"]
kernel = my_kernel.build(
passes, codegen_cuda,
grid=(grid_x, grid_y, grid_z),
block=(block_x, block_y, block_z),
shared_mem_bytes=SMEM_SIZE,
arch="sm_90", # or "sm_100a" for Blackwell
cluster_dim=(cx, cy, cz), # optional, for SM90+ clusters
)
# Launch (pass same args as the @ll_kernel function)
kernel(data_tensor, N)
torch.cuda.synchronize()
Inspecting Generated Code
To see the generated CUDA source without compiling:
cuda_src = my_kernel.compile(passes, codegen_cuda)
print(cuda_src)