Getting Started
Installation
LittleKernel is included in Triton-distributed. A standard editable install picks it up automatically:
cd python
TRITON_BUILD_LITTLE_KERNEL=ON pip install -e .
Verify:
python -c "import little_kernel; print('OK')"
Prerequisites:
Python >= 3.9
PyTorch with CUDA support
nvccon$PATH(CUDA Toolkit 12+; CUDA 13 for SM100)
Writing Your First Kernel
A minimal kernel that adds two vectors:
import little_kernel as lk
import little_kernel.language as ll
from little_kernel.core.passes import PASSES
from little_kernel.codegen.codegen_cuda import codegen_cuda
import torch
@lk.ll_kernel(backend="cuda", is_entry=True)
def vec_add(
A: ll.Tensor[ll.float32],
B: ll.Tensor[ll.float32],
C: ll.Tensor[ll.float32],
N: ll.int32,
) -> ll.void:
tid: ll.int32 = ll.blockIdx_x() * ll.blockDim_x() + ll.threadIdx_x()
if tid < N:
C[tid] = A[tid] + B[tid]
# Build
passes = PASSES["cuda"]
N = 1024
kernel = vec_add.build(
passes, codegen_cuda,
grid=(N // 256, 1, 1),
block=(256, 1, 1),
arch="sm_90",
)
# Launch
A = torch.randn(N, device="cuda", dtype=torch.float32)
B = torch.randn(N, device="cuda", dtype=torch.float32)
C = torch.empty(N, device="cuda", dtype=torch.float32)
kernel(A, B, C, N)
torch.cuda.synchronize()
assert torch.allclose(C, A + B)
print("PASS")
Running Benchmarks
Micro-benchmarks
Run the full suite on a Hopper GPU:
python -m little_kernel.benchmark.run_all
Or a specific category:
python -m little_kernel.benchmark.run_all --category compute
SM90 GEMM Kernels
Run individual GEMM variants:
python -m little_kernel.benchmark.gemm_sm90.gemm_v1
SM100 GEMM Kernels
Run all levels (requires Blackwell GPU):
python -m little_kernel.benchmark.gemm_sm100.test_all_levels
Or a specific level:
python -m little_kernel.benchmark.gemm_sm100.gemm_level1
Running Tests
Unit tests (no GPU required for most):
pytest test/little_kernel/unit/ -v
Integration tests (requires GPU):
pytest test/little_kernel/integration/ -v