Intrinsics Reference ==================== LittleKernel exposes hardware intrinsics through ``little_kernel.language`` (commonly imported as ``ll``). Each intrinsic maps to one or more PTX instructions. .. contents:: On this page :local: :depth: 2 Common Intrinsics ----------------- Thread Indexing ~~~~~~~~~~~~~~~ .. code-block:: python ll.threadIdx_x() # threadIdx.x ll.threadIdx_y() # threadIdx.y ll.blockIdx_x() # blockIdx.x ll.blockIdx_y() # blockIdx.y ll.blockDim_x() # blockDim.x ll.get_lane_idx() # %laneid ll.get_warp_idx() # threadIdx.x / 32 Synchronization ~~~~~~~~~~~~~~~ .. code-block:: python ll.syncthreads() # __syncthreads() ll.syncwarp() # __syncwarp(0xFFFFFFFF) ll.threadfence() # __threadfence() Memory ~~~~~~ .. code-block:: python ll.get_smem_address(buf) # cvta.shared.u32 (SMEM pointer) ll.get_smem_address64(buf) # cvta.shared.u64 Timing ~~~~~~ .. code-block:: python ll.clock64() # clock64() for latency measurement SM90 (Hopper) Intrinsics ------------------------- WGMMA (Warp Group MMA) ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python ll.wgmma_init_accum_64x64() # Zero-init m64n64 FP32 accumulator ll.wgmma_init_accum_64x128() # Zero-init m64n128 FP32 accumulator ll.wgmma_init_accum_64x256() # Zero-init m64n256 FP32 accumulator ll.wgmma_fence() # wgmma.fence.sync.aligned ll.wgmma_commit() # wgmma.commit_group.sync.aligned ll.wgmma_wait(n) # wgmma.wait_group.sync.aligned N ll.wgmma_m64n64k16_bf16(ad, bd) # WGMMA instruction (descriptor-based) ll.wgmma_m64n128k16_bf16(ad, bd) ll.wgmma_m64n256k16_bf16(ad, bd) ll.wgmma_store_d_64x64(C, ...) # Store accumulator to global memory ll.wgmma_store_d_64x128(C, ...) ll.wgmma_store_d_64x256(C, ...) TMA (Tensor Memory Accelerator) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # 2D TMA load (global -> shared) ll.tma_load_2d(desc, bar, smem_addr, coord_x, coord_y) # 2D TMA store (shared -> global) ll.tma_store_2d(desc, smem_addr, coord_x, coord_y) ll.tma_store_arrive() ll.tma_store_wait() # 1D bulk copy ll.tma_copy_1d_g2s(smem_addr, gmem_addr, bar, size) MBarrier ~~~~~~~~ .. code-block:: python ll.mbarrier_init(bar, count) ll.mbarrier_arrive_expect_tx(bar, tx_bytes) ll.mbarrier_try_wait_parity(bar, parity) ll.mbarrier_arrive(bar) ll.mbarrier_invalidate(bar) Cluster ~~~~~~~ .. code-block:: python ll.cluster_arrive() ll.cluster_wait() ll.cluster_sync() ll.get_cluster_ctaid() SM100 (Blackwell) Intrinsics ----------------------------- UMMA (Unified MMA / tcgen05) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python ll.elect_one() # elect.sync (returns bool) ll.tcgen05_alloc(num_cols) # TMEM allocation ll.tcgen05_dealloc(tmem, num_cols) # TMEM deallocation ll.tcgen05_ld_4x(tmem, smem_addr) # Load 4 rows from SMEM to TMEM ll.tcgen05_ld_8x(tmem, smem_addr) # Load 8 rows from SMEM to TMEM ll.tcgen05_fence_before() # Fence before MMA ll.tcgen05_fence_after() # Fence after MMA ll.umma_m256nNk16_bf16(idesc, tmem, ad, bd) # UMMA MMA instruction ll.umma_commit() # Commit UMMA group TMA (SM100 / cta_group::2) ~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python ll.tma_load_2d_cg2(desc, bar, smem_addr, x, y) # TMA load with PEER_BIT_MASK ll.tma_store_2d_sm100(desc, smem_addr, x, y) # SM100 TMA store ll.tma_store_commit() # SM100 TMA store commit TMEM Operations ~~~~~~~~~~~~~~~ .. code-block:: python ll.tmem_store_bf16_row(tmem, D, row_off, N, ldm) # TMEM -> global (BF16) ll.tmem_epilogue_coalesced_4w(...) # Optimized epilogue (requires 4 warps) Cluster MBarrier (SM100) ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python ll.mbarrier_arrive_cluster(bar) ll.mbarrier_arrive_expect_tx_cluster(bar, tx) Utilities ~~~~~~~~~ .. code-block:: python ll.pack_bf16(a, b) # Pack two BF16 into uint32 ll.st_shared_128(addr, v0, v1, v2, v3) # 128-bit shared store ll.uint_as_float(x) # Reinterpret uint32 as float32 ll.__shfl_sync(mask, val, src_lane, width) # Warp shuffle Runtime Helpers --------------- TMA Descriptor Creation ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from little_kernel.runtime.tma_descriptor import create_tma_2d_descriptor desc = create_tma_2d_descriptor( tensor, gmem_inner_dim=K, gmem_outer_dim=M, smem_inner_dim=BK, smem_outer_dim=BM, gmem_outer_stride=K, swizzle_mode=128, # B128 swizzle oob_fill=0, l2_promotion=3, # L2_256B )