Triton-distributed Semantics
This document describes the design philosophy and semantic model of Triton-distributed.
Design Philosophy
Triton-distributed is built on a Tile-Centric design philosophy (as described in our MLSys 2025 paper), which means:
Tile as the Unit of Work: Both computation and communication are organized around tiles (blocks of data). Each tile is a self-contained unit that can be computed, transferred, and synchronized independently.
Decoupled Computation and Communication: Communication (data transfer) and computation (GEMM, etc.) are explicitly separated and can be performed by different SMs (Streaming Multiprocessors) or different parts of the same kernel.
Fine-grained Overlapping: By organizing work around tiles, we can achieve fine-grained overlapping where computation on tiles that are already available can proceed while other tiles are still being transferred.
Core Semantic Concepts
Producer-Consumer Model
Triton-distributed uses a producer-consumer model for overlapping computation with communication:
Producer: Responsible for data transfer (e.g., AllGather, All-to-All). The producer copies data to a shared buffer and signals when each tile is ready.
Consumer: Responsible for computation (e.g., GEMM). The consumer waits for tiles to be ready, then computes on them immediately.
# Producer: Transfer data and signal
@triton_dist.jit
def producer_kernel(...):
# Transfer tile to shared buffer
tl.store(remote_buffer_ptr + tile_offset, data)
# Signal that tile is ready
dl.notify(signal_ptr, peer_rank, signal=tile_id, sig_op="set")
# Consumer: Wait for data and compute
@triton_dist.jit
def consumer_kernel(...):
# Wait for tile to be ready
token = dl.wait(signal_ptr + tile_id, 1, "gpu", "acquire")
# Consume token to establish data dependency
data_ptr = dl.consume_token(data_ptr, token)
# Now safe to compute on the tile
result = compute(tl.load(data_ptr))
Signal-Based Synchronization
The synchronization model is based on signals rather than barriers:
wait(ptr, n, scope, semantic): Wait until n signals at ptr reach expected valuesnotify(ptr, rank, signal, sig_op): Send a signal to a specific rankconsume_token(value, token): Establish data dependency between wait and memory access
Key Insight: Unlike global barriers that synchronize all ranks, signal-based synchronization allows fine-grained tile-level coordination. This enables:
Different tiles to proceed independently
Maximum overlapping between communication and computation
Avoiding the “straggler effect” where slow ranks block everyone
Memory Scope and Ordering
Triton-distributed provides explicit control over memory scope and ordering:
Scope:
"gpu": Synchronization visible only within the same GPU"sys": Synchronization visible across the entire system (multiple GPUs/nodes)
Semantic:
"acquire": Ensures all previous writes by other threads are visible after the wait"release": Ensures all previous writes by this thread are visible to other threads
Symmetric Memory Model
Triton-distributed uses symmetric memory (NVSHMEM/ROCSHMEM) for cross-rank communication:
All ranks allocate memory at the same virtual address
symm_at(ptr, rank): Maps a local pointer to the corresponding address on another rankEnables direct load/store across ranks without explicit send/receive
# Access remote rank's memory directly
remote_ptr = dl.symm_at(local_ptr, peer_rank)
tl.store(remote_ptr + offset, data) # Write to peer's memory
Kernel Design Patterns
AllGather + GEMM Overlapping
Time →
┌─────────────────────────────────────────────────────────────┐
│ Producer (AllGather) │
│ [Tile 0] → [Tile 1] → [Tile 2] → [Tile 3] → ... │
│ ↓ ↓ ↓ ↓ │
│ signal signal signal signal │
│ ↓ ↓ ↓ ↓ │
│ Consumer (GEMM) │
│ [Tile 0] → [Tile 1] → [Tile 2] → [Tile 3] → ... │
└─────────────────────────────────────────────────────────────┘
Producer transfers tiles from other ranks via AllGather
After each tile transfer completes, producer signals the consumer
Consumer waits for each tile, then immediately starts GEMM computation
Computation and communication overlap, hiding communication latency
GEMM + ReduceScatter Overlapping
Time →
┌─────────────────────────────────────────────────────────────┐
│ Producer (GEMM) │
│ [Tile 0] → [Tile 1] → [Tile 2] → [Tile 3] → ... │
│ ↓ ↓ ↓ ↓ │
│ signal signal signal signal │
│ ↓ ↓ ↓ ↓ │
│ Consumer (ReduceScatter) │
│ [Tile 0] → [Tile 1] → [Tile 2] → [Tile 3] → ... │
└─────────────────────────────────────────────────────────────┘
Producer (GEMM) computes output tiles
After each tile is computed, producer signals the consumer
Consumer (ReduceScatter) waits for tiles and performs reduction + scatter
Communication happens as computation produces results, maximizing overlap
Threadblock Swizzling
To maximize overlap and minimize synchronization, Triton-distributed uses threadblock swizzling:
# Swizzle tile assignment so each rank starts with its local data
pid_m = (pid_m + rank * tiles_per_rank) % total_tiles
This ensures:
Each rank starts computing on locally available data (no wait needed)
By the time a rank needs remote data, it’s likely already transferred
Reduces synchronization overhead and improves cache efficiency
Token-Based Data Dependency
The consume_token primitive establishes explicit data dependencies:
# Without consume_token: compiler might reorder load before wait
token = dl.wait(signal_ptr, 1, "gpu", "acquire")
data = tl.load(data_ptr) # BUG: might execute before wait!
# With consume_token: explicit dependency
token = dl.wait(signal_ptr, 1, "gpu", "acquire")
data_ptr = dl.consume_token(data_ptr, token) # Establishes dependency
data = tl.load(data_ptr) # Guaranteed to execute after wait
This is essential for correctness because:
GPU compilers aggressively reorder instructions
The wait and load have no syntactic dependency
consume_tokencreates an explicit data dependency that prevents reordering
Best Practices
Minimize Synchronization Granularity: Use per-tile signals instead of global barriers
Overlap Aggressively: Start computation as soon as any tile is ready, don’t wait for all tiles
Use Threadblock Swizzling: Arrange work so local data is processed first
Batch Signals When Possible: Multiple tiles can share a signal if they’re always accessed together
Consider Memory Scope: Use
"gpu"scope for intra-node,"sys"for inter-nodeProfile and Tune: Use the built-in profiler to identify overlap efficiency