Sequence Parallel Flash Decode

Sequence Parallel Flash Decode layer for distributed attention decoding.

Description

This layer implements distributed flash attention decoding with sequence parallelism, efficiently distributing the KV cache across multiple GPUs.

Example Usage

# Test SP Flash Decode
bash scripts/launch.sh python/triton_dist/test/nvidia/test_sp_decode_attn.py --case perf