KV Cache

KV cache implementation for efficient autoregressive decoding.

See python/triton_dist/models/kv_cache.py for implementation details.