# Intra-Kernel Profiler User Guide
This guide details the function and interface usage of the Intra-Kernel Profiler, which profiles the execution time of each task in the kernel to guide performance optimization.

The profiler provides separate interfaces for **Device Side** and **Host Side** with the following usage:


## 1. Device Side Interfaces
Used to initialize the profiler and record start/end times of target tasks within the kernel.

### 1.0 Dependencies
The intra-kernel profiler relies on `tg4perfetto` to generate trace for now.
Install `tg4perfetto` first:
```bash
pip install tg4perfetto
```

### 1.1 Profiler Initialization
Create a profiler instance with configuration parameters:
```python
from triton_dist.tools.profiler import Profiler

profiler = Profiler.create(
    profiler_buffer=profiler_buf,
    group_id=0,
    num_groups=1,
    is_leader=(tid(0) == 0),
    ENABLE_PROFILING=True
)
```

- `profiler_buffer`: Device side tensor passed from the host side to the kernel.
- `group_id`: For the current Triton frontend, **set to 0**.
- `num_groups`: Total number of thread groups in a block. For the triton, **set to 1** is enough.
- `is_leader`: Predicate to select one thread per group (e.g., tid(0) == 0) to perform record.
- `ENABLE_PROFILING`: Default to true, if set to False to skip all record operations.

### 1.2 Task Time Record
Record start and end times of target tasks using the record method:

```python
# Record task start
profiler = profiler.record(is_start=True, task_type=0)
# do something.... 
# Record task end
profiler = profiler.record(is_start=False, task_type=0)
```

- `is_start`: Distinguish between the task start and end, True (start) / False (end).
- `task_type`: Integers start from 0 will be mapped to the corresponding task name during visualization(e.g. task_type=0 -> "perfect")

> Note: The Triton frontend does not support in-place modification. Thus, `profiler.record` returns a new profiler instance that overwrites the original.

## 2. Host Side Interfaces
Used to manage profiler buffers and export trace files, with two usage modes: Wrapped Interface (simplified) and Separate Interfaces (flexible).

### 2.1 Wrapped Interface: ProfilerBuffer
A context-manager-based interface simplifying buffer management and trace export:

```python
from triton_dist.tools.profiler import ProfilerBuffer

with ProfilerBuffer(
    max_num_profile_slots=1000000,
    trace_file="copy",
    task_names=["perfect", "non-perfect"]
) as profile_buf:
    # Execute Triton kernel (pass profile_buf as parameter)
    copy_1d_tilewise_kernel[grid](
        profile_buf, src_tensor, dst_tensor, grid_barrier, M * N
    )
```

- `max_num_profile_slots`: Must be greater than the total number of record operations across all thread blocks (user responsibility).
- `trace_file`	Output trace file name.
- `task_names`	List of readable names corresponding to task_type (e.g., task_type=0 -> "perfect").

By default, a trace file is generated for each iteration. To export traces selectively, use these switches:

```python
from triton_dist.tools.profiler import set_export_trace_on, set_export_trace_off

set_export_trace_on()  # Enable export on ProfilerBuffer exit

set_export_trace_off()  # Disable export
```

### 2.2 Separate Interfaces
For fine-grained control, use independent functions for buffer management and trace export:
```python
from triton_dist.tools.profiler import (
    alloc_profiler_buffer, 
    reset_profiler_buffer,
    export_to_perfetto_trace
)

# Allocate profiler buffer
profile_buf = alloc_profiler_buffer(max_num_profile_slots=1000000)

# Reset buffer
reset_profiler_buffer(profile_buf)

# Execute Triton kernel
copy_1d_tilewise_kernel[grid](
    profile_buf, src_tensor, dst_tensor, grid_barrier, M * N
)

# Export trace data
export_to_perfetto_trace(
    profiler_buffer=profile_buf,
    task_names=["perfect", "non-perfect"],
    file_name="copy"
)
```

## 3. Reference
- [feat: flashinfer intra-kernel profiler](https://github.com/flashinfer-ai/flashinfer/pull/913)