NVIDIA CUDA CUPTI Runtime Libraries

12.9.79 · active · verified Sun Mar 29

The `nvidia-cuda-cupti-cu12` package provides the CUDA Profiling Tools Interface (CUPTI) runtime libraries for CUDA 12.x. CUPTI is a dynamic library that enables the creation of profiling and tracing tools for CUDA applications. This package primarily supplies the low-level C libraries, with Python bindings provided by the `cupti-python` package. The current version is 12.9.79 and it is actively maintained by NVIDIA.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use `cupti-python` to profile a simple Numba CUDA kernel. It registers callbacks to capture kernel launch information and then flushes the collected activities. Ensure you have `numba-cuda` and a CUDA-capable GPU with appropriate drivers installed. This code assumes `cupti-python` is installed and the `libcupti.so` library (provided by `nvidia-cuda-cupti-cu12`) is discoverable.

import numpy as np
from numba import cuda
from cupti import cupti

@cuda.jit
def vector_add(A, B, C):
    idx = cuda.grid(1)
    if idx < A.size:
        C[idx] = A[idx] + B[idx]

def func_buffer_requested():
    buffer_size = 8 * 1024 * 1024  # 8MB buffer
    max_num_records = 0
    return buffer_size, max_num_records

def func_buffer_completed(activities: list):
    for activity in activities:
        if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
            print(f"Kernel Name: {activity.name}")
            print(f"Kernel Duration (ns): {activity.end - activity.start}")

# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)

threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block

# Register CUPTI callbacks
cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)

# Enable CUPTI activity collection for concurrent kernels
cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)

# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()

# Flush and disable CUPTI activity
cupti.activity_flush()
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)

view raw JSON →