NVIDIA CUDA CUPTI Runtime Libraries
The `nvidia-cuda-cupti-cu12` package provides the CUDA Profiling Tools Interface (CUPTI) runtime libraries for CUDA 12.x. CUPTI is a dynamic library that enables the creation of profiling and tracing tools for CUDA applications. This package primarily supplies the low-level C libraries, with Python bindings provided by the `cupti-python` package. The current version is 12.9.79 and it is actively maintained by NVIDIA.
Warnings
- gotcha The `nvidia-cuda-cupti-cu12` package provides the underlying C/C++ libraries. For Python-level interaction and APIs, the `cupti-python` package must also be installed. Direct Python imports are typically from `cupti` (the `cupti-python` module), not `nvidia_cuda_cupti_cu12`.
- gotcha CUPTI Python relies on the `libcupti.so` C library. If `nvidia-cuda-cupti-cu12` is uninstalled or if `libcupti.so` cannot be found automatically, you may need to explicitly set the `LD_LIBRARY_PATH` environment variable to the directory containing `libcupti.so` (e.g., `$CUDA_TOOLKIT_INSTALL_PATH/extras/CUPTI/lib64`).
- breaking In CUDA Toolkit 12.0, the activity record `CUpti_ActivityKernel8` was deprecated and replaced by `CUpti_ActivityKernel9` to accommodate new fields for devices with compute capability 9.0 and higher. This impacts users interacting with the low-level CUPTI C API, and potentially `cupti-python` users working with older code that explicitly references these activity kinds.
- gotcha Older versions of `nvidia-cuda-cupti-cu12` (e.g., 12.4.127, 12.3.101) have been flagged with severe vulnerabilities. While the current version 12.9.79 should address these, always ensure you are running the latest stable version and keep your CUDA Toolkit and drivers updated.
Install
-
pip install nvidia-cuda-cupti-cu12 -
pip install --extra-index-url https://pypi.ngc.nvidia.com nvidia-cuda-runtime-cu12 -
pip install cupti-python
Imports
- cupti
from cupti import cupti
Quickstart
import numpy as np
from numba import cuda
from cupti import cupti
@cuda.jit
def vector_add(A, B, C):
idx = cuda.grid(1)
if idx < A.size:
C[idx] = A[idx] + B[idx]
def func_buffer_requested():
buffer_size = 8 * 1024 * 1024 # 8MB buffer
max_num_records = 0
return buffer_size, max_num_records
def func_buffer_completed(activities: list):
for activity in activities:
if activity.kind == cupti.ActivityKind.CONCURRENT_KERNEL:
print(f"Kernel Name: {activity.name}")
print(f"Kernel Duration (ns): {activity.end - activity.start}")
# Initialize data
vector_length = 1024 * 1024
A = np.random.rand(vector_length)
B = np.random.rand(vector_length)
C = np.zeros_like(A)
threads_per_block = 128
blocks_per_grid = (vector_length + (threads_per_block - 1)) // threads_per_block
# Register CUPTI callbacks
cupti.activity_register_callbacks(func_buffer_requested, func_buffer_completed)
# Enable CUPTI activity collection for concurrent kernels
cupti.activity_enable(cupti.ActivityKind.CONCURRENT_KERNEL)
# Launch kernel
vector_add[blocks_per_grid, threads_per_block](A, B, C)
cuda.synchronize()
# Flush and disable CUPTI activity
cupti.activity_flush()
cupti.activity_disable(cupti.ActivityKind.CONCURRENT_KERNEL)