XProf Profiler Plugin
The XProf Profiler Plugin is a powerful tool for profiling and performance analysis of machine learning models across various frameworks, including TensorFlow, JAX, and PyTorch/XLA. It helps users understand, debug, and optimize their programs to run efficiently on CPUs, GPUs, and TPUs. The current version is 2.22.1, and the library follows the TensorFlow versioning scheme, with frequent updates and releases.
Warnings
- gotcha A regression in `libtpu` versions `0.0.35` and `0.0.37` affects HLO Module-dependent tools like 'HLO Op Profile', 'Trace Viewer', and 'Graph Viewer' in XProf. It is recommended to use `libtpu 0.0.36` as a temporary workaround.
- gotcha The TensorBoard Profiler Plugin requires internet access to load the Google Chart library. If running TensorBoard offline, behind a corporate firewall, or in a datacenter, some charts and tables in the profiler interface may not display correctly.
- gotcha When using virtual environments, ensure that both `TensorBoard` and `tensorboard-plugin-profile` are installed within the *same and activated* virtual environment. Mixing installations or not activating the environment can lead to the profiler tab not appearing or displaying a 'plugin has moved' error.
- breaking The Profiler plugin requires recent versions of TensorFlow and TensorBoard. Specifically, `TensorFlow >= 2.18.0` and `TensorBoard >= 2.18.0` are prerequisites. Older versions may lead to compatibility issues, including 'The profile plugin has moved' messages even after installation.
- gotcha For GPU profiling, the NVIDIA CUDA Profiling Tools Interface (CUPTI) must be correctly configured and accessible via the `LD_LIBRARY_PATH` environment variable. Insufficient privileges or an incorrect path can prevent GPU profiling data collection.
- gotcha Running the profiler for excessively long durations can lead to out-of-memory errors. It is recommended to profile no more than 10 steps at a time. Also, avoid profiling the first few batches of training, as initialization overhead can skew results.
Install
-
pip install tensorboard-plugin-profile
Quickstart
import tensorflow as tf
from datetime import datetime
import os
# Ensure log directory exists
log_dir = os.path.join("logs", "profile", datetime.now().strftime("%Y%m%d-%H%M%S"))
os.makedirs(log_dir, exist_ok=True)
# Dummy model and data for profiling
model = tf.keras.models.Sequential([
tf.keras.layers.Dense(10, activation='relu', input_shape=(10,)),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='binary_crossentropy')
data = tf.random.normal(shape=(100, 10))
labels = tf.random.uniform(shape=(100, 1), maxval=2, dtype=tf.int64)
# Option 1: Programmatic profiling with tf.profiler.experimental
print(f"Starting programmatic profile, data will be in {log_dir}")
with tf.profiler.experimental.Profile(log_dir):
model.fit(data, labels, epochs=2, batch_size=32)
# Option 2: Using TensorBoard Keras Callback for profiling specific batches
# tb_callback = tf.keras.callbacks.TensorBoard(
# log_dir=log_dir,
# profile_batch='1,3' # Profile batches 1 to 3
# )
# model.fit(data, labels, epochs=2, batch_size=32, callbacks=[tb_callback])
print("Profiling data generated. To view, run TensorBoard in your terminal:")
print(f"tensorboard --logdir={os.path.abspath('logs')}")
print("Then open your browser to http://localhost:6006/#profile")