ProdigyOpt Optimizer
ProdigyOpt is an Adam-like optimizer for neural networks, designed for high performance and memory efficiency. It features adaptive learning rate estimation and implements decoupled weight decay. The current version is 1.1.2, and releases typically focus on minor bug fixes and performance enhancements.
Common errors
-
ImportError: No module named 'prodigyopt'
cause The `prodigyopt` library is not installed in your current Python environment.fixRun `pip install prodigyopt` in your terminal to install the library. -
ImportError: cannot import name 'Prodigy' from 'prodigyopt'
cause Typo in the import statement or an issue with the installed package.fixEnsure your import statement is exactly `from prodigyopt import Prodigy` and verify the package is installed correctly. -
ValueError: Optimizer got an empty parameter list.
cause No trainable parameters were passed to the `Prodigy` optimizer during its initialization. This can happen if `model.parameters()` is empty, or all parameters are frozen.fixVerify that your model has parameters with `requires_grad=True` and that you are correctly passing `model.parameters()` to the optimizer, e.g., `optimizer = Prodigy(model.parameters(), lr=1e-3)`. -
RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn
cause You are attempting to call `.backward()` on a loss tensor that does not originate from a computation graph involving parameters that require gradients. This often means your model parameters are frozen, or inputs were detached.fixEnsure that your model's parameters have `requires_grad=True` and that the computation graph connecting your inputs through the model to the loss is intact (i.e., no `.detach()` calls inappropriately breaking the graph).
Warnings
- gotcha The `slice_p` parameter (introduced in v1.1, default 1) can significantly impact memory usage for large models. Higher values (e.g., 4) process parameters in slices, reducing peak memory at the cost of a slight performance overhead.
- gotcha The `decouple_wd` parameter defaults to `True` in Prodigy. This applies weight decay in a decoupled manner, which is generally desired for AdamW-like optimizers but might behave differently from other optimizers if you expect coupled weight decay.
- breaking Versions prior to `1.1.2` had known issues when used with PyTorch's FSDP (Fully Sharded Data Parallel), particularly when some parameters were frozen, leading to incorrect behavior or crashes.
Install
-
pip install prodigyopt
Imports
- Prodigy
from prodigy import Prodigy
from prodigyopt import Prodigy
Quickstart
import torch
import torch.nn as nn
from prodigyopt import Prodigy
# 1. Define a simple model
model = nn.Linear(10, 2)
# 2. Initialize the optimizer with model parameters
# decouple_wd=True is default, but explicitly shown for clarity
optimizer = Prodigy(model.parameters(), lr=1e-3, decouple_wd=True)
# 3. Define a loss function
loss_fn = nn.MSELoss()
# 4. Prepare dummy data
inputs = torch.randn(5, 10)
targets = torch.randn(5, 2)
# 5. Perform a training step
optimizer.zero_grad() # Clear gradients
outputs = model(inputs) # Forward pass
loss = loss_fn(outputs, targets) # Compute loss
loss.backward() # Backward pass to compute gradients
optimizer.step() # Update model parameters
print(f"Loss after one step: {loss.item():.4f}")