Trainer (Coqui-AI)
Trainer by Coqui-AI is a general-purpose model trainer for PyTorch, designed to be flexible for various deep learning tasks. It wraps common training patterns, including distributed training via Hugging Face Accelerate, making it suitable for quick experimentation and larger-scale projects. The library is in active development (v0.0.36) with frequent micro-releases addressing bugs and adding features.
Common errors
-
ModuleNotFoundError: No module named 'trainer'
cause The 'trainer' library is not installed or the current Python environment does not have it.fixEnsure the library is installed: `pip install trainer`. Verify you are running your script in the correct Python environment where `trainer` is installed. -
KeyError: 'output_path' (or similar for config keys)
cause The `config` dictionary passed to the `Trainer` is missing essential keys required by the training loop, such as `output_path`, `epochs`, or `start_by_epochs`.fixReview the `Trainer`'s initialization requirements and ensure your `config` dictionary contains all necessary parameters. Refer to the library's examples for a minimal working configuration, and ensure all required keys for your desired training mode (e.g., `start_by_epochs` for epoch-based training) are present. -
RuntimeError: Expected all tensors to be on the same device, but found at least two devices (cpu and cuda:0)!
cause This is a common PyTorch error indicating that your model, data, or loss function are not all on the same computational device (e.g., CPU vs. CUDA). While `trainer` attempts to manage this, manual device placement or inconsistent data loading might cause conflicts.fixIf not using `accelerate` for multi-device handling, ensure your model and data loaders explicitly move tensors to the target device (e.g., `model.to('cuda')`, `data.to('cuda')`). If `accelerate` is in use, verify its configuration and avoid manual device placement that might override its logic.
Warnings
- breaking The `continue_path` (for resuming training from checkpoints) and `save_best_model` functionalities have undergone several reverts and fixes across versions v0.0.33, v0.0.34, and v0.0.35. This indicates potential instability and breaking changes in how checkpoints are handled or resumed.
- gotcha As a pre-1.0 library (currently v0.0.36), the API may evolve rapidly. Methods, arguments, or configurations might change without extensive deprecation warnings, leading to unexpected errors with minor version updates.
- gotcha Distributed training setups, which leverage `accelerate`, can be complex. Issues like 'distribute rank initialization' have been fixed (v0.0.32), suggesting that multi-GPU or distributed configurations might require careful setup and debugging.
Install
-
pip install trainer
Imports
- Trainer
import trainer
from trainer import Trainer
Quickstart
import torch
from torch import nn, optim
from torch.utils.data import DataLoader, Dataset
from trainer import Trainer
import os
# 1. Dummy Dataset
class DummyDataset(Dataset):
def __init__(self, num_samples=100, input_dim=10, output_dim=1):
self.X = torch.randn(num_samples, input_dim)
self.y = torch.randn(num_samples, output_dim)
def __len__(self):
return len(self.X)
def __getitem__(self, idx):
return self.X[idx], self.y[idx]
# 2. Dummy Model
class DummyModel(nn.Module):
def __init__(self, input_dim=10, output_dim=1):
super().__init__()
self.linear = nn.Linear(input_dim, output_dim)
def forward(self, x):
return self.linear(x)
# 3. Setup components
input_dim = 10
output_dim = 1
model = DummyModel(input_dim, output_dim)
optimizer = optim.Adam(model.parameters(), lr=0.001)
criterion = nn.MSELoss()
train_dataset = DummyDataset(num_samples=100, input_dim=input_dim, output_dim=output_dim)
eval_dataset = DummyDataset(num_samples=20, input_dim=input_dim, output_dim=output_dim)
dataloader_train = DataLoader(train_dataset, batch_size=4, shuffle=True)
dataloader_eval = DataLoader(eval_dataset, batch_size=4, shuffle=False)
# 4. Minimal Config (usually from argparse)
config = {
"output_path": "./trainer_quickstart_output",
"epochs": 2,
"start_by_epochs": True,
"print_step": 1,
"save_step": 1,
"eval_step": 1
}
# Ensure output path exists for trainer to save checkpoints/logs
os.makedirs(config["output_path"], exist_ok=True)
# 5. Initialize and run Trainer
trainer_instance = Trainer(
config=config,
model=model,
optimizer=optimizer,
criterion=criterion,
dataloader_train=dataloader_train,
dataloader_eval=dataloader_eval,
)
print(f"Starting training for {config['epochs']} epochs...")
trainer_instance.train_loop()
print("Training finished.")
# Output files will be created in ./trainer_quickstart_output
# In a real application, you might add cleanup or more complex logging.