{"id":10307,"library":"trainer","title":"Trainer (Coqui-AI)","description":"Trainer by Coqui-AI is a general-purpose model trainer for PyTorch, designed to be flexible for various deep learning tasks. It wraps common training patterns, including distributed training via Hugging Face Accelerate, making it suitable for quick experimentation and larger-scale projects. The library is in active development (v0.0.36) with frequent micro-releases addressing bugs and adding features.","status":"active","version":"0.0.36","language":"en","source_language":"en","source_url":"https://github.com/coqui-ai/Trainer","tags":["pytorch","deep-learning","training","coqui","accelerate","machine-learning"],"install":[{"cmd":"pip install trainer","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Core deep learning framework.","package":"torch","optional":false},{"reason":"Enables distributed training and mixed precision.","package":"accelerate","optional":false},{"reason":"Scientific computing utilities, often used in data processing or metrics.","package":"scipy","optional":false}],"imports":[{"note":"The primary class for managing training loops is named `Trainer` and should be imported directly from the `trainer` package, not accessed via `trainer.Trainer` after a bare `import trainer`.","wrong":"import trainer","symbol":"Trainer","correct":"from trainer import Trainer"}],"quickstart":{"code":"import torch\nfrom torch import nn, optim\nfrom torch.utils.data import DataLoader, Dataset\nfrom trainer import Trainer\nimport os\n\n# 1. Dummy Dataset\nclass DummyDataset(Dataset):\n    def __init__(self, num_samples=100, input_dim=10, output_dim=1):\n        self.X = torch.randn(num_samples, input_dim)\n        self.y = torch.randn(num_samples, output_dim)\n    def __len__(self):\n        return len(self.X)\n    def __getitem__(self, idx):\n        return self.X[idx], self.y[idx]\n\n# 2. Dummy Model\nclass DummyModel(nn.Module):\n    def __init__(self, input_dim=10, output_dim=1):\n        super().__init__()\n        self.linear = nn.Linear(input_dim, output_dim)\n    def forward(self, x):\n        return self.linear(x)\n\n# 3. Setup components\ninput_dim = 10\noutput_dim = 1\nmodel = DummyModel(input_dim, output_dim)\noptimizer = optim.Adam(model.parameters(), lr=0.001)\ncriterion = nn.MSELoss()\n\ntrain_dataset = DummyDataset(num_samples=100, input_dim=input_dim, output_dim=output_dim)\neval_dataset = DummyDataset(num_samples=20, input_dim=input_dim, output_dim=output_dim)\n\ndataloader_train = DataLoader(train_dataset, batch_size=4, shuffle=True)\ndataloader_eval = DataLoader(eval_dataset, batch_size=4, shuffle=False)\n\n# 4. Minimal Config (usually from argparse)\nconfig = {\n    \"output_path\": \"./trainer_quickstart_output\",\n    \"epochs\": 2,\n    \"start_by_epochs\": True, \n    \"print_step\": 1, \n    \"save_step\": 1, \n    \"eval_step\": 1\n}\n\n# Ensure output path exists for trainer to save checkpoints/logs\nos.makedirs(config[\"output_path\"], exist_ok=True)\n\n# 5. Initialize and run Trainer\ntrainer_instance = Trainer(\n    config=config,\n    model=model,\n    optimizer=optimizer,\n    criterion=criterion,\n    dataloader_train=dataloader_train,\n    dataloader_eval=dataloader_eval,\n)\n\nprint(f\"Starting training for {config['epochs']} epochs...\")\ntrainer_instance.train_loop()\nprint(\"Training finished.\")\n\n# Output files will be created in ./trainer_quickstart_output\n# In a real application, you might add cleanup or more complex logging.","lang":"python","description":"This quickstart demonstrates how to set up a minimal PyTorch model, optimizer, criterion, and data loaders, then initialize and run the `Trainer` class for a basic training loop. It uses dummy data and a simple linear model to illustrate the core workflow. The `config` dictionary is essential for guiding the trainer's behavior, including output paths and training epochs."},"warnings":[{"fix":"Always test checkpointing and resumption thoroughly after updating the library. Refer to the specific release notes for bug fixes related to `continue_path` and `save_best_model` in your target version.","message":"The `continue_path` (for resuming training from checkpoints) and `save_best_model` functionalities have undergone several reverts and fixes across versions v0.0.33, v0.0.34, and v0.0.35. This indicates potential instability and breaking changes in how checkpoints are handled or resumed.","severity":"breaking","affected_versions":">=0.0.33, <0.0.36"},{"fix":"Pin your `trainer` dependency to an exact version (`trainer==0.0.36`) in production environments and review GitHub releases/changelogs carefully before upgrading. Maintain robust integration tests for your training pipelines.","message":"As a pre-1.0 library (currently v0.0.36), the API may evolve rapidly. Methods, arguments, or configurations might change without extensive deprecation warnings, leading to unexpected errors with minor version updates.","severity":"gotcha","affected_versions":"<1.0.0"},{"fix":"Consult the `accelerate` documentation and `trainer`'s examples for distributed training. Ensure your environment is correctly configured for distributed processes, and verify that all ranks initialize correctly, especially when setting up for the first time.","message":"Distributed training setups, which leverage `accelerate`, can be complex. Issues like 'distribute rank initialization' have been fixed (v0.0.32), suggesting that multi-GPU or distributed configurations might require careful setup and debugging.","severity":"gotcha","affected_versions":"<0.0.33"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Ensure the library is installed: `pip install trainer`. Verify you are running your script in the correct Python environment where `trainer` is installed.","cause":"The 'trainer' library is not installed or the current Python environment does not have it.","error":"ModuleNotFoundError: No module named 'trainer'"},{"fix":"Review the `Trainer`'s initialization requirements and ensure your `config` dictionary contains all necessary parameters. Refer to the library's examples for a minimal working configuration, and ensure all required keys for your desired training mode (e.g., `start_by_epochs` for epoch-based training) are present.","cause":"The `config` dictionary passed to the `Trainer` is missing essential keys required by the training loop, such as `output_path`, `epochs`, or `start_by_epochs`.","error":"KeyError: 'output_path' (or similar for config keys)"},{"fix":"If not using `accelerate` for multi-device handling, ensure your model and data loaders explicitly move tensors to the target device (e.g., `model.to('cuda')`, `data.to('cuda')`). If `accelerate` is in use, verify its configuration and avoid manual device placement that might override its logic.","cause":"This is a common PyTorch error indicating that your model, data, or loss function are not all on the same computational device (e.g., CPU vs. CUDA). While `trainer` attempts to manage this, manual device placement or inconsistent data loading might cause conflicts.","error":"RuntimeError: Expected all tensors to be on the same device, but found at least two devices (cpu and cuda:0)!"}]}