{"id":7702,"library":"schedulefree","title":"Schedule-Free Optimization in PyTorch","description":"Schedule-Free is a PyTorch library that provides optimizers designed for 'schedule-free' learning, eliminating the need for traditional learning rate schedules. It aims to achieve faster training times without requiring users to specify the stopping time or steps in advance. The library, currently at version 1.4.1, offers variants of popular optimizers like SGD, AdamW, and RAdam, and is actively maintained by Facebook Research.","status":"active","version":"1.4.1","language":"en","source_language":"en","source_url":"https://github.com/facebookresearch/schedule_free","tags":["pytorch","optimizer","deep-learning","machine-learning","training"],"install":[{"cmd":"pip install schedulefree","lang":"bash","label":"PyPI"}],"dependencies":[{"reason":"Core deep learning framework for which schedulefree provides optimizers.","package":"torch"}],"imports":[{"symbol":"SGDScheduleFree","correct":"from schedulefree import SGDScheduleFree"},{"symbol":"AdamWScheduleFree","correct":"from schedulefree import AdamWScheduleFree"},{"symbol":"RAdamScheduleFree","correct":"from schedulefree import RAdamScheduleFree"},{"note":"For wrapping existing PyTorch optimizers to make them schedule-free.","symbol":"ScheduleFreeWrapper","correct":"from schedulefree import ScheduleFreeWrapper"}],"quickstart":{"code":"import torch\nimport torch.nn as nn\nfrom schedulefree import AdamWScheduleFree\n\n# 1. Define a simple model\nclass SimpleModel(nn.Module):\n    def __init__(self):\n        super().__init__()\n        self.linear = nn.Linear(10, 1)\n\n    def forward(self, x):\n        return self.linear(x)\n\nmodel = SimpleModel()\n\n# 2. Define a Schedule-Free optimizer (e.g., AdamWScheduleFree)\n# Note: Schedule-Free optimizers often benefit from higher learning rates than traditional ones.\noptimizer = AdamWScheduleFree(model.parameters(), lr=1e-3, warmup_steps=100)\n\n# 3. Define a loss function\ncriterion = nn.MSELoss()\n\n# 4. Dummy data for demonstration\ninputs = torch.randn(64, 10)\ntargets = torch.randn(64, 1)\n\n# 5. Training loop (simplified)\nnum_epochs = 10\nfor epoch in range(num_epochs):\n    model.train() # Standard PyTorch model training mode\n    optimizer.train() # REQUIRED for Schedule-Free optimizers\n\n    # Forward pass\n    outputs = model(inputs)\n    loss = criterion(outputs, targets)\n\n    # Backward and optimize\n    optimizer.zero_grad()\n    loss.backward()\n    optimizer.step()\n\n    if (epoch + 1) % 2 == 0:\n        print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')\n\n    # Evaluation phase (e.g., for validation or checkpointing)\n    model.eval() # Standard PyTorch model evaluation mode\n    optimizer.eval() # REQUIRED for Schedule-Free optimizers before evaluation/checkpointing\n    with torch.no_grad():\n        val_inputs = torch.randn(16, 10)\n        val_targets = torch.randn(16, 1)\n        val_outputs = model(val_inputs)\n        val_loss = criterion(val_outputs, val_targets)\n        # In a real scenario, you'd calculate metrics on val_outputs and val_targets\n\nprint(\"Training complete.\")\n","lang":"python","description":"This quickstart demonstrates how to integrate `AdamWScheduleFree` into a basic PyTorch training loop. Key steps include initializing the optimizer, performing forward/backward passes, and crucially, calling `optimizer.train()` and `optimizer.eval()` alongside `model.train()` and `model.eval()` for correct parameter buffer handling during training and evaluation/checkpointing."},"warnings":[{"fix":"Ensure `optimizer.train()` is called before the training step and `optimizer.eval()` before any evaluation or checkpoint saving. Alternatively, use `ScheduleFreeClosure` versions if your code supports PyTorch Optimizer step closures, as these do not require explicit `train()`/`eval()` calls.","message":"Schedule-Free optimizers require explicit calls to `optimizer.train()` and `optimizer.eval()` to manage internal parameter buffers correctly during training and evaluation phases, respectively. Forgetting these calls can lead to incorrect updates or runtime errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If replicating results from versions prior to 1.3, use `AdamWScheduleFreePaper` which retains the older weight decay implementation.","message":"In version 1.3, the behavior of weight decay during learning rate warmup was changed to improve stability and consistency with standard `AdamW` in PyTorch.","severity":"breaking","affected_versions":">=1.3"},{"fix":"Consult the official documentation or examples for specific guidance on handling BatchNorm layers with Schedule-Free optimizers. Using PreciseBN is also suggested to avoid this issue.","message":"If your model utilizes BatchNorm layers, additional modifications are necessary for `test/val` evaluations to function correctly. This is because batch statistics need to be computed from the `x` sequence, not the `y` sequence.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Experiment with `beta` values like `0.95` or `0.98`, especially for extended training sessions, if initial results are suboptimal.","message":"Training with Schedule-Free optimizers can be more sensitive to the choice of the `beta` parameter than with standard momentum. While the default `0.9` works for many problems, increasing it to `0.95` or `0.98` might be necessary for very long training runs to achieve optimal performance.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Begin hyperparameter tuning with higher learning rates for Schedule-Free optimizers compared to what you would use for their traditional counterparts.","message":"The optimal learning rates for Schedule-Free optimizers are typically higher than those used with schedule-based approaches. For SGD, a learning rate 10x-50x larger might be a good starting point, while for AdamW, 1x-10x larger rates are often effective.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Add `optimizer.train()` at the beginning of your training epoch or step, similar to how `model.train()` is used.","cause":"The `optimizer.train()` method was not called before `optimizer.step()` during the training loop.","error":"Exception: Optimizer was not in train mode when step is called. optimizer.train() must be called before optimizer.step(). See documentation for details."},{"fix":"Implement explicit handling for BatchNorm layers during evaluation to ensure statistics are correctly updated from the `x` sequence, or use `PreciseBN` if applicable.","cause":"Batch normalization statistics during evaluation are being computed from the intermediate `y` sequence of the optimizer instead of the final `x` sequence, leading to discrepancies.","error":"Incorrect or inconsistent evaluation results, especially with models using BatchNorm layers."},{"fix":"Tune the learning rate (often higher than traditional optimizers, e.g., 10x-50x for SGD, 1x-10x for AdamW) and regularization parameters. Consider increasing the `beta` value for very long training runs (e.g., to 0.95 or 0.98).","cause":"Despite being 'schedule-free', the library still requires careful tuning of other hyperparameters like the initial learning rate and regularization, and potentially the `beta` parameter. Starting with default values without adjustment may yield poor results.","error":"Suboptimal convergence or performance compared to traditional optimizers with well-tuned learning rate schedules."}]}