OneCycleLR#
- class braintools.optim.OneCycleLR(max_lr=0.01, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', div_factor=25.0, final_div_factor=10000.0, last_epoch=0)#
One cycle learning rate scheduler - Super-convergence training policy.
OneCycleLR implements the 1cycle learning rate policy, which enables super-convergence - training neural networks an order of magnitude faster than with standard methods. The policy consists of two phases: first increasing the learning rate from a low value to a maximum value, then decreasing it to a value much lower than the initial one. This is typically combined with momentum scheduling in the opposite direction.
- Parameters:
max_lr (
float|List[float]) – Upper learning rate boundaries in the cycle. This is the peak learning rate that will be reached during training. Can be a single float or list for multiple parameter groups. Default: 1e-2.total_steps (
int|None) – The total number of steps (batches) in the cycle. Either this or the combination of epochs and steps_per_epoch must be provided.epochs (
int|None) – The number of epochs to train for. Used with steps_per_epoch to calculate total_steps if total_steps is not provided.steps_per_epoch (
int|None) – The number of steps (batches) per epoch. Used with epochs to calculate total_steps if total_steps is not provided.pct_start (
float) – The percentage of the cycle spent increasing the learning rate. Default: 0.3 (30% of cycle for warmup).anneal_strategy (
str) –{‘cos’, ‘linear’}. Specifies the annealing strategy:
’cos’: Cosine annealing from max_lr to final_lr
’linear’: Linear annealing from max_lr to final_lr
Default: ‘cos’.
div_factor (
float) – Determines the initial learning rate via initial_lr = max_lr / div_factor. Default: 25.0.final_div_factor (
float) – Determines the final learning rate via final_lr = max_lr / final_div_factor. Default: 1e4.last_epoch (
int) – The index of the last batch. Used when resuming training. Default: 0.
Notes
Three Phases of OneCycleLR:
Warmup phase (0 to pct_start): - LR increases from initial_lr to max_lr - Allows gradients to stabilize
Annealing phase (pct_start to 1.0): - LR decreases from max_lr to final_lr - Uses cosine or linear annealing
Final phase (optional extension): - LR stays at final_lr for additional training
Mathematical Formulation:
Initial learning rate:
\[\text{initial_lr} = \frac{\text{max_lr}}{\text{div_factor}}\]Final learning rate:
\[\text{final_lr} = \frac{\text{max_lr}}{\text{final_div_factor}}\]Super-Convergence Benefits:
10x faster training: Achieve same accuracy in 1/10th the epochs
Better generalization: Often achieves better final accuracy
Regularization effect: High LR acts as regularization
Simpler hyperparameter tuning: Mainly need to find max_lr
Finding Optimal max_lr:
Use the LR range test: 1. Start with very small LR 2. Gradually increase LR each batch 3. Plot loss vs LR 4. Choose max_lr slightly less than where loss starts increasing
Momentum Scheduling:
OneCycleLR works best with momentum scheduling in opposite direction: - When LR increases, momentum decreases - When LR decreases, momentum increases
Examples
Basic usage with super-convergence:
>>> import braintools >>> import brainstate >>> >>> # Training for 5 epochs with 100 batches per epoch >>> scheduler = braintools.optim.OneCycleLR( ... max_lr=0.1, ... epochs=5, ... steps_per_epoch=100 ... ) >>> optimizer = braintools.optim.SGD(lr=scheduler, momentum=0.9) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState)) >>> >>> for epoch in range(5): ... for batch in train_loader: ... train_step(batch) ... scheduler.step()
With total steps specification:
>>> # Specify total training steps directly >>> total_training_steps = 10000 >>> scheduler = braintools.optim.OneCycleLR( ... max_lr=0.3, ... total_steps=total_training_steps, ... pct_start=0.3, # 30% warmup ... anneal_strategy='cos' ... )
Custom phase percentages:
>>> # Longer warmup phase (40% of training) >>> scheduler = braintools.optim.OneCycleLR( ... max_lr=0.1, ... total_steps=5000, ... pct_start=0.4, # 40% for warmup ... div_factor=10, # Start from 0.01 ... final_div_factor=100 # End at 0.001 ... )
For different model sizes:
>>> # Small model/dataset - conservative settings >>> scheduler_small = braintools.optim.OneCycleLR( ... max_lr=0.01, ... total_steps=1000, ... pct_start=0.3, ... div_factor=25, ... final_div_factor=1000 ... ) >>> >>> # Large model - aggressive settings for super-convergence >>> scheduler_large = braintools.optim.OneCycleLR( ... max_lr=1.0, # Very high max_lr ... total_steps=10000, ... pct_start=0.2, # Shorter warmup ... div_factor=25, ... final_div_factor=10000 ... )
With momentum cycling (recommended):
>>> class OneCycleOptimizer: ... def __init__(self, model, max_lr=0.1, total_steps=1000): ... self.scheduler = braintools.optim.OneCycleLR( ... max_lr=max_lr, ... total_steps=total_steps ... ) ... self.base_momentum = 0.85 ... self.max_momentum = 0.95 ... self.optimizer = braintools.optim.SGD( ... lr=self.scheduler, ... momentum=self.max_momentum ... ) ... ... def step(self, grads): ... # Update learning rate ... self.scheduler.step() ... ... # Cycle momentum in opposite direction ... pct_complete = self.scheduler.last_epoch / self.scheduler.total_steps ... if pct_complete < self.scheduler.pct_start: ... # LR increasing, momentum decreasing ... momentum = self.max_momentum - (self.max_momentum - self.base_momentum) * pct_complete / self.scheduler.pct_start ... else: ... # LR decreasing, momentum increasing ... momentum = self.base_momentum + (self.max_momentum - self.base_momentum) * (pct_complete - self.scheduler.pct_start) / (1 - self.scheduler.pct_start) ... ... self.optimizer.momentum = momentum ... self.optimizer.update(grads)
LR range test for finding max_lr:
>>> def find_max_lr(model, data_loader, init_lr=1e-7, final_lr=10, num_iter=100): ... '''Find optimal max_lr using LR range test''' ... scheduler = braintools.optim.OneCycleLR( ... max_lr=final_lr, ... total_steps=num_iter, ... div_factor=final_lr/init_lr, ... final_div_factor=1.0, # Don't decrease at end ... pct_start=1.0 # Only increase ... ) ... optimizer = braintools.optim.SGD(lr=scheduler, momentum=0.9) ... ... lrs, losses = [], [] ... for i, batch in enumerate(data_loader): ... if i >= num_iter: ... break ... ... loss = compute_loss(model, batch) ... grads = compute_gradients(loss) ... optimizer.update(grads) ... ... lrs.append(scheduler.get_lr()[0]) ... losses.append(loss.item()) ... scheduler.step() ... ... # Find LR where loss stops decreasing ... import numpy as np ... smooth_losses = np.convolve(losses, np.ones(5)/5, mode='valid') ... max_lr_idx = np.argmin(smooth_losses) + len(losses) - len(smooth_losses) ... suggested_max_lr = lrs[max_lr_idx] ... print(f"Suggested max_lr: {suggested_max_lr}") ... ... return lrs, losses
Transfer learning with OneCycle:
>>> # Fine-tuning pretrained model >>> scheduler = braintools.optim.OneCycleLR( ... max_lr=0.001, # Lower max_lr for fine-tuning ... total_steps=2000, ... pct_start=0.1, # Short warmup ... div_factor=100, # Very low initial LR ... final_div_factor=1000 ... ) >>> >>> # Freeze early layers initially >>> for param in model.early_layers.parameters(): ... param.requires_grad = False >>> >>> # Unfreeze after warmup >>> def unfreeze_callback(epoch): ... if epoch > scheduler.total_steps * scheduler.pct_start: ... for param in model.early_layers.parameters(): ... param.requires_grad = True
Different annealing strategies:
>>> # Cosine annealing (smoother) >>> scheduler_cos = braintools.optim.OneCycleLR( ... max_lr=0.1, ... total_steps=1000, ... anneal_strategy='cos' ... ) >>> >>> # Linear annealing (more aggressive) >>> scheduler_linear = braintools.optim.OneCycleLR( ... max_lr=0.1, ... total_steps=1000, ... anneal_strategy='linear' ... )
Monitoring training progress:
>>> scheduler = braintools.optim.OneCycleLR( ... max_lr=0.1, ... total_steps=1000, ... pct_start=0.3 ... ) >>> >>> for step in range(1000): ... train_step(...) ... scheduler.step() ... ... if step % 100 == 0: ... phase = "warmup" if step < 300 else "annealing" ... lr = scheduler.get_lr()[0] ... progress = step / 1000 * 100 ... print(f"Step {step} ({progress:.1f}%): {phase} phase, LR={lr:.6f}")
Multiple parameter groups:
>>> # Different max_lr for different layers >>> scheduler = braintools.optim.OneCycleLR( ... max_lr=[0.001, 0.01], # Lower for pretrained, higher for new layers ... total_steps=1000, ... pct_start=0.3 ... )
See also
CyclicLRCyclic learning rate schedules
CosineAnnealingLRCosine annealing schedule
LinearLRLinear learning rate schedule
WarmupSchedulerSimple warmup schedule
References