OneCycleLR

OneCycleLR#

class braintools.optim.OneCycleLR(max_lr=0.01, total_steps=None, epochs=None, steps_per_epoch=None, pct_start=0.3, anneal_strategy='cos', div_factor=25.0, final_div_factor=10000.0, last_epoch=0)#

One cycle learning rate scheduler - Super-convergence training policy.

OneCycleLR implements the 1cycle learning rate policy, which enables super-convergence - training neural networks an order of magnitude faster than with standard methods. The policy consists of two phases: first increasing the learning rate from a low value to a maximum value, then decreasing it to a value much lower than the initial one. This is typically combined with momentum scheduling in the opposite direction.

Parameters:

max_lr (float | List[float]) – Upper learning rate boundaries in the cycle. This is the peak learning rate that will be reached during training. Can be a single float or list for multiple parameter groups. Default: 1e-2.
total_steps (int | None) – The total number of steps (batches) in the cycle. Either this or the combination of epochs and steps_per_epoch must be provided.
epochs (int | None) – The number of epochs to train for. Used with steps_per_epoch to calculate total_steps if total_steps is not provided.
steps_per_epoch (int | None) – The number of steps (batches) per epoch. Used with epochs to calculate total_steps if total_steps is not provided.
pct_start (float) – The percentage of the cycle spent increasing the learning rate. Default: 0.3 (30% of cycle for warmup).
anneal_strategy (str) –
{‘cos’, ‘linear’}. Specifies the annealing strategy:
- ’cos’: Cosine annealing from max_lr to final_lr
- ’linear’: Linear annealing from max_lr to final_lr
Default: ‘cos’.
div_factor (float) – Determines the initial learning rate via initial_lr = max_lr / div_factor. Default: 25.0.
final_div_factor (float) – Determines the final learning rate via final_lr = max_lr / final_div_factor. Default: 1e4.
last_epoch (int) – The index of the last batch. Used when resuming training. Default: 0.

Notes

Three Phases of OneCycleLR:

Warmup phase (0 to pct_start): - LR increases from initial_lr to max_lr - Allows gradients to stabilize
Annealing phase (pct_start to 1.0): - LR decreases from max_lr to final_lr - Uses cosine or linear annealing
Final phase (optional extension): - LR stays at final_lr for additional training

Mathematical Formulation:

Initial learning rate:

\[\text{initial_lr} = \frac{\text{max_lr}}{\text{div_factor}}\]

Final learning rate:

\[\text{final_lr} = \frac{\text{max_lr}}{\text{final_div_factor}}\]

Super-Convergence Benefits:

10x faster training: Achieve same accuracy in 1/10th the epochs
Better generalization: Often achieves better final accuracy
Regularization effect: High LR acts as regularization
Simpler hyperparameter tuning: Mainly need to find max_lr

Finding Optimal max_lr:

Use the LR range test: 1. Start with very small LR 2. Gradually increase LR each batch 3. Plot loss vs LR 4. Choose max_lr slightly less than where loss starts increasing

Momentum Scheduling:

OneCycleLR works best with momentum scheduling in opposite direction: - When LR increases, momentum decreases - When LR decreases, momentum increases

Examples

Basic usage with super-convergence:

>>> import braintools
>>> import brainstate
>>>
>>> # Training for 5 epochs with 100 batches per epoch
>>> scheduler = braintools.optim.OneCycleLR(
...     max_lr=0.1,
...     epochs=5,
...     steps_per_epoch=100
... )
>>> optimizer = braintools.optim.SGD(lr=scheduler, momentum=0.9)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
>>>
>>> for epoch in range(5):
...     for batch in train_loader:
...         train_step(batch)
...         scheduler.step()

With total steps specification:

>>> # Specify total training steps directly
>>> total_training_steps = 10000
>>> scheduler = braintools.optim.OneCycleLR(
...     max_lr=0.3,
...     total_steps=total_training_steps,
...     pct_start=0.3,  # 30% warmup
...     anneal_strategy='cos'
... )

Custom phase percentages:

>>> # Longer warmup phase (40% of training)
>>> scheduler = braintools.optim.OneCycleLR(
...     max_lr=0.1,
...     total_steps=5000,
...     pct_start=0.4,  # 40% for warmup
...     div_factor=10,  # Start from 0.01
...     final_div_factor=100  # End at 0.001
... )

For different model sizes:

>>> # Small model/dataset - conservative settings
>>> scheduler_small = braintools.optim.OneCycleLR(
...     max_lr=0.01,
...     total_steps=1000,
...     pct_start=0.3,
...     div_factor=25,
...     final_div_factor=1000
... )
>>>
>>> # Large model - aggressive settings for super-convergence
>>> scheduler_large = braintools.optim.OneCycleLR(
...     max_lr=1.0,  # Very high max_lr
...     total_steps=10000,
...     pct_start=0.2,  # Shorter warmup
...     div_factor=25,
...     final_div_factor=10000
... )

With momentum cycling (recommended):

>>> class OneCycleOptimizer:
...     def __init__(self, model, max_lr=0.1, total_steps=1000):
...         self.scheduler = braintools.optim.OneCycleLR(
...             max_lr=max_lr,
...             total_steps=total_steps
...         )
...         self.base_momentum = 0.85
...         self.max_momentum = 0.95
...         self.optimizer = braintools.optim.SGD(
...             lr=self.scheduler,
...             momentum=self.max_momentum
...         )
...
...     def step(self, grads):
...         # Update learning rate
...         self.scheduler.step()
...
...         # Cycle momentum in opposite direction
...         pct_complete = self.scheduler.last_epoch / self.scheduler.total_steps
...         if pct_complete < self.scheduler.pct_start:
...             # LR increasing, momentum decreasing
...             momentum = self.max_momentum - (self.max_momentum - self.base_momentum) * pct_complete / self.scheduler.pct_start
...         else:
...             # LR decreasing, momentum increasing
...             momentum = self.base_momentum + (self.max_momentum - self.base_momentum) * (pct_complete - self.scheduler.pct_start) / (1 - self.scheduler.pct_start)
...
...         self.optimizer.momentum = momentum
...         self.optimizer.update(grads)

LR range test for finding max_lr:

>>> def find_max_lr(model, data_loader, init_lr=1e-7, final_lr=10, num_iter=100):
...     '''Find optimal max_lr using LR range test'''
...     scheduler = braintools.optim.OneCycleLR(
...         max_lr=final_lr,
...         total_steps=num_iter,
...         div_factor=final_lr/init_lr,
...         final_div_factor=1.0,  # Don't decrease at end
...         pct_start=0.99  # Almost entirely increasing (must be < 1.0)
...     )
...     optimizer = braintools.optim.SGD(lr=scheduler, momentum=0.9)
...
...     lrs, losses = [], []
...     for i, batch in enumerate(data_loader):
...         if i >= num_iter:
...             break
...
...         loss = compute_loss(model, batch)
...         grads = compute_gradients(loss)
...         optimizer.update(grads)
...
...         lrs.append(scheduler.get_lr()[0])
...         losses.append(loss.item())
...         scheduler.step()
...
...     # Find LR where loss stops decreasing
...     import numpy as np
...     smooth_losses = np.convolve(losses, np.ones(5)/5, mode='valid')
...     max_lr_idx = np.argmin(smooth_losses) + len(losses) - len(smooth_losses)
...     suggested_max_lr = lrs[max_lr_idx]
...     print(f"Suggested max_lr: {suggested_max_lr}")
...
...     return lrs, losses

Transfer learning with OneCycle:

>>> # Fine-tuning pretrained model
>>> scheduler = braintools.optim.OneCycleLR(
...     max_lr=0.001,  # Lower max_lr for fine-tuning
...     total_steps=2000,
...     pct_start=0.1,  # Short warmup
...     div_factor=100,  # Very low initial LR
...     final_div_factor=1000
... )
>>>
>>> # Freeze early layers initially
>>> for param in model.early_layers.parameters():
...     param.requires_grad = False
>>>
>>> # Unfreeze after warmup
>>> def unfreeze_callback(epoch):
...     if epoch > scheduler.total_steps * scheduler.pct_start:
...         for param in model.early_layers.parameters():
...             param.requires_grad = True

Different annealing strategies:

>>> # Cosine annealing (smoother)
>>> scheduler_cos = braintools.optim.OneCycleLR(
...     max_lr=0.1,
...     total_steps=1000,
...     anneal_strategy='cos'
... )
>>>
>>> # Linear annealing (more aggressive)
>>> scheduler_linear = braintools.optim.OneCycleLR(
...     max_lr=0.1,
...     total_steps=1000,
...     anneal_strategy='linear'
... )

Monitoring training progress:

>>> scheduler = braintools.optim.OneCycleLR(
...     max_lr=0.1,
...     total_steps=1000,
...     pct_start=0.3
... )
>>>
>>> for step in range(1000):
...     train_step(...)
...     scheduler.step()
...
...     if step % 100 == 0:
...         phase = "warmup" if step < 300 else "annealing"
...         lr = scheduler.get_lr()[0]
...         progress = step / 1000 * 100
...         print(f"Step {step} ({progress:.1f}%): {phase} phase, LR={lr:.6f}")

Multiple parameter groups:

>>> # Different max_lr for different layers
>>> scheduler = braintools.optim.OneCycleLR(
...     max_lr=[0.001, 0.01],  # Lower for pretrained, higher for new layers
...     total_steps=1000,
...     pct_start=0.3
... )

OneCycleLR

Contents

OneCycleLR#

Modeling

Infrastructure

Compilation