Fromage

Fromage#

class braintools.optim.Fromage(lr=1.0, momentum=0.0, grad_clip_norm=None, grad_clip_value=None)#

Fromage (FRee-scale Optimal Metho for Adaptive GradiEnt) optimizer.

Fromage is a learning-rate-free optimizer that adapts the step size automatically based on the curvature of the loss landscape. It eliminates the need for manual learning rate tuning by using the ratio of gradient norms to determine the optimal step size. This makes it particularly useful for hyperparameter-free training and rapid prototyping.

The key innovation of Fromage is computing the step size from the ratio of consecutive gradient norms, which approximates the local curvature of the loss function. This provides an automatic adaptation mechanism without requiring explicit learning rate scheduling or tuning.

Parameters:
  • lr (float | LRScheduler) – Learning rate scale factor. While Fromage is designed to be learning-rate-free, this parameter can be used to globally scale the automatically computed step sizes. If float is provided, it will be automatically converted to a ConstantLR scheduler. Typically set to 1.0 to use pure automatic adaptation.

  • momentum (float) – Momentum coefficient for the first moment. When > 0, maintains an exponential moving average of gradients. Set to 0 to disable momentum and use pure gradient-based updates.

  • grad_clip_norm (float | None) – Maximum gradient norm for gradient clipping. If None, no gradient norm clipping is applied.

  • grad_clip_value (float | None) – Maximum absolute gradient value for element-wise gradient clipping. If None, no gradient value clipping is applied.

Notes

The Fromage update rules are:

\[ \begin{align}\begin{aligned}\alpha_t = \frac{\|G_t\|_2}{\|G_t - G_{t-1}\|_2 + \epsilon}\\M_t = \rho M_{t-1} + (1 - \rho) G_t \quad \text{(if momentum > 0)}\\\theta_{t+1} = \theta_t - \alpha_t \cdot M_t\end{aligned}\end{align} \]

where:

  • \(G_t\) is the gradient at step t

  • \(\alpha_t\) is the automatically computed step size

  • \(\|G_t\|_2\) is the L2 norm of the current gradient

  • \(\|G_t - G_{t-1}\|_2\) is the gradient difference norm (curvature proxy)

  • \(M_t\) is the momentum (optional)

  • \(\rho\) is the momentum coefficient

  • \(\epsilon\) is a small constant for numerical stability

The step size \(\alpha_t\) approximates \(1/L\) where L is the local Lipschitz constant of the gradient, providing an optimal step size based on local curvature.

Key advantages of Fromage:

  • Learning-rate-free: No manual lr tuning needed

  • Automatic adaptation: Step size adjusts to local curvature

  • Simple: Minimal hyperparameters to tune

  • Fast prototyping: Good default performance without tuning

  • Curvature-aware: Adapts to loss landscape geometry

  • Robust: Works across different problem scales

Fromage is particularly well-suited for:

  • Rapid prototyping and experimentation

  • Hyperparameter-free training pipelines

  • Problems where learning rate is hard to tune

  • Transfer learning with unknown optimal lr

  • Automated machine learning (AutoML)

  • Research experiments requiring minimal tuning

Comparison with other optimizers:

  • vs SGD: No learning rate tuning required

  • vs Adam: Simpler, fewer hyperparameters, learning-rate-free

  • vs AdaGrad: Automatic adaptation without accumulation issues

  • vs Hypergradient methods: Simpler, more efficient computation

Limitations:

  • May be less optimal than well-tuned adaptive optimizers

  • Requires multiple gradient evaluations for best performance

  • Gradient difference computation adds slight overhead

  • Best for medium-scale problems (not extensively tested on huge models)

References

Examples

Basic learning-rate-free usage:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Create model
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Initialize Fromage with default lr=1.0 (no tuning needed)
>>> optimizer = braintools.optim.Fromage(lr=1.0)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With momentum for smoother updates:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Enable momentum for better convergence
>>> optimizer = braintools.optim.Fromage(lr=1.0, momentum=0.9)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Without momentum (pure adaptive):

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Pure gradient-based updates
>>> optimizer = braintools.optim.Fromage(lr=1.0, momentum=0.0)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With global learning rate scaling:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Scale automatic step sizes by 0.5
>>> optimizer = braintools.optim.Fromage(lr=0.5, momentum=0.9)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With gradient clipping:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Clip gradients for stability
>>> optimizer = braintools.optim.Fromage(
...     lr=1.0,
...     momentum=0.9,
...     grad_clip_norm=1.0
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Complete configuration for prototyping:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Model for rapid experimentation
>>> model = brainstate.nn.Linear(100, 50)
>>>
>>> # Complete Fromage configuration
>>> optimizer = braintools.optim.Fromage(
...     lr=1.0,
...     momentum=0.9,
...     grad_clip_norm=1.0
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

See also

SGD

Stochastic gradient descent with momentum

Adam

Adaptive moment estimation

Adagrad

Adaptive learning rates for sparse features

default_tx()[source]#

Create default gradient transformation with clipping and weight decay.