Fromage

Fromage#

class braintools.optim.Fromage(lr=1.0, momentum=0.0, grad_clip_norm=None, grad_clip_value=None)#

Fromage (FRee-scale Optimal Metho for Adaptive GradiEnt) optimizer.

Fromage is a learning-rate-free optimizer that adapts the step size automatically based on the curvature of the loss landscape. It eliminates the need for manual learning rate tuning by using the ratio of gradient norms to determine the optimal step size. This makes it particularly useful for hyperparameter-free training and rapid prototyping.

The key innovation of Fromage is computing the step size from the ratio of consecutive gradient norms, which approximates the local curvature of the loss function. This provides an automatic adaptation mechanism without requiring explicit learning rate scheduling or tuning.

Parameters:

lr (float | LRScheduler) – Learning rate scale factor. While Fromage is designed to be learning-rate-free, this parameter can be used to globally scale the automatically computed step sizes. If float is provided, it will be automatically converted to a ConstantLR scheduler. Typically set to 1.0 to use pure automatic adaptation.
momentum (float) – Momentum coefficient for the first moment. When > 0, maintains an exponential moving average of gradients. Set to 0 to disable momentum and use pure gradient-based updates.
grad_clip_norm (float | None) – Maximum gradient norm for gradient clipping. If None, no gradient norm clipping is applied.
grad_clip_value (float | None) – Maximum absolute gradient value for element-wise gradient clipping. If None, no gradient value clipping is applied.

Notes

The Fromage update rules are:

\[ \begin{align}\begin{aligned}\alpha_t = \frac{\|G_t\|_2}{\|G_t - G_{t-1}\|_2 + \epsilon}\\M_t = \rho M_{t-1} + (1 - \rho) G_t \quad \text{(if momentum > 0)}\\\theta_{t+1} = \theta_t - \alpha_t \cdot M_t\end{aligned}\end{align} \]

where:

\(G_t\) is the gradient at step t
\(\alpha_t\) is the automatically computed step size
\(\|G_t\|_2\) is the L2 norm of the current gradient
\(\|G_t - G_{t-1}\|_2\) is the gradient difference norm (curvature proxy)
\(M_t\) is the momentum (optional)
\(\rho\) is the momentum coefficient
\(\epsilon\) is a small constant for numerical stability

The step size \(\alpha_t\) approximates \(1/L\) where L is the local Lipschitz constant of the gradient, providing an optimal step size based on local curvature.

Key advantages of Fromage:

Learning-rate-free: No manual lr tuning needed
Automatic adaptation: Step size adjusts to local curvature
Simple: Minimal hyperparameters to tune
Fast prototyping: Good default performance without tuning
Curvature-aware: Adapts to loss landscape geometry
Robust: Works across different problem scales

Fromage is particularly well-suited for:

Rapid prototyping and experimentation
Hyperparameter-free training pipelines
Problems where learning rate is hard to tune
Transfer learning with unknown optimal lr
Automated machine learning (AutoML)
Research experiments requiring minimal tuning

Comparison with other optimizers:

vs SGD: No learning rate tuning required
vs Adam: Simpler, fewer hyperparameters, learning-rate-free
vs AdaGrad: Automatic adaptation without accumulation issues
vs Hypergradient methods: Simpler, more efficient computation

Limitations:

May be less optimal than well-tuned adaptive optimizers
Requires multiple gradient evaluations for best performance
Gradient difference computation adds slight overhead
Best for medium-scale problems (not extensively tested on huge models)

References

Examples

Basic learning-rate-free usage:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Create model
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Initialize Fromage with default lr=1.0 (no tuning needed)
>>> optimizer = braintools.optim.Fromage(lr=1.0)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With momentum for smoother updates:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Enable momentum for better convergence
>>> optimizer = braintools.optim.Fromage(lr=1.0, momentum=0.9)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Without momentum (pure adaptive):

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Pure gradient-based updates
>>> optimizer = braintools.optim.Fromage(lr=1.0, momentum=0.0)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With global learning rate scaling:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Scale automatic step sizes by 0.5
>>> optimizer = braintools.optim.Fromage(lr=0.5, momentum=0.9)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With gradient clipping:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Clip gradients for stability
>>> optimizer = braintools.optim.Fromage(
...     lr=1.0,
...     momentum=0.9,
...     grad_clip_norm=1.0
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Complete configuration for prototyping:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Model for rapid experimentation
>>> model = brainstate.nn.Linear(100, 50)
>>>
>>> # Complete Fromage configuration
>>> optimizer = braintools.optim.Fromage(
...     lr=1.0,
...     momentum=0.9,
...     grad_clip_norm=1.0
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Fromage

Contents

Fromage#

Modeling

Infrastructure

Compilation