Adadelta

Adadelta#

class braintools.optim.Adadelta(lr=1.0, rho=0.9, eps=1e-06, weight_decay=0.0, grad_clip_norm=None, grad_clip_value=None)#

Adadelta optimizer - an extension of Adagrad.

Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size.

Parameters:
  • lr (float | LRScheduler) – Learning rate (scaling factor). Can be a float or LRScheduler instance. Note: Adadelta is largely learning rate free, so 1.0 is often sufficient.

  • rho (float) – Coefficient used for computing running average of squared gradients.

  • eps (float) – Term added to the denominator to improve numerical stability.

  • weight_decay (float) – Weight decay (L2 penalty) coefficient.

  • grad_clip_norm (float | None) – Maximum gradient norm for clipping.

  • grad_clip_value (float | None) – Maximum gradient value for clipping.

Notes

The Adadelta update is computed as:

\[ \begin{align}\begin{aligned}E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2\\\Delta \theta_t = - \frac{\sqrt{E[\Delta \theta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t\\E[\Delta \theta^2]_t = \rho E[\Delta \theta^2]_{t-1} + (1 - \rho) \Delta \theta_t^2\\\theta_t = \theta_{t-1} + \Delta \theta_t\end{aligned}\end{align} \]

where \(\rho\) is the decay rate, \(g_t\) is the gradient, and \(\epsilon\) is for numerical stability.

References

Examples

Basic Adadelta usage:

>>> import brainstate
>>> import braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>> optimizer = braintools.optim.Adadelta()
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Adadelta with custom rho:

>>> optimizer = braintools.optim.Adadelta(rho=0.95)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Adadelta with explicit learning rate:

>>> optimizer = braintools.optim.Adadelta(lr=0.5, rho=0.9)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

See also

Adagrad

Adaptive gradient algorithm with accumulating squared gradients

RMSprop

Similar to Adadelta but simpler

Adam

Combines ideas from RMSprop and momentum

default_tx()[source]#

Create Adadelta-specific gradient transformation.