Adadelta#
- class braintools.optim.Adadelta(lr=1.0, rho=0.9, eps=1e-06, weight_decay=0.0, grad_clip_norm=None, grad_clip_value=None)#
Adadelta optimizer - an extension of Adagrad.
Adadelta is an extension of Adagrad that seeks to reduce its aggressive, monotonically decreasing learning rate. Instead of accumulating all past squared gradients, Adadelta restricts the window of accumulated past gradients to some fixed size.
- Parameters:
lr (
float|LRScheduler) – Learning rate (scaling factor). Can be a float or LRScheduler instance. Note: Adadelta is largely learning rate free, so 1.0 is often sufficient.rho (
float) – Coefficient used for computing running average of squared gradients.eps (
float) – Term added to the denominator to improve numerical stability.weight_decay (
float) – Weight decay (L2 penalty) coefficient.grad_clip_norm (
float|None) – Maximum gradient norm for clipping.grad_clip_value (
float|None) – Maximum gradient value for clipping.
Notes
The Adadelta update is computed as:
\[ \begin{align}\begin{aligned}E[g^2]_t = \rho E[g^2]_{t-1} + (1 - \rho) g_t^2\\\Delta \theta_t = - \frac{\sqrt{E[\Delta \theta^2]_{t-1} + \epsilon}}{\sqrt{E[g^2]_t + \epsilon}} g_t\\E[\Delta \theta^2]_t = \rho E[\Delta \theta^2]_{t-1} + (1 - \rho) \Delta \theta_t^2\\\theta_t = \theta_{t-1} + \Delta \theta_t\end{aligned}\end{align} \]where \(\rho\) is the decay rate, \(g_t\) is the gradient, and \(\epsilon\) is for numerical stability.
References
Examples
Basic Adadelta usage:
>>> import brainstate >>> import braintools >>> >>> model = brainstate.nn.Linear(10, 5) >>> optimizer = braintools.optim.Adadelta() >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
Adadelta with custom rho:
>>> optimizer = braintools.optim.Adadelta(rho=0.95) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
Adadelta with explicit learning rate:
>>> optimizer = braintools.optim.Adadelta(lr=0.5, rho=0.9) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
See also