AdamW

AdamW#

class braintools.optim.AdamW(lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, grad_clip_norm=None, grad_clip_value=None)#

AdamW optimizer with decoupled weight decay regularization.

AdamW modifies the standard Adam algorithm by decoupling the weight decay from the gradient-based update, which has been shown to improve generalization performance.

Parameters:

lr (float | LRScheduler) – Learning rate. Can be a float or LRScheduler instance.
betas (Tuple[float, float]) – Coefficients (beta1, beta2) for computing running averages.
eps (float) – Term added to the denominator for numerical stability.
weight_decay (float) – Weight decay coefficient (decoupled from gradient).
grad_clip_norm (float | None) – Maximum gradient norm for clipping.
grad_clip_value (float | None) – Maximum gradient value for clipping.

Notes

Unlike Adam where weight decay is part of the gradient computation, AdamW applies weight decay directly to the parameters:

\[\theta_t = \theta_{t-1} - \alpha (\frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_{t-1})\]

where \(\lambda\) is the weight decay coefficient.

References

Examples

Basic AdamW usage:

>>> import brainstate
>>> import braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>> optimizer = braintools.optim.AdamW(lr=0.001, weight_decay=0.01)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

AdamW with scheduler:

>>> scheduler = braintools.optim.CosineAnnealingLR(base_lr=0.001, T_max=100)
>>> optimizer = braintools.optim.AdamW(lr=scheduler, weight_decay=0.01)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

AdamW

Contents

AdamW#

Modeling

Infrastructure

Compilation