Novograd

Novograd#

class braintools.optim.Novograd(lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, grad_clip_norm=None, grad_clip_value=None)#

Novograd (Normalized Gradient) optimizer - Layer-wise gradient normalization with momentum.

Novograd is an adaptive learning rate optimizer that combines layer-wise gradient normalization with Adam-like second moment estimation. It was designed specifically for training speech recognition models but has been shown to work well across various deep learning tasks including computer vision and NLP.

The key innovation of Novograd is computing the second moment per layer rather than per weight, which provides more stable training and reduces memory usage. It normalizes gradients by their layer-wise L2 norm, which helps with training stability, especially for models with varying layer sizes.

Parameters:
  • lr (float | LRScheduler) – Learning rate. Can be a float or LRScheduler instance. If float is provided, it will be automatically converted to a ConstantLR scheduler.

  • betas (Tuple[float, float]) – Coefficients (beta1, beta2) used for computing running averages. beta1 is for the first moment (momentum), beta2 is for the second moment (per-layer gradient variance).

  • eps (float) – Term added to the denominator for numerical stability. Prevents division by zero when gradients are very small.

  • weight_decay (float) – Weight decay coefficient (L2 penalty). When greater than 0, applies L2 regularization to the parameters.

  • grad_clip_norm (float | None) – Maximum gradient norm for gradient clipping. If None, no gradient norm clipping is applied.

  • grad_clip_value (float | None) – Maximum absolute gradient value for element-wise gradient clipping. If None, no gradient value clipping is applied.

Notes

The Novograd update rules are:

For each layer l with gradient \(G_t^{(l)}\):

\[ \begin{align}\begin{aligned}g_t^{(l)} = \frac{G_t^{(l)}}{\|G_t^{(l)}\|_2 + \epsilon}\\v_t^{(l)} = \beta_2 v_{t-1}^{(l)} + (1 - \beta_2) \|G_t^{(l)}\|_2^2\\m_t^{(l)} = \beta_1 m_{t-1}^{(l)} + g_t^{(l)} + \lambda \theta_{t-1}^{(l)}\\\theta_t^{(l)} = \theta_{t-1}^{(l)} - \alpha \frac{m_t^{(l)}}{\sqrt{v_t^{(l)}} + \epsilon}\end{aligned}\end{align} \]

where:

  • \(G_t^{(l)}\) is the gradient for layer l at step t

  • \(g_t^{(l)}\) is the normalized gradient (unit norm)

  • \(v_t^{(l)}\) is the second moment (per-layer, not per-weight)

  • \(m_t^{(l)}\) is the first moment (momentum)

  • \(\lambda\) is the weight decay coefficient

  • \(\alpha\) is the learning rate

Key differences from Adam:

  • Layer-wise normalization: Normalizes gradients by layer L2 norm

  • Per-layer second moment: Stores one variance per layer, not per weight

  • Memory efficient: Reduces memory for second moment estimation

  • More stable: Layer-wise normalization improves training stability

  • Better for varied layer sizes: Handles layers of different sizes better

Key advantages of Novograd:

  • Stable training: Layer-wise normalization reduces gradient variance

  • Memory efficient: Per-layer second moment reduces memory usage

  • Robust to layer size: Works well with varying layer dimensions

  • Good generalization: Often achieves better test performance than Adam

  • Simple: No complex hyperparameter tuning needed

Novograd is particularly well-suited for:

  • Speech recognition models (Jasper, QuartzNet)

  • Training from scratch (not fine-tuning)

  • Models with layers of varying sizes

  • Tasks requiring stable training dynamics

  • Replacing Adam for better generalization

Comparison with other optimizers:

  • vs Adam: Less memory, more stable, better generalization

  • vs SGD: Adaptive rates, no manual lr tuning needed

  • vs RMSprop: Better momentum, per-layer adaptation

  • vs Layer-wise Adam: Similar concept, different implementation

References

Examples

Basic usage:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Create model
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Initialize Novograd
>>> optimizer = braintools.optim.Novograd(lr=0.001)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With custom betas:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Higher beta1 for more momentum
>>> optimizer = braintools.optim.Novograd(lr=0.001, betas=(0.95, 0.999))
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With weight decay:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Add L2 regularization
>>> optimizer = braintools.optim.Novograd(lr=0.001, weight_decay=0.01)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With learning rate scheduler:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Polynomial decay schedule
>>> scheduler = braintools.optim.StepLR(
...     base_lr=0.01,
...     step_size=100,
...     gamma=0.5
... )
>>> optimizer = braintools.optim.Novograd(lr=scheduler)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With gradient clipping for stable training:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Clip gradients by global norm
>>> optimizer = braintools.optim.Novograd(
...     lr=0.001,
...     grad_clip_norm=1.0
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Complete configuration for speech recognition:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Large speech model
>>> model = brainstate.nn.Linear(1000, 500)
>>>
>>> # Learning rate schedule with warmup
>>> scheduler = braintools.optim.StepLR(
...     base_lr=0.01,
...     step_size=1000,
...     gamma=0.9
... )
>>>
>>> # Complete Novograd configuration
>>> optimizer = braintools.optim.Novograd(
...     lr=scheduler,
...     betas=(0.95, 0.98),
...     eps=1e-8,
...     weight_decay=0.001,
...     grad_clip_norm=1.0
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

See also

Adam

Standard adaptive moment estimation

RMSprop

Root mean square propagation

SGD

Stochastic gradient descent with momentum

Lars

Layer-wise adaptive rate scaling

default_tx()[source]#

Create default gradient transformation with clipping and weight decay.