Novograd#
- class braintools.optim.Novograd(lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.0, grad_clip_norm=None, grad_clip_value=None)#
Novograd (Normalized Gradient) optimizer - Layer-wise gradient normalization with momentum.
Novograd is an adaptive learning rate optimizer that combines layer-wise gradient normalization with Adam-like second moment estimation. It was designed specifically for training speech recognition models but has been shown to work well across various deep learning tasks including computer vision and NLP.
The key innovation of Novograd is computing the second moment per layer rather than per weight, which provides more stable training and reduces memory usage. It normalizes gradients by their layer-wise L2 norm, which helps with training stability, especially for models with varying layer sizes.
- Parameters:
lr (
float|LRScheduler) – Learning rate. Can be a float or LRScheduler instance. If float is provided, it will be automatically converted to a ConstantLR scheduler.betas (
Tuple[float,float]) – Coefficients (beta1, beta2) used for computing running averages. beta1 is for the first moment (momentum), beta2 is for the second moment (per-layer gradient variance).eps (
float) – Term added to the denominator for numerical stability. Prevents division by zero when gradients are very small.weight_decay (
float) – Weight decay coefficient (L2 penalty). When greater than 0, applies L2 regularization to the parameters.grad_clip_norm (
float|None) – Maximum gradient norm for gradient clipping. If None, no gradient norm clipping is applied.grad_clip_value (
float|None) – Maximum absolute gradient value for element-wise gradient clipping. If None, no gradient value clipping is applied.
Notes
The Novograd update rules are:
For each layer l with gradient \(G_t^{(l)}\):
\[ \begin{align}\begin{aligned}g_t^{(l)} = \frac{G_t^{(l)}}{\|G_t^{(l)}\|_2 + \epsilon}\\v_t^{(l)} = \beta_2 v_{t-1}^{(l)} + (1 - \beta_2) \|G_t^{(l)}\|_2^2\\m_t^{(l)} = \beta_1 m_{t-1}^{(l)} + g_t^{(l)} + \lambda \theta_{t-1}^{(l)}\\\theta_t^{(l)} = \theta_{t-1}^{(l)} - \alpha \frac{m_t^{(l)}}{\sqrt{v_t^{(l)}} + \epsilon}\end{aligned}\end{align} \]where:
\(G_t^{(l)}\) is the gradient for layer l at step t
\(g_t^{(l)}\) is the normalized gradient (unit norm)
\(v_t^{(l)}\) is the second moment (per-layer, not per-weight)
\(m_t^{(l)}\) is the first moment (momentum)
\(\lambda\) is the weight decay coefficient
\(\alpha\) is the learning rate
Key differences from Adam:
Layer-wise normalization: Normalizes gradients by layer L2 norm
Per-layer second moment: Stores one variance per layer, not per weight
Memory efficient: Reduces memory for second moment estimation
More stable: Layer-wise normalization improves training stability
Better for varied layer sizes: Handles layers of different sizes better
Key advantages of Novograd:
Stable training: Layer-wise normalization reduces gradient variance
Memory efficient: Per-layer second moment reduces memory usage
Robust to layer size: Works well with varying layer dimensions
Good generalization: Often achieves better test performance than Adam
Simple: No complex hyperparameter tuning needed
Novograd is particularly well-suited for:
Speech recognition models (Jasper, QuartzNet)
Training from scratch (not fine-tuning)
Models with layers of varying sizes
Tasks requiring stable training dynamics
Replacing Adam for better generalization
Comparison with other optimizers:
vs Adam: Less memory, more stable, better generalization
vs SGD: Adaptive rates, no manual lr tuning needed
vs RMSprop: Better momentum, per-layer adaptation
vs Layer-wise Adam: Similar concept, different implementation
References
Examples
Basic usage:
>>> import brainstate as brainstate >>> import braintools as braintools >>> >>> # Create model >>> model = brainstate.nn.Linear(10, 5) >>> >>> # Initialize Novograd >>> optimizer = braintools.optim.Novograd(lr=0.001) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
With custom betas:
>>> import brainstate as brainstate >>> import braintools as braintools >>> >>> model = brainstate.nn.Linear(10, 5) >>> >>> # Higher beta1 for more momentum >>> optimizer = braintools.optim.Novograd(lr=0.001, betas=(0.95, 0.999)) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
With weight decay:
>>> import brainstate as brainstate >>> import braintools as braintools >>> >>> model = brainstate.nn.Linear(10, 5) >>> >>> # Add L2 regularization >>> optimizer = braintools.optim.Novograd(lr=0.001, weight_decay=0.01) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
With learning rate scheduler:
>>> import brainstate as brainstate >>> import braintools as braintools >>> >>> model = brainstate.nn.Linear(10, 5) >>> >>> # Polynomial decay schedule >>> scheduler = braintools.optim.StepLR( ... base_lr=0.01, ... step_size=100, ... gamma=0.5 ... ) >>> optimizer = braintools.optim.Novograd(lr=scheduler) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
With gradient clipping for stable training:
>>> import brainstate as brainstate >>> import braintools as braintools >>> >>> model = brainstate.nn.Linear(10, 5) >>> >>> # Clip gradients by global norm >>> optimizer = braintools.optim.Novograd( ... lr=0.001, ... grad_clip_norm=1.0 ... ) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
Complete configuration for speech recognition:
>>> import brainstate as brainstate >>> import braintools as braintools >>> >>> # Large speech model >>> model = brainstate.nn.Linear(1000, 500) >>> >>> # Learning rate schedule with warmup >>> scheduler = braintools.optim.StepLR( ... base_lr=0.01, ... step_size=1000, ... gamma=0.9 ... ) >>> >>> # Complete Novograd configuration >>> optimizer = braintools.optim.Novograd( ... lr=scheduler, ... betas=(0.95, 0.98), ... eps=1e-8, ... weight_decay=0.001, ... grad_clip_norm=1.0 ... ) >>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
See also