Rprop

Rprop#

class braintools.optim.Rprop(lr=0.01, etas=(0.5, 1.2), step_sizes=(1e-06, 50.0), grad_clip_norm=None, grad_clip_value=None)#

Rprop optimizer (Resilient Backpropagation).

Rprop is a gradient-based optimization algorithm that adapts the step size individually for each parameter based only on the sign of the gradient, not its magnitude. This makes it particularly robust to varying gradient scales and well-suited for batch learning.

Parameters:
  • lr (float | LRScheduler) – Initial learning rate (step size). Can be a float (converted to ConstantLR) or any LRScheduler instance. In Rprop, this serves as the initial step size.

  • etas (Tuple[float, float]) – Step size adjustment factors (eta_minus, eta_plus). When gradient sign changes, step size is multiplied by eta_minus (typically < 1). When gradient sign is consistent, step size is multiplied by eta_plus (typically > 1).

  • step_sizes (Tuple[float, float]) – Minimum and maximum allowed step sizes (min_step_size, max_step_size). Prevents step sizes from becoming too small or too large.

  • grad_clip_norm (float | None) – Maximum norm for gradient clipping. If specified, gradients are clipped when their global norm exceeds this value.

  • grad_clip_value (float | None) – Maximum absolute value for gradient clipping. If specified, gradients are clipped element-wise to [-grad_clip_value, grad_clip_value].

Notes

Rprop adapts the step size for each weight based on the sign pattern of gradients. The update rule is:

\[\begin{split}\Delta_t^{(i)} = \begin{cases} \eta^+ \cdot \Delta_{t-1}^{(i)} & \text{if } \frac{\partial E}{\partial w_i^{(t)}} \cdot \frac{\partial E}{\partial w_i^{(t-1)}} > 0 \\ \eta^- \cdot \Delta_{t-1}^{(i)} & \text{if } \frac{\partial E}{\partial w_i^{(t)}} \cdot \frac{\partial E}{\partial w_i^{(t-1)}} < 0 \\ \Delta_{t-1}^{(i)} & \text{otherwise} \end{cases}\end{split}\]

The step size is then clipped:

\[\Delta_t^{(i)} = \text{clip}(\Delta_t^{(i)}, \Delta_{\min}, \Delta_{\max})\]

Finally, the parameter update is:

\[w_t^{(i)} = w_{t-1}^{(i)} - \text{sign}\left(\frac{\partial E}{\partial w_i^{(t)}}\right) \cdot \Delta_t^{(i)}\]

Key characteristics of Rprop:

  • Sign-based updates: Uses only gradient sign, not magnitude

  • Individual step sizes: Each parameter has its own adaptive step size

  • Batch learning: Designed for full-batch gradient descent

  • Robust to scales: Insensitive to gradient magnitude variations

  • Simple and effective: Few hyperparameters to tune

  • Local adaptation: Adapts based on consecutive gradient signs

Rprop is particularly well-suited for:

  • Neural network training with batch learning

  • Problems with varying gradient scales across parameters

  • Scenarios where gradient magnitudes are unreliable

  • Feed-forward networks and small-medium sized problems

Advantages:

  • Robust to gradient scaling issues

  • Fast convergence on many problems

  • Simple to implement and tune

Limitations:

  • Not designed for mini-batch stochastic optimization

  • Requires sign consistency across consecutive steps

  • Less effective with very noisy gradients

References

Examples

Basic usage with default parameters:

>>> import brainstate as brainstate
>>> import braintools as braintools
>>>
>>> # Create model
>>> model = brainstate.nn.Linear(10, 5)
>>>
>>> # Initialize Rprop optimizer
>>> optimizer = braintools.optim.Rprop(lr=0.01)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With custom eta values for step size adjustment:

>>> # More aggressive step size changes
>>> optimizer = braintools.optim.Rprop(
...     lr=0.01,
...     etas=(0.3, 1.5)  # Faster decrease, faster increase
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

With custom step size bounds:

>>> # Tighter bounds on step sizes
>>> optimizer = braintools.optim.Rprop(
...     lr=0.01,
...     step_sizes=(1e-5, 10.0)
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Complete configuration:

>>> # All parameters customized
>>> optimizer = braintools.optim.Rprop(
...     lr=0.01,
...     etas=(0.5, 1.2),
...     step_sizes=(1e-6, 50.0)
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))

Batch training example:

>>> import jax.numpy as jnp
>>>
>>> # Setup for batch learning
>>> model = brainstate.nn.Sequential(
...     brainstate.nn.Linear(100, 50),
...     brainstate.nn.TanhT(),
...     brainstate.nn.Linear(50, 10)
... )
>>>
>>> # Rprop for batch training
>>> optimizer = braintools.optim.Rprop(
...     lr=0.01,
...     etas=(0.5, 1.2)
... )
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
>>>
>>> # Full-batch training step
>>> def train_step(batch_x, batch_y):
...     def loss_fn():
...         logits = model(batch_x)
...         return jnp.mean(
...             braintools.metric.softmax_cross_entropy(logits, batch_y)
...         )
...
...     grads = brainstate.transform.grad(loss_fn, model.states(brainstate.ParamState))()
...     optimizer.update(grads)
...     return loss_fn()
>>>
>>> # Use full batch or large batches
>>> x = jnp.ones((500, 100))
>>> y = jnp.zeros((500, 10))
>>> # loss = train_step(x, y)

Classification task example:

>>> # Rprop for classification
>>> model = brainstate.nn.Sequential(
...     brainstate.nn.Linear(784, 256),
...     brainstate.nn.ReLU(),
...     brainstate.nn.Linear(256, 10)
... )
>>>
>>> optimizer = braintools.optim.Rprop(lr=0.01)
>>> optimizer.register_trainable_weights(model.states(brainstate.ParamState))
>>>
>>> # Rprop adapts step sizes automatically
>>> # Works well even with varying gradient scales

See also

SGD

Stochastic gradient descent with momentum

Adam

Adaptive moment estimation

LBFGS

Limited-memory BFGS for batch optimization

default_tx()[source]#

Create default gradient transformation with clipping and weight decay.