IODimVjpAlgorithm

IODimVjpAlgorithm#

class braintrace.IODimVjpAlgorithm(model, decay_or_rank, name=None, vjp_method='single-step', fast_solve=True, control_flow=None)[source]#

Online gradient algorithm with diagonal approximation and input-output-dimension complexity.

This algorithm computes the gradients of the weights with the diagonal approximation and the input-output dimensional complexity. It is based on the RTRL algorithm (Real-Time Recurrent Learning).

Parameters:

model (Module) – The model function, which receives the input arguments and returns the model output.
decay_or_rank (float | int) – The exponential smoothing factor for the eligibility trace. If a float, it is the decay factor and should be in the range \((0, 1)\). If an integer, it is the number of approximation ranks for the algorithm and should be greater than 0.
vjp_method (str) –
The method for computing the VJP. It should be either "single-step" or "multi-step".
- "single-step": the VJP is computed at the current time step, i.e., \(\partial L^t/\partial h^t\).
- "multi-step": the VJP is computed at multiple time steps, i.e., \(\partial L^t/\partial h^{t-k}\), where \(k\) is determined by the data input.
control_flow (ControlFlowPolicy | None) – Policy governing control-flow canonicalization (cond if-conversion, scan unrolling, structured scan descent, …) during graph compilation. None (default) uses DEFAULT_CONTROL_FLOW_POLICY.
name (str | None) – The name of the etrace algorithm.
mode (braintrace.mixin.Mode, optional) – The computing mode, indicating the batching information.

Notes

The learning rule is

\[\begin{split}\begin{aligned} & \boldsymbol{\epsilon}^t \approx \boldsymbol{\epsilon}_{\mathbf{f}}^t \otimes \boldsymbol{\epsilon}_{\mathbf{x}}^t \\ & \boldsymbol{\epsilon}_{\mathbf{x}}^t=\alpha \boldsymbol{\epsilon}_{\mathbf{x}}^{t-1}+\mathbf{x}^t \\ & \boldsymbol{\epsilon}_{\mathbf{f}}^t=\alpha \operatorname{diag}\left(\mathbf{D}^t\right) \circ \boldsymbol{\epsilon}_{\mathbf{f}}^{t-1}+(1-\alpha) \operatorname{diag}\left(\mathbf{D}_f^t\right) \\ & \nabla_{\boldsymbol{\theta}} \mathcal{L}=\sum_{t^{\prime} \in \mathcal{T}} \frac{\partial \mathcal{L}^{t^{\prime}}}{\partial \mathbf{h}^{t^{\prime}}} \circ \boldsymbol{\epsilon}^{t^{\prime}} \end{aligned}\end{split}\]

where \(\boldsymbol{\epsilon}_{\mathbf{x}}^t\) is the input-side trace, \(\boldsymbol{\epsilon}_{\mathbf{f}}^t\) the output-side trace, \(\alpha\) the exponential-smoothing factor, \(\mathbf{D}^t\) the hidden-to-hidden Jacobian, \(\mathbf{D}_f^t\) the state-to-output Jacobian, and \(\mathbf{x}^t\) the presynaptic input.

\(\mathbf{D}_f^t\) is read off by _compute_hid2weight_jacobian() from a single all-ones-tangent jax.jvp of the y -> hidden map; see that method’s docstring for when this is exact (elementwise maps) versus an approximation (non-elementwise maps, e.g. a normalization layer between the weight op and the neuron) — the same approximation is shared with ParamDimVjpAlgorithm (D_RTRL).

The full per-parameter D-RTRL trace \(\boldsymbol{\epsilon}^t \in \mathbb{R}^{I\times O}\) is approximated by the outer product of two exponentially-smoothed vectors — one over the input dimension and one over the output dimension. Storing the two factors instead of the full matrix drops the memory from \(O(I\cdot O)\) to \(O(I+O)\) per layer. The decay \(\alpha\) (equivalently an approximation rank) controls how much temporal history the factored trace retains; the bias of the exponential estimator is corrected at solve time.

This algorithm has \(O(BI+BO)\) memory complexity and \(O(BIO)\) computational complexity, where \(I\) and \(O\) are the number of input and output dimensions, and \(B\) the batch size. In particular, for a linear transformation layer, the weight gradients are computed with \(O(Bn)\) memory complexity and \(O(Bn^2)\) computational complexity, where \(n\) is the number of hidden dimensions.

For more details, please see the ES-D-RTRL algorithm presented in our manuscript.

Examples

>>> import brainstate
>>> import braintrace
>>>
>>> class RNN(brainstate.nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.cell = braintrace.nn.ValinaRNNCell(1, 20, activation='tanh')
...         self.out = braintrace.nn.Linear(20, 1)
...     def update(self, x):
...         return x >> self.cell >> self.out
>>>
>>> model = RNN()
>>> x0 = brainstate.random.randn(1)
>>> # one call: initialise states, build the trace graph, return a learner
>>> learner = braintrace.compile(model, braintrace.pp_prop, x0, decay_or_rank=0.9)  # or rank: decay_or_rank=19
>>> y = learner(x0)             # forward pass + eligibility-trace update

References

get_etrace_of(weight)[source]#

Get the eligibility trace of the given weight.

Parameters:

weight (ParamState | Tuple[str, ...]) – The weight whose eligibility trace is requested, given either as a brainstate.ParamState instance or as its path in the model.

Return type:

Tuple[Dict, Dict]

Returns:

etrace_xs (dict) – The input-side eligibility traces keyed by the weight-input variable.
etrace_dfs (dict) – The output-side eligibility traces keyed by (y_var, hidden-group index).

Raises:

ValueError – If no eligibility trace is found for the given weight.

init_etrace_state(*args, **kwargs)[source]#

Initialize the eligibility trace states of the etrace algorithm.

This method is needed after compiling the etrace graph. See compile_graph() for the details.

Return type:: None

reset_state(batch_size=None, **kwargs)[source]#

Reset the eligibility trace states.

Parameters:: batch_size (int | None) – The batch size used to reshape the reset trace states. Default None.
Return type:: None

IODimVjpAlgorithm

Contents

IODimVjpAlgorithm#

Modeling

Infrastructure

Compilation