ParamDimVjpAlgorithm

ParamDimVjpAlgorithm#

class braintrace.ParamDimVjpAlgorithm(model, name=None, vjp_method='single-step', fast_solve=True, trace_dtype=None, chunked_trace=True, control_flow=None)[source]#

Online gradient algorithm with diagonal approximation and parameter-dimension complexity.

This algorithm computes the gradients of the weights with the diagonal approximation and the parameter-dimension complexity. It is based on the RTRL algorithm (Real-Time Recurrent Learning).

Parameters:

model (Module) – The model function, which receives the input arguments and returns the model output.
vjp_method (str) –
The method for computing the VJP. It should be either "single-step" or "multi-step".
- "single-step": the VJP is computed at the current time step, i.e., \(\partial L^t/\partial h^t\).
- "multi-step": the VJP is computed at multiple time steps, i.e., \(\partial L^t/\partial h^{t-k}\), where \(k\) is determined by the data input.
chunked_trace (bool) – When True (default) and the input spans multiple time steps, the eligibility-trace roll over the window is computed in closed form — suffix products of the hidden-to-hidden Jacobians plus a single time-contracting einsum — instead of a per-step scan. Mathematically identical to the per-step roll (up to floating-point reassociation), but converts the dominant per-step elementwise passes over the parameter-sized trace into matmul-class kernels (~an order of magnitude faster on long windows). Relations without a chunk kernel (conv / sparse / LoRA / grouped), descended-scan relations, and relations with an active weight_fn / bias_fn transform fall back to the per-step scan automatically. Single-step input is unaffected. Note: chunking stacks the per-step Jacobians (O(T · B · (I + H)) memory for a window of length T) instead of fusing the roll into the forward scan; for very long windows either feed the sequence in smaller windows (the trace carries across calls) or set chunked_trace=False.
control_flow (ControlFlowPolicy | None) – Policy governing control-flow canonicalization (cond if-conversion, scan unrolling, structured scan descent, …) during graph compilation. None (default) uses DEFAULT_CONTROL_FLOW_POLICY.
name (str | None) – The name of the etrace algorithm.
mode (braintrace.mixin.Mode, optional) – The computing mode, indicating the batching behavior.

Notes

The learning rule is

\[\begin{split}\begin{aligned} &\boldsymbol{\epsilon}^t \approx \mathbf{D}^t \boldsymbol{\epsilon}^{t-1}+\operatorname{diag}\left(\mathbf{D}_f^t\right) \otimes \mathbf{x}^t \\ & \nabla_{\boldsymbol{\theta}} \mathcal{L}=\sum_{t^{\prime} \in \mathcal{T}} \frac{\partial \mathcal{L}^{t^{\prime}}}{\partial \mathbf{h}^{t^{\prime}}} \circ \boldsymbol{\epsilon}^{t^{\prime}} \end{aligned}\end{split}\]

where \(\boldsymbol{\epsilon}^t\) is the per-parameter eligibility trace, \(\mathbf{D}^t\) the hidden-to-hidden Jacobian, \(\mathbf{D}_f^t\) the state-to-output Jacobian, \(\mathbf{x}^t\) the presynaptic input, and \(\partial \mathcal{L}^{t'}/\partial \mathbf{h}^{t'}\) the learning signal back-propagated from the loss at each step.

\(\mathbf{D}_f^t\) is read off by _compute_hid2weight_jacobian() from a single all-ones-tangent jax.jvp of the y -> hidden map; see that method’s docstring for when this is exact (elementwise maps) versus an approximation (non-elementwise maps, e.g. a normalization layer between the weight op and the neuron) — the same approximation is shared with IODimVjpAlgorithm.

Real-Time Recurrent Learning (RTRL) propagates the full sensitivity \(\partial \mathbf{h}^t/\partial \boldsymbol{\theta}\) forward in time, which costs \(O(|\theta| \cdot H)\) memory. D-RTRL keeps only the diagonal of the hidden-to-hidden Jacobian, collapsing the trace to one value per parameter. The trace is then contracted with the instantaneous learning signal at each step to accumulate the gradient — no backward pass through time and memory linear in the parameter count.

ParamDimVjpAlgorithm is a subclass of brainstate.nn.Module and is sensitive to the context/mode of the computation. In particular, it is sensitive to brainstate.mixin.Batching behavior.

For dense (linear) transformation layers this algorithm has \(O(B\theta)\) memory complexity, where \(\theta\) is the number of parameters and \(B\) the batch size — the weight gradients are computed with \(O(BIO)\) complexity, where \(I\) and \(O\) are the number of input and output dimensions.

For a convolutional layer the exact eligibility trace must keep one kernel-shaped slot per spatial output position — the kernel is spatially shared while the diagonal discount acts per output element, so a spatially pre-summed (kernel-shaped) trace cannot follow the recurrence. The conv trace therefore costs \(O(B S \theta)\) memory, where \(S\) is the number of spatial output positions and \(\theta\) the kernel parameter count. For large convolutions prefer the IO-dim algorithm (pp_prop / IODimVjpAlgorithm), whose conv trace stays output-shaped.

For more details, please see the D-RTRL algorithm presented in our manuscript.

Examples

>>> import brainstate
>>> import braintrace
>>>
>>> class RNN(brainstate.nn.Module):
...     def __init__(self):
...         super().__init__()
...         self.cell = braintrace.nn.ValinaRNNCell(1, 20, activation='tanh')
...         self.out = braintrace.nn.Linear(20, 1)
...     def update(self, x):
...         return x >> self.cell >> self.out
>>>
>>> model = RNN()
>>> x0 = brainstate.random.randn(1)
>>> # ``braintrace.D_RTRL`` is an alias of ``ParamDimVjpAlgorithm``; one call
>>> # initialises states, builds the trace graph, and returns a learner.
>>> learner = braintrace.compile(model, braintrace.D_RTRL, x0)
>>> y = learner(x0)             # forward pass + eligibility-trace update

References

get_etrace_of(weight)[source]#

Get the eligibility trace of the given weight.

Parameters:: weight (ParamState | Tuple[str, ...]) – The weight whose eligibility trace is requested, given either as a brainstate.ParamState instance or as its path in the model.
Returns:: A dictionary mapping (y_var id, hidden-group index) keys to the eligibility-trace values associated with the given weight.
Return type:: Dict
Raises:: ValueError – If no eligibility trace is found for the given weight.

init_etrace_state(*args, **kwargs)[source]#

Initialize the eligibility trace states of the etrace algorithm.

This method is needed after compiling the etrace graph. See compile_graph() for the details.

Return type:: None

reset_state(batch_size=None, **kwargs)[source]#

Reset the eligibility trace states.

Parameters:: batch_size (int | None) – The batch size used to reshape the reset trace states. Default None.
Return type:: None

ParamDimVjpAlgorithm

Contents

ParamDimVjpAlgorithm#

Modeling

Infrastructure

Compilation