make_fenchel_young_loss

make_fenchel_young_loss#

class braintools.metric.make_fenchel_young_loss(max_fun)#

Create a Fenchel-Young loss function from a max function.

Fenchel-Young losses provide a framework for building differentiable loss functions from convex regularizers. They are particularly useful in machine learning for structured prediction tasks and provide a principled way to construct losses that encourage sparsity or specific structure in predictions.

Given a strictly convex regularizer \(\Omega\), its convex conjugate (a.k.a. the max function or log-partition / soft-max function) is

\[\Omega^*(\theta) = \max_{\mu \in \mathcal{C}} \; \langle \theta, \mu \rangle - \Omega(\mu),\]

and the associated Fenchel-Young loss is

\[\ell_{FY}(\theta, y) = \Omega^*(\theta) - \langle \theta, y \rangle,\]

where \(\theta\) are the scores and \(y\) is the target. max_fun is exactly this conjugate \(\Omega^*\) (NOT the regularizer \(\Omega\) itself). When max_fun is a genuine convex conjugate and the target \(y\) lies in the marginal polytope \(\mathcal{C}\), the loss is convex in \(\theta\), non-negative, and zero iff the prediction matches the target. (These guarantees do not hold for an arbitrary max_fun such as a plain max.) Its gradient w.r.t. the scores is

\[\nabla_\theta \ell_{FY}(\theta, y) = \hat{y}(\theta) - y, \qquad \hat{y}(\theta) = \nabla \Omega^*(\theta),\]

i.e. the prediction \(\hat{y}(\theta) = \nabla \Omega^*(\theta)\) minus the target. For max_fun = logsumexp we have \(\nabla \Omega^*(\theta) = \mathrm{softmax}(\theta)\), recovering the softmax cross-entropy loss.

Parameters:: max_fun (MaxFun) – The max function \(\Omega^*\) (the convex conjugate of the regularizer) on which the Fenchel-Young loss is built. It must map a score vector over the last dimension to a scalar, consistent with the vectorize signature "(n)->()". Common choices include jax.scipy.special.logsumexp for softmax-based losses or custom max functions for structured outputs.
Returns:: A Fenchel-Young loss function with signature fenchel_young_loss(scores, targets, *args, **kwargs) that computes the loss between scores and targets. Any extra *args/**kwargs are forwarded to max_fun.
Return type:: callable

Notes

Warning

The resulting loss operates over the last dimension of the input arrays and accepts arbitrary leading dimensions. This differs from some other implementations that flatten inputs into 1D vectors.

Warning

The gradient \(\hat{y}(\theta) - y\) is obtained by autodiff of max_fun. This is only correct when \(\Omega^*\) is smooth (i.e. differentiable), as it is for logsumexp. Sparse / piecewise-linear conjugates such as sparsemax or entmax are non-smooth: their argmax is set-valued at kink points and plain autodiff of max_fun gives a wrong or undefined gradient. Supporting those correctly requires registering a custom_vjp whose backward pass returns the sparse prediction oracle \(\hat{y}(\theta) - y\); this is not implemented here (future work). Only pass a smooth, differentiable max_fun.

The choice of max function determines the properties of the resulting loss:

logsumexp: Creates a softmax-based cross-entropy loss
max: Creates a (non-smooth) max-margin loss; use only for the forward value, not for gradients (see warning above)
Custom smooth functions: Can create structured losses for specific applications

Examples

Create a softmax-based Fenchel-Young loss:

>>> import jax.numpy as jnp
>>> from jax.scipy.special import logsumexp
>>> import braintools as braintools
>>> # Create the loss function
>>> fy_loss = braintools.metric.make_fenchel_young_loss(max_fun=logsumexp)
>>> # Example usage
>>> scores = jnp.array([[2.0, 1.0, 0.5], [1.5, 2.5, 1.0]])
>>> targets = jnp.array([[1.0, 0.0, 0.0], [0.0, 1.0, 0.0]])
>>> loss = fy_loss(scores, targets)
>>> print(loss.shape)
(2,)

The gradient is the softmax prediction minus the target:

>>> import jax
>>> grad = jax.grad(lambda s, t: fy_loss(s, t).sum())(scores, targets)
>>> print(jnp.allclose(grad, jax.nn.softmax(scores, axis=-1) - targets))
True

Create a custom smooth max function for structured prediction. The function must return a SCALAR per core call (consistent with "(n)->()"):

>>> def custom_max(x):
...     return logsumexp(x) + 0.1 * jnp.sum(x ** 2)  # logsumexp plus a quadratic term
>>> structured_loss = braintools.metric.make_fenchel_young_loss(max_fun=custom_max)

make_fenchel_young_loss

Contents

make_fenchel_young_loss#

Modeling

Infrastructure

Compilation