downhill.adaptive.Adam

class downhill.adaptive.Adam(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Adam optimizer using unbiased gradient moment estimates.

Parameters:

learning_rate: float, optional (default 1e-4)

Step size to take during optimization.

beta1_decay: float, optional (default 1 - 1e-6)

Extend the \(\beta_1\) halflife by this amount after every update.

beta1_halflife: float, optional (default 7)

Compute RMS gradient estimates using an exponentially weighted moving average that decays with this halflife.

beta2_halflife: float, optional (default 69)

Compute squared-magnitude RMS gradient estimates using an exponentially weighted moving average that decays with this halflife.

rms_regularizer: float, optional (default 1e-8)

Regularize RMS gradient values by this \(\epsilon\).

momentum: float, optional (default 0)

Momentum to apply to the updates, if any. Defaults to 0 (no momentum). Set to a value close to 1 (e.g., 1 - 1e-4) for large amounts of momentum.

nesterov: bool, optional (default False)

Set this to True to enable Nesterov-style momentum updates, whenever momentum is nonzero.

Notes

The Adam method uses the same general strategy as all first-order stochastic gradient methods, in the sense that these methods make small parameter adjustments iteratively using local derivative information.

The difference here is that as gradients are computed during each parameter update, exponentially-weighted moving averages (EWMAs) of (1) the first moment of the recent gradient values and (2) the second moment of recent gradient values are maintained as well. At each update, the step taken is proportional to the ratio of the first moment to the second moment.

\[\begin{split}\begin{eqnarray*} \beta_1^t &=& \beta_1 \lambda^{t} f_{t+1} &=& \beta_1^t f_t + (1 - \beta_1^t) \frac{\partial\mathcal{L}}{\partial\theta} \\ g_{t+1} &=& \beta_2 g_t + (1 - \beta_2) \left(\frac{\partial\mathcal{L}}{\partial\theta}\right)^2 \\ \theta_{t+1} &=& \theta_t - \frac{f_{t+1} / (1 - \beta_1^t)}{\sqrt{g_{t+1} / (1 - \beta_2)} + \epsilon} \end{eqnarray*}\end{split}\]

Like all adaptive optimization algorithms, this optimizer effectively maintains a sort of parameter-specific momentum value. It shares with RMSProp and ADADELTA the idea of using an EWMA to track recent quantities related to the stochastic gradient during optimization. But the Adam method is unique in that it incorporates an explicit computation to remove the bias from these estimates.

In this implementation, \(\epsilon\) regularizes the RMS values and is given using the rms_regularizer keyword argument. The weight parameters \(\beta_1\) and \(\beta_2\) for the first and second EWMA windows are computed from the beta1_halflife and beta2_halflife keyword arguments, respectively, such that the actual EWMA weight varies inversely with the halflife \(h\): \(\gamma = e^{\frac{-\ln 2}{h}}\). The decay \(\lambda\) for the \(\beta_1\) EWMA is provided by the beta1_decay keyword argument.

The implementation here is taken from Algorithm 1 of [King15].

References

[King15](1, 2) D. Kingma & J. Ba. (ICLR 2015) “Adam: A Method for Stochastic Optimization.” http://arxiv.org/abs/1412.6980
__init__(loss, params=None, inputs=None, updates=(), monitors=(), monitor_gradients=False)

Methods

__init__(loss[, params, inputs, updates, ...])
evaluate(dataset) Evaluate the current model parameters on a dataset.
get_updates(**kwargs) Get parameter update expressions for performing optimization.
iterate([train, valid, max_updates]) Optimize a loss iteratively using a training and validation dataset.
minimize(*args, **kwargs) Optimize our loss exhaustively.
set_params([targets]) Set the values of the parameters to the given target values.