AdaMax

Published on: August 3, 2021

AdaMax

Table of Content

In Adam, the update rule for individual weights is scaling their gradients inversely proportional to the norm of the past and current gradients.

The L2 norm can be generalized to the norm.

Such variants generally become numerically unstable for large , which is why and norms are most common in practice. However, in the special case where we let , a surprisingly simple and stable algorithm emerges.

To avoid confusion with Adam, we use  to denote the infinity norm-constrained :

We can now plug into the Adam update equation replacing to obtain the AdaMax update rule:

Code

Resources

More stories

  • Activation Functions

  • RAdam - Rectified Adam

  • QHM (Quasi-Hyperbolic Momentum)