AdaMax
Published on: August 3, 2021
Table of Content
In Adam, the update rule for individual weights is scaling their gradients inversely proportional to the
The L2 norm can be generalized to the
Such variants generally become numerically unstable for large
To avoid confusion with Adam, we use
We can now plug