Published on: July 10, 2021

Instead of inefficiently storing previous squared gradients, the sum of gradients is recursively defined as a decaying average of all past squared gradients. The running average at time step therefore only depends on the previous average and the current gradient [2]:

is usually set to around 0.9. Rewriting SGD updates in terms of the parameter update vector:

The authors that the units in the weight update don't match, i.e., the update should have the same hypothetical units as the parameters/weights. To realize this, they use the root mean squared error of parameter updates.

Since is unknown, it's approximated with the RMS of the parameter updates until the previous time step .

For more information on how to derive this formula, take a look at 'An overview of gradient descent optimization algorithms' by Sebastian Ruder and the original Adadelta paper by Matthew D. Zeiler.