Implicit Gradient Regularization

It is surprising that gradient descent is so good at optimising overparameterised models such as deep neural networks since these models typically have far more parameters than training data and should be susceptible to overfitting. One strategy to avoid overfitting is to introduce explicit regularisation to bias our models toward solutions that have some desirable properties, such as robustness to noise. However, gradient descent often works well without explicit regularisation. We find, using backward analysis, that gradient descent implicitly regularizes the second moment of the loss gradient, which we call implicit gradient regularisation (IGR). It is a particularly interesting quantity because it drives models toward solutions that are robust to parameter and data variations, consistent with a wide body of numerical observations. We also predict that the regularisation constant is proportional to the learning rate and network size, which we confirm numerically for deep networks trained for MNIST classification. More broadly, our work indicates that backward analysis is a useful theoretical approach to the perennial question of how learning rate, model size and parameter initialisation interact to determine the properties of over-parmatersied models optimsed with gradient descent.

Authors' notes