Aleksandar Armacki, Dragana Bajović, Dušan Jakovetić, Soummya Kar, Ali H. Sayed
This paper establishes tight long-term tail decay rates for SGD and clipped SGD in non-convex optimization, showing significantly faster decay than previously known results.
This research investigates how the stochastic gradient descent (SGD) algorithm behaves over time, particularly focusing on the likelihood of large errors occurring. While previous studies have looked at short-term error probabilities, this paper examines the long-term behavior, which is more relevant for algorithms run over many iterations. The authors find that the probability of large errors decreases much faster over time than previously believed, especially when using a version of SGD that handles certain types of noise. This means that the algorithm is more reliable over long periods than earlier research suggested.