Learning algorithms based on Stochastic Gradient approximations are known for their poor performance on optimization tasks and their extremely good performance on machine learning tasks (Bottou and Bousquet, 2008). Despite these proven capabilities, there were lingering concerns about the difficulty of setting the adaptation gains and achieving robust performance. Stochastic gradient algorithms have been historically associated with back-propagation algorithms in multilayer neural networks, which can be very challenging non-convex problems. Stochastic gradient algorithms are also notoriously hard to debug because they often appear to somehow work despite the bugs. Experimenters are then led to believe, incorrectly, that the algorithm itself is flawed.
Therefore it is useful to see how Stochastic Gradient Descent performs on simple linear and convex problems such as linear Support Vector Machines (SVMs) or Conditional Random Fields (CRFs). This page proposes simple code examples illustrating the good properties of stochastic gradient descent algorithms. The provided source code values clarity over speed.
The second major release of this code includes a robust implementation of the averaged stochastic gradient descent algorithm (Ruppert, 1988) which consists of performing stochastic gradient descent iterations and simultaneously averaging the parameter vectors over time. When the stochastic gradient gains decrease with an appropriately slow schedule, Polyak and Juditsky (1992) have shown that the algorithm converges like a second-order stochastic gradient descent but with much smaller computational costs. One can therefore hope to match the batch optimization performance after a single pass on the randomly shuffled training set (Fabian, 1978; Bottou and LeCun, 2004). Achieving one-pass learning in practice remains difficult because one often needs more than one pass to simply reach this favorable asymptotic regime. The gain schedule has a deep impact on this convergence. Finer analyses (Xu, 2010; Bach and Moulines, 2011) reveal useful guidelines to set these learning rates. Xu (2010) also describe a wonderful way to efficiently perform the averaging operation when the training data is sparse. The resulting algorithm reaches near-optimal test set performance after only a couple passes.
- Changes to previous version:
Version 2.0 features ASGD.
Leave a comment
You must be logged in to post comments.