Prof. Panagiotis (Panos) Toulis
Panagiotis (Panos) Toulis

Panagiotis (Panos) Toulis
Assistant Prof. & John E. Jeuck Fellow
Econometrics and Statistics

Contact info
Address: 5807 S Woodlawn Ave
Office: 358
tel: 773.834.5953
Chicago, IL, 60637 panos.toulis [at]

Statistics of Stochastic Gradient Descent

Current draft: December 27, 2014

Section 1. Introduction

Consider a function $F(x, \xi)$ where $x$ is the input/parameters, and $\xi$ is random. In many interesting cases we wish to find the point $x_\star$ such that $\mathbb{E}(F(x_\star, \xi))= 0$. If $F$ was deterministic and differentiable, we could employ a method such as Newton-Raphson such as $ x_t = x_{t-1} -\gamma_t F(x_{t-1}),$ where $\gamma_t$ could be the inverse of the Jacobian (first-order derivatives) $J_F = (\partial F_i/x_j)$, or simply a positive sequence. In many cases, $F$ is the gradient (or subgradient) of some loss function which we wish to minimize. However, in our case of stochastic optimization, we only have access to "noisy" values of $F$, we have access to values $F(x_{t-1}, \xi) = M(x_{t-1}) + \epsilon_{t}$ where $M(\cdot)$ is deterministic but $\epsilon_t$ is random such that $\mathbf{E}(\epsilon_t) = 0$ -- in other words, if $F= \nabla \text{loss}$, we have unbiased estimates of the gradient. With noisy estimates, the aforementioned procedure can be written as $ x_t = x_{t-1} -\gamma_t F(x_{t-1}, \xi_t), \quad \quad (1)$ where $\xi_t$ indicates that we observe one random variable at every iteration.

In statistics, this framework comes up routinely in practice. Let $\xi$ be an observation produced from a model with fixed model parameters $x_\star$. The loss is the negative log-likelihood which we wish to minimize; thus $F(x, \xi) = -\nabla \ell(x; \xi)$ is the gradient of negative log-likelihood of parameter value $x$ for observation $\xi$. Our goal in statistics is to estimate the parameters $x_\star$ and assign some uncertainty to an estimate when we are using observed data. In this case, we write $ x_t = x_{t-1} + \gamma_t \nabla \ell(x_{t-1}; \xi_t). \quad \quad (2)$ If we assume that observations $\xi_t$ are produced iid by such a model, it is standard statistical theory that under certain regularity conditions $\mathbf{E}(\ell(x_\star; \xi_t)) = 0$, as required. Procedures (1) and (2) are equivalent variants of Stochastic Gradient Descent, or SGD for short. However, the respective iterates $x_t$ receive different treatment in stochastic optimization comprared to statistics. Roughly, in stochastic optimization, we usually care about worst-case and/or non-asymptotic performance, convergence of $x_t$ to some empirical minizers, and so on. In statistics, we view $x_t$ as an estimator of $x_\star$ and thus we are interested in its statistical properties, such as bias, variance, efficiency, and so on. This usually translates into trying to understand how much of the information that is contained in the data is used by our estimator; the focus is usually on asympotic efficiency.

A suprisingly rich variant of the classical SGD (2) is obtained if we consider a procedure as follows: $ x_t = x_{t-1} + \gamma_t \nabla \ell(x_{t}; \xi_t). \quad \quad (3)$ Note that the update in Equation (3) is implicit because the next iterate $x_t$ appears on both sides of the equation. Procedure (3) is the implicit Stochastic Gradient Descent method. It was introduced and analyzed for a large family of generalized linear models by Toulis (2014). In the limit both variations of SGD are statistically equivalent, however in small samples implicit SGD is significantly more stable. It is easy to establish that implicit SGD is using second-order information, implicitly through its definition (3). More details are given in Section 2b of this tutorial.

Section 1b. Quick examples
Section 2. The statistical framework
   Section 2b. Implicit stochastic gradient descent
   Section 2c. Optimal SGD: learning rates, parameterization.
Section 3. The optimization framework
   Section 3b. Convergence and bounds
Section 4. Averaging
   Section 4b. When averaging works and when it fails
Section 5. More examples
Booth Homepage | Booth Intranet | UC Homepage Copyright © 2016 Chicago Booth