Prof. Panagiotis (Panos) Toulis

Panagiotis (Panos) Toulis
Assistant Professor of Econometrics and Statistics
Office: 358
tel: 773.834.5953
panos.toulis@ChicagoBooth.edu

## What is implicit stochastic gradient descent?

Let $z_n = (x_n, y_n)$ be i.i.d. observations of a model with parameters $\theta_\star$. To estimate $\theta_\star$ through classical stochastic gradient descent (SGD) amounts to the procedure $\theta_n = \theta_{n-1} + \gamma_n C \nabla \ell(\theta_{n-1}; z_n)$, where $\ell$ is the log-likelihood function, $\gamma_n$ is the learning rate of the procedure (e.g., $\gamma_n = \gamma / (\gamma_0 + t))$, and $C$ is an appropriate symmetric psd condition matrix. This approach is well-studied and has been used extensively in applications. However, one major issue is stability: classical SGD is heavily affected by misspecifications of the learning rate $\gamma_n$, and can often explode. Currently, this issue is solved in an ad-hoc way (e.g., projected SGD, gradient clipping, robust transformations, etc.) Implicit SGD is a modification of the classical algorithm that aims to solve this problem in a principled way. Often, we can write the method as follows $\boldsymbol{\theta}_n = \theta_{n-1} + \gamma_n C \nabla \ell(\boldsymbol{\theta}_n; z_n)$, The update in this procedure is implicit because $\theta_n$ appears on both sides of the equation. In general such updates are hard to compute, however for a large family of statistical models (e.g., generalized linear models, M-estimation) the implicit update can be computed very fast (see Algorithm 1, Toulis et.al. (2014)). Running implicit SGD is then equivalent to running $\theta_n = \theta_{n-1} + \gamma_n \xi_n \nabla \ell(\theta_{n-1}^\intercal x_n; y_n)$, where the classical gradient $\ell(\theta_{n-1}^\intercal x_n; y_n)$ is scaled by $\xi_n \in \mathbb{R}$ that satisfies a fixed-point equation: $\mathbf{\xi_n} s_n = \ell'(\theta_{n-1}^\intercal x_n + \gamma_n \xi_n ||x_n||^2s_n; y_n)$, where $s_n = \ell'(\theta_{n-1}^\intercal x_n; y_n)$---$\ell'$ denotes the derivative of $\ell(u; y)$ with respect to the scalar $u$. The scalar $\xi_n$ is one-dimensional and so easy to compute as narrow search bounds are available. We can also show that $\theta_n \to \theta_\star$ and also give asymptotic variance results. In particular, if $n \gamma_n \to \gamma > 0$, then for both standard and implicit SGD, $t \cdot \mathbb{E}(\theta_n - \theta_\star) (\theta_n - \theta_\star)^\intercal \to \gamma^2 (2\gamma C \mathcal{I}(\theta_\star) - \mathbb{I})^{-1} C \mathcal{I}(\theta_\star) C, \quad \quad (1)$ where $\mathcal{I}(\theta) = \mathbb{E} (\nabla \ell(\theta; z) \nabla \ell(\theta; z)^\intercal) = - \mathbb{E}(\nabla^2 \ell(\theta; z))$ is the Fisher information matrix of the model, and $\mathbb{I}$ is the identity matrix -- form (1) is proved here (note: $C$ needs to commute with $\mathcal{I}(\theta_\star)$). Result (1) reveals that SGD procedures incur an information loss. In order to reduce the information loss, the condition matrix needs to account for the Fisher information; generally, $C$ needs to account for the curvature of the regression function of the statistic used in the SGD iterate; in the typical cases shown above, this curvature coincides with the Fisher information. Optimal estimation is obtained only when $C = \mathcal{I}(\theta_\star)^{-1}$, in which case the variance of $\theta_n$ is $(1/t) \mathcal{I}(\theta_\star)^{-1}$ i.e., the variance of the asymptotically optimal MLE. I discuss more details in my tutorial on Stochastic Gradient Descent (under progress).

## Why implicit?

Implicit SGD has significant stability advantages over the classical SGD. Intuitively, it is a shrinked version of the typical (explicit) SGD and the shrinkage factor depends on the observed Fisher information. Thus, implicit SGD is significantly more stable than explicit SGD. In the limit, both methods are statistically equivalent, but in small-to-moderate samples, implicit SGD is more stable, and robust to misspecifications of the learning rate, or outliers. Implicit SGD can also be viewed as a stochastic proximal algorithm since it solves $\theta_n = \arg \max_{\theta} \{ \gamma_n\ell(\theta; z_n) - \frac{1}{2} ||\theta-\theta_{n-1}||^2 \}$. This formulation has a Bayesian interpretation in which $\theta_n$ is the posterior mode of a model where the prior distribution of $\theta_n$ is a normal $\mathcal{N}(\theta_{n-1}, \gamma_n \mathbb{I})$ and the data distribution is $\exp(\ell(\theta; z_n))$; note that the prior is shrinking towards $\theta_{n-1}$ as $n\to\infty$.

## Relevant theory and code

• PT, Edoardo M. Airoldi, "Asymptotic and finite-sample properties of estimators based on stochastic gradients", Annals of Statistics, 2016, forthcoming (draft,   pdf)
• PT, Jason Rennie, Edoardo Airoldi, "Statistical analysis of stochastic gradient methods for generalized linear models", International Conference of Machine Learning, 2014, Beijing, China (ICML' 14,   www,   pdf)
• PT, Edoardo M. Airoldi, "Scalable estimation strategies based on stochastic approximations: Classical results and new insights", Statistics and Computing, 2015 (  www).
• PT, Edoardo Airoldi, "Stochastic gradient methods for principled estimation with large datasets", in "Handbook of Big Data", CRC Press, 2016 (eds. Buhlmann et. al.)-(  www)
• PT, Edoardo M. Airoldi, "Implicit stochastic approximation" (2015,  pdf)
 Booth Homepage | Booth Intranet | UC Homepage Copyright © 2016 Chicago Booth