## What is implicit stochastic gradient descent?

Let $z_n = (x_n, y_n)$ be i.i.d. observations of a model with parameters $\theta_\star$.
To estimate $\theta_\star$ through classical stochastic gradient descent (SGD) amounts to the procedure
*learning rate* of the procedure (e.g., $\gamma_n = \gamma / (\gamma_0 + t))$,
and $C$ is an appropriate symmetric psd condition matrix.
This approach is well-studied and has been used extensively in applications.
However, one major issue is stability: classical SGD is heavily affected by misspecifications of the learning rate $\gamma_n$,
and can often explode. Currently, this issue is solved in an ad-hoc way (e.g., projected SGD, gradient clipping, robust transformations, etc.)
**Implicit SGD** is a modification of the classical algorithm that aims to solve this problem in a principled way.
Often, we can write the method as follows
*implicit* because $\theta_n$ appears on both sides of the equation.
In general such updates are hard to compute, however for a large family of statistical models (e.g., generalized linear models, M-estimation)
the implicit update can be computed very fast (see Algorithm 1, Toulis et.al. (2014)).
Running implicit SGD is then equivalent to running
*curvature* of the regression function of the statistic used in the SGD iterate;
in the typical cases shown above, this curvature coincides with the Fisher information. *Optimal* estimation is obtained only when $C = \mathcal{I}(\theta_\star)^{-1}$, in which case
the variance of $\theta_n$ is $(1/t) \mathcal{I}(\theta_\star)^{-1}$ i.e., the variance of the asymptotically optimal MLE.
I discuss more details in my tutorial on Stochastic Gradient Descent (under progress).

## Why implicit?

Implicit SGD has significant stability advantages over the classical SGD.
Intuitively, it is a *shrinked* version of the typical (explicit) SGD and the shrinkage factor
depends on the observed Fisher information. Thus, implicit SGD is significantly more stable than explicit SGD.
In the limit, both methods are statistically equivalent, but in small-to-moderate samples, implicit SGD is more stable,
and robust to misspecifications of the learning rate, or outliers. Implicit SGD can also be viewed as a stochastic proximal algorithm
since it solves