## Statistics of Stochastic Gradient Descent

### Current draft: December 27, 2014

Consider a function $F(x, \xi)$ where $x$ is the input/parameters, and $\xi$ is random.
In many interesting cases we wish to find the point $x_\star$ such that
*stochastic optimization*,
we only have access to "noisy" values of $F$, we have access to values
$F(x_{t-1}, \xi) = M(x_{t-1}) + \epsilon_{t}$ where $M(\cdot)$ is deterministic but $\epsilon_t$ is random such that $\mathbf{E}(\epsilon_t) = 0$ --
in other words, if $F= \nabla \text{loss}$, we have *unbiased* estimates of the gradient.
With noisy estimates, the aforementioned procedure can be written as

In statistics, this framework comes up routinely in practice. Let $\xi$ be an observation produced from a model
with fixed model parameters $x_\star$. The loss is the
negative log-likelihood which we wish to minimize; thus $F(x, \xi) = -\nabla \ell(x; \xi)$
is the gradient of negative log-likelihood of parameter value $x$ for observation $\xi$.
Our goal in statistics is to *estimate*
the parameters $x_\star$ and assign some uncertainty to an estimate when we are using *observed data*.
In this case, we write
*Stochastic Gradient Descent*, or SGD for short.
However, the respective iterates $x_t$ receive different treatment in stochastic optimization comprared to statistics.
Roughly, in stochastic optimization, we usually care about worst-case and/or non-asymptotic performance, convergence
of $x_t$ to some empirical minizers, and so on.
In statistics, we view $x_t$ as an *estimator* of $x_\star$ and thus we
are interested in its statistical properties, such as bias, variance, *efficiency*, and so on.
This usually translates into trying to understand how much of the information that is contained
in the data is used by our estimator; the focus is usually on asympotic efficiency.

A suprisingly rich variant of the classical SGD (2) is obtained if we consider a procedure as follows:
*implicit* because the next iterate
$x_t$ appears on both sides of the equation. Procedure (3) is
the implicit Stochastic Gradient Descent method.
It was introduced and analyzed for a large family of generalized linear models by Toulis et.al. (2014).
In the limit both variations of SGD are statistically equivalent, however in small samples implicit SGD is significantly
more stable. It is easy to establish that implicit SGD is using second-order information, *implicitly*
through its definition (3). More details are given in Section 2b of this tutorial.