Mirror descent (MD) is a famous convex optimization algorithm. When the objective function is Lipschitz continuous, the convergence rate of MD is known to be $O ( t^{-1/2} )$. When the objective function is also strongly convex, intuitively, a better convergence rate should be possible. Suprisingly, there are very few related results in existing literature, to the best of my knowledge. This post summarizes one such result in a review article by Juditsky and Nemirovski.
We will consider a standard constrained convex optimization problem:
where $f$ is a convex function, and $\mathcal{X}$ is a compact convex set in $\mathbb{R}^d$. We assume that $f$ is $L$-Lipschitz continuous w.r.t. some norm $\Vert \cdot \Vert$ on $\mathbb{R}^d$.
Brief Review of Mirror Descent
Let $\omega$ be a continuously differentiable $1$-strongly convex function on $\mathcal{X}$, w.r.t. $\Vert \cdot \Vert$. Define the corresponding Bregman divergence
and prox-mapping
Let $x_1 \in \mathcal{X}$, and $( \eta_1, \eta_2, \ldots )$ be a sequence of step sizes. The standard MD iterates as follows.
Define
The following result can be found in, e.g., the classic paper by Beck and Teboulle.
Proposition 1. It holds that, for any $u \in \mathcal{X}$,
where $\left\Vert \cdot \right\Vert_*$ denotes the dual norm.
Proposition 1 implies the following convergence guarantee for the standard MD.
Corollary. Fix a positive integer $T$. Set
Then
A Modified MD for Minimizing Strongly Convex Functions
Now assume that $f$ is also $\mu$-strongly convex w.r.t. $\Vert \cdot \Vert$. How do we exploit this additional information?
Let us choose $\omega$ such that it is $1$-strongly convex on the whole $\mathbb{R}^d$, instead of merely $\mathcal{X}$. For any $R > 0$, define $\omega_{R, z} ( x ) := \omega ( R^{-1} ( x - z ) )$; denote by $D_{R, z}$ the corresponding Bregman divergence, and $\mathrm{prox}_{R, z}$ the corresponding prox-mapping.
Let us define an iteration rule very similar to the standard MD; the only difference is that now we replace $\omega$ by $\omega_{R, z}$ for some $z \in \mathbb{R}^d$: Let $( \eta_1, \eta_2, \ldots )$ be a given sequence of step sizes. For any $x_1 \in \mathcal{X}$, we compute
Define
Proposition 2. Fix a positive integer $T$. Set
Then
where $x^\star$ is the unique minimizer of $f$ on $\mathcal{X}$.
Proof. Notice that $\omega_R$ is $1$-strongly convex w.r.t. $\Vert \cdot \Vert_R := R^{-1} \Vert \cdot \Vert$. The first inequality follows from Proposition 1, using the norm $\Vert \cdot \Vert_R$. The second inequality follows from the following two inequalities, obtained by the strong convexity of $f$ and the optimality of $x^\star$, respectively:
Q.E.D.
Now we have an error bound depending on $R_0$; the smaller $R_0$ is, the smaller the error bounds are. Also notice that the bound of the distance between $x^T$ and $x^\star$ is strictly decreasing with $T$. These observations motivate the following restarting strategy: Set $y_0 \in \mathcal{X}$. For $k = 1, 2, \ldots$,
- Set $T_k$ such that $\left\Vert x^{T_k} ( R_{k - 1}, y_{k - 1}, y_{k - 1} ) - x^\star \right\Vert^2 \leq 2^{-1} R_{k - 1}^2$.
- Compute $y_k = x^{T_k} ( R_{k - 1}, y_{k -1}, y_{k - 1} )$, with $\eta_t = \frac{\sqrt{2 \Omega }}{ R L \sqrt{T_k}}$ for all $t$.
- Set $R_k^2 = 2^{-1} R_{k - 1}^2$.
By the proposition we have just proved, it suffices to choose
Proposition 3. Define $M_k = \sum_{\kappa = 1}^k T_{\kappa}$. Let $k^\star$ be the smallest $k$ such that
Then for $k < k^\star$,
for $k \geq k^\star$,
Proof. It is easily checked, by Proposition 2, that
By our choice of $T_k$, it holds that
Therefore, $M_k \leq 2 k$ for $k < k^\star$, and
for $k \geq k^\star$. Q.E.D.
Notice that to compute each $y^k$, we need to compute $T_k$ additional prox-mappings. Therefore, the effective number of iterations is represented by $M_k$, instead of $k$, and the convergence rate is understood as