Mirror descent (MD) is a famous convex optimization algorithm. When the objective function is Lipschitz continuous, the convergence rate of MD is known to be $O ( t^{-1/2} )$. When the objective function is also strongly convex, intuitively, a better convergence rate should be possible. Suprisingly, there are very few related results in existing literature, to the best of my knowledge. This post summarizes one such result in a review article by Juditsky and Nemirovski.

We will consider a standard constrained convex optimization problem:

where $f$ is a convex function, and $\mathcal{X}$ is a compact convex set in $\mathbb{R}^d$. We assume that $f$ is $L$-Lipschitz continuous w.r.t. some norm $\Vert \cdot \Vert$ on $\mathbb{R}^d$.

### Brief Review of Mirror Descent

Let $\omega$ be a continuously differentiable $1$-strongly convex function on $\mathcal{X}$, w.r.t. $\Vert \cdot \Vert$.
Define the corresponding *Bregman divergence*

and *prox-mapping*

Let $x_1 \in \mathcal{X}$, and $( \eta_1, \eta_2, \ldots )$ be a sequence of *step sizes*.
The *standard* MD iterates as follows.

Define

The following result can be found in, e.g., the classic paper by Beck and Teboulle.

**Proposition 1.** It holds that, for any $u \in \mathcal{X}$,

where $\left\Vert \cdot \right\Vert_*$ denotes the dual norm.

Proposition 1 implies the following convergence guarantee for the standard MD.

**Corollary.** Fix a positive integer $T$.
Set

Then

### A Modified MD for Minimizing Strongly Convex Functions

Now assume that $f$ is also $\mu$-strongly convex w.r.t. $\Vert \cdot \Vert$. How do we exploit this additional information?

Let us choose $\omega$ such that it is $1$-strongly convex on the whole $\mathbb{R}^d$, *instead of merely $\mathcal{X}$*.
For any $R > 0$, define $\omega_{R, z} ( x ) := \omega ( R^{-1} ( x - z ) )$; denote by $D_{R, z}$ the corresponding Bregman divergence, and $\mathrm{prox}_{R, z}$ the corresponding prox-mapping.

Let us define an iteration rule very similar to the standard MD; the only difference is that now we replace $\omega$ by $\omega_{R, z}$ for some $z \in \mathbb{R}^d$: Let $( \eta_1, \eta_2, \ldots )$ be a given sequence of step sizes. For any $x_1 \in \mathcal{X}$, we compute

Define

**Proposition 2.** Fix a positive integer $T$.
Set

Then

where $x^\star$ is the unique minimizer of $f$ on $\mathcal{X}$.

*Proof.*
Notice that $\omega_R$ is $1$-strongly convex w.r.t. $\Vert \cdot \Vert_R := R^{-1} \Vert \cdot \Vert$.
The first inequality follows from Proposition 1, using the norm $\Vert \cdot \Vert_R$.
The second inequality follows from the following two inequalities, obtained by the strong convexity of $f$ and the optimality of $x^\star$, respectively:

*Q.E.D.*

Now we have an error bound depending on $R_0$; the smaller $R_0$ is, the smaller the error bounds are.
Also notice that the bound of the distance between $x^T$ and $x^\star$ is strictly decreasing with $T$.
These observations motivate the following *restarting* strategy:
Set $y_0 \in \mathcal{X}$.
For $k = 1, 2, \ldots$,

- Set $T_k$ such that $\left\Vert x^{T_k} ( R_{k - 1}, y_{k - 1}, y_{k - 1} ) - x^\star \right\Vert^2 \leq 2^{-1} R_{k - 1}^2$.
- Compute $y_k = x^{T_k} ( R_{k - 1}, y_{k -1}, y_{k - 1} )$, with $\eta_t = \frac{\sqrt{2 \Omega }}{ R L \sqrt{T_k}}$ for all $t$.
- Set $R_k^2 = 2^{-1} R_{k - 1}^2$.

By the proposition we have just proved, it suffices to choose

**Proposition 3.** Define $M_k = \sum_{\kappa = 1}^k T_{\kappa}$.
Let $k^\star$ be the smallest $k$ such that

Then for $k < k^\star$,

for $k \geq k^\star$,

*Proof.* It is easily checked, by Proposition 2, that

By our choice of $T_k$, it holds that

Therefore, $M_k \leq 2 k$ for $k < k^\star$, and

for $k \geq k^\star$.
*Q.E.D.*

Notice that to compute each $y^k$, we need to compute $T_k$ additional prox-mappings. Therefore, the effective number of iterations is represented by $M_k$, instead of $k$, and the convergence rate is understood as