Que-sais je?

Minimizing a Strongly Convex Function by Mirror Descent

Mirror descent (MD) is a famous convex optimization algorithm. When the objective function is Lipschitz continuous, the convergence rate of MD is known to be $O ( t^{-1/2} )$. When the objective function is also strongly convex, intuitively, a better convergence rate should be possible. Suprisingly, there are very few related results in existing literature, to the best of my knowledge. This post summarizes one such result in a review article by Juditsky and Nemirovski.

We will consider a standard constrained convex optimization problem:

where $f$ is a convex function, and $\mathcal{X}$ is a compact convex set in $\mathbb{R}^d$. We assume that $f$ is $L$-Lipschitz continuous w.r.t. some norm $\Vert \cdot \Vert$ on $\mathbb{R}^d$.

Brief Review of Mirror Descent

Let $\omega$ be a continuously differentiable $1$-strongly convex function on $\mathcal{X}$, w.r.t. $\Vert \cdot \Vert$. Define the corresponding Bregman divergence

and prox-mapping

Let $x_1 \in \mathcal{X}$, and $( \eta_1, \eta_2, \ldots )$ be a sequence of step sizes. The standard MD iterates as follows.


The following result can be found in, e.g., the classic paper by Beck and Teboulle.

Proposition 1. It holds that, for any $u \in \mathcal{X}$,

where $\left\Vert \cdot \right\Vert_*$ denotes the dual norm.

Proposition 1 implies the following convergence guarantee for the standard MD.

Corollary. Fix a positive integer $T$. Set


A Modified MD for Minimizing Strongly Convex Functions

Now assume that $f$ is also $\mu$-strongly convex w.r.t. $\Vert \cdot \Vert$. How do we exploit this additional information?

Let us choose $\omega$ such that it is $1$-strongly convex on the whole $\mathbb{R}^d$, instead of merely $\mathcal{X}$. For any $R > 0$, define $\omega_{R, z} ( x ) := \omega ( R^{-1} ( x - z ) )$; denote by $D_{R, z}$ the corresponding Bregman divergence, and $\mathrm{prox}_{R, z}$ the corresponding prox-mapping.

Let us define an iteration rule very similar to the standard MD; the only difference is that now we replace $\omega$ by $\omega_{R, z}$ for some $z \in \mathbb{R}^d$: Let $( \eta_1, \eta_2, \ldots )$ be a given sequence of step sizes. For any $x_1 \in \mathcal{X}$, we compute


Proposition 2. Fix a positive integer $T$. Set


where $x^\star$ is the unique minimizer of $f$ on $\mathcal{X}$.

Proof. Notice that $\omega_R$ is $1$-strongly convex w.r.t. $\Vert \cdot \Vert_R := R^{-1} \Vert \cdot \Vert$. The first inequality follows from Proposition 1, using the norm $\Vert \cdot \Vert_R$. The second inequality follows from the following two inequalities, obtained by the strong convexity of $f$ and the optimality of $x^\star$, respectively:


Now we have an error bound depending on $R_0$; the smaller $R_0$ is, the smaller the error bounds are. Also notice that the bound of the distance between $x^T$ and $x^\star$ is strictly decreasing with $T$. These observations motivate the following restarting strategy: Set $y_0 \in \mathcal{X}$. For $k = 1, 2, \ldots$,

  1. Set $T_k$ such that $\left\Vert x^{T_k} ( R_{k - 1}, y_{k - 1}, y_{k - 1} ) - x^\star \right\Vert^2 \leq 2^{-1} R_{k - 1}^2$.
  2. Compute $y_k = x^{T_k} ( R_{k - 1}, y_{k -1}, y_{k - 1} )$, with $\eta_t = \frac{\sqrt{2 \Omega }}{ R L \sqrt{T_k}}$ for all $t$.
  3. Set $R_k^2 = 2^{-1} R_{k - 1}^2$.

By the proposition we have just proved, it suffices to choose

Proposition 3. Define $M_k = \sum_{\kappa = 1}^k T_{\kappa}$. Let $k^\star$ be the smallest $k$ such that

Then for $k < k^\star$,

for $k \geq k^\star$,

Proof. It is easily checked, by Proposition 2, that

By our choice of $T_k$, it holds that

Therefore, $M_k \leq 2 k$ for $k < k^\star$, and

for $k \geq k^\star$. Q.E.D.

Notice that to compute each $y^k$, we need to compute $T_k$ additional prox-mappings. Therefore, the effective number of iterations is represented by $M_k$, instead of $k$, and the convergence rate is understood as