Blackwell Approachability

Sep 23rd, 2017 9:52 pm

Introduction

Suppose that Alice and Bob is playing a so-called two-player zero-sum game with perfect information. Alice chooses her action from a finite action set $\mathcal{A}$ ; Bob chooses his action from a finite action set $\mathcal{B}$ . There is a payoff function $f$, which assigns to each action pair $(a, b) \in \mathcal{A} \times \mathcal{B}$ a payoff $f ( a, b ) \in \mathbb{R}$; Bob’s payoff is then simply $-f (a, b)$. In general, Alice and Bob may choose their actions randomly, so we consider the expected payoff: for any probability distribution $P \in \Delta ( \mathcal{A} )$ and $Q \in \Delta ( \mathcal{B} )$, define

$f ( P, Q ) := \sum_{a \in \mathcal{A}} \sum_{b \in \mathcal{B}} f ( a, b ) P ( a ) Q ( b ) .$

If Alice does not have any additional information about Bob, it is reasonable to consider the very robust minimax strategy, i.e., to choose her action $a \in \mathcal{A}$ randomly according to

$P^\star \in \mathrm{arg\, max}_{P} \mathrm{min}_Q f ( P, Q ) .$

No matter what Bob’s strategy is, Alice is guaranteed to achieve at least the minimax expected payoff.

If the payoff function $f$ is vector-valued, talking about the minimax expected payoff becomes nonsense, as we cannot compare two vectors. This post introduces an analog for the vector-valued payoff case, proposed by David Blackwell in this classic paper.

Blackwell Approachability

In the rest of this post, suppose that the payoff function $f$ takes values in $\mathbb{R}^d$ .

Blackwell considered the following protocol of repeated games. For $t = 1, 2, \ldots$,

Alice announces $a_t \in \mathcal{A}$.
Bob announces $b_t \in \mathcal{B}$, without known $a_t$.

In general, Alice can use a randomized strategy. For convenience, let us say that Alice chooses $a_t$ randomly according to some probability distribution $P_t \in \Delta ( \mathcal{A} )$ .

Recall that Alice cannot achieve the “minimax expected payoff” in this vector-payoff case, as it is just not well-define. However, it might be possible to ensure that her average payoff will ultimately lies in some specific subset of $\mathbb{R}^d$.

Define the average payoff

$\bar{f}_t := \frac{1}{t} \sum_{\tau = 1}^t f ( a_\tau, b_\tau ) .$

Definition. Let $\mathcal{X} \subseteq \mathbb{R}^d$. The set $\mathcal{X}$ is said to be approachable by Alice, if for any sequence $( b_t )_{t \in \mathbb{N}}$ ,

$\lim_{T \to \infty} \mathrm{dist} ( \bar{f}_T, \mathcal{X} ) = 0 , \quad \text{a.s.},$

where $\mathrm{dist} ( \bar{f}_T, \mathcal{X} )$ denotes the $2$-norm distance between $\bar{f}_T$ and the projection of $\bar{f}_T$ onto $\mathcal{X}$.

If we only consider convex sets, there is an exact characterization of approachable sets.

Theorem 1. A convex set $\mathcal{X} \subseteq \mathbb{R}^d$ is approachable to Alice, if and only if for any $Q \in \Delta ( \mathcal{B} )$, there exists some $P \in \Delta ( \mathcal{A} )$ such that $f ( P, Q ) \in \mathcal{X}$.

Exercise. Suppose that for each $t$, Bob’s action $b_t$ can be dependent on Alice’s action $a_t$. Show that then even when the if and only if condition holds, $\mathcal{X}$ may not be approachable to Alice.

The only if part of Theorem 1 is easy to check. Suppose that there exists some $Q^\star \in \Delta ( \mathcal{B} )$, such that $f ( P, Q^\star ) \notin \mathcal{X}$ for any $P \in \Delta ( \mathcal{A} )$ . To guarantee that Alice fails to approach $\mathcal{X}$, Bob can always choose $b_t$ randomly according to $Q^\star$ for all $t$.

We will focus on the if part of Theorem 1 in the rest of this post.

Blackwell’s Condition and Alice’s Strategy

To prove the if part of Theorem 1, we need to show the existence of an approaching strategy. Indeed, the proof is constructive; there is an explicitly defined approaching strategy based on the following lemma.

Lemma 1. Suppose that for any $Q \in \Delta ( \mathcal{B} )$, there exists some $P \in \Delta ( \mathcal{A} )$ such that $f ( P, Q ) \in \mathcal{X}$. Then for any $v \in \mathbb{R}^d$, there exists some $P \in \Delta ( \mathcal{A} )$, such that

$\left\langle v - \mathrm{proj} ( v ), f ( P, Q ) - \mathrm{proj} ( v ) \right\rangle \leq 0, \quad \text{for all } Q \in \Delta ( \mathcal{B} ) ,$

where $\mathrm{proj}$ denotes the projection (w.r.t. the $2$-norm) onto $\mathcal{X}$.

Proof. By definition, $\mathrm{proj} (v)$ is the minimizer of the $2$-norm distance to $v$ in $\mathcal{X}$. The optimality condition says

$\left\langle v - \mathrm{proj} ( v ), w - \mathrm{proj} ( v ) \right\rangle \leq 0, \quad \text{for all } w \in \mathcal{X} .$

What we want to prove can be equivalently written as

$\mathrm{min}_{P} \mathrm{max}_{Q} \left\langle v - \mathrm{proj} ( v ), f ( P, Q ) - \mathrm{proj} ( v ) \right\rangle \leq 0 .$

Applying von Neumann’s minimax theorem, we can exchange the order of $\min$ and $\max$, and write

$\mathrm{max}_{Q} \mathrm{min}_{P} \left\langle v - \mathrm{proj} ( v ), f ( P, Q ) - \mathrm{proj} ( v ) \right\rangle \leq 0 .$

The lemma follows from the optimality condition. $Q.E.D.$

We call the condition in Lemma 1 Blackwell’s condition, following the convention in literature. Blackwell’s condition ensures that the following strategy for Alice is well-defined.

Alice’s Strategy. Define the expected and average expected payoff:

$f_t := f( P_t, b_t ) := \sum_{a \in \mathcal{A}} f( a, b_t ) P_t( a ), \quad \tilde{f}_t := \frac{1}{t} \sum_{\tau = 1}^t f ( P_t, b_t ) .$

For each $t \in \mathbb{N}$, Alice computes some $P_t$ such that

$\left\langle \tilde{f}_{t - 1} - \mathrm{proj} ( \tilde{f}_{t - 1} ), f ( P_t, Q ) - \mathrm{proj} ( \tilde{f}_{t - 1} ) \right\rangle \leq 0, \quad \text{for all } Q \in \Delta ( \mathcal{B} ) ,$

and then announces $a_t$ randomly according to $P_t$.

Remark. In practice, $P_t$ can be computed as

$P_t \in \mathrm{arg\, min}_{P} \max_{Q} \left\langle \tilde{f}_{t - 1} - \mathrm{proj} ( \tilde{f}_{t - 1} ), f ( P, Q ) - \mathrm{proj} ( \tilde{f}_{t - 1} ) \right\rangle .$

This is a bilinear saddle point problem; there exists a variety of algorithms solving such a problem.

Proof of Theorem 1

Since $\mathcal{X}$ is approachable if and only if its closure is approachable, we assume that $\mathcal{X}$ is closed w.l.o.g. We will also assume that $\mathcal{X}$ is bounded for convenience; we will relax this unnecessary condition at the end of the proof.

Theorem 1 is in fact a corollary of the following theorem.

Theorem 2. Suppose that the if and only if condition in Theorem 1 holds. Suppose that Alice adopts the strategy described above. Then for each $T \in \mathbb{N}$ and $\delta \in ( 0, 1 )$, with probability at least $1 - \delta$ , it holds that

$\mathrm{dist} ( \bar{f}_T, \mathcal{X} ) = O \left( \frac{M + R}{\sqrt{T}} + M \sqrt{ \frac{\log(1 / \delta)}{T} } \right) ,$

where $M := \max_{a, b} \left\Vert f ( a, b ) \right\Vert_2$ , and $R := \max_{w \in \mathcal{X}} \left\Vert w \right\Vert_2$ .

To prove Theorem 2, we start with the following lemma.

Lemma 2. For all $T \in \mathbb{N}$, it holds that

$\mathrm{dist} \left( \tilde{f}_T, \mathcal{X} \right) \leq \frac{M + R}{\sqrt{T}} .$

Proof. Define $\pi_t := \mathrm{proj} ( f_t )$ and $\tilde{\pi}_t := \mathrm{proj} ( \tilde{f}_t )$ . By definition, we have

$\left\Vert \tilde{f}_{t + 1} - \tilde{\pi}_{t + 1} \right\Vert_2^2 \leq \left\Vert \tilde{f}_{t + 1} - \tilde{\pi}_{t} \right\Vert_2^2.$

Writing $\tilde{f}_{t + 1} = [t / (t + 1)] \tilde{f}_t + [1 / (t+1)] f_{t + 1}$ , we have

$\mathrm{RHS} = \left( \frac{t}{t + 1} \right)^2 \left\Vert \tilde{f}_t - \tilde{\pi}_t \right\Vert_2^2 + \left( \frac{1}{t + 1} \right)^2 \left\Vert f_{t + 1} - \tilde{\pi}_t \right\Vert_2^2 + \frac{2 t}{(t + 1)^2} \left\langle \tilde{f}_t - \tilde{\pi}_t, f_{t + 1} - \tilde{\pi}_t \right\rangle .$

Applying Lemma 1 and the triangle inequality, we obtain

$\mathrm{RHS} \leq \left( \frac{t}{t + 1} \right)^2 \left\Vert \tilde{f}_t - \tilde{\pi}_t \right\Vert_2^2 + \left( \frac{1}{t + 1} \right)^2 ( M + R )^2 .$

The lemma can be then easily checked by induction. Q.E.D.

Proof of Theorem 2. Recall that $\bar{f}_t$ denotes the average payoff; define $\bar{\pi}_t := \mathrm{proj}( \bar{f}_t )$ . We start with the decomposition,

$\mathrm{dist} ( \bar{f}_T, \mathcal{X} ) = \left\Vert \bar{f}_T - \bar{\pi}_T \right\Vert_2 \leq \left\Vert \bar{f}_T - \tilde{f}_T \right\Vert_2 + \left\Vert \tilde{f}_T - \tilde{\pi}_T \right\Vert_2 + \left\Vert \tilde{\pi}_T - \bar{\pi}_T \right\Vert_2$

By Lemma 2 and the non-expansiveness of projections, we have

$\mathrm{RHS} \leq \frac{M + R}{\sqrt{T}} + 2 \left\Vert \bar{f}_T - \bar{\pi}_T \right\Vert_2 .$

Define $X_t := \sum_{\tau = 1}^t f ( a_\tau, b_\tau ) - f ( P_\tau, b_\tau )$ . Then $( X_t )_{t \in \mathbb{N}}$ forms a vector-valued martingale. Applying the Azuma inequality for Banach-space valued martingales, we obtain, for any $u > 0$,

$\mathsf{P} \left( \left\Vert \bar{f}_T - \tilde{f}_T \right\Vert_2 \geq u \right) = O \left( \exp \left( - \frac{C T u^2 }{M^2} \right) \right) ,$

for some constant $C > 0$. Q.E.D.

In general, the set to be approached may not be bounded; then $R = + \infty$ if we directly apply the definition. Notice that, however, only sets lie in the convex hull of $\{ f ( a, b ) \}$ can be approached. Denote the convex hull by $\mathcal{F}$. Hence, it suffices to replace $\mathcal{X}$ by $\mathcal{X} \cap \mathcal{F}$ , which is always bounded by construction; then it is easily checked that $R$ can be bounded above by $M$ to slightly simplify the expressions.

Calibrated Forecasting

One might wonder why one should consider vector-valued payoffs. An interesting application is calibrated forecasting. With Theorem 1, it is an easy exercise to prove the theorem in the previous post, showing the existence of an $\varepsilon$-calibrated forecasting strategy.

References

The strategy of Alice presented above is not exactly the original one proposed by Blackwell, as the original one does not lead to the $O ( 1 / \sqrt{T} )$ with-high-probability convergence rate in Theorem 2. The strategy presented above was described in the book by Cesa-Bianchi and Lugosi, without proof. The proof above is basically a mix of the arguments and results in the following references.

Blackwell, D., 1956. An analog of the minimax theorem for vector payoffs. Pac. J. Math.
Cesa-Bianchi, N. and Lugosi, G., 2006. Prediction, Learning, and Games.
Naor, A., 2012. On the Banach-space-valued Azuma inequality and small-set isoperimetry of Alon–Roichman graphs. Combinatorics Probab. Comput.
Sorin, S., 2002. A First Course on Zero-Sum Repeated Games.

Calibrated Forecasting

Sep 22nd, 2017 1:37 pm

Probability Forecasting

Can one predict the weather ignorantly yet accurately? To be concrete, suppose we are given the weather record of the past 100 days,

$\text{wet}, \text{wet}, \text{dry}, \text{wet}, \text{dry}, \ldots;$

we cannot access any other information (e.g., temperature, satellite images, etc.), and we know nothing about meteorology. Can we tell how likely it will rain tomorrow?

This problem can be cast as a probability forecasting problem. The protocol of probability forecasting is as follows. For $t = 1, 2, \ldots$ ,

Predictor announces $p_t \in [ 0, 1 ]$ .
Reality announces $y_t \in \{ 0, 1 \}$ .

The goal of Predictor is to make accurate predictions, such that $p_t$ is a good estimate of the likelihood of the event $\{ y_t = 1 \}$ . As the protocol is deterministic, we do not say “probability of the event $\{ y_t = 1 \}$ ”.

Thinking of ‘$0$’ as ‘dry’, and ‘$1$’ as ‘wet’, we recover the original weather forecasting problem.

Calibration

What precisely are the requirements a good prediction strategy should satisfy? The requirements in general depend on the actual application scenario. In this post, we will only consider a minimal requirement—any good prediction strategy should be calibrated.

Roughly speaking, the notion of calibration requires the predictions to comply with the empirical frequencies, i.e., for any $\rho \in ( 0, 1 )$, it holds that

$\frac{\sum_{t = 1}^T \mathbb{I}_{ \{ p_t \approx \rho, y_t = 1 \} }}{\sum_{t = 1}^T \mathbb{I}_{ \{ p_t \approx \rho \} }} \approx \rho .$

To make the notion precise, consider the following discretization argument. Fix some $\varepsilon \in ( 0, 1 )$. Let us decompose the interval $[ 0, 1 ]$ as

$[ 0, 1 ] = \left[ 0, \varepsilon \right) \cup \left[ \varepsilon, 2 \varepsilon \right) \cup \cdots \cup \left[ (M - 2) \varepsilon, (M - 1) \varepsilon \right) \cup \left[ ( M - 1 ) \varepsilon, 1 \right] := I_1 \cup I_2 \cup \cdots \cup I_{M - 1} \cup I_M .$

where $M := \lceil \varepsilon^{-1} \rceil$. We denote the center of $I_m$ by $c_m$. Define the empirical frequencies

$\rho_m ( T ) := \frac{ \sum_{t = 1}^T \mathbb{I}_{ \{ p_t \in I_m, y_t = 1 \} } }{ \sum_{t = 1}^T \mathbb{I}_{ \{ p_t \in I_m \} } } ,$

if the divisor is non-zero, and $\rho_m (T) := c_m$ otherwise. We would like, for all $m$,

$\limsup_{T \to \infty} \left\vert \rho_m (T) - c_m \right\vert \leq \varepsilon, \quad \text{a.s.}$

We do not really want to care about those $m$’s for which $\sum_{t = 1}^T \mathbb{I}_{ \{ p_t \in I_m \} } = o( T )$ , as they are irrelevant in the long run. A compact formulation of the requirement is

$\limsup_{T \to \infty} \sum_{m = 1}^M \left\vert \rho_m (T) - c_m \right\vert \frac{ \sum_{t = 1}^T \mathbb{I}_{ \{ p_t \in I_m \} } }{ T } \leq \varepsilon , \quad \text{a.s.}$

Some simple algebraic manipulations lead to the following (slightly different) defnition.

Definition. A (possibly randomized) prediction strategy is $\varepsilon$ -calibrated, if the resulting sequence of predictions satisfy

$\limsup_{T \to \infty} \frac{1}{T} \sum_{m = 1}^M \left\vert \sum_{t = 1}^T \mathbb{I}_{ \{ p_t \in I_m \} } ( p_t - y_t ) \right\vert \leq \varepsilon , \quad \text{a.s.}$

Surprisingly, even when one knows nothing about the mechanism generating the sequence $( y_t )_{t \in \mathbb{N}}$ , calibration is achievable.

Theorem. Suppose that Reality, before announcing $y_t$, does not know the exact value of $p_t$. For any $\varepsilon > 0$, there exists an $\varepsilon$-calibrated probability forecasting strategy.

Some quick observations:

If $y_t$ can depend on the exact value of $p_t$, calibration is impossible. For example, Reality can always choose $y_t = 1$ whenever $p_t \leq 0.5$, and $y_t = 0$ otherwise.
The remark above also shows the necessity of a randomized forecasting strategy. If Forecaster uses a deterministic strategy, for which $p_t$ is a deterministic function of $p_1, y_1, p_2, y_2, \ldots, p_{t - 1}, y_{t - 1}$, then Reality can also compute exact value of $p_t$.

An Achieving Strategy

A calibrated forecasting strategy can be derived by the idea of Blackwell approachability.

The strategy only chooses $p_t$ from the set $\{ c_m \}$ ; effectively, Forecaster only needs to choose $m_t \in \{ 1, 2, \ldots, M \}$ for all $t$. Define, for all $m$,

$f_t ( m, y ) := ( 0, \ldots, 0, c_m - y, 0, \ldots, 0 ) \in \mathbb{R}^M .$

We already know that Forecaster must adopt a randomized strategy. Suppose that Forecaster chooses $m_t$ randomly according to some probability distribution $P_t$ on $\{ 1, 2, \ldots, M \}$ . Define

$f_t ( P_t, y_t ) := \sum_{m = 1}^M P_t (m) \, f_t ( m, y_t ), \quad \tilde{f}_t := \frac{1}{t} \sum_{\tau = 1}^t f_t ( P_{\tau}, y_\tau ) .$

Forecaster’s Strategy. For each $t$, Forecaster first computes some $P_t$ such that for any probability distribution $Q$ on $\{ 0, 1 \}$ ,

$\left\langle \tilde{f}_{t - 1} - \mathrm{proj} ( \tilde{f}_{t - 1} ), f ( P_t, Q ) - \mathrm{proj} ( \tilde{f}_{t - 1} ) \right\rangle \leq 0 ,$

where $\mathrm{proj}$ denotes projection onto the $1$-norm ball of radius $\varepsilon$ in $\mathbb{R}^M$, and then announces $m_t$ randomly according to $P_t$.

Lemma. Such $P_t$ always exists.

Remark. In practice, one can compute $P_t$ by solving the saddle point problem:

$P_t \in \mathrm{argmin}_P \mathrm{max}_Q \left\langle \tilde{f}_{t - 1} - \mathrm{proj} ( \tilde{f}_{t - 1} ), f ( P, Q ) - \mathrm{proj} ( \tilde{f}_{t - 1} ) \right\rangle .$

Notice that the objective function is bilinear; there are a variety of existing algorithms that solve this problem.

The strategy looks mysterious at first glance. I will introduce the theory of Blackwell approachability in the next post.

References

Cesa-Bianchi, N. and Lugosi, G., 2006. Prediction, Learning, and Games.
Dawid, A.P., 1982. The well calibrated Bayesian. J. Am. Stat. Assoc.
Dawid, A. P., 1985. Comment: The impossibility of inductive inference. J. Am. Stat. Assoc.
Foster, D. P., 1999. A proof of calibration via Blackwell’s approachability theorem. Games Econ. Behav.
Mannor, S. and Stoltz, G., 2010. A geometric proof of calibration. Math. Oper. Res.
Oakes, D., 1985. Self-calibrating priors do not exist. J. Am. Stat. Assoc.

Universally Consistent Prediction

Sep 20th, 2017 6:48 pm

Introduction

Can one predict the future well, without any subjective assumptions? Indeed, there are actually some prediction strategies that work without any subjective assumption, and approach (arguably) the best possible performance in some sense. This post introduces one such strategy described in an article by V. Vovk.

Consider the following protocol of online prediction. For $t = 1, 2, \ldots$,

Reality announces $x_t \in \mathcal{X}$.
Predictor announces $\gamma_t \in \Gamma$.
Reality announces $y_t \in \mathcal{Y}$.

We assume the perfect information case; that is, all $x_t$’s, $\gamma_t$’s, and $y_t$’s, once announced, are known to both Reality and Predictor.

Exercise. Find some real-life examples of this protocol. Notice that $\mathcal{X}$ can be the space of histories.

We measure the quality of prediction by a given loss function $\lambda: \mathcal{X} \times \Gamma \times \mathcal{Y} \to \mathbb{R}$, known to both Reality and Predictor. We will consider compact-type losses.

Definition. A loss function is called compact-type, if for any compact sets $A \subseteq \mathcal{X}$ and $B \subseteq \mathcal{Y}$ and any $M \in \mathbb{R}$, there exists some compact set $C \subset \Gamma$, such that $\lambda ( x, \gamma, y ) > M$ for any $x \in A$, $y \in B$, and $\gamma \notin C$.

A (possibly randomized) prediction strategy is a function $\psi: \mathcal{X} \to \Delta ( \Gamma )$, where $\Delta ( \Gamma )$ denotes the set of all probabiity measures on $\Gamma$. For each $t$, Predictor computes $Q_t := \psi( x_t )$, and chooses $\gamma_t \in \Gamma$ randomly according to $Q_t$.

Definition. We say that a prediction strategy is continuous, if the associated function $\psi$ is continuous.

Theorem 1 (Vovk). Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are locally compact metric spaces, and $\Gamma$ is a metric space. Suppose that $\lambda$ is continuous and compact-type. Then for any $\varepsilon > 0$, there exists a prediction strategy, such that if $\{ x_t \}$ and $\{ y_t \}$ are precompact,

$\limsup_{T \to \infty} \frac{1}{T} \sum_{t = 1}^T \left( \lambda ( x_t, \gamma_t, y_t ) - \lambda ( x_t, \tilde{\gamma_t}, y_t ) \right) \leq \varepsilon, \quad \text{a.s.} ,$

where $( \tilde{\gamma_t} )_{t \in \mathbb{N}}$ is the sequence of predictions made by any continuous prediction strategy.

Any prediction strategy that achieves the theorem above is called universally consistent.

Ideas

The prediction strategy proposed by Vovk consists of two parts. Predictor first computes an estimate of the conditional probability distribution of $y_t$ given the history; then Predictor chooses the optimal $\gamma_t$ minimizing the estimated conditional expected loss. Notice that, however, the online prediction protocol is completely deterministic; hence talking about the conditional probability and expected loss, rigorously speaking, is not legal in the sense of measure-theoretic probability.

The first part is called probability forecasting. The protocol of probability forecasting is as follows. For $t = 1, 2, \ldots$,

Reality announces $x_t \in \mathcal{X}$.
Forecaster announces $P_t \in \Delta ( \mathcal{Y} )$.
Reality announces $y_t \in \mathcal{Y}$.

Theorem 2 (Vovk). Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are compact metric spaces. There exists some forecasting strategy, such that

$\lim_{T \to \infty} \frac{1}{T} \sum_{t = 1}^T \left( f ( x_t, P_t, y_t ) - \int_{\mathcal{Y}} f ( x_t, P_t, y ) \, P_t ( \mathrm{d} y ) \right) = 0 ,$

for any continuous function $f: \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} \to \mathbb{R}$.

The forecasting strategy behind Theorem 2 is based on the existence of a universal RKHS $\mathcal{H}$ on $\mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y}$, i.e., an RKHS dense in $C ( \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} )$. Following the framework of defensive forecasting, one can construct a forecasting strategy such that the inequality in Theorem 2 holds for all functions in $\mathcal{H}$. As $\mathcal{H}$ is dense in $C ( \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} )$, one obtains Theorem 2.

Naively speaking, the second part corresponds to finding the randomized action that minimizes the conditional expected loss, i.e., computing

$\tilde{Q}_t \in \mathrm{argmin}_{Q \in \Delta( \Gamma )} \int_{\mathcal{Y}} \lambda ( x_t, Q, y ) P_t ( \mathrm{d} y ) ,$

where $P_t$ is the output of probability forecasting, and

$\lambda ( x, Q, y ) := \int_{\Gamma} \lambda ( x, \gamma, y ) Q ( \mathrm{d} \gamma ) .$

Then Predictor chooses $\gamma_t$ randomly according to $\tilde{Q}_t$.

Theorem 2 considers continuous functions, while $\tilde{Q}_t$ may not be continuously dependent on $(x_t, P_t)$. Therefore, we need to slightly modify the second part. What Predictor actually does in the second part is to compute $Q_t := G ( x_t, P_t ) \in \Delta ( \Gamma )$, for some continuous function $G$ satisfying

$\int_{\mathcal{Y}} \lambda ( x_t, G( x_t, P_t ), y ) P_t ( \mathrm{d} y ) \leq \mathrm{argmin}_{Q \in \Delta( \Gamma )} \int_{\mathcal{Y}} \lambda ( x_t, Q, y ) P_t ( \mathrm{d} y ) + \delta$

for some $\delta > 0$.

Lemma 1 (Vovk). Suppose that $\Gamma$ is a compact convex subset of a topological vector space. Such a continuous function $G$ exists for any $\delta > 0$.

In Theorem 1, the conditions on $\mathcal{X}$, $\mathcal{Y}$, and $\Gamma$ are not as strict as those in Theorem 2 and Lemma 1. This gap can be addressed by a doubling-like trick. Roughly speaking, Predictor starts with compact subsets $A \subseteq \mathcal{X}$ and $B \subseteq \mathcal{Y}$, and a compact convex subset $C \subseteq \Gamma$. Predictor implements the prediction strategy described above only on $A$, $B$, and $C$. If Reality announces some $( x_t, y_t ) \notin A \times B$, Predictor enlarges $A$, $B$ and $C$ such that $( x_t, y_t )$ is contained.

If Predictor is forced to enlarge the sets for inifinitely many times, $\{ x_t \}$ and $\{ y_t \}$ cannot be both precompact.
Otherwise, ultimately Predictor is working with some $A \subseteq \mathcal{X}$, $B \subseteq \mathcal{Y}$, and $C \subseteq \Gamma$, to which Theorem 2 and Lemma 1 apply.

Minimizing a Strongly Convex Function by Mirror Descent

May 5th, 2017 3:45 pm

Mirror descent (MD) is a famous convex optimization algorithm. When the objective function is Lipschitz continuous, the convergence rate of MD is known to be $O ( t^{-1/2} )$. When the objective function is also strongly convex, intuitively, a better convergence rate should be possible. Suprisingly, there are very few related results in existing literature, to the best of my knowledge. This post summarizes one such result in a review article by Juditsky and Nemirovski.

We will consider a standard constrained convex optimization problem:

$f^\star = \mathrm{arg\, min} \left\lbrace f ( x ) | x \in \mathcal{X} \right\rbrace ,$

where $f$ is a convex function, and $\mathcal{X}$ is a compact convex set in $\mathbb{R}^d$. We assume that $f$ is $L$-Lipschitz continuous w.r.t. some norm $\Vert \cdot \Vert$ on $\mathbb{R}^d$.

Brief Review of Mirror Descent

Let $\omega$ be a continuously differentiable $1$-strongly convex function on $\mathcal{X}$, w.r.t. $\Vert \cdot \Vert$. Define the corresponding Bregman divergence

$D ( y, x ) := \omega ( y ) - \left[ \omega ( x ) + \left\langle \omega' ( x ), y - x \right\rangle \right] ,$

and prox-mapping

$\mathrm{prox} ( \xi; x ) := \mathrm{arg\, min} \left\lbrace \left\langle \xi, u \right\rangle + D ( u, x ) \, \middle| \, x \in \mathcal{X} \right\rbrace .$

Let $x_1 \in \mathcal{X}$, and $( \eta_1, \eta_2, \ldots )$ be a sequence of step sizes. The standard MD iterates as follows.

$x_{t + 1} = \mathrm{prox} ( \eta_t f' ( x_t ) ; x_t ) , \quad t = 1, 2, \ldots .$

Define

$x^t := \frac{ \sum_{\tau = 1}^t \eta_\tau x_\tau }{ \sum_{\tau = 1}^t \eta_\tau } .$

The following result can be found in, e.g., the classic paper by Beck and Teboulle.

Proposition 1. It holds that, for any $u \in \mathcal{X}$,

$f ( x^t ) - f ( u ) \leq \frac{ D ( u, x_1 ) + \frac{1}{2} \sum_{\tau = 1}^t \eta_\tau^2 \left\Vert f' ( x_\tau ) \right\Vert_*^2 }{ \sum_{\tau = 1}^t \eta_\tau } ,$

where $\left\Vert \cdot \right\Vert_*$ denotes the dual norm.

Proposition 1 implies the following convergence guarantee for the standard MD.

Corollary. Fix a positive integer $T$. Set

$\Omega \geq \max \left\lbrace D ( u, x_1 ) \, \middle| \, u \in \mathcal{X} \right\rbrace , \quad \eta_t = \frac{\sqrt{2 \Omega }}{ L \sqrt{T}} \,\, \text{ for } t = 1, 2, \ldots, T .$

Then

$f ( x^T ) - f^\star \leq \frac{L \sqrt{2 \Omega}}{\sqrt{T}} .$

A Modified MD for Minimizing Strongly Convex Functions

Now assume that $f$ is also $\mu$-strongly convex w.r.t. $\Vert \cdot \Vert$. How do we exploit this additional information?

Let us choose $\omega$ such that it is $1$-strongly convex on the whole $\mathbb{R}^d$, instead of merely $\mathcal{X}$. For any $R > 0$, define $\omega_{R, z} ( x ) := \omega ( R^{-1} ( x - z ) )$; denote by $D_{R, z}$ the corresponding Bregman divergence, and $\mathrm{prox}_{R, z}$ the corresponding prox-mapping.

Let us define an iteration rule very similar to the standard MD; the only difference is that now we replace $\omega$ by $\omega_{R, z}$ for some $z \in \mathbb{R}^d$: Let $( \eta_1, \eta_2, \ldots )$ be a given sequence of step sizes. For any $x_1 \in \mathcal{X}$, we compute

$x_{t + 1} = \mathrm{prox}_{R, z} ( \eta_t f' ( x_t ) ; x_t ) , \quad t = 1, 2, \ldots .$

Define

$x^t ( R, z, x_1 ) := \frac{ \sum_{\tau = 1}^t \eta_\tau x_\tau }{ \sum_{\tau = 1}^t \eta_\tau } .$

Proposition 2. Fix a positive integer $T$. Set

$\Omega \geq \max \left\lbrace D ( u, x_1 ) \, \middle| \, \left\Vert u \right\Vert \leq 1, u \in \mathbb{R}^d \right\rbrace , \quad R_0 \geq \left\Vert x_1 - x^\star \right\Vert , \quad \eta_t = \frac{\sqrt{2 \Omega }}{ R_0 L \sqrt{T}} \,\, \text{ for all } t .$

Then

$f ( x^T ( R_0, x_1, x_1 ) ) - f^\star \leq \frac{R_0 L \sqrt{2 \Omega}}{\sqrt{T}}, \quad \left\Vert x^T ( R_0, x_1, x_1 ) - x^\star \right\Vert^2 \leq \frac{R_0 L \sqrt{8 \Omega}}{\mu \sqrt{T}} ,$

where $x^\star$ is the unique minimizer of $f$ on $\mathcal{X}$.

Proof. Notice that $\omega_R$ is $1$-strongly convex w.r.t. $\Vert \cdot \Vert_R := R^{-1} \Vert \cdot \Vert$. The first inequality follows from Proposition 1, using the norm $\Vert \cdot \Vert_R$. The second inequality follows from the following two inequalities, obtained by the strong convexity of $f$ and the optimality of $x^\star$, respectively:

$f ( x^T ) - f^\star \geq \left\langle f' ( x^\star ) , x^T - x^\star \right\rangle + \frac{\mu}{2} \Vert x^T - x^\star \Vert^2, \quad \left\langle f' ( x^\star ), x^T - x^\star \right\rangle \geq 0 .$

Q.E.D.

Now we have an error bound depending on $R_0$; the smaller $R_0$ is, the smaller the error bounds are. Also notice that the bound of the distance between $x^T$ and $x^\star$ is strictly decreasing with $T$. These observations motivate the following restarting strategy: Set $y_0 \in \mathcal{X}$. For $k = 1, 2, \ldots$,

Set $T_k$ such that $\left\Vert x^{T_k} ( R_{k - 1}, y_{k - 1}, y_{k - 1} ) - x^\star \right\Vert^2 \leq 2^{-1} R_{k - 1}^2$.
Compute $y_k = x^{T_k} ( R_{k - 1}, y_{k -1}, y_{k - 1} )$, with $\eta_t = \frac{\sqrt{2 \Omega }}{ R L \sqrt{T_k}}$ for all $t$.
Set $R_k^2 = 2^{-1} R_{k - 1}^2$.

By the proposition we have just proved, it suffices to choose

$T_k = \left\lceil \frac{32 L^2 \Omega}{\mu^2 R_{k - 1}^2} \right\rceil .$

Proposition 3. Define $M_k = \sum_{\kappa = 1}^k T_{\kappa}$. Let $k^\star$ be the smallest $k$ such that

$k \leq \frac{L^2 \Omega}{\mu^2 R_0^2} 2^{k + 5} .$

Then for $k < k^\star$,

$f ( y_k ) - f^\star \leq 2^{-(0.5 M_k + 1)} \mu R_0^2, \quad \left\Vert y_k - x^\star \right\Vert^2 \leq 2^{- 0.5 M_k} R_0^2 ;$

for $k \geq k^\star$,

$f ( y_k ) - f^\star \leq \frac{32 L^2 \Omega}{\mu M_k}, \quad \left\Vert y_k - x^\star \right\Vert^2 \leq \frac{64 L^2 \Omega}{\mu^2 M_k} .$

Proof. It is easily checked, by Proposition 2, that

$f ( y_k ) - f^\star \leq 2^{-(k + 1)} \mu R_0^2, \quad \left\Vert y_k - x^\star \right\Vert^2 \leq 2^{-k} R_0^2 .$

By our choice of $T_k$, it holds that

$M_k \leq k + \sum_{\kappa = 1}^k \frac{32 L^2 \Omega}{\mu^2 R_{\kappa - 1}^2} = k + \sum_{\kappa = 1}^k \frac{32 L^2 \Omega}{\mu^2 2^{-(\kappa - 1)} R_0^2} \leq k + \frac{L^2 \Omega}{\mu^2 R_0^2} 2^{k + 5} .$

Therefore, $M_k \leq 2 k$ for $k < k^\star$, and

$M_k \leq \frac{L^2 \Omega}{\mu^2 R_0^2} 2^{k + 6}$

for $k \geq k^\star$. Q.E.D.

Notice that to compute each $y^k$, we need to compute $T_k$ additional prox-mappings. Therefore, the effective number of iterations is represented by $M_k$, instead of $k$, and the convergence rate is understood as

$\text{gap to the optimal value} = O ( 1 / ( \text{effective number of iterations} ) ) .$

Defensive Forecasting

Apr 12th, 2017 6:37 pm

Defensive forecasting is an approach to designing competitive on-line learning algorithms. The ideas was partially motivated by the game-theoretic framework of probability theory. Although it seems that finally, defensive forecasting can be formulated without stating game-theoretic probability as in this paper, I find it insightful to understand the original formulation.

Game-Theoretic Probability

To get some idea about what the game-theoretic framework is, let us review an interesting result in this framework. Consider the following probabiilty prediction game. Let $K_0 = 1$. For every $n \in \mathbb{N}$,

Forecaster announces $p_n \in [ 0, 1 ]$.
Skeptic announces $s_n \in \mathbb{R}$.
Reality announces $y_n \in \lbrace 0, 1 \rbrace$.
$K_n = K_{n - 1} + s_n ( y_n - p_n )$.

This is in fact a game of betting from the perspective of Skeptic: Initially Skeptic has $1$ dollar. If $s_n > 0$, Skeptic is betting on the event $y_n = 1$; Skeptic gets $s_n y_n$ dollars in return if the event happens, and loses $s_n p_n$ dollars otherwise. Similarly, if $s_n < 0$, Skeptic is betting against the event $y_n = 1$.

The sequence $(K_n)$ is called a martingale. If $K_0 = 1$ and $K_n \geq 0$ for all $n$, the sequence $( K_n )$ is called a scoring martingale.

Of course, the values of $K_n$ depend on the strategy of Skeptic. Roughly speaking, in the game-theoretic framework, an event of small probabiilty corresponds to a large value of $K_n$.

Theorem. (Weak law of large numbers) For any $N \in \mathbb{N}$, there exists a scoring martingale $( K_n )$, such that for any $\delta > 0$, $K_N \geq 1 / \delta$ unless

$\left\vert \frac{\sum_{n = 1}^N ( y_n - p_n )}{N} \right\vert \leq \frac{1}{\sqrt{ \delta N }}.$

Proof. Choose

$s_n = \frac{2 \sum_{i = 1}^{n - 1} ( y_i - p_i )}{N}.$

Q.E.D.

Defensive Forecasting

Defensive forecasting aims at finding a strategy for Forecaster, such that Skeptic’s capital does not grow, no matter what Reality’s strategy is.

If Skeptic and Reality cooperate, this aim is in general impossible. For example, Skeptic and Reality can choose $s_n$ and $y_n$ in the following way:

If $p_n < 1 / 2$, they choose $s_n = 2$ and $y_n = 1$.
If $p_n \geq 1 / 2$, they choose $s_n = -2$ and $y_n = 0$.

Then $K_n \geq K_{n - 1} + 1$ for every $n$.

However, if we slightly weaken Skeptic, this aim becomes surprisingly easy. Notice that Skeptic keeps his wealth growing, by setting $s_n$ as a discontinuous function of $p_n$. Let us impose a restriction that $s_n$ must be a continuous function of $p_n$. Then we arrive at the following defensive porecasting protocol.

Let $K_0 = 1$. For every $n \in \mathbb{N}$,

Skeptic announces a continuous function $s_n: [ 0, 1 ] \to \mathbb{R}$.
Forecaster announces $p_n \in [ 0, 1 ]$.
Reality announces $y_n \in \lbrace 0, 1 \rbrace$.
$K_n = K_{n - 1} + s_n ( p_n ) ( y_n - p_n )$.

Theorem. There always exists a strategy for Forecaster, such that $K_0 \geq K_1 \geq K_2 \geq \cdots$.

Proof. If $s_n$ is strictly positive on $[ 0, 1 ]$, choose $p_n = 1$. If $s_n$ is strictly negative on $[ 0, 1 ]$, choose $p_n = 0$. Otherwise, by continuity of $s_n$, there must exist some $p^\star \in [0, 1]$ such that $s_n ( p^\star ) = 0$; choose $p_n = p^\star$. Q.E.D.

K29

K29 is an algorithm for probability prediction, based on the defensive forecasting approach. Consider the following protocol. For $n = 1, 2, \ldots, N$,

Forecaster announces $p_n \in [ 0, 1 ]$.
Reality announces $y_n \in \lbrace 0, 1 \rbrace$.

Weather forecasting is an instance of this protocol. One can associate $y_n = 1$ to a rainy day; then $p_n$ denotes Forecaster’s estimate of the probability that a rainy day will happen.

Let $\kappa$ be a kernel on $[ 0, 1 ]$. The aim of K29 is to guarantee, for all $p \in [ 0, 1 ]$,

$\left\vert \sum_{n = 1}^N \kappa ( p_n, p ) ( y_n - p_n ) \right\vert \leq \sqrt{N} .$

This inequality, roughly speaking, guarantees that the forecasting strategy is calibrated, i.e., for any $p \in [ 0, 1 ]$, among those days with $p_n \approx p$, there are about $( 100 \times p ) \%$ of them that are actually rainy. A precise intrepretation of the inequality is in fact much more complicated (for example, it is known that it is impossible to have a deterministic calibrated forecasting algorithm), and hence skipped due to my lazziness ;) Please find the references for details.

Theorem. Assume $\kappa ( p, p ) \leq 1$ for all $p \in [ 0, 1 ]$. The inequality above can be achieved, if Forecaster applies the defensive forecasting strategy with respect to

$s_n: p \mapsto \sum_{i = 1}^n \kappa ( p, p_i ) ( y_i - p_i ) .$

The resulting forecasting algorithm is called K29, because it is related to a paper of Kolmogorov published in 1929.

References

G. Shafer and S. Ring, “Predicting bond yields using defensive forecasting,” 2010.
V. Vovk, “Predictions as statements and decisions,” 2006.
V. Vovk, A. Takemura, and G. Shafer, “Defensive forecasting,” 2005.

Hedge Algorithm

Mar 19th, 2017 11:56 am

Alice wants to win some money by betting on horce racing games. She knows nothing about horce racing. Fortunately, she knows that some of her friends are very good at betting on horce racing; however, she does not know who exactly is the best. Suppose Alice will bet on several rounds sequentially. Can Alice bet as well as the best among her friends in the long run?

Indeed, there are algorithms for doing so. A famous one is the so-called hedge algorithm, introduced in the paper “A desicion-theoretic generalization of on-line learning and an application to boosting” by Y. Freund and R. Schapire.

The idea is to switch between Alice’s fridends’ strategies. Assume for simplicity that Alice will always bet 1 unit of money for each round. Alice keeps in her mind a weight for each of her friend. At the beginning of each round, she chooses one of her friends with probability proportional to the weights, say Bob, and follows Bob’s bet. Once a round finishes, Alice sees how much she would lose had she followed any of her friend. She then updates the weights accordingly. Those who yielded smaller losses have higher weights, and vice versa.

Let us formalize the idea. We label the rounds by positive integers $t \in \mathbb{N}$. Suppse Alice has $N$ friends; let us index them by the set $[N] := \lbrace 1, \ldots, N \rbrace$. We summarize the weights before the $t$-th round by a weight vector $w_t := ( w_t (i) )_{i \in [N]} \in \mathbb{R}^N$ in the probability simplex, where $w_t (i)$ denotes the weight assigned to Friend $i$. We also summarize the losses yielded by each friend’s bet for the $t$-th round by a vector

$x_t := ( x_t (i) )_{i \in [N]} \in [ 0, + \infty [^N .$

Then for the $t$-th round, the expected loss of Alice is

$\sum_{i \in [N] } x_t(i) w_t(i) := \langle x_t, w_t \rangle .$

Suppose Alice will bet on $T$ rounds. The goal of Alice is to have a small regret

$R_T := \sum_{t \in [T]} \langle x_t, w_t \rangle - \min_{i \in [N]} \sum_{t \in [T]} x_t(i) ,$

where $[T] := \lbrace 1, \ldots, T \rbrace$.

The regret measures the difference between Alice’s expected loss and the loss yielded by the best friend in hindsight. If $R_T = o(T)$, in hindsight, the difference between the average expected losses of Alice and the best friend converges to zero. Notice that the best friend in general varies with $T$.

The hedge algorithm is a specific rule of setting the weights. Initially, Alice chooses any weight $w_1$ such that $w_1 (i) \neq 0$ for all $i$. After observing the losses $x_t$, Alice updates the weights as

$w_{t+1} (i) = Z_t^{-1} w_t(i) \, \mathrm{e}^{- \eta x_t (i)} , \quad i \in [N] ,$

where $Z_t := \sum_{i \in [N]} w_t (i) \, \mathrm{e}^{- \eta x_t (i)}$ is the normalizing constant, and $\eta$ is a properly chosen learning rate.

The hedge algorithm looks similar to the aggregating algorithm (AA). Indeed, we can formulate the problem using the setting of prediction with expert advice (PEA). Let Alice be Learner and her friends be the experts. Let Nature choose the loss vectors. Set the outcome space $\Omega$ as $[ 0, + \infty [^N$, prediction space $\Gamma$ as the probability simplex in $\mathbb{R}^N$, and loss function as

$\lambda ( \omega, \gamma ) := \langle \omega, \gamma \rangle , \quad \omega \in \Omega, \gamma \in \Gamma.$

When each expert $i \in [N]$ chooses a fixed prediction $\gamma_t (i) = ( \delta_{i,j} )_{j \in [N]} \in \mathbb{R}^N$, we recover the original betting-horce-racing-games problem.

However, applying the AA to this specific PEA game is not trivial, and requires some generalization of the mixability condition. Assuming bounded losses, the resulting algorithm yields $R_T = O( \sqrt{T} )$. See this paper and this paper for the details.

Aggregating Algorithm

Mar 2nd, 2017 6:57 pm

This post is essentially a summary of some results in the classical paper “A game of prediction with expert advice” by V. Vovk.

Definition. A game is a triple $( \Omega, \Gamma, \lambda )$, where $\Omega$ is the outcome space, $\Gamma$ is the prediction space, and $\lambda: \Omega \times \Gamma \to [0, \infty]$ is the loss function.

Consider the standard learning with expert advice setting, in which a learner tries to predict the outcomes of the nature sequentially, with the help of $n$ experts. Precisely speaking, at each trial $t \in \mathbb{N}$:

Each expert $i$ makes a prediction $\gamma_t (i)$, $1 \leq i \leq n$.
The learner makes a prediction $\gamma_t \in \Gamma$.
The nature chooses an outcome $\omega_t \in \Omega$.
The learner suffers for the loss $\lambda ( \omega_t, \gamma_t )$.

Notice that before making his prediction at the trial $t$, the learner has access to the past history of the nature (up to trial $t - 1$) and all of the experts (up to trial $t$).

For every $t \in \mathbb{N}$, let us define the learner’s accumulative loss as

$L_t := \lambda( \omega_1, \gamma_1 ) + \cdots + \lambda ( \omega_t, \gamma_t ),$

and each expert’s accumulative loss as

$L_t ( i ) := \lambda( \omega_1, \gamma_1 ( i ) ) + \cdots + \lambda( \omega_t, \gamma_t ( i ) ), \quad i = 1, \ldots, n .$

The learner’s goal is to make predictions almost as well as the best expert, in the sense that $L_t \leq L_t(i) + \varepsilon$ for all $i \leq n$, $t \in \mathbb{N}$, and some small enough $\varepsilon$.

Definition. We say the game is $\eta$-mixable, if for any probability distribution $P$ on $\Gamma$, there exists a $\gamma^\star \in \Gamma$, such that

$\exp ( - \eta \lambda ( \omega, \gamma^\star ) ) \geq \int \exp ( - \eta \lambda( \omega, \gamma ) ) \, P( \mathrm{d} \gamma ), \quad \text{for all } \omega \in \Omega . \notag$

Remark. By Jensen’s inequality, if the mapping $\gamma \mapsto \exp ( - \eta \lambda ( \omega, \gamma ) )$ is concave for all $\omega \in \Omega$, then the game is $\eta$-mixable. Any such loss function $\lambda$ is called exp-concave. Obviously, one can simply choose $\gamma^\star = \int \gamma \, P ( \mathrm{d} \gamma )$ if the loss is exp-concave.

Proposition. If a game is $\eta$-mixable, there exists a prediction algorithm for the learner, such that

$L_t \leq L_t ( i ) + \eta^{-1} \ln n , \quad \text{for all } i \leq n .$

One such algorithm is the aggregating algorithm (AA), first introduced in the paper “Aggregating strategies” by V. Vovk. Let $( \pi_0 ( i ) )_{i \leq n}$ be a probability vector whose entries are all non-zero. At each trial $t \in \mathbb{N}$, the AA outputs a prediction $\gamma_t \in \Gamma$ satisfying

$\exp ( - \eta \lambda ( \omega_t, \gamma_t ) ) \geq \sum_{i \leq n} \pi_{t - 1} ( i ) \exp ( - \eta \lambda( \omega_t, \gamma_t ( i ) ) ) ,$

and after observing $\omega_t$, the AA updates the probability vector as

$\pi_t ( i ) := z_t^{-1} \pi_{t - 1} ( i ) \exp ( - \eta \, \lambda ( \omega_t, \gamma_t ( i ) ) ) , \quad \text{for all } i \leq n ,$

where $z_t$ is a normalizing constant such that $( \pi_t (i) )_{i \leq n}$ is also a probability vector. Because of the $\eta$-mixability assumption, the AA is well-defined.

Lemma. For each $t \in \mathbb{N}$, one has

$\exp ( - \eta L_t ) \geq \sum_{i \leq n} \pi_0 ( i ) \, \exp ( - \eta L_t ( i ) ) .$

Proof. We prove by induction. Obviously, the inequality holds for $t = 0$ (for which we set $L_t = L_t (i) = 0$ for all $i \leq n$). Assume the inequality to be proved holds for $t = T$. We write

$\exp ( - \eta L_{T + 1} ) = \exp ( - \eta L_T ) \exp ( - \eta \lambda ( \omega_{T + 1}, \gamma_{T + 1}(i) ) )$

By the definition of the AA, the RHS is bounded below by

$\exp ( - \eta L_T ) \frac{ \sum_{i \leq n} \exp ( - \eta \lambda ( \omega_{T+1}, \lambda_{T+1} ( i ) ) ) \, \pi_0 ( i ) \, \exp ( - \eta L_T ( i ) ) }{ \sum_{i \leq n} \pi_0 ( i ) \, \exp ( - \eta L_T ( i ) ) } .$

By assumption, the RHS can be further bounded below by

$\sum_{i \leq n} \exp ( - \eta \lambda ( \omega_{T+1}, \lambda_{T+1} ( i ) ) ) \, \pi_0 ( i ) \, \exp ( - \eta L_{T} ( i ) ) = \sum_{i \leq n} \pi_0 ( i ) \exp ( \eta L_{T+1} ( i ) ) .$

This proves the lemma. Q.E.D.

By the lemma, we have

$\exp ( - \eta L_t ) \geq \pi_0 ( i ) \, \exp ( - \eta L_t ( i ) ) , \quad \text{for all } i \leq n .$

Hence we obtain

$L_t \leq L_t ( i ) + \frac{1}{ \eta } \log \left( \frac{1}{ \pi_0 ( i ) } \right) , \quad \text{for all} i \leq n.$

Choosing $\pi_0 ( i ) = 1 / n$ for all $i$ (the uniform distribution) proves the proposition.

Some Interesting Talks at NIPS 2016

Dec 12th, 2016 10:57 pm

NIPS 2016 was quite successful, in the sense that most of the papers interesting to me were chosen as oral presentations:) The list below is of course non-exhaustive and biased. The order is alphabetical, according to the last names of the presenters/first authors.

“Kernel-based Methods for Bandit Convex Optimization” by S. Bubeck
- Bubeck wrote a blog article on this paper: part 1, part 2, and part 3.
- Bubeck had given basically the same talk at the Simons Institute (youtube video).
“Supervised learning through the lens of compression” by O. David, S. Moran, and A. Yehudayoff
- Roughly speaking, a function class is learnable if it allows (approximate) compression. Notice that the compression is not defined in the Shannon-theoretic way.
“Generalization of ERM in stochastic convex optimization: The dimension strikes back” by V. Feldman
- This paper shows that minimizing the empirical average is not an optimal strategy for stochastic approximation in general.
“Safe testing: An adaptive alternative to p-value-based testing” by P. Grunwald
- Grunwald proposed the notion of safe test, which is robust against possible abuse of statistical methods, such as collecting data until the p-value is large enough. He also provided an algorithm for safe testing, based on the so-called reverse I-projection.
- Unfortunately, it seems that the paper has not been available on the internet.
“Theory and algorithms for forecasting non-stationary time series” by V. Kuznetsov and M. Mohri
- This is a tutorial talk mainly based on their NIPS’15 paper and COLT’16 paper.
“Without-replacement sampling for stochastic gradient methods” by O. Shamir
- This paper provides convergence guarantees for the setting mentioned in its title.
“Stochastic optimization: Beyond stochastic gradients and convexity: Part 2” by S. Sra
- The slides can serve as a good bibliography on solving non-convex finite-sum optimization problems.
“MetaGrad: Multiple learning rates in online learning” by T. van Erven and W. M. Koolen
- This paper proposes a somewhat universally optimal scheme for online learning. The idea is to discretize the interval of possible step sizes, treat each candidate step size as an expert, and then do prediction with experts to choose the step size.

The Pinching Trick and the Golden-Thompson Inequality

Nov 10th, 2016 11:10 am

Pinching

Let $A$ be a Hermitian matrix, and $A = \sum_j \lambda_j P_j$ be the spectral decomposition of $A$. The pinching map defined by $A$ is given by

$\mathcal{P}_A (X) = \sum_j P_j X P_j ,$

for any Hermitian matrix $X$.

Theorem 1. Let $A$ be a positive semi-definite matrix and $B$ be a Hermitian matrix. The following statements hold.

$\mathcal{P}_B (A)$ commutes with $B$.
$\mathrm{Tr} ( \mathcal{P}_B (A) B ) = \mathrm{Tr} ( A B )$.
(Pinching inequality) $\vert \mathrm{spec} (B) \vert \, \mathcal{P}_B (A) \geq A$, where $\mathrm{spec} (B)$ denotes the set of eigenvalues of $B$.

The first two statements are easy to check. The earliest reference on the pinching inequality I can find is the classic book by Jacques Dixmier. A simple proof of the pinching inequality can be found in the textbook by Masahito Hayashi.

One main issue in matrix analysis is non-commutativity. The first statement in Theorem 1 hints that pinching can be an useful tool to deal with this issue. In the next section, the pinching trick is illustrated using the Golden-Thompson inequality as an example.

A proof of the Golden-Thompson Inequality

The Golden-Thompson inequality says that

$\mathrm{Tr} ( \exp ( A + B ) ) \leq \mathrm{Tr} ( \exp (A) \exp (B) ) ,$

for any two Hermitian matrices $A$ and $B$. Obviously, if $A$ commutes with $B$, the Golden-Thompson inequality holds with an equality; however, in general one needs to take non-commutativity into consideration. Below we present a very elegant proof using the pinching trick from a recent paper by D. Sutter et al.

The key observation is that $\vert \mathrm{spec} ( A^{\otimes n} ) \vert$ does not grow rapidly with $n$ for any Hermitian matrix $A$.

Lemma 1. One has $\vert \mathrm{spec} ( A^{\otimes n} ) \vert = O ( \mathsf{poly} (n) )$ for any Hermitian matrix $A$.

Proof (Golden-Thompson inequality).

Let $X$ and $Y$ be two positive definite matrices. Then one can write

$\log \mathrm{Tr} ( \exp ( \log X + \log Y ) ) = \frac{1}{n} \log \mathrm{Tr} ( \exp ( \log X^{\otimes n} + \log Y^{\otimes n} ) ),$

for any positive integer $n$. By the pinching inequality, one has

$\frac{1}{n} \log \mathrm{Tr} ( \exp ( \log X^{\otimes n} + \log Y^{\otimes n} ) ) \leq \frac{1}{n} \log \mathrm{Tr} \{ \exp [ \log ( \vert \mathrm{spec} ( Y^{\otimes n} ) \vert \, \mathcal{P}_{Y^{\otimes n}} ( X^{\otimes n} ) ] + \log Y^{\otimes n} ) \}$

By the first two statements in Theorem 1 and Lemma 1, one has

$\begin{align} \text{RHS} & = \frac{1}{n} \log \mathrm{Tr}\, ( \mathcal{P}_{Y^{\otimes n}} ( X^{\otimes n} ) Y^{\otimes n} ) + \frac{\log \mathsf{poly} (n)}{n} \notag \\ & = \frac{1}{n} \log \mathrm{Tr} ( X^{\otimes n} Y^{\otimes n} ) + \frac{\log \mathsf{poly} (n)}{n} \notag \\ & = \log \mathrm{Tr}\, ( X Y ) + \frac{\log \mathsf{poly} (n)}{n} . \end{align}$

Then one obtains the Golden-Thompson Inequality by letting $n \to \infty$.

Blog Archives