Que-sais je?

Blackwell Approachability

Introduction

Suppose that Alice and Bob is playing a so-called two-player zero-sum game with perfect information. Alice chooses her action from a finite action set ; Bob chooses his action from a finite action set . There is a payoff function $f$, which assigns to each action pair $(a, b) \in \mathcal{A} \times \mathcal{B}$ a payoff $f ( a, b ) \in \mathbb{R}$; Bob’s payoff is then simply $-f (a, b)$. In general, Alice and Bob may choose their actions randomly, so we consider the expected payoff: for any probability distribution $P \in \Delta ( \mathcal{A} )$ and $Q \in \Delta ( \mathcal{B} )$, define

If Alice does not have any additional information about Bob, it is reasonable to consider the very robust minimax strategy, i.e., to choose her action randomly according to

No matter what Bob’s strategy is, Alice is guaranteed to achieve at least the minimax expected payoff.

If the payoff function $f$ is vector-valued, talking about the minimax expected payoff becomes nonsense, as we cannot compare two vectors. This post introduces an analog for the vector-valued payoff case, proposed by David Blackwell in this classic paper.

Blackwell Approachability

In the rest of this post, suppose that the payoff function $f$ takes values in .

Blackwell considered the following protocol of repeated games. For $t = 1, 2, \ldots$,

  1. Alice announces $a_t \in \mathcal{A}$.
  2. Bob announces $b_t \in \mathcal{B}$, without known $a_t$.

In general, Alice can use a randomized strategy. For convenience, let us say that Alice chooses $a_t$ randomly according to some probability distribution .

Recall that Alice cannot achieve the “minimax expected payoff” in this vector-payoff case, as it is just not well-define. However, it might be possible to ensure that her average payoff will ultimately lies in some specific subset of $\mathbb{R}^d$.

Define the average payoff

Definition. Let $\mathcal{X} \subseteq \mathbb{R}^d$. The set $\mathcal{X}$ is said to be approachable by Alice, if for any sequence ,

where $\mathrm{dist} ( \bar{f}_T, \mathcal{X} )$ denotes the $2$-norm distance between $\bar{f}_T$ and the projection of $\bar{f}_T$ onto $\mathcal{X}$.

If we only consider convex sets, there is an exact characterization of approachable sets.

Theorem 1. A convex set is approachable to Alice, if and only if for any $Q \in \Delta ( \mathcal{B} )$, there exists some $P \in \Delta ( \mathcal{A} )$ such that $f ( P, Q ) \in \mathcal{X}$.

Exercise. Suppose that for each $t$, Bob’s action $b_t$ can be dependent on Alice’s action $a_t$. Show that then even when the if and only if condition holds, $\mathcal{X}$ may not be approachable to Alice.

The only if part of Theorem 1 is easy to check. Suppose that there exists some $Q^\star \in \Delta ( \mathcal{B} )$, such that for any . To guarantee that Alice fails to approach $\mathcal{X}$, Bob can always choose $b_t$ randomly according to $Q^\star$ for all $t$.

We will focus on the if part of Theorem 1 in the rest of this post.

Blackwell’s Condition and Alice’s Strategy

To prove the if part of Theorem 1, we need to show the existence of an approaching strategy. Indeed, the proof is constructive; there is an explicitly defined approaching strategy based on the following lemma.

Lemma 1. Suppose that for any $Q \in \Delta ( \mathcal{B} )$, there exists some $P \in \Delta ( \mathcal{A} )$ such that $f ( P, Q ) \in \mathcal{X}$. Then for any $v \in \mathbb{R}^d$, there exists some $P \in \Delta ( \mathcal{A} )$, such that

where $\mathrm{proj}$ denotes the projection (w.r.t. the $2$-norm) onto $\mathcal{X}$.

Proof. By definition, $\mathrm{proj} (v)$ is the minimizer of the $2$-norm distance to $v$ in $\mathcal{X}$. The optimality condition says

What we want to prove can be equivalently written as

Applying von Neumann’s minimax theorem, we can exchange the order of $\min$ and $\max$, and write

The lemma follows from the optimality condition. $Q.E.D.$

We call the condition in Lemma 1 Blackwell’s condition, following the convention in literature. Blackwell’s condition ensures that the following strategy for Alice is well-defined.

Alice’s Strategy. Define the expected and average expected payoff:

For each $t \in \mathbb{N}$, Alice computes some $P_t$ such that

and then announces $a_t$ randomly according to $P_t$.

Remark. In practice, $P_t$ can be computed as

This is a bilinear saddle point problem; there exists a variety of algorithms solving such a problem.

Proof of Theorem 1

Since $\mathcal{X}$ is approachable if and only if its closure is approachable, we assume that $\mathcal{X}$ is closed w.l.o.g. We will also assume that $\mathcal{X}$ is bounded for convenience; we will relax this unnecessary condition at the end of the proof.

Theorem 1 is in fact a corollary of the following theorem.

Theorem 2. Suppose that the if and only if condition in Theorem 1 holds. Suppose that Alice adopts the strategy described above. Then for each $T \in \mathbb{N}$ and $\delta \in ( 0, 1 )$, with probability at least , it holds that

where , and .

To prove Theorem 2, we start with the following lemma.

Lemma 2. For all $T \in \mathbb{N}$, it holds that

Proof. Define and . By definition, we have

Writing , we have

Applying Lemma 1 and the triangle inequality, we obtain

The lemma can be then easily checked by induction. Q.E.D.

Proof of Theorem 2. Recall that denotes the average payoff; define . We start with the decomposition,

By Lemma 2 and the non-expansiveness of projections, we have

Define . Then forms a vector-valued martingale. Applying the Azuma inequality for Banach-space valued martingales, we obtain, for any $u > 0$,

for some constant $C > 0$. Q.E.D.

In general, the set to be approached may not be bounded; then $R = + \infty$ if we directly apply the definition. Notice that, however, only sets lie in the convex hull of can be approached. Denote the convex hull by $\mathcal{F}$. Hence, it suffices to replace $\mathcal{X}$ by , which is always bounded by construction; then it is easily checked that $R$ can be bounded above by $M$ to slightly simplify the expressions.

Calibrated Forecasting

One might wonder why one should consider vector-valued payoffs. An interesting application is calibrated forecasting. With Theorem 1, it is an easy exercise to prove the theorem in the previous post, showing the existence of an $\varepsilon$-calibrated forecasting strategy.

References

The strategy of Alice presented above is not exactly the original one proposed by Blackwell, as the original one does not lead to the $O ( 1 / \sqrt{T} )$ with-high-probability convergence rate in Theorem 2. The strategy presented above was described in the book by Cesa-Bianchi and Lugosi, without proof. The proof above is basically a mix of the arguments and results in the following references.

  1. Blackwell, D., 1956. An analog of the minimax theorem for vector payoffs. Pac. J. Math.
  2. Cesa-Bianchi, N. and Lugosi, G., 2006. Prediction, Learning, and Games.
  3. Naor, A., 2012. On the Banach-space-valued Azuma inequality and small-set isoperimetry of Alon–Roichman graphs. Combinatorics Probab. Comput.
  4. Sorin, S., 2002. A First Course on Zero-Sum Repeated Games.

Calibrated Forecasting

Probability Forecasting

Can one predict the weather ignorantly yet accurately? To be concrete, suppose we are given the weather record of the past 100 days,

we cannot access any other information (e.g., temperature, satellite images, etc.), and we know nothing about meteorology. Can we tell how likely it will rain tomorrow?

This problem can be cast as a probability forecasting problem. The protocol of probability forecasting is as follows. For ,

  1. Predictor announces .
  2. Reality announces .

The goal of Predictor is to make accurate predictions, such that $p_t$ is a good estimate of the likelihood of the event . As the protocol is deterministic, we do not say “probability of the event ”.

Thinking of ‘$0$’ as ‘dry’, and ‘$1$’ as ‘wet’, we recover the original weather forecasting problem.

Calibration

What precisely are the requirements a good prediction strategy should satisfy? The requirements in general depend on the actual application scenario. In this post, we will only consider a minimal requirement—any good prediction strategy should be calibrated.

Roughly speaking, the notion of calibration requires the predictions to comply with the empirical frequencies, i.e., for any $\rho \in ( 0, 1 )$, it holds that

To make the notion precise, consider the following discretization argument. Fix some $\varepsilon \in ( 0, 1 )$. Let us decompose the interval $[ 0, 1 ]$ as

where $M := \lceil \varepsilon^{-1} \rceil$. We denote the center of $I_m$ by $c_m$. Define the empirical frequencies

if the divisor is non-zero, and $\rho_m (T) := c_m$ otherwise. We would like, for all $m$,

We do not really want to care about those $m$’s for which , as they are irrelevant in the long run. A compact formulation of the requirement is

Some simple algebraic manipulations lead to the following (slightly different) defnition.

Definition. A (possibly randomized) prediction strategy is -calibrated, if the resulting sequence of predictions satisfy

Surprisingly, even when one knows nothing about the mechanism generating the sequence , calibration is achievable.

Theorem. Suppose that Reality, before announcing $y_t$, does not know the exact value of $p_t$. For any $\varepsilon > 0$, there exists an $\varepsilon$-calibrated probability forecasting strategy.

Some quick observations:

  1. If $y_t$ can depend on the exact value of $p_t$, calibration is impossible. For example, Reality can always choose $y_t = 1$ whenever $p_t \leq 0.5$, and $y_t = 0$ otherwise.

  2. The remark above also shows the necessity of a randomized forecasting strategy. If Forecaster uses a deterministic strategy, for which $p_t$ is a deterministic function of $p_1, y_1, p_2, y_2, \ldots, p_{t - 1}, y_{t - 1}$, then Reality can also compute exact value of $p_t$.

An Achieving Strategy

A calibrated forecasting strategy can be derived by the idea of Blackwell approachability.

The strategy only chooses $p_t$ from the set ; effectively, Forecaster only needs to choose for all $t$. Define, for all $m$,

We already know that Forecaster must adopt a randomized strategy. Suppose that Forecaster chooses $m_t$ randomly according to some probability distribution $P_t$ on . Define

Forecaster’s Strategy. For each $t$, Forecaster first computes some $P_t$ such that for any probability distribution $Q$ on ,

where $\mathrm{proj}$ denotes projection onto the $1$-norm ball of radius $\varepsilon$ in $\mathbb{R}^M$, and then announces $m_t$ randomly according to $P_t$.

Lemma. Such $P_t$ always exists.

Remark. In practice, one can compute $P_t$ by solving the saddle point problem:

Notice that the objective function is bilinear; there are a variety of existing algorithms that solve this problem.

The strategy looks mysterious at first glance. I will introduce the theory of Blackwell approachability in the next post.

References

  1. Cesa-Bianchi, N. and Lugosi, G., 2006. Prediction, Learning, and Games.
  2. Dawid, A.P., 1982. The well calibrated Bayesian. J. Am. Stat. Assoc.
  3. Dawid, A. P., 1985. Comment: The impossibility of inductive inference. J. Am. Stat. Assoc.
  4. Foster, D. P., 1999. A proof of calibration via Blackwell’s approachability theorem. Games Econ. Behav.
  5. Mannor, S. and Stoltz, G., 2010. A geometric proof of calibration. Math. Oper. Res.
  6. Oakes, D., 1985. Self-calibrating priors do not exist. J. Am. Stat. Assoc.

Universally Consistent Prediction

Introduction

Can one predict the future well, without any subjective assumptions? Indeed, there are actually some prediction strategies that work without any subjective assumption, and approach (arguably) the best possible performance in some sense. This post introduces one such strategy described in an article by V. Vovk.

Consider the following protocol of online prediction. For $t = 1, 2, \ldots$,

  1. Reality announces $x_t \in \mathcal{X}$.
  2. Predictor announces $\gamma_t \in \Gamma$.
  3. Reality announces $y_t \in \mathcal{Y}$.

We assume the perfect information case; that is, all $x_t$’s, $\gamma_t$’s, and $y_t$’s, once announced, are known to both Reality and Predictor.

Exercise. Find some real-life examples of this protocol. Notice that $\mathcal{X}$ can be the space of histories.

We measure the quality of prediction by a given loss function $\lambda: \mathcal{X} \times \Gamma \times \mathcal{Y} \to \mathbb{R}$, known to both Reality and Predictor. We will consider compact-type losses.

Definition. A loss function is called compact-type, if for any compact sets $A \subseteq \mathcal{X}$ and $B \subseteq \mathcal{Y}$ and any $M \in \mathbb{R}$, there exists some compact set $C \subset \Gamma$, such that $\lambda ( x, \gamma, y ) > M$ for any $x \in A$, $y \in B$, and $\gamma \notin C$.

A (possibly randomized) prediction strategy is a function $\psi: \mathcal{X} \to \Delta ( \Gamma )$, where $\Delta ( \Gamma )$ denotes the set of all probabiity measures on $\Gamma$. For each $t$, Predictor computes $Q_t := \psi( x_t )$, and chooses $\gamma_t \in \Gamma$ randomly according to $Q_t$.

Definition. We say that a prediction strategy is continuous, if the associated function $\psi$ is continuous.

Theorem 1 (Vovk). Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are locally compact metric spaces, and $\Gamma$ is a metric space. Suppose that $\lambda$ is continuous and compact-type. Then for any $\varepsilon > 0$, there exists a prediction strategy, such that if and are precompact,

where $( \tilde{\gamma_t} )_{t \in \mathbb{N}}$ is the sequence of predictions made by any continuous prediction strategy.

Any prediction strategy that achieves the theorem above is called universally consistent.

Ideas

The prediction strategy proposed by Vovk consists of two parts. Predictor first computes an estimate of the conditional probability distribution of $y_t$ given the history; then Predictor chooses the optimal $\gamma_t$ minimizing the estimated conditional expected loss. Notice that, however, the online prediction protocol is completely deterministic; hence talking about the conditional probability and expected loss, rigorously speaking, is not legal in the sense of measure-theoretic probability.

The first part is called probability forecasting. The protocol of probability forecasting is as follows. For $t = 1, 2, \ldots$,

  1. Reality announces $x_t \in \mathcal{X}$.
  2. Forecaster announces $P_t \in \Delta ( \mathcal{Y} )$.
  3. Reality announces $y_t \in \mathcal{Y}$.

Theorem 2 (Vovk). Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are compact metric spaces. There exists some forecasting strategy, such that

for any continuous function $f: \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} \to \mathbb{R}$.

The forecasting strategy behind Theorem 2 is based on the existence of a universal RKHS $\mathcal{H}$ on $\mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y}$, i.e., an RKHS dense in $C ( \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} )$. Following the framework of defensive forecasting, one can construct a forecasting strategy such that the inequality in Theorem 2 holds for all functions in $\mathcal{H}$. As $\mathcal{H}$ is dense in $C ( \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} )$, one obtains Theorem 2.

Naively speaking, the second part corresponds to finding the randomized action that minimizes the conditional expected loss, i.e., computing

where $P_t$ is the output of probability forecasting, and

Then Predictor chooses $\gamma_t$ randomly according to $\tilde{Q}_t$.

Theorem 2 considers continuous functions, while $\tilde{Q}_t$ may not be continuously dependent on $(x_t, P_t)$. Therefore, we need to slightly modify the second part. What Predictor actually does in the second part is to compute $Q_t := G ( x_t, P_t ) \in \Delta ( \Gamma )$, for some continuous function $G$ satisfying

for some $\delta > 0$.

Lemma 1 (Vovk). Suppose that $\Gamma$ is a compact convex subset of a topological vector space. Such a continuous function $G$ exists for any $\delta > 0$.

In Theorem 1, the conditions on $\mathcal{X}$, $\mathcal{Y}$, and $\Gamma$ are not as strict as those in Theorem 2 and Lemma 1. This gap can be addressed by a doubling-like trick. Roughly speaking, Predictor starts with compact subsets $A \subseteq \mathcal{X}$ and $B \subseteq \mathcal{Y}$, and a compact convex subset $C \subseteq \Gamma$. Predictor implements the prediction strategy described above only on $A$, $B$, and $C$. If Reality announces some $( x_t, y_t ) \notin A \times B$, Predictor enlarges $A$, $B$ and $C$ such that $( x_t, y_t )$ is contained.

  1. If Predictor is forced to enlarge the sets for inifinitely many times, and cannot be both precompact.
  2. Otherwise, ultimately Predictor is working with some $A \subseteq \mathcal{X}$, $B \subseteq \mathcal{Y}$, and $C \subseteq \Gamma$, to which Theorem 2 and Lemma 1 apply.

Minimizing a Strongly Convex Function by Mirror Descent

Mirror descent (MD) is a famous convex optimization algorithm. When the objective function is Lipschitz continuous, the convergence rate of MD is known to be $O ( t^{-1/2} )$. When the objective function is also strongly convex, intuitively, a better convergence rate should be possible. Suprisingly, there are very few related results in existing literature, to the best of my knowledge. This post summarizes one such result in a review article by Juditsky and Nemirovski.

We will consider a standard constrained convex optimization problem:

where $f$ is a convex function, and $\mathcal{X}$ is a compact convex set in $\mathbb{R}^d$. We assume that $f$ is $L$-Lipschitz continuous w.r.t. some norm $\Vert \cdot \Vert$ on $\mathbb{R}^d$.

Brief Review of Mirror Descent

Let $\omega$ be a continuously differentiable $1$-strongly convex function on $\mathcal{X}$, w.r.t. $\Vert \cdot \Vert$. Define the corresponding Bregman divergence

and prox-mapping

Let $x_1 \in \mathcal{X}$, and $( \eta_1, \eta_2, \ldots )$ be a sequence of step sizes. The standard MD iterates as follows.

Define

The following result can be found in, e.g., the classic paper by Beck and Teboulle.

Proposition 1. It holds that, for any $u \in \mathcal{X}$,

where $\left\Vert \cdot \right\Vert_*$ denotes the dual norm.

Proposition 1 implies the following convergence guarantee for the standard MD.

Corollary. Fix a positive integer $T$. Set

Then

A Modified MD for Minimizing Strongly Convex Functions

Now assume that $f$ is also $\mu$-strongly convex w.r.t. $\Vert \cdot \Vert$. How do we exploit this additional information?

Let us choose $\omega$ such that it is $1$-strongly convex on the whole $\mathbb{R}^d$, instead of merely $\mathcal{X}$. For any $R > 0$, define $\omega_{R, z} ( x ) := \omega ( R^{-1} ( x - z ) )$; denote by $D_{R, z}$ the corresponding Bregman divergence, and $\mathrm{prox}_{R, z}$ the corresponding prox-mapping.

Let us define an iteration rule very similar to the standard MD; the only difference is that now we replace $\omega$ by $\omega_{R, z}$ for some $z \in \mathbb{R}^d$: Let $( \eta_1, \eta_2, \ldots )$ be a given sequence of step sizes. For any $x_1 \in \mathcal{X}$, we compute

Define

Proposition 2. Fix a positive integer $T$. Set

Then

where $x^\star$ is the unique minimizer of $f$ on $\mathcal{X}$.

Proof. Notice that $\omega_R$ is $1$-strongly convex w.r.t. $\Vert \cdot \Vert_R := R^{-1} \Vert \cdot \Vert$. The first inequality follows from Proposition 1, using the norm $\Vert \cdot \Vert_R$. The second inequality follows from the following two inequalities, obtained by the strong convexity of $f$ and the optimality of $x^\star$, respectively:

Q.E.D.

Now we have an error bound depending on $R_0$; the smaller $R_0$ is, the smaller the error bounds are. Also notice that the bound of the distance between $x^T$ and $x^\star$ is strictly decreasing with $T$. These observations motivate the following restarting strategy: Set $y_0 \in \mathcal{X}$. For $k = 1, 2, \ldots$,

  1. Set $T_k$ such that $\left\Vert x^{T_k} ( R_{k - 1}, y_{k - 1}, y_{k - 1} ) - x^\star \right\Vert^2 \leq 2^{-1} R_{k - 1}^2$.
  2. Compute $y_k = x^{T_k} ( R_{k - 1}, y_{k -1}, y_{k - 1} )$, with $\eta_t = \frac{\sqrt{2 \Omega }}{ R L \sqrt{T_k}}$ for all $t$.
  3. Set $R_k^2 = 2^{-1} R_{k - 1}^2$.

By the proposition we have just proved, it suffices to choose

Proposition 3. Define $M_k = \sum_{\kappa = 1}^k T_{\kappa}$. Let $k^\star$ be the smallest $k$ such that

Then for $k < k^\star$,

for $k \geq k^\star$,

Proof. It is easily checked, by Proposition 2, that

By our choice of $T_k$, it holds that

Therefore, $M_k \leq 2 k$ for $k < k^\star$, and

for $k \geq k^\star$. Q.E.D.

Notice that to compute each $y^k$, we need to compute $T_k$ additional prox-mappings. Therefore, the effective number of iterations is represented by $M_k$, instead of $k$, and the convergence rate is understood as

Defensive Forecasting

Defensive forecasting is an approach to designing competitive on-line learning algorithms. The ideas was partially motivated by the game-theoretic framework of probability theory. Although it seems that finally, defensive forecasting can be formulated without stating game-theoretic probability as in this paper, I find it insightful to understand the original formulation.

Game-Theoretic Probability

To get some idea about what the game-theoretic framework is, let us review an interesting result in this framework. Consider the following probabiilty prediction game. Let $K_0 = 1$. For every $n \in \mathbb{N}$,

  1. Forecaster announces $p_n \in [ 0, 1 ]$.
  2. Skeptic announces $s_n \in \mathbb{R}$.
  3. Reality announces $y_n \in \lbrace 0, 1 \rbrace$.
  4. $K_n = K_{n - 1} + s_n ( y_n - p_n )$.

This is in fact a game of betting from the perspective of Skeptic: Initially Skeptic has $1$ dollar. If $s_n > 0$, Skeptic is betting on the event $y_n = 1$; Skeptic gets $s_n y_n$ dollars in return if the event happens, and loses $s_n p_n$ dollars otherwise. Similarly, if $s_n < 0$, Skeptic is betting against the event $y_n = 1$.

The sequence $(K_n)$ is called a martingale. If $K_0 = 1$ and $K_n \geq 0$ for all $n$, the sequence $( K_n )$ is called a scoring martingale.

Of course, the values of $K_n$ depend on the strategy of Skeptic. Roughly speaking, in the game-theoretic framework, an event of small probabiilty corresponds to a large value of $K_n$.

Theorem. (Weak law of large numbers) For any $N \in \mathbb{N}$, there exists a scoring martingale $( K_n )$, such that for any $\delta > 0$, $K_N \geq 1 / \delta$ unless

Proof. Choose

Q.E.D.

Defensive Forecasting

Defensive forecasting aims at finding a strategy for Forecaster, such that Skeptic’s capital does not grow, no matter what Reality’s strategy is.

If Skeptic and Reality cooperate, this aim is in general impossible. For example, Skeptic and Reality can choose $s_n$ and $y_n$ in the following way:

  • If $p_n < 1 / 2$, they choose $s_n = 2$ and $y_n = 1$.
  • If $p_n \geq 1 / 2$, they choose $s_n = -2$ and $y_n = 0$.

Then $K_n \geq K_{n - 1} + 1$ for every $n$.

However, if we slightly weaken Skeptic, this aim becomes surprisingly easy. Notice that Skeptic keeps his wealth growing, by setting $s_n$ as a discontinuous function of $p_n$. Let us impose a restriction that $s_n$ must be a continuous function of $p_n$. Then we arrive at the following defensive porecasting protocol.

Let $K_0 = 1$. For every $n \in \mathbb{N}$,

  1. Skeptic announces a continuous function $s_n: [ 0, 1 ] \to \mathbb{R}$.
  2. Forecaster announces $p_n \in [ 0, 1 ]$.
  3. Reality announces $y_n \in \lbrace 0, 1 \rbrace$.
  4. $K_n = K_{n - 1} + s_n ( p_n ) ( y_n - p_n )$.

Theorem. There always exists a strategy for Forecaster, such that $K_0 \geq K_1 \geq K_2 \geq \cdots$.

Proof. If $s_n$ is strictly positive on $[ 0, 1 ]$, choose $p_n = 1$. If $s_n$ is strictly negative on $[ 0, 1 ]$, choose $p_n = 0$. Otherwise, by continuity of $s_n$, there must exist some $p^\star \in [0, 1]$ such that $s_n ( p^\star ) = 0$; choose $p_n = p^\star$. Q.E.D.

K29

K29 is an algorithm for probability prediction, based on the defensive forecasting approach. Consider the following protocol. For $n = 1, 2, \ldots, N$,

  1. Forecaster announces $p_n \in [ 0, 1 ]$.
  2. Reality announces $y_n \in \lbrace 0, 1 \rbrace$.

Weather forecasting is an instance of this protocol. One can associate $y_n = 1$ to a rainy day; then $p_n$ denotes Forecaster’s estimate of the probability that a rainy day will happen.

Let $\kappa$ be a kernel on $[ 0, 1 ]$. The aim of K29 is to guarantee, for all $p \in [ 0, 1 ]$,

This inequality, roughly speaking, guarantees that the forecasting strategy is calibrated, i.e., for any $p \in [ 0, 1 ]$, among those days with $p_n \approx p$, there are about $( 100 \times p ) \%$ of them that are actually rainy. A precise intrepretation of the inequality is in fact much more complicated (for example, it is known that it is impossible to have a deterministic calibrated forecasting algorithm), and hence skipped due to my lazziness ;) Please find the references for details.

Theorem. Assume $\kappa ( p, p ) \leq 1$ for all $p \in [ 0, 1 ]$. The inequality above can be achieved, if Forecaster applies the defensive forecasting strategy with respect to

The resulting forecasting algorithm is called K29, because it is related to a paper of Kolmogorov published in 1929.

References

  1. G. Shafer and S. Ring, “Predicting bond yields using defensive forecasting,” 2010.
  2. V. Vovk, “Predictions as statements and decisions,” 2006.
  3. V. Vovk, A. Takemura, and G. Shafer, “Defensive forecasting,” 2005.

Hedge Algorithm

Alice wants to win some money by betting on horce racing games. She knows nothing about horce racing. Fortunately, she knows that some of her friends are very good at betting on horce racing; however, she does not know who exactly is the best. Suppose Alice will bet on several rounds sequentially. Can Alice bet as well as the best among her friends in the long run?

Indeed, there are algorithms for doing so. A famous one is the so-called hedge algorithm, introduced in the paper “A desicion-theoretic generalization of on-line learning and an application to boosting” by Y. Freund and R. Schapire.

The idea is to switch between Alice’s fridends’ strategies. Assume for simplicity that Alice will always bet 1 unit of money for each round. Alice keeps in her mind a weight for each of her friend. At the beginning of each round, she chooses one of her friends with probability proportional to the weights, say Bob, and follows Bob’s bet. Once a round finishes, Alice sees how much she would lose had she followed any of her friend. She then updates the weights accordingly. Those who yielded smaller losses have higher weights, and vice versa.

Let us formalize the idea. We label the rounds by positive integers $t \in \mathbb{N}$. Suppse Alice has $N$ friends; let us index them by the set $[N] := \lbrace 1, \ldots, N \rbrace$. We summarize the weights before the $t$-th round by a weight vector $w_t := ( w_t (i) )_{i \in [N]} \in \mathbb{R}^N$ in the probability simplex, where $w_t (i)$ denotes the weight assigned to Friend $i$. We also summarize the losses yielded by each friend’s bet for the $t$-th round by a vector

Then for the $t$-th round, the expected loss of Alice is

Suppose Alice will bet on $T$ rounds. The goal of Alice is to have a small regret

where $[T] := \lbrace 1, \ldots, T \rbrace$.

The regret measures the difference between Alice’s expected loss and the loss yielded by the best friend in hindsight. If $R_T = o(T)$, in hindsight, the difference between the average expected losses of Alice and the best friend converges to zero. Notice that the best friend in general varies with $T$.

The hedge algorithm is a specific rule of setting the weights. Initially, Alice chooses any weight $w_1$ such that $w_1 (i) \neq 0$ for all $i$. After observing the losses $x_t$, Alice updates the weights as

where $Z_t := \sum_{i \in [N]} w_t (i) \, \mathrm{e}^{- \eta x_t (i)}$ is the normalizing constant, and $\eta$ is a properly chosen learning rate.

The hedge algorithm looks similar to the aggregating algorithm (AA). Indeed, we can formulate the problem using the setting of prediction with expert advice (PEA). Let Alice be Learner and her friends be the experts. Let Nature choose the loss vectors. Set the outcome space $\Omega$ as $[ 0, + \infty [^N$, prediction space $\Gamma$ as the probability simplex in $\mathbb{R}^N$, and loss function as

When each expert $i \in [N]$ chooses a fixed prediction $\gamma_t (i) = ( \delta_{i,j} )_{j \in [N]} \in \mathbb{R}^N$, we recover the original betting-horce-racing-games problem.

However, applying the AA to this specific PEA game is not trivial, and requires some generalization of the mixability condition. Assuming bounded losses, the resulting algorithm yields $R_T = O( \sqrt{T} )$. See this paper and this paper for the details.

Aggregating Algorithm

This post is essentially a summary of some results in the classical paper “A game of prediction with expert advice” by V. Vovk.

Definition. A game is a triple $( \Omega, \Gamma, \lambda )$, where $\Omega$ is the outcome space, $\Gamma$ is the prediction space, and $\lambda: \Omega \times \Gamma \to [0, \infty]$ is the loss function.

Consider the standard learning with expert advice setting, in which a learner tries to predict the outcomes of the nature sequentially, with the help of $n$ experts. Precisely speaking, at each trial $t \in \mathbb{N}$:

  1. Each expert $i$ makes a prediction $\gamma_t (i)$, $1 \leq i \leq n$.
  2. The learner makes a prediction $\gamma_t \in \Gamma$.
  3. The nature chooses an outcome $\omega_t \in \Omega$.
  4. The learner suffers for the loss $\lambda ( \omega_t, \gamma_t )$.

Notice that before making his prediction at the trial $t$, the learner has access to the past history of the nature (up to trial $t - 1$) and all of the experts (up to trial $t$).

For every $t \in \mathbb{N}$, let us define the learner’s accumulative loss as

and each expert’s accumulative loss as

The learner’s goal is to make predictions almost as well as the best expert, in the sense that $L_t \leq L_t(i) + \varepsilon$ for all $i \leq n$, $t \in \mathbb{N}$, and some small enough $\varepsilon$.

Definition. We say the game is $\eta$-mixable, if for any probability distribution $P$ on $\Gamma$, there exists a $\gamma^\star \in \Gamma$, such that

Remark. By Jensen’s inequality, if the mapping $\gamma \mapsto \exp ( - \eta \lambda ( \omega, \gamma ) )$ is concave for all $\omega \in \Omega$, then the game is $\eta$-mixable. Any such loss function $\lambda$ is called exp-concave. Obviously, one can simply choose $\gamma^\star = \int \gamma \, P ( \mathrm{d} \gamma )$ if the loss is exp-concave.

Proposition. If a game is $\eta$-mixable, there exists a prediction algorithm for the learner, such that

One such algorithm is the aggregating algorithm (AA), first introduced in the paper “Aggregating strategies” by V. Vovk. Let $( \pi_0 ( i ) )_{i \leq n}$ be a probability vector whose entries are all non-zero. At each trial $t \in \mathbb{N}$, the AA outputs a prediction $\gamma_t \in \Gamma$ satisfying

and after observing $\omega_t$, the AA updates the probability vector as

where $z_t$ is a normalizing constant such that $( \pi_t (i) )_{i \leq n}$ is also a probability vector. Because of the $\eta$-mixability assumption, the AA is well-defined.

Lemma. For each $t \in \mathbb{N}$, one has

Proof. We prove by induction. Obviously, the inequality holds for $t = 0$ (for which we set $L_t = L_t (i) = 0$ for all $i \leq n$). Assume the inequality to be proved holds for $t = T$. We write

By the definition of the AA, the RHS is bounded below by

By assumption, the RHS can be further bounded below by

This proves the lemma. Q.E.D.

By the lemma, we have

Hence we obtain

Choosing $\pi_0 ( i ) = 1 / n$ for all $i$ (the uniform distribution) proves the proposition.

Some Interesting Talks at NIPS 2016

NIPS 2016 was quite successful, in the sense that most of the papers interesting to me were chosen as oral presentations:) The list below is of course non-exhaustive and biased. The order is alphabetical, according to the last names of the presenters/first authors.

The Pinching Trick and the Golden-Thompson Inequality

Pinching

Let $A$ be a Hermitian matrix, and $A = \sum_j \lambda_j P_j$ be the spectral decomposition of $A$. The pinching map defined by $A$ is given by

for any Hermitian matrix $X$.

Theorem 1. Let $A$ be a positive semi-definite matrix and $B$ be a Hermitian matrix. The following statements hold.

  1. $\mathcal{P}_B (A)$ commutes with $B$.
  2. $\mathrm{Tr} ( \mathcal{P}_B (A) B ) = \mathrm{Tr} ( A B )$.
  3. (Pinching inequality) $\vert \mathrm{spec} (B) \vert \, \mathcal{P}_B (A) \geq A$, where $\mathrm{spec} (B)$ denotes the set of eigenvalues of $B$.

The first two statements are easy to check. The earliest reference on the pinching inequality I can find is the classic book by Jacques Dixmier. A simple proof of the pinching inequality can be found in the textbook by Masahito Hayashi.

One main issue in matrix analysis is non-commutativity. The first statement in Theorem 1 hints that pinching can be an useful tool to deal with this issue. In the next section, the pinching trick is illustrated using the Golden-Thompson inequality as an example.

A proof of the Golden-Thompson Inequality

The Golden-Thompson inequality says that

for any two Hermitian matrices $A$ and $B$. Obviously, if $A$ commutes with $B$, the Golden-Thompson inequality holds with an equality; however, in general one needs to take non-commutativity into consideration. Below we present a very elegant proof using the pinching trick from a recent paper by D. Sutter et al.

The key observation is that $\vert \mathrm{spec} ( A^{\otimes n} ) \vert$ does not grow rapidly with $n$ for any Hermitian matrix $A$.

Lemma 1. One has $\vert \mathrm{spec} ( A^{\otimes n} ) \vert = O ( \mathsf{poly} (n) )$ for any Hermitian matrix $A$.

Proof (Golden-Thompson inequality).

Let $X$ and $Y$ be two positive definite matrices. Then one can write

for any positive integer $n$. By the pinching inequality, one has

By the first two statements in Theorem 1 and Lemma 1, one has

Then one obtains the Golden-Thompson Inequality by letting $n \to \infty$.