Suppose that Alice and Bob is playing a so-called two-player zero-sum game with perfect information. Alice chooses her action from a finite action set ; Bob chooses his action from a finite action set . There is a payoff function $f$, which assigns to each action pair $(a, b) \in \mathcal{A} \times \mathcal{B}$ a payoff $f ( a, b ) \in \mathbb{R}$; Bob’s payoff is then simply $-f (a, b)$. In general, Alice and Bob may choose their actions randomly, so we consider the expected payoff: for any probability distribution $P \in \Delta ( \mathcal{A} )$ and $Q \in \Delta ( \mathcal{B} )$, define
If Alice does not have any additional information about Bob, it is reasonable to consider the very robust minimax strategy, i.e., to choose her action randomly according to
No matter what Bob’s strategy is, Alice is guaranteed to achieve at least the minimax expected payoff.
If the payoff function $f$ is vector-valued, talking about the minimax expected payoff becomes nonsense, as we cannot compare two vectors. This post introduces an analog for the vector-valued payoff case, proposed by David Blackwell in this classic paper.
In the rest of this post, suppose that the payoff function $f$ takes values in .
Blackwell considered the following protocol of repeated games. For $t = 1, 2, \ldots$,
In general, Alice can use a randomized strategy. For convenience, let us say that Alice chooses $a_t$ randomly according to some probability distribution .
Recall that Alice cannot achieve the “minimax expected payoff” in this vector-payoff case, as it is just not well-define. However, it might be possible to ensure that her average payoff will ultimately lies in some specific subset of $\mathbb{R}^d$.
Define the average payoff
Definition. Let $\mathcal{X} \subseteq \mathbb{R}^d$. The set $\mathcal{X}$ is said to be approachable by Alice, if for any sequence ,
where $\mathrm{dist} ( \bar{f}_T, \mathcal{X} )$ denotes the $2$-norm distance between $\bar{f}_T$ and the projection of $\bar{f}_T$ onto $\mathcal{X}$.
If we only consider convex sets, there is an exact characterization of approachable sets.
Theorem 1. A convex set is approachable to Alice, if and only if for any $Q \in \Delta ( \mathcal{B} )$, there exists some $P \in \Delta ( \mathcal{A} )$ such that $f ( P, Q ) \in \mathcal{X}$.
Exercise. Suppose that for each $t$, Bob’s action $b_t$ can be dependent on Alice’s action $a_t$. Show that then even when the if and only if condition holds, $\mathcal{X}$ may not be approachable to Alice.
The only if part of Theorem 1 is easy to check. Suppose that there exists some $Q^\star \in \Delta ( \mathcal{B} )$, such that for any . To guarantee that Alice fails to approach $\mathcal{X}$, Bob can always choose $b_t$ randomly according to $Q^\star$ for all $t$.
We will focus on the if part of Theorem 1 in the rest of this post.
To prove the if part of Theorem 1, we need to show the existence of an approaching strategy. Indeed, the proof is constructive; there is an explicitly defined approaching strategy based on the following lemma.
Lemma 1. Suppose that for any $Q \in \Delta ( \mathcal{B} )$, there exists some $P \in \Delta ( \mathcal{A} )$ such that $f ( P, Q ) \in \mathcal{X}$. Then for any $v \in \mathbb{R}^d$, there exists some $P \in \Delta ( \mathcal{A} )$, such that
where $\mathrm{proj}$ denotes the projection (w.r.t. the $2$-norm) onto $\mathcal{X}$.
Proof. By definition, $\mathrm{proj} (v)$ is the minimizer of the $2$-norm distance to $v$ in $\mathcal{X}$. The optimality condition says
What we want to prove can be equivalently written as
Applying von Neumann’s minimax theorem, we can exchange the order of $\min$ and $\max$, and write
The lemma follows from the optimality condition. $Q.E.D.$
We call the condition in Lemma 1 Blackwell’s condition, following the convention in literature. Blackwell’s condition ensures that the following strategy for Alice is well-defined.
Alice’s Strategy. Define the expected and average expected payoff:
For each $t \in \mathbb{N}$, Alice computes some $P_t$ such that
and then announces $a_t$ randomly according to $P_t$.
Remark. In practice, $P_t$ can be computed as
This is a bilinear saddle point problem; there exists a variety of algorithms solving such a problem.
Since $\mathcal{X}$ is approachable if and only if its closure is approachable, we assume that $\mathcal{X}$ is closed w.l.o.g. We will also assume that $\mathcal{X}$ is bounded for convenience; we will relax this unnecessary condition at the end of the proof.
Theorem 1 is in fact a corollary of the following theorem.
Theorem 2. Suppose that the if and only if condition in Theorem 1 holds. Suppose that Alice adopts the strategy described above. Then for each $T \in \mathbb{N}$ and $\delta \in ( 0, 1 )$, with probability at least , it holds that
where , and .
To prove Theorem 2, we start with the following lemma.
Lemma 2. For all $T \in \mathbb{N}$, it holds that
Proof. Define and . By definition, we have
Writing , we have
Applying Lemma 1 and the triangle inequality, we obtain
The lemma can be then easily checked by induction. Q.E.D.
Proof of Theorem 2. Recall that denotes the average payoff; define . We start with the decomposition,
By Lemma 2 and the non-expansiveness of projections, we have
Define . Then forms a vector-valued martingale. Applying the Azuma inequality for Banach-space valued martingales, we obtain, for any $u > 0$,
for some constant $C > 0$. Q.E.D.
In general, the set to be approached may not be bounded; then $R = + \infty$ if we directly apply the definition. Notice that, however, only sets lie in the convex hull of can be approached. Denote the convex hull by $\mathcal{F}$. Hence, it suffices to replace $\mathcal{X}$ by , which is always bounded by construction; then it is easily checked that $R$ can be bounded above by $M$ to slightly simplify the expressions.
One might wonder why one should consider vector-valued payoffs. An interesting application is calibrated forecasting. With Theorem 1, it is an easy exercise to prove the theorem in the previous post, showing the existence of an $\varepsilon$-calibrated forecasting strategy.
The strategy of Alice presented above is not exactly the original one proposed by Blackwell, as the original one does not lead to the $O ( 1 / \sqrt{T} )$ with-high-probability convergence rate in Theorem 2. The strategy presented above was described in the book by Cesa-Bianchi and Lugosi, without proof. The proof above is basically a mix of the arguments and results in the following references.
Can one predict the weather ignorantly yet accurately? To be concrete, suppose we are given the weather record of the past 100 days,
we cannot access any other information (e.g., temperature, satellite images, etc.), and we know nothing about meteorology. Can we tell how likely it will rain tomorrow?
This problem can be cast as a probability forecasting problem. The protocol of probability forecasting is as follows. For ,
The goal of Predictor is to make accurate predictions, such that $p_t$ is a good estimate of the likelihood of the event . As the protocol is deterministic, we do not say “probability of the event ”.
Thinking of ‘$0$’ as ‘dry’, and ‘$1$’ as ‘wet’, we recover the original weather forecasting problem.
What precisely are the requirements a good prediction strategy should satisfy? The requirements in general depend on the actual application scenario. In this post, we will only consider a minimal requirement—any good prediction strategy should be calibrated.
Roughly speaking, the notion of calibration requires the predictions to comply with the empirical frequencies, i.e., for any $\rho \in ( 0, 1 )$, it holds that
To make the notion precise, consider the following discretization argument. Fix some $\varepsilon \in ( 0, 1 )$. Let us decompose the interval $[ 0, 1 ]$ as
where $M := \lceil \varepsilon^{-1} \rceil$. We denote the center of $I_m$ by $c_m$. Define the empirical frequencies
if the divisor is non-zero, and $\rho_m (T) := c_m$ otherwise. We would like, for all $m$,
We do not really want to care about those $m$’s for which , as they are irrelevant in the long run. A compact formulation of the requirement is
Some simple algebraic manipulations lead to the following (slightly different) defnition.
Definition. A (possibly randomized) prediction strategy is -calibrated, if the resulting sequence of predictions satisfy
Surprisingly, even when one knows nothing about the mechanism generating the sequence , calibration is achievable.
Theorem. Suppose that Reality, before announcing $y_t$, does not know the exact value of $p_t$. For any $\varepsilon > 0$, there exists an $\varepsilon$-calibrated probability forecasting strategy.
Some quick observations:
If $y_t$ can depend on the exact value of $p_t$, calibration is impossible. For example, Reality can always choose $y_t = 1$ whenever $p_t \leq 0.5$, and $y_t = 0$ otherwise.
The remark above also shows the necessity of a randomized forecasting strategy. If Forecaster uses a deterministic strategy, for which $p_t$ is a deterministic function of $p_1, y_1, p_2, y_2, \ldots, p_{t - 1}, y_{t - 1}$, then Reality can also compute exact value of $p_t$.
A calibrated forecasting strategy can be derived by the idea of Blackwell approachability.
The strategy only chooses $p_t$ from the set ; effectively, Forecaster only needs to choose for all $t$. Define, for all $m$,
We already know that Forecaster must adopt a randomized strategy. Suppose that Forecaster chooses $m_t$ randomly according to some probability distribution $P_t$ on . Define
Forecaster’s Strategy. For each $t$, Forecaster first computes some $P_t$ such that for any probability distribution $Q$ on ,
where $\mathrm{proj}$ denotes projection onto the $1$-norm ball of radius $\varepsilon$ in $\mathbb{R}^M$, and then announces $m_t$ randomly according to $P_t$.
Lemma. Such $P_t$ always exists.
Remark. In practice, one can compute $P_t$ by solving the saddle point problem:
Notice that the objective function is bilinear; there are a variety of existing algorithms that solve this problem.
The strategy looks mysterious at first glance. I will introduce the theory of Blackwell approachability in the next post.
Can one predict the future well, without any subjective assumptions? Indeed, there are actually some prediction strategies that work without any subjective assumption, and approach (arguably) the best possible performance in some sense. This post introduces one such strategy described in an article by V. Vovk.
Consider the following protocol of online prediction. For $t = 1, 2, \ldots$,
We assume the perfect information case; that is, all $x_t$’s, $\gamma_t$’s, and $y_t$’s, once announced, are known to both Reality and Predictor.
Exercise. Find some real-life examples of this protocol. Notice that $\mathcal{X}$ can be the space of histories.
We measure the quality of prediction by a given loss function $\lambda: \mathcal{X} \times \Gamma \times \mathcal{Y} \to \mathbb{R}$, known to both Reality and Predictor. We will consider compact-type losses.
Definition. A loss function is called compact-type, if for any compact sets $A \subseteq \mathcal{X}$ and $B \subseteq \mathcal{Y}$ and any $M \in \mathbb{R}$, there exists some compact set $C \subset \Gamma$, such that $\lambda ( x, \gamma, y ) > M$ for any $x \in A$, $y \in B$, and $\gamma \notin C$.
A (possibly randomized) prediction strategy is a function $\psi: \mathcal{X} \to \Delta ( \Gamma )$, where $\Delta ( \Gamma )$ denotes the set of all probabiity measures on $\Gamma$. For each $t$, Predictor computes $Q_t := \psi( x_t )$, and chooses $\gamma_t \in \Gamma$ randomly according to $Q_t$.
Definition. We say that a prediction strategy is continuous, if the associated function $\psi$ is continuous.
Theorem 1 (Vovk). Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are locally compact metric spaces, and $\Gamma$ is a metric space. Suppose that $\lambda$ is continuous and compact-type. Then for any $\varepsilon > 0$, there exists a prediction strategy, such that if and are precompact,
where $( \tilde{\gamma_t} )_{t \in \mathbb{N}}$ is the sequence of predictions made by any continuous prediction strategy.
Any prediction strategy that achieves the theorem above is called universally consistent.
The prediction strategy proposed by Vovk consists of two parts. Predictor first computes an estimate of the conditional probability distribution of $y_t$ given the history; then Predictor chooses the optimal $\gamma_t$ minimizing the estimated conditional expected loss. Notice that, however, the online prediction protocol is completely deterministic; hence talking about the conditional probability and expected loss, rigorously speaking, is not legal in the sense of measure-theoretic probability.
The first part is called probability forecasting. The protocol of probability forecasting is as follows. For $t = 1, 2, \ldots$,
Theorem 2 (Vovk). Suppose that $\mathcal{X}$ and $\mathcal{Y}$ are compact metric spaces. There exists some forecasting strategy, such that
for any continuous function $f: \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} \to \mathbb{R}$.
The forecasting strategy behind Theorem 2 is based on the existence of a universal RKHS $\mathcal{H}$ on $\mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y}$, i.e., an RKHS dense in $C ( \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} )$. Following the framework of defensive forecasting, one can construct a forecasting strategy such that the inequality in Theorem 2 holds for all functions in $\mathcal{H}$. As $\mathcal{H}$ is dense in $C ( \mathcal{X} \times \Delta ( \mathcal{Y} ) \times \mathcal{Y} )$, one obtains Theorem 2.
Naively speaking, the second part corresponds to finding the randomized action that minimizes the conditional expected loss, i.e., computing
where $P_t$ is the output of probability forecasting, and
Then Predictor chooses $\gamma_t$ randomly according to $\tilde{Q}_t$.
Theorem 2 considers continuous functions, while $\tilde{Q}_t$ may not be continuously dependent on $(x_t, P_t)$. Therefore, we need to slightly modify the second part. What Predictor actually does in the second part is to compute $Q_t := G ( x_t, P_t ) \in \Delta ( \Gamma )$, for some continuous function $G$ satisfying
for some $\delta > 0$.
Lemma 1 (Vovk). Suppose that $\Gamma$ is a compact convex subset of a topological vector space. Such a continuous function $G$ exists for any $\delta > 0$.
In Theorem 1, the conditions on $\mathcal{X}$, $\mathcal{Y}$, and $\Gamma$ are not as strict as those in Theorem 2 and Lemma 1. This gap can be addressed by a doubling-like trick. Roughly speaking, Predictor starts with compact subsets $A \subseteq \mathcal{X}$ and $B \subseteq \mathcal{Y}$, and a compact convex subset $C \subseteq \Gamma$. Predictor implements the prediction strategy described above only on $A$, $B$, and $C$. If Reality announces some $( x_t, y_t ) \notin A \times B$, Predictor enlarges $A$, $B$ and $C$ such that $( x_t, y_t )$ is contained.
We will consider a standard constrained convex optimization problem:
where $f$ is a convex function, and $\mathcal{X}$ is a compact convex set in $\mathbb{R}^d$. We assume that $f$ is $L$-Lipschitz continuous w.r.t. some norm $\Vert \cdot \Vert$ on $\mathbb{R}^d$.
Let $\omega$ be a continuously differentiable $1$-strongly convex function on $\mathcal{X}$, w.r.t. $\Vert \cdot \Vert$. Define the corresponding Bregman divergence
and prox-mapping
Let $x_1 \in \mathcal{X}$, and $( \eta_1, \eta_2, \ldots )$ be a sequence of step sizes. The standard MD iterates as follows.
Define
The following result can be found in, e.g., the classic paper by Beck and Teboulle.
Proposition 1. It holds that, for any $u \in \mathcal{X}$,
where $\left\Vert \cdot \right\Vert_*$ denotes the dual norm.
Proposition 1 implies the following convergence guarantee for the standard MD.
Corollary. Fix a positive integer $T$. Set
Then
Now assume that $f$ is also $\mu$-strongly convex w.r.t. $\Vert \cdot \Vert$. How do we exploit this additional information?
Let us choose $\omega$ such that it is $1$-strongly convex on the whole $\mathbb{R}^d$, instead of merely $\mathcal{X}$. For any $R > 0$, define $\omega_{R, z} ( x ) := \omega ( R^{-1} ( x - z ) )$; denote by $D_{R, z}$ the corresponding Bregman divergence, and $\mathrm{prox}_{R, z}$ the corresponding prox-mapping.
Let us define an iteration rule very similar to the standard MD; the only difference is that now we replace $\omega$ by $\omega_{R, z}$ for some $z \in \mathbb{R}^d$: Let $( \eta_1, \eta_2, \ldots )$ be a given sequence of step sizes. For any $x_1 \in \mathcal{X}$, we compute
Define
Proposition 2. Fix a positive integer $T$. Set
Then
where $x^\star$ is the unique minimizer of $f$ on $\mathcal{X}$.
Proof. Notice that $\omega_R$ is $1$-strongly convex w.r.t. $\Vert \cdot \Vert_R := R^{-1} \Vert \cdot \Vert$. The first inequality follows from Proposition 1, using the norm $\Vert \cdot \Vert_R$. The second inequality follows from the following two inequalities, obtained by the strong convexity of $f$ and the optimality of $x^\star$, respectively:
Q.E.D.
Now we have an error bound depending on $R_0$; the smaller $R_0$ is, the smaller the error bounds are. Also notice that the bound of the distance between $x^T$ and $x^\star$ is strictly decreasing with $T$. These observations motivate the following restarting strategy: Set $y_0 \in \mathcal{X}$. For $k = 1, 2, \ldots$,
By the proposition we have just proved, it suffices to choose
Proposition 3. Define $M_k = \sum_{\kappa = 1}^k T_{\kappa}$. Let $k^\star$ be the smallest $k$ such that
Then for $k < k^\star$,
for $k \geq k^\star$,
Proof. It is easily checked, by Proposition 2, that
By our choice of $T_k$, it holds that
Therefore, $M_k \leq 2 k$ for $k < k^\star$, and
for $k \geq k^\star$. Q.E.D.
Notice that to compute each $y^k$, we need to compute $T_k$ additional prox-mappings. Therefore, the effective number of iterations is represented by $M_k$, instead of $k$, and the convergence rate is understood as
]]>To get some idea about what the game-theoretic framework is, let us review an interesting result in this framework. Consider the following probabiilty prediction game. Let $K_0 = 1$. For every $n \in \mathbb{N}$,
This is in fact a game of betting from the perspective of Skeptic: Initially Skeptic has $1$ dollar. If $s_n > 0$, Skeptic is betting on the event $y_n = 1$; Skeptic gets $s_n y_n$ dollars in return if the event happens, and loses $s_n p_n$ dollars otherwise. Similarly, if $s_n < 0$, Skeptic is betting against the event $y_n = 1$.
The sequence $(K_n)$ is called a martingale. If $K_0 = 1$ and $K_n \geq 0$ for all $n$, the sequence $( K_n )$ is called a scoring martingale.
Of course, the values of $K_n$ depend on the strategy of Skeptic. Roughly speaking, in the game-theoretic framework, an event of small probabiilty corresponds to a large value of $K_n$.
Theorem. (Weak law of large numbers) For any $N \in \mathbb{N}$, there exists a scoring martingale $( K_n )$, such that for any $\delta > 0$, $K_N \geq 1 / \delta$ unless
Proof. Choose
Q.E.D.
Defensive forecasting aims at finding a strategy for Forecaster, such that Skeptic’s capital does not grow, no matter what Reality’s strategy is.
If Skeptic and Reality cooperate, this aim is in general impossible. For example, Skeptic and Reality can choose $s_n$ and $y_n$ in the following way:
Then $K_n \geq K_{n - 1} + 1$ for every $n$.
However, if we slightly weaken Skeptic, this aim becomes surprisingly easy. Notice that Skeptic keeps his wealth growing, by setting $s_n$ as a discontinuous function of $p_n$. Let us impose a restriction that $s_n$ must be a continuous function of $p_n$. Then we arrive at the following defensive porecasting protocol.
Let $K_0 = 1$. For every $n \in \mathbb{N}$,
Theorem. There always exists a strategy for Forecaster, such that $K_0 \geq K_1 \geq K_2 \geq \cdots$.
Proof. If $s_n$ is strictly positive on $[ 0, 1 ]$, choose $p_n = 1$. If $s_n$ is strictly negative on $[ 0, 1 ]$, choose $p_n = 0$. Otherwise, by continuity of $s_n$, there must exist some $p^\star \in [0, 1]$ such that $s_n ( p^\star ) = 0$; choose $p_n = p^\star$. Q.E.D.
K29 is an algorithm for probability prediction, based on the defensive forecasting approach. Consider the following protocol. For $n = 1, 2, \ldots, N$,
Weather forecasting is an instance of this protocol. One can associate $y_n = 1$ to a rainy day; then $p_n$ denotes Forecaster’s estimate of the probability that a rainy day will happen.
Let $\kappa$ be a kernel on $[ 0, 1 ]$. The aim of K29 is to guarantee, for all $p \in [ 0, 1 ]$,
This inequality, roughly speaking, guarantees that the forecasting strategy is calibrated, i.e., for any $p \in [ 0, 1 ]$, among those days with $p_n \approx p$, there are about $( 100 \times p ) \%$ of them that are actually rainy. A precise intrepretation of the inequality is in fact much more complicated (for example, it is known that it is impossible to have a deterministic calibrated forecasting algorithm), and hence skipped due to my lazziness ;) Please find the references for details.
Theorem. Assume $\kappa ( p, p ) \leq 1$ for all $p \in [ 0, 1 ]$. The inequality above can be achieved, if Forecaster applies the defensive forecasting strategy with respect to
The resulting forecasting algorithm is called K29, because it is related to a paper of Kolmogorov published in 1929.
Indeed, there are algorithms for doing so. A famous one is the so-called hedge algorithm, introduced in the paper “A desicion-theoretic generalization of on-line learning and an application to boosting” by Y. Freund and R. Schapire.
The idea is to switch between Alice’s fridends’ strategies. Assume for simplicity that Alice will always bet 1 unit of money for each round. Alice keeps in her mind a weight for each of her friend. At the beginning of each round, she chooses one of her friends with probability proportional to the weights, say Bob, and follows Bob’s bet. Once a round finishes, Alice sees how much she would lose had she followed any of her friend. She then updates the weights accordingly. Those who yielded smaller losses have higher weights, and vice versa.
Let us formalize the idea. We label the rounds by positive integers $t \in \mathbb{N}$. Suppse Alice has $N$ friends; let us index them by the set $[N] := \lbrace 1, \ldots, N \rbrace$. We summarize the weights before the $t$-th round by a weight vector $w_t := ( w_t (i) )_{i \in [N]} \in \mathbb{R}^N$ in the probability simplex, where $w_t (i)$ denotes the weight assigned to Friend $i$. We also summarize the losses yielded by each friend’s bet for the $t$-th round by a vector
Then for the $t$-th round, the expected loss of Alice is
Suppose Alice will bet on $T$ rounds. The goal of Alice is to have a small regret
where $[T] := \lbrace 1, \ldots, T \rbrace$.
The regret measures the difference between Alice’s expected loss and the loss yielded by the best friend in hindsight. If $R_T = o(T)$, in hindsight, the difference between the average expected losses of Alice and the best friend converges to zero. Notice that the best friend in general varies with $T$.
The hedge algorithm is a specific rule of setting the weights. Initially, Alice chooses any weight $w_1$ such that $w_1 (i) \neq 0$ for all $i$. After observing the losses $x_t$, Alice updates the weights as
where $Z_t := \sum_{i \in [N]} w_t (i) \, \mathrm{e}^{- \eta x_t (i)}$ is the normalizing constant, and $\eta$ is a properly chosen learning rate.
The hedge algorithm looks similar to the aggregating algorithm (AA). Indeed, we can formulate the problem using the setting of prediction with expert advice (PEA). Let Alice be Learner and her friends be the experts. Let Nature choose the loss vectors. Set the outcome space $\Omega$ as $[ 0, + \infty [^N$, prediction space $\Gamma$ as the probability simplex in $\mathbb{R}^N$, and loss function as
When each expert $i \in [N]$ chooses a fixed prediction $\gamma_t (i) = ( \delta_{i,j} )_{j \in [N]} \in \mathbb{R}^N$, we recover the original betting-horce-racing-games problem.
However, applying the AA to this specific PEA game is not trivial, and requires some generalization of the mixability condition. Assuming bounded losses, the resulting algorithm yields $R_T = O( \sqrt{T} )$. See this paper and this paper for the details.
]]>Definition. A game is a triple $( \Omega, \Gamma, \lambda )$, where $\Omega$ is the outcome space, $\Gamma$ is the prediction space, and $\lambda: \Omega \times \Gamma \to [0, \infty]$ is the loss function.
Consider the standard learning with expert advice setting, in which a learner tries to predict the outcomes of the nature sequentially, with the help of $n$ experts. Precisely speaking, at each trial $t \in \mathbb{N}$:
Notice that before making his prediction at the trial $t$, the learner has access to the past history of the nature (up to trial $t - 1$) and all of the experts (up to trial $t$).
For every $t \in \mathbb{N}$, let us define the learner’s accumulative loss as
and each expert’s accumulative loss as
The learner’s goal is to make predictions almost as well as the best expert, in the sense that $L_t \leq L_t(i) + \varepsilon$ for all $i \leq n$, $t \in \mathbb{N}$, and some small enough $\varepsilon$.
Definition. We say the game is $\eta$-mixable, if for any probability distribution $P$ on $\Gamma$, there exists a $\gamma^\star \in \Gamma$, such that
Remark. By Jensen’s inequality, if the mapping $\gamma \mapsto \exp ( - \eta \lambda ( \omega, \gamma ) )$ is concave for all $\omega \in \Omega$, then the game is $\eta$-mixable. Any such loss function $\lambda$ is called exp-concave. Obviously, one can simply choose $\gamma^\star = \int \gamma \, P ( \mathrm{d} \gamma )$ if the loss is exp-concave.
Proposition. If a game is $\eta$-mixable, there exists a prediction algorithm for the learner, such that
One such algorithm is the aggregating algorithm (AA), first introduced in the paper “Aggregating strategies” by V. Vovk. Let $( \pi_0 ( i ) )_{i \leq n}$ be a probability vector whose entries are all non-zero. At each trial $t \in \mathbb{N}$, the AA outputs a prediction $\gamma_t \in \Gamma$ satisfying
and after observing $\omega_t$, the AA updates the probability vector as
where $z_t$ is a normalizing constant such that $( \pi_t (i) )_{i \leq n}$ is also a probability vector. Because of the $\eta$-mixability assumption, the AA is well-defined.
Lemma. For each $t \in \mathbb{N}$, one has
Proof. We prove by induction. Obviously, the inequality holds for $t = 0$ (for which we set $L_t = L_t (i) = 0$ for all $i \leq n$). Assume the inequality to be proved holds for $t = T$. We write
By the definition of the AA, the RHS is bounded below by
By assumption, the RHS can be further bounded below by
This proves the lemma. Q.E.D.
By the lemma, we have
Hence we obtain
Choosing $\pi_0 ( i ) = 1 / n$ for all $i$ (the uniform distribution) proves the proposition.
]]>Let $A$ be a Hermitian matrix, and $A = \sum_j \lambda_j P_j$ be the spectral decomposition of $A$. The pinching map defined by $A$ is given by
for any Hermitian matrix $X$.
Theorem 1. Let $A$ be a positive semi-definite matrix and $B$ be a Hermitian matrix. The following statements hold.
The first two statements are easy to check. The earliest reference on the pinching inequality I can find is the classic book by Jacques Dixmier. A simple proof of the pinching inequality can be found in the textbook by Masahito Hayashi.
One main issue in matrix analysis is non-commutativity. The first statement in Theorem 1 hints that pinching can be an useful tool to deal with this issue. In the next section, the pinching trick is illustrated using the Golden-Thompson inequality as an example.
The Golden-Thompson inequality says that
for any two Hermitian matrices $A$ and $B$. Obviously, if $A$ commutes with $B$, the Golden-Thompson inequality holds with an equality; however, in general one needs to take non-commutativity into consideration. Below we present a very elegant proof using the pinching trick from a recent paper by D. Sutter et al.
The key observation is that $\vert \mathrm{spec} ( A^{\otimes n} ) \vert$ does not grow rapidly with $n$ for any Hermitian matrix $A$.
Lemma 1. One has $\vert \mathrm{spec} ( A^{\otimes n} ) \vert = O ( \mathsf{poly} (n) )$ for any Hermitian matrix $A$.
Proof (Golden-Thompson inequality).
Let $X$ and $Y$ be two positive definite matrices. Then one can write
for any positive integer $n$. By the pinching inequality, one has
By the first two statements in Theorem 1 and Lemma 1, one has
Then one obtains the Golden-Thompson Inequality by letting $n \to \infty$.
]]>