This post is essentially a summary of some results in the classical paper “A game of prediction with expert advice” by V. Vovk.
Definition. A game is a triple $( \Omega, \Gamma, \lambda )$, where $\Omega$ is the outcome space, $\Gamma$ is the prediction space, and $\lambda: \Omega \times \Gamma \to [0, \infty]$ is the loss function.
Consider the standard learning with expert advice setting, in which a learner tries to predict the outcomes of the nature sequentially, with the help of $n$ experts. Precisely speaking, at each trial $t \in \mathbb{N}$:
- Each expert $i$ makes a prediction $\gamma_t (i)$, $1 \leq i \leq n$.
- The learner makes a prediction $\gamma_t \in \Gamma$.
- The nature chooses an outcome $\omega_t \in \Omega$.
- The learner suffers for the loss $\lambda ( \omega_t, \gamma_t )$.
Notice that before making his prediction at the trial $t$, the learner has access to the past history of the nature (up to trial $t - 1$) and all of the experts (up to trial $t$).
For every $t \in \mathbb{N}$, let us define the learner’s accumulative loss as
and each expert’s accumulative loss as
The learner’s goal is to make predictions almost as well as the best expert, in the sense that $L_t \leq L_t(i) + \varepsilon$ for all $i \leq n$, $t \in \mathbb{N}$, and some small enough $\varepsilon$.
Definition. We say the game is $\eta$-mixable, if for any probability distribution $P$ on $\Gamma$, there exists a $\gamma^\star \in \Gamma$, such that
Remark. By Jensen’s inequality, if the mapping $\gamma \mapsto \exp ( - \eta \lambda ( \omega, \gamma ) )$ is concave for all $\omega \in \Omega$, then the game is $\eta$-mixable. Any such loss function $\lambda$ is called exp-concave. Obviously, one can simply choose $\gamma^\star = \int \gamma \, P ( \mathrm{d} \gamma )$ if the loss is exp-concave.
Proposition. If a game is $\eta$-mixable, there exists a prediction algorithm for the learner, such that
One such algorithm is the aggregating algorithm (AA), first introduced in the paper “Aggregating strategies” by V. Vovk. Let $( \pi_0 ( i ) )_{i \leq n}$ be a probability vector whose entries are all non-zero. At each trial $t \in \mathbb{N}$, the AA outputs a prediction $\gamma_t \in \Gamma$ satisfying
and after observing $\omega_t$, the AA updates the probability vector as
where $z_t$ is a normalizing constant such that $( \pi_t (i) )_{i \leq n}$ is also a probability vector. Because of the $\eta$-mixability assumption, the AA is well-defined.
Lemma. For each $t \in \mathbb{N}$, one has
Proof. We prove by induction. Obviously, the inequality holds for $t = 0$ (for which we set $L_t = L_t (i) = 0$ for all $i \leq n$). Assume the inequality to be proved holds for $t = T$. We write
By the definition of the AA, the RHS is bounded below by
By assumption, the RHS can be further bounded below by
This proves the lemma. Q.E.D.
By the lemma, we have
Hence we obtain
Choosing $\pi_0 ( i ) = 1 / n$ for all $i$ (the uniform distribution) proves the proposition.