This post is essentially a summary of some results in the classical paper “A game of prediction with expert advice” by V. Vovk.

**Definition.** A *game* is a triple $( \Omega, \Gamma, \lambda )$, where $\Omega$ is the *outcome space*, $\Gamma$ is the *prediction space*, and $\lambda: \Omega \times \Gamma \to [0, \infty]$ is the *loss function*.

Consider the standard *learning with expert advice* setting, in which a *learner* tries to predict the outcomes of the *nature* sequentially, with the help of $n$ *experts*.
Precisely speaking, at each trial $t \in \mathbb{N}$:

- Each expert $i$ makes a prediction $\gamma_t (i)$, $1 \leq i \leq n$.
- The learner makes a prediction $\gamma_t \in \Gamma$.
- The nature chooses an outcome $\omega_t \in \Omega$.
- The learner suffers for the loss $\lambda ( \omega_t, \gamma_t )$.

Notice that before making his prediction at the trial $t$, the learner has access to the past history of the nature (up to trial $t - 1$) and all of the experts (up to trial $t$).

For every $t \in \mathbb{N}$, let us define the learner’s accumulative loss as

and each expert’s accumulative loss as

The learner’s goal is to make predictions almost as well as the *best expert*, in the sense that $L_t \leq L_t(i) + \varepsilon$ for all $i \leq n$, $t \in \mathbb{N}$, and some *small enough* $\varepsilon$.

**Definition.** We say the game is *$\eta$-mixable*, if for any probability distribution $P$ on $\Gamma$, there exists a $\gamma^\star \in \Gamma$, such that

**Remark.** By Jensen’s inequality, if the mapping $\gamma \mapsto \exp ( - \eta \lambda ( \omega, \gamma ) )$ is concave for all $\omega \in \Omega$, then the game is $\eta$-mixable.
Any such loss function $\lambda$ is called *exp-concave*.
Obviously, one can simply choose $\gamma^\star = \int \gamma \, P ( \mathrm{d} \gamma )$ if the loss is exp-concave.

**Proposition.** *If a game is $\eta$-mixable, there exists a prediction algorithm for the learner, such that*

One such algorithm is the *aggregating algorithm (AA)*, first introduced in the paper “Aggregating strategies” by V. Vovk.
Let $( \pi_0 ( i ) )_{i \leq n}$ be a probability vector whose entries are all non-zero.
At each trial $t \in \mathbb{N}$, the AA outputs a prediction $\gamma_t \in \Gamma$ satisfying

and after observing $\omega_t$, the AA updates the probability vector as

where $z_t$ is a normalizing constant such that $( \pi_t (i) )_{i \leq n}$ is also a probability vector. Because of the $\eta$-mixability assumption, the AA is well-defined.

**Lemma.** For each $t \in \mathbb{N}$, one has

**Proof.** We prove by induction.
Obviously, the inequality holds for $t = 0$ (for which we set $L_t = L_t (i) = 0$ for all $i \leq n$).
Assume the inequality to be proved holds for $t = T$.
We write

By the definition of the AA, the RHS is bounded below by

By assumption, the RHS can be further bounded below by

This proves the lemma. *Q.E.D.*

By the lemma, we have

Hence we obtain

Choosing $\pi_0 ( i ) = 1 / n$ for all $i$ (the uniform distribution) proves the proposition.