NIPS 2016 was quite successful, in the sense that most of the papers interesting to me were chosen as oral presentations:) The list below is of course non-exhaustive and biased. The order is alphabetical, according to the last names of the presenters/first authors.

- “Kernel-based Methods for Bandit Convex Optimization” by S. Bubeck
- Bubeck wrote a blog article on this paper: part 1, part 2, and part 3.
- Bubeck had given basically the same talk at the Simons Institute (youtube video).

- “Supervised learning through the lens of compression” by O. David, S. Moran, and A. Yehudayoff
- Roughly speaking, a function class is learnable if it allows (approximate) compression. Notice that the compression is not defined in the Shannon-theoretic way.

- “Generalization of ERM in stochastic convex optimization: The dimension strikes back” by V. Feldman
- This paper shows that minimizing the empirical average is not an optimal strategy for stochastic approximation in general.

- “Safe testing: An adaptive alternative to p-value-based testing” by P. Grunwald
- Grunwald proposed the notion of safe test, which is robust against possible abuse of statistical methods, such as collecting data until the p-value is large enough. He also provided an algorithm for safe testing, based on the so-called reverse I-projection.
- Unfortunately, it seems that the paper has not been available on the internet.

- “Theory and algorithms for forecasting non-stationary time series” by V. Kuznetsov and M. Mohri
- This is a tutorial talk mainly based on their NIPS’15 paper and COLT’16 paper.

- “Without-replacement sampling for stochastic gradient methods” by O. Shamir
- This paper provides convergence guarantees for the setting mentioned in its title.

- “Stochastic optimization: Beyond stochastic gradients and convexity: Part 2” by S. Sra
- The slides can serve as a good bibliography on solving non-convex finite-sum optimization problems.

- “MetaGrad: Multiple learning rates in online learning” by T. van Erven and W. M. Koolen
- This paper proposes a somewhat universally optimal scheme for online learning. The idea is to discretize the interval of possible step sizes, treat each candidate step size as an expert, and then do prediction with experts to choose the step size.