Information-Theoretic Learning

Project name: Information-Theoretic Learning (INS4.2)

Coordinator of this project: Prof.dr. P.D. Grunwald
Startdate: 1997-01-01
Enddate: 2100-01-01

INS 4.2 addresses a broad range of issues related to machine learning, both theoretically and practically, and mostly from an information-theoretic perspective. In particular, we study the relation between data compression, generalization properties and prediction in the sense of the "minimum description length" paradigm, basically a formal version of Occam's Razor. The project coordinator is the author of "The Minimum Description Length Principle", the first comprehensive overview of this area.

Recent work includes new methods for dealing with MDL/Bayesian inference when all models are wrong. In particular, we identified a novel practical model selection and prediction method that allows for substantial improvement of both cross-validation/AIC-type methods on the one hand, and Bayesian model selection/averaging methods on the other hand. In other recent work, it is shown that both (standard) MDL and Bayes can behave quite badly if the model is only slightly wrong.

We also study the Kolmogorov structure function and its relation to the `idealized' versions of MDL based on Kolmogorov complexity. Relatedly, we have recently developed an individual-sequence version of rate-distortion theory. This fits in the broader picture of an algorithmic rather than probabilistic approach to statistics and learning.

More practically oriented research includes similarity analysis by data compression, which can be used to determine a compression-based "distance" between any two types of objects, be they, for example, DNA sequences, literary texts or midi files. A variation of this idea has led to the Normalized Google Distance, which uses the world-wide web to determine the similarity between objects.