In a paper from today’s edition of Science (Detecting Novel Associations in Large Data Sets), Reshef et al. propose a new measure of correlation which they call the maximal information coefficient (MIC). The authors have their focus on detecting and ranking correlations in large and high-dimensional dataset. They argue that previous measures of correlation lack at least one of the following properties:
- Generality – the correlation coefficient should be sensitive to a wide range of possible dependencies, including superpositions of functions.
- Equitability – the score of the coefficient should be influenced by noise, but not by the form of the dependency between variables
These requirements sound sensible, but note that the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper). One could probably argue about the second requirement, though, because there would be a certain sense in having more complicated relationships rated lower due to typical probability/parsimony arguments, but I can also see the argument against that.
Anyway, the authors go on showing that that the MIC (which is based on “gridding” the correlation space at different resolutions, finding the grid partitioning with the largest mutual information at each resolution, normalizing the mutual information values, and choosing the maximum value among all considered resolutions as the MIC) fulfills these requirements, and works well when applied to several real world datasets. There is a MINE Website with more information and code on this algorithm, and a blog entry by Michael Mitzenmacher which might also link to more information on the paper in the future.
Interesting, I guess, would be whether this method can be extended to higher-dimensional correlations.
ADDITIONAL COMMENTS 05.04.12: There is some additional discussion on the properties of MIC as a statistical estimator at science and at the blog of Andrew Gelman that are probably of interest, most importantly relating to the critique by Noah Simon & Robert Tibshirani, as well as the comments by Gorfine, Heller & Heller.
ADDITIONAL COMMENTS 06.02.13: Justin B. Kinney made me aware of some follow-up work of Gurinder Atwal and his that questions the claimed “equitability” properties of MIC. A discussion of this paper, including a response by Michael Mitzenmacher in the comment thread and counter-response by him, can be found on Andrew Gelman’s blog.