In a paper from today’s edition of Science (Detecting Novel Associations in Large Data Sets), Reshef et al. propose a new measure of correlation which they call the **maximal information coefficient (MIC)**. The authors have their focus on detecting and ranking correlations in large and high-dimensional dataset. They argue that previous measures of correlation lack at least one of the following properties:

**Generality**– the correlation coefficient should be sensitive to a wide range of possible dependencies, including superpositions of functions.**Equitability**– the score of the coefficient should be influenced by noise, but not by the form of the dependency between variables

These requirements sound sensible, but note that the consequential metric goes far beyond traditional measures of correlation, and rather towards what I would think of as a general pattern recognition algorithm that is sensitive to any type of systematic pattern between two variables (see the examples in Fig. 2 of the paper). One could probably argue about the second requirement, though, because there would be a certain sense in having more complicated relationships rated lower due to typical probability/parsimony arguments, but I can also see the argument against that.

Anyway, the authors go on showing that that the MIC (which is based on “gridding” the correlation space at different resolutions, finding the grid partitioning with the largest mutual information at each resolution, normalizing the mutual information values, and choosing the maximum value among all considered resolutions as the MIC) fulfills these requirements, and works well when applied to several real world datasets. There is a MINE Website with more information and code on this algorithm, and a blog entry by Michael Mitzenmacher which might also link to more information on the paper in the future.

Interesting, I guess, would be whether this method can be extended to higher-dimensional correlations.

ADDED LATER: Of interest may also be a post by Andrew Gelman, and a news article in nature.

ADDITIONAL COMMENTS 05.04.12: There is some additional discussion on the properties of MIC as a statistical estimator at science and at the blog of Andrew Gelman that are probably of interest, most importantly relating to the critique by Noah Simon & Robert Tibshirani, as well as the comments by Gorfine, Heller & Heller.

ADDITIONAL COMMENTS 06.02.13: Justin B. Kinney made me aware of some follow-up work of Gurinder Atwal and his that questions the claimed “equitability” properties of MIC. A discussion of this paper, including a response by Michael Mitzenmacher in the comment thread and counter-response by him, can be found on Andrew Gelman’s blog.

Pingback: Predictions on Big Data Miss the Real Big Trend : Beyond Search

Pingback: 2012, Product Data and the next big thing

Pingback: Hello 2012! « Follow the Data

I wonder – what do you think of the responses to this paper – that it has perhaps already been done, and been done better, with the distance based correlation and covariance. http://scientificbsides.wordpress.com/2012/01/23/detecting-novel-associations-in-large-data-sets-let-the-giants-battle-it-out/

Well, this is not really my field, but I guess it’s save to say that MIC is structurally different from other correlation measures, so I don’t think it has already been done. The more relevant question seems to be whether this new measure is more appropriate or useful that its alternatives, given that there are a fair number of them out there. Simon and Tibshirani certainly point out some properties of MIC that are generally not considered desirable for a statistical estimator, but many factors affect the popularity of statistical methodology, including ease of use, results that people find relevant, etc … I guess we’ll just have to wait and see how MIC is accepted, after all, despite all the excitement to see a statistical publication in Science, what has been proposed is simply a new measure of correlation, this may be very useful for many, but likely it will simply have some strengths and some weaknesses, as many of the existing correlation measures, and you’ll have to know when to use it and when not.

Following up on that subject, note my additions from 05.04.12 at the end of the post.