Note also the blog post by Daniel Lakens on the same topic here.
Update March 23, 2022: a more concise / slightly modified version of these thoughts was published as a letter in TREE, co-authored together with Fred Barraquand.
Criticism of p-values has been rampant in recent years, as were predictions of their imminent demise. When looking closer, most of these critiques are actually not sparked by the definition of the p-value (Fisher, 1925) as such, but rather by the NHST framework (Neyman and Pearson, 1933; Neyman, 1950), which introduces a cut-off (significance level alpha) to transform the p-value into a binary decision (significant / n.s.) while trading off Type I/II error rates in the process.
Just to be clear: I fully agree with many of these critiques, in particular that a fixed threshold is somewhat arbitrary, favours p-hacking and looses information compared to continuous measures of evidence for / against the null. I also agree that p-values are too dominant in the presentation and discussion of scientific results, compared to other indicators, for example effect sizes & CIs. On the other hand, having an operational, objective framework to decide whether the data points towards an effect or not is important well beyond science (think funding, regulation and political decisions), and if the NHST framework is to be abandoned for that purpose, what should replace it?
The suggestion by Muff et al.
In a recently published paper in TREE, Muff et al. recapitulate these issues. They also mention established alternatives to NHST, such as information criteria (IC), confidence intervals (CI) or the Bayes factor (BF); stating, however, that:
[…] when these alternatives are used to make binary decisions, for example regarding the inclusion of variables in model selection, when checking whether the null effect lies in a CI, or when employing a certain threshold of a Bayes factor, we are not doing anything different from an NHST. It can, for example, be shown that model selection based on the AIC criterion can be converted into P-value limits (e.g., ), and even Bayes factors have an approximate equivalent in terms of P-values [4,25,26]. Finally, checking whether a certain value (often 0) lies outside the CI is equivalent to checking the P-value limit, like P < 0.05 if the 95% CI is used, for example.
This statement seemed somewhat odd, because although it is of course possible to define thresholds on the BF, there is no need to do so. Kass & Raftery (1995), the classical citation for the BF, suggest that they should be interpreted as continuous measures of evidence. Here a snapshot from the paper:
Thus, if we want a measure of evidence in favour of H1, we could just use the BF. Muff et al., however, seem to implicitly dismiss the use of the BF by giving a different recommendation. They suggest to “stop using the term ‘statistical significance’” in conjunction with the p-value, and “replace it with a gradual notion of evidence” based on the p-value. What they mean by that is summarised in their Fig 1 (reproduced below).
They further recommend that this idea should be directly translated into text, as can be seen in a reproduction of Table 1 from their paper below (Fig. 3), which shows how statistical results should be reported.
The paper gives no explicit explanation for why they think this option is preferable over the more statistically established BF. What I imagine to read between the lines, however, is that they think that convincing people to use the BF would be too large a step, and we thus need something simple that is readily available and builds on what people know. For example, they write:
There is ample agreement, maybe the lowest common denominator of the whole discussion, that we should retire statistical significance by eliminating binary decision making from most scientific papers . Such a transition will take time, but the show must go on today and we urgently need simple and safe ways to bypass the current state of disorientation.
I do understand the argument that using the p-value to calculate a continuous measure of evidence would be far more popular in the E&E community than using BF, because the p-value is what people are used to and what is reported by the software they use. Still, that presupposes that all information necessary to quantify the evidence provided by a study is contained in the p-value, which is far from obvious. We know, for example, that effect size alone is not sufficient to quantify to quantify the evidence for an effect, so how can we be sure that the p-value alone will suffice?
What do we even mean by “evidence”?
To approach the question of whether a p-value can be sensibly translated into evidence, we first have to consider what we mean by the word “evidence”. After all, if by evidence we simply mean some monotonously increasing function of 1-p, the proposal is trivially true. In this case, Fig. 1 / Table 2 would just amount to look-up tables that translate the p-value into words and vice versa, in the same way that we often display a * in regression tables.
Unfortunately, Muff et al. offer no formal definition of what they mean by “evidence”, other than saying that “evidence is a very intuitive concept”. Nevertheless, I think we can reject the interpretation above. First of all, a mere verbal translation of the numeric value of p would be highly redundant, as numeric p-values are presented in the text as well; and secondly, it is clear that the words “weak evidence”, “strong evidence” will be interpreted by a human reader not as a transformation of the p-value, but in the common-sense meaning of the word. By this I mean that after reading that a study presents “strong evidence”, I would understand that, in the absence of other information, I should be relatively convinced that an effect is present (let’s say something like > 80% probable), while after reading that a study that reports “weak evidence”, I would understand something like “more likely than not, but not so sure”. Also, I would understand that
- If two studies have equal p-value -> evidence, then we should have equal conviction that !H0 is true, in the absence of other information
- If a study 1 has a lower p-value -> evidence than study 2, then we should have less convinced based on the study 1 that !H0 is true than based on study 2, in the absence of other information
- It should also be possible to “add” the evidence in study 1 and 2 in some consistent way
Is there a relationship between BF and p-value?
A sensible operational definition of evidence that satisfies the above-stated properties is the Bayes Factor (BF), which gives us the relative probability of the hypothesis based on the data, assuming equal priors on the hypotheses. Assuming that other sensible definitions of evidence should thus provide similar values than the BF would thus require the existence of a UNIVERSAL correspondence between evidence as summarised by the BF, and evidence derived from a p-value according to Muff et al., irrespectively of the statistical model, sample size, power and the relative complexities of H0 and H1.
The fact that this cannot not true in general can be easily seen just by looking a the formula for the BF, where we see that the relative model complexity of H0 / H1 (including the parameter priors) interacts with the data size to produce the final BF in a way that is unlikely to be universally connected to the p-value. The variance can be somewhat reduced by standardising the parameter priors of H0/H1, but there are remaining discrepancies. This is nicely summarised in the review by Held & Ott (2018), who explicitly advice against translating p-values directly into evidence for that reason. If we look at Fig. 2a of this paper (Fig. 4 of this text), we see that the relationship of the BF to the p-value is sample-size dependent.
Noting that the BF doesn’t map universally on the p-value (Fig. 2a), Held & Ott write:
Assuming an alternative hypothesis H1 has also been specified, the Bayes factor directly quantifies whether the data have increased or decreased the odds of H0. A better approach than categorizing a p-value is thus to transform a p-value to a Bayes factor or a lower bound on a Bayes factor, a so-called minimum Bayes factor (Goodman 1999b). But many such ways have been proposed to calibrate p-values, and there is currently no consensus on how p-values should be transformed to Bayes factors.
Here, they introduce the concept of the minimum BF, defined as “the smallest possible Bayes factor within a prespecified class of prior distributions over alternative hypotheses”. We can see that the minimum BF has a more general relationship with the p-value (Fig. 2b), but judging from the other figures in Held & Ott, (2018), this relationship is still not universal across all models (I suspect because the minimum BF is still contingent on priors and the hypotheses H0/H1); moreover, seeing that BF and minimum BF can be quite different, the question is how the minimum BF is to be interpreted, and if we really improve the status of statistical reporting by introducing “the smallest possible evidence in favour of H1” instead of the “evidence in favour of H1”. Held & Ott, (2018) is cited by Mutt et al., so I assume they are aware of the content of the paper, but they apparently interpreted it differently. Anyway, my understanding of the literature is that a universal correspondence between p-value and BF is hard to establish, but that what Muff et al. define is likely more close to a minimum BF than to a conventional BF, which means that it is NOT the evidence in favour of H1 (which would be the conventional BF).
To summarise, I see three immediate problems with the suggestion by Muff et al.:
- Muff et al. provide no operational definition (in terms of probabilities) of what they mean by evidence, other than the p-value itself. If we define the evidence = p-value, the definition is circular and redundant.
- Inevitably, however, readers will interpret the word “evidence” not as a p-value, but as an absolute measure of how persuasive the data is in favour of H1 (or at least !H0). In statistics, this concept is usually measured by the BF, but we can mathematically proof that the BF does not map universally on the definition of Muff et al., thus we would have two mutually inconsistent quantities for measuring evidence in the literature.
- More importantly, however, the current proposal is likely internally inconsistent as well, in the sense that studies with equal p-value do not always carry the same evidence according to common-sense definitions of the word. For example, the proposed mapping is likely not suitable to compare evidence across studies with different sample size (at least if by “more evidence” we mean “more likely to be true, all other things equal”), or accumulate evidence in a meta-analysis, as the evidence contained in p-value differs depending of the sample size of a study.
Based on this, I conclude that interpreting p-values directly as measures of evidence is unlikely to contribute to clarity in statistical reporting. If we want a continuous measure of evidence, a better solution would be to report the BF, which has a clear and established probabilistic definition. We could then think further about approximating the BF based on the p-value (e.g. by the minimum BF), but I do not see much sense in this because it is harder to interpret and, for most statistical end users, it would likely be easier to calculate the BF directly via available packages in R than to obtain the minimum BF.
Being explicit about the fact that evidence = BF would have an additional advantage: it allows us to also calculate posterior odds. Experience shows that many people (particularly with less statistical training) have a hard time to distinguish likelihood-based from posterior evidence, and will easily interpret a statement such as “we have strong evidence” as high posterior odds for an effect (i.e. as BF * priors). Personally, I would therefore recommend to report evidence (e.g.: “the data contains medium evidence for an effect”) together with the posterior odds of H1 (e.g. “together with prior odds from previous studies, we conclude that there is a high probability of an effect)”, rather than relying only on the (marginal) likelihood alone, as one can simultaneously have high evidence for an effect and low probability that an effect is present (as humorously noted in this XKCD cartoon).
The last point can easily be understood even from a frequentist viewpoint, when considering the rate at which significant p-values are type I errors, known as the false discovery rate (FDR). Fig. 3 show a slide that I usually use in my BSc Statistics lecture. What we see is that the the probability that a significant result is a false positive depends crucially on a) the rate at !H0 is true in experiments, and the power of the experiments (1-beta).
When analysing the formula, we must conclude, and this has been repeated many times in the literature, that when considering the odds that an effect is there based on the p-value, one must to so in conjunction with the power of the experiment, and the prior odds of an effect (see also the popular article by Regina Nuzzo, Nature News (2004)).
Held, L., & Ott, M. (2018). On p-values and Bayes factors. Annual Review of Statistics and Its Application, 5, 393-419.
S. Muff, E. B. Nilsen, R. B. O’Hara, and C. R. Nater, “Rewriting results sections in the language
of evidence” Trends Ecol. Evol., Nov. 2021.
Fisher, R. (1925). Statistical Methods for Research Workers, First Edition. Edinburgh: Oliver and Boyd.
Neyman, J. (1950). Probability and Statistics. New York, NY: Holt.
Neyman, J., and Pearson, E. S. (1933). On the problem of the most efficient tests of statistical hypotheses. Philos. Trans. R. Soc. Lond. Ser. A 231, 289–337.
Nuzzo, Regina. “Scientific method: statistical errors.” Nature News 506.7487 (2014): 150.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the american statistical association, 90(430), 773-795.