p values – TARG Blog

I’ve written previously about the problems associated with an unhealthy fixation on P-values in psychology. Although null hypothesis significance testing (NHST) remains the dominant approach, there are a number of important problems with it. Tressoldi and colleagues summarise some of these in a recent article.

First, NHST focuses on rejection of the null hypothesis at a pre-specified level of probability (typically 5%, or 0.05). The implicit assumption, therefore, is that we are only interested answering “Yes!” to questions of the form “Is there a difference from zero?”. What if we are interested in cases where the answer is “No!”? Since the null hypothesis is hypothetical and unobserved, NHST doesn’t allow us to conclude that the null hypothesis is true.

Second, P-values can vary widely when the same experiment is repeated (for example, because the participants you sample will be different each time) – in other words, it gives very unreliable information about whether a finding is likely to be reproducible. This is important in the context of recent concerns about the poor reproducibility of many scientific findings.

Third, with a large enough sample size we will always be able to reject the null hypothesis. No observed distribution is ever exactly consistent with the null hypothesis, and as sample size increases the likelihood of being able to reject the null increases. This means that trivial differences (for example, a difference in age of a few days) can lead to a P-value less than 0.05 in a large enough sample, despite the difference having no theoretical or practical importance.

The last point is particularly important, and relates to two other limitations. Namely, the P-value doesn’t tell us anything about how large an effect is (i.e., the effect size), or about how precise our estimate of the effect size is. Any measurement will include a degree of error, and it’s important to know how large this is likely to be.

There are a number of things that can be done to address these limitations. One is the routine reporting of effect size and confidence intervals. The confidence interval is essentially a measure of the reliability of our estimate of the effect size, and can be calculated for different ranges. A 95% confidence interval, for example, represents the range of values that we can be 95% confident that the true effect size in the underlying population lies within. Reporting the effect size and associated confidence interval therefore tells us both the likely magnitude of the observed effect, and the degree of precision associated with that estimate. The reporting of effect sizes and confidence intervals is recommended by a number of scientific organisations, including the American Psychological Association, and the International Committee of Medical Journal Editors.

How often does this happen in the best journals? Tressoldi and colleagues go on to assess the frequency with which effect sizes and confidence intervals are reported in some of the most prestigious journals, including Science, Nature, Lancet and New England Journal of Medicine. The results showed a clear split. Prestigious medical journals did reasonably well, with most selected articles reporting prospective power (Lancet 66%, New England Journal of Medicine 61%) and an effect size and associated confidence interval (Lancet 86%, New England Journal of Medicine 83%). However, non-medical journals did very poorly, with hardly any selected articles reporting prospective power (Science 0%, Nature 3%) or an effect size and associated confidence interval (Science 0%, Nature 3%). Conversely, these journals frequently (Science 42%, Nature 89%) reported P-values in the absence of any other information (such as prospective power, effect size or confidence intervals).

There are a number of reasons why we should be cautious when ranking journals according to metrics intended to reflect quality and convey a sense of prestige. One of these appears to be that many of the articles in the “best” journals neglect some simple reporting procedures for statistics. This may be for a number of reasons – editorial policy, common practices within a particular field, or article formats which encourage extreme brevity. Fortunately the situation appears to be improving – Nature recently introduced a methods reporting checklist for new submissions, which includes statistical power and sample size calculation. It’s not perfect (there’s no mention of effect size or confidence intervals, for example), but it’s a start…

Reference:

Tressoldi, P.E., Giofré, D., Sella, F. & Cumming, G. (2013). High impact = high statistical standards? Not necessarily so. PLoS One, e56180.

Posted by Marcus Munafo

An excellent paper published a few years ago, Sifting the Evidence, highlighted many of the problems inherent in significance testing, and the use of P-values. One particular problem highlighted was the use of arbitrary thresholds (typically P < 0.05) to divide results into “significant” and “non-significant”. More recently, there has been a lot of coverage of the problems of reproducibility in science, and in particular distinguishing true effects from false positives. Confusion about what P-values actually tell us may contribute to this.

It is often not made clear whether research is exploratory or confirmatory. This distinction is now commonly made in genetic epidemiology, where individual studies routinely report “discovery” and “replication” samples. That in itself is helpful – it’s all too common for post-hoc analyses (e.g., of sub-groups within a sample) to be described as having been based on a priori hypotheses. This is sometimes called HARKing (Hypothesising After the Results are Known), which can make it seem like results were expected (and therefore more likely to be true), when in fact they were unexpected (and therefore less likely to be true). In other words, a P-value alone is often not very informative in telling us whether an observed effect is likely to be true – we also need to take into account whether it conforms with our prior expectations.

One way we can do this is by taking into account the pre-study probability that the effect or association being investigated is real. This is difficult of course, because we can’t know this with certainty. However, what we perhaps can estimate is the extent to which a study is exploratory (the first to address a particular question, or use a newly-developed methodology) or confirmatory (the latest in a long series of studies addressing the same basic question). Broer et al (2013) describe a simple way to take this into account and increase the likelihood that a reported finding is actually true. Their basic point is that the likelihood that a claimed finding is actually true (which they call the positive predictive value, or PPV) is related to three things: the prior probability (i.e., whether the study is exploratory or confirmatory), the statistical power (i.e., the probability of finding an effect if it really exists), and the Type I error rate (i.e., the P-value or significance threshold used). We have recently described the problems associated with low statistical power in neuroscience (Button et al., 2013).

What Broer and colleagues show is that if we adjust the P-value threshold we use, depending on whether a study is exploratory or confirmatory, we can dramatically increase the likelihood that a claimed finding is true. For highly exploratory research, with a very low prior probability, they suggest a P-value of 1 × 10^-7. Where the prior probability is uncertain or difficult to estimate, they suggest a value of 1 × 10^-5. Only for highly confirmatory research, where the prior probability is high, do they suggest that a “conventional” value of 0.05 is appropriate.

Psychologists are notorious for having an unhealthy fixation on P-values, and particularly the 0.05 threshold. This is unhelpful for lots of reasons, and many journals now discourage or even ban the use of the word “significant”. The genetics literature that Broer and colleagues draw on has learned these lessons from bitter experience. However, if we are going to use thresholds, it makes sense that these reflect the exploratory or confirmatory nature of our research question. Fewer findings might pass these new thresholds, but those that do will be much more likely to be true.

References:

Broer L, Lill CM, Schuur M, Amin N, Roehr JT, Bertram L, Ioannidis JP, van Duijn CM. (2013). Distinguishing true from false positives in genomic studies: p values. Eur J Epidemiol; 28(2): 131-8.

Button KS, Ioannidis JP, Mokrysz C, Nosek BA, Flint J, Robinson ES, Munafò MR. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nat Rev Neurosci; 14(5): 365-76.

Posted by Marcus Munafo and thanks to Mark Stokes at Oxford University for the ‘Statistical power is truth power’ image.

Tag: p values

Having confidence…

Shifting the Evidence