ASHIVNI SHEKHAWAT
Research scientist at Lyft Inc.

The paper Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations Greenland, S. et. al. Eur J Epidemiol (2016) 31:337–350 (pdf available here) discusses 25 misrepresentations prevalent in statistical testing literature. I reproduce some of their observations here. I do not have copyrights on the original paper, and am reproducing the original work of the authors here for the sake of science. If a copyright holder objects, please let me know!

Some great quotes referenced in the paper

too often we weaken our capacity to interpret data and to take reasonable decisions whatever the value of P. And far too often we deduce ‘no difference’ from ‘no significant difference.’

Hill, 1965

it is doubtful whether the knowledge that (a P value) was really 0.03 (or 0.06), rather than 0.05… would in fact ever modify our judgment

Neyman and Pearson

Note: Definitions of P-values, confidence intervals etc. can be found in this post.

Misinterpretations of P-values

1. The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true.

No! The P value assumes the test hypothesis is true—it is not a hypothesis probability and may be far from any reasonable probability for the test hypothesis.

2. A significant test result (P < 0.05) means that the test hypothesis is false or should be rejected.

No! A small P value simply flags the data as being unusual if all the assumptions used to compute it (including the test hypothesis) were correct; it may be small because there was a large random error or because
some assumption other than the test hypothesis was violated (for example, the assumption that this P value was not selected for presentation because
it was below 0.05). Further, the P value will be small not only under the current null hypothesis, but under a family of hypotheses, all of which would mark the data as unusual.

3. A large P value is evidence in favor of the test hypothesis.

No! In fact, any P value less than 1 implies that the test hypothesis is not the hypothesis most compatible with the data, because any other hypothesis with a larger P value would be even more compatible with the data.

4. A null-hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated.

No! Observing P > 0.05 for the null hypothesis only means that the null is one among the many hypotheses that have P > 0.05. Thus, unless the point estimate (observed association) equals the null value exactly, it is a mistake to conclude from P > 0.05 that a study found ‘‘no association’’ or ‘‘no evidence’’ of an effect.

5. Statistical significance indicates a scientifically or substantively important relation has been detected.

No! Especially when a study is large, very minor effects or small assumption violations can lead to statistically significant tests of the null hypothesis.

6. Lack of statistical significance indicates that the effect size is small.

No! Especially when a study is small, even large effects may be ‘‘drowned in noise’’ and thus fail to be detected as statistically significant by a statistical test.

7. If you reject the test hypothesis because P > 0.05, the chance you are in error (the chance your ‘‘significant finding’’ is a false positive) is 5 %.

No! To see why this description is false, suppose the test hypothesis is in fact true. Then, if you reject it, the chance you are in error is 100 %, not 5 %. The 5 % refers only to how often you would reject it, and therefore be in error, over very many uses of the test across different studies when the test hypothesis and all other assumptions used for the test are true.

Misinterpretations of confidence intervals and power

1. The specific 95 % confidence interval presented by a study has a 95 % chance of containing the true effect size.

No! The frequency with which an observed interval (e.g., 0.72–2.88) contains the true effect is either 100 % if the true effect is within the interval or 0 % if not; the 95 % refers only to how often 95% confidence intervals computed from very many studies would contain the true size if all the assumptions used to compute the intervals were correct. It is possible to compute an interval that can be interpreted as having 95 % probability of containing the true value; nonetheless, such computations require not only the assumptions used to compute the confidence interval, but also further assumptions about the size of effects in the model. These further assumptions are summarized in what is called a prior distribution, and the resulting intervals are usually called Bayesian posterior (or credible) intervals to distinguish them from confidence intervals.

2. If you accept the null hypothesis because the null P value exceeds 0.05 and the power of your test is 90 %, the chance you are in error (the chance that your finding is a false negative) is 10 %.

No! If the null hypothesis is false and you accept it, the chance you are in error is 100 %, not 10 %. Conversely, if the null hypothesis is true and you accept it, the chance you are in error is 0 %. The 10 % refers only to how often you would be in error over very many uses of the test across different studies when the particular alternative used to compute power is correct and all other assumptions used for the test are correct in all the studies.