In the modern era, ever-increasing amounts of data are becoming available to inform decisions. There is growing concern, however, that a focus on p-values and “statistical significance” that accompany analyses of these data is resulting in a misrepresentation of reality and leading to poor decision-making. The problem is so widespread as to have led some prominent authors to call for abandoning of the term “statistical significance” altogether.[1]
Back to school – all about the tests
The issue of overuse and misuse of null hypothesis significance testing (NHST) has been a cause for alarm among statisticians for decades.[2, 3] Most often, an “interesting” or “successful” result is considered the achievement of some NHST to yield a p-value less than 0.05, interpreted as indicating that the effect is real and not due to chance. Confidence intervals (comprising 95% of results) are often similarly interpreted should they exclude a value taken to define parity.
What’s the problem?
A fundamental problem with this approach is the reduction of complex systems and processes to a binary yes/no outcome. “Statistically significant” results are interpreted to be “the only results that matter”, leaving out the question of the size of the effect. That is that the clinical or real-world impact/significance of the result is ignored. For example, does a “statistically significant” reduction in blood pressure of 0.001 mmHg make any difference to your health? On the other hand, potentially important results are ignored or in abbreviated shorthand described as “not different” even if differences can clearly be seen between point estimates (mean or medians). Since p-values are derived from data, they can (and are expected) to differ among studies.[4] This effect does not make analyses weak, does not by itself demonstrate irreproducibility of results, and nor does it make the p-value unfit for purpose. It does argue against oversimplification and application of thresholds for categorization of outcomes as “worthy” and “unworthy”.
A challenging landscape
Dealing with the issue faces a challenging landscape. Aside from funding, time is a precious resource and one in short supply among individuals who have to assess masses of data to make decisions. A simple numerical threshold like “p < 0.05” or the words “statistically-significant difference” have become shorthand for areas to target. Most problematic for promoting correct use of NHST and p-values is that including an NHST may make the difference between a paper’s acceptance for publication, a product’s regulatory approval, or an agency’s awarding of grant funding because guidelines or reviewers often demand these calculations, summaries, and interpretations, even if they are inappropriate.
What’s a decision-maker or product developer to do?
There is a growing chorus of scientists envisioning a world where “statistical significance” is replaced with consideration of results in a fuller context to help inform their use in guiding decisions.[3, 5] The American Statistician journal has recently dedicated an entire issue of 43 articles to the subject covering suggested approaches for various disciplines from basic to social to clinical sciences (Vol 73, 2019 – Issue sup1: “Statistical Inference in the 21st Century: A World Beyond p < 0.05”). Change will take time; in the meantime, authors and developers should keep the concept in mind: resist the temptation to frame results around the result of tests, moving away from reliance on the connotations of importance associated with “statistical significance”, while of course adhering to submission requirements. Decision-makers too should be prepared to invest just a bit more time than a scanning of p-values to inform decisions regarding what may be the most relevant outcome.
- 1. Amrhein V, Greenland S, McShane B: Scientists rise up against statistical significance. Nature 2019, 567(7748):305-307.
- 2. Smith RJ: The continuing misuse of null hypothesis significance testing in biological anthropology. Am J Phys Anthropol 2018, 166(1):236-245.
- 3. Wasserstein RL, Schirm AL, Lazar NA: Moving to a World Beyond “p < 0.05”. The American Statistician 2019, 73(sup1):1-19.
- 4. Van Calster B, Steyerberg EW, Collins GS, Smits T: Consequences of relying on statistical significance: Some illustrations. Eur J Clin Invest 2018, 48(5):e12912.
- 5. Mellis C: Lies, damned lies an