The Need for Some Agreement Before Debating Proposals to Address the “Replicability Crisis”

“For a debate to proceed, both teams need a clear understanding of what the motion means. This requires the motion to be ‘defined’ so that everyone (audience and adjudicators included) knows what is being debated. Problems arise if the two teams present different understandings of the meaning of the motion. This can result in a ‘definition debate’, where the focus of the debate becomes the meaning of the words in the motion, rather than the motion itself. Interaction and clash between the two teams concentrates on whose definition is correct, rather than the issues raised by the motion. Definition debates should be avoided wherever possible. They make a mockery of what debating seeks to achieve.” (Stockley, 2002)

One debate occurring across many scientific disciplines, including my own (social psychology), focuses on what should be done, if anything, to deal with the “replicability crisis” (i.e., the apparent inability of some study findings to be directly replicated within and across labs). As suggested by Stockley in the quote above, for a proper debate on the “replicability crisis” (i.e., the motion) to ensue, participants need to agree on some basic facts surrounding the issue being discussed. Only then can the strength of arguments for or against various propositions for resolving the issue be evaluated.

In the current “replicability crisis”, what are some facts we can all agree upon when debating resolutions? From my own reading the past few years, here are a few things that seem fairly straightforward for our field to agree on that have the potential to be problematic for the reliability of published research findings:

1) In a series of simulations, Colquhoun (2014) demonstrated that “…if you use p = .05 as a criterion for claiming that you have discovered an effect you will make a fool of yourself at least 30% of the time.” (p. 11). Stated differently, approximately 30% of statistical analyses will yield statistically significant findings (p ≤ .05) when in fact no effect truly exists (false positives, or Type I error), a rate much higher than the typically accepted rate of 5%. I direct readers to Colquhoun’s paper, published in an open access journal (citation information below), to verify these claims. Ioannidis (2005) made similar arguments.

2) The overwhelming majority of published research papers contain statistically significant results — well over 90% of all presented findings are statistically significant, whereas very few papers publish non-significant findings (Fanelli, 2010; Sterling, 1959; Sterling, Rosebaum, & Weinkam, 1995).

3) Considering the two previous points in concert, it is undeniable that a non-trivial amount of published research findings are false-positives. Given that very few non-significant findings are published, the rate of false negatives (or claims of no effect when an effect truly exists) in the published literature is therefore very low in comparison.

The first point is based on large scale simulations using the types of statistical tests our field typically employs, and the second point is based on observations of actual published research in our journals. The third point represents a logical conclusion obtained by pairing the first two points. There are other important issues related to the “replicability crisis”, such as the use of questionable research practices to obtain p values at or below the accepted threshold of .05, but it is difficult to ascertain the prevalence of these practices and I will not include them on this list.

If we agree that presently a non-trivial amount of published research findings are false-positives for the reasons discussed above, we can then debate the merits of different proposal to address this issue. But, if there is disagreement among participants in this debate on the relative amount of published research that contain false positives, then it becomes very difficult to evaluate the strength of different arguments put forward because the arguments are not discussing the issue at hand but rather the definition of the motion. And if we simply keep debating the definition of the motion (i.e., the prevalence of false positives in the literature), then it is difficult to envision any proposal on how to address this issue receiving a critical mass of support within the field. It may also be the case that as a field we would be making “…a mockery of what debating seeks to achieve”.

References

Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science. 1: 140216. http://dx.doi.org/10.1098/rsos.140216.

Fanelli, D. (2010). “Positive” results increase down the hierarchy of Science. PLoS ONE, 5(4), e10068. doi:10.1371/journal.pone.0010068.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—Or vice versa. Journal of the American Statistical Association, 54, 30–34.

Sterling, T. D., Rosebaum,W. L., &Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.

Stockely, A. (2002). Defining motions & constructing cases: Guidelines for competitors and adjudicators. Retrieved from http://www.schoolsdebate.com/docs/definitions.asp on November 26, 2014.

Leave a Reply

Your email address will not be published. Required fields are marked *