Greater Transparency Can Help Reduce Type I and Type II Errors in Research

First a brief review of how researchers typically decide if their research findings do or do not support their hypotheses:

Consider the simple example where a researcher has created a new way to teach math to 3rd graders and wants to determine whether the new method is more effective than the standard approach to teaching math in this grade. He assigns the students to be taught math the standard way or to be taught using his new method. After a pre-determined period of time all students take the same math competency test, and the researcher conducts statistical tests to compare math scores between the two groups of students. The null hypothesis is that there is no difference in the effectiveness of the teaching methods (test scores equal across two test groups), whereas the experimental (or alternative) hypothesis is that the new teaching method is more effective than the standard method (test scores will be higher for the students taught using the new compared to the standard method). The researcher looks at the test scores in the two groups and applies a statistical test that provides the probability of getting the results in the current sample given the null hypothesis is true. The researcher then makes a decision regarding the effectiveness of his new method relative to the standard teaching method, taking into consideration information about the methods and sample, as well as the results of the statistical test(s) used. The researcher subsequently attempts to communicate this decision to the broader academic community via a manuscript submitted for publication, contingent on the evaluation of the research by a few peers and a journal editor.

In this type of research process, at least two types of errors can be made regarding the decision the researcher makes after considering all of the evidence. These errors are known as type I and type II errors:

Type I error: deciding to reject the null hypothesis when in fact it is correct (deciding the new teaching method is better than the standard method, when in fact it is not better).

Type II error: failing to reject a false null hypothesis, or deciding there is no effect when in fact an effect exists (deciding the new teaching method is not better than the standard method, when in fact it is better).

Given that a goal of science is to accumulate accurate explanations of how our world actually works, both types of errors are problematic. Finding ways to reduce these errors is therefore important for scientific discovery.

A lot of attention the past few years has focused on reducing type I errors (e.g., Simmons, Nelson & Simonsohn, 2011, and many, many others) using both methodological (e.g., pre-registering study hypotheses, increasing sample sizes) and statistical (e.g., minimizing “p-hacking” during data analysis) approaches. Less attention has focused on how to reduce type II errors specifically. With respect to statistical tests, when the probability of correctly rejecting a false null hypothesis is low (i.e., low statistical power), the probability of making a type II error increases (if relying only on results of the statistical test to make research decisions). Increasing statistical power therefore reduces the probability of statistical tests failing to reach the chosen threshold of “statistical significance” when an effect truly exists (i.e., type II errors). There are three factors that have a big influence on the statistical power of a test:

  • size of effect—smaller effects can be more challenging to detect compared to larger effects
  • size of sample—smaller samples, all else being equal, provide lower power compared to larger samples
  • when alpha is lower—in psychology the norm is to use an alpha level of .05. All else being equal, lower alphas decrease the probability of making a type I error but increase the probability of making a type II error compared to higher alphas

Ideally, therefore, researchers should recruit large samples of participants to increase power to help decrease type II errors in statistical tests, particularly given that the true effect sizes of interest are often unknown in advance. For example, if the researcher in the teaching method example above had 20 students in each teaching condition, the size of the effect would need to be d > .90 (or rather large) in order to have 80% power to detect a difference between the two groups (see Simmons, Nelson & Simonsohn, 2013). And, again, researchers should remain mindful of the effect lowering the alpha level of their tests has on the likelihood of a type II error.

It is important to remember, however, that results of statistical tests do not dictate the decisions researchers make regarding the presence or absence of effects (see Gigerenzer & Marewski, 2015). Whatever the results of the statistical analyses used to test hypotheses, researchers need to weigh all relevant evidence, statistical as well as methodological, to reach a verdict on the perceived strength of the evidence to reject or not reject the null hypothesis and/or plan additional tests of the hypothesis. In the teaching method example, if students in the new teaching condition happened to come from schools specializing in math and science whereas students in the standard teaching condition happened to come from schools specializing in the arts (i.e., non-random assignment to condition), a significant difference in test scores in the predicted direction would not be taken as strong evidence for rejecting the null hypothesis; the lack of random assignment in this case greatly increases the risk of type I error. Similarly, a non-significant difference in test scores may not speak to the ineffectiveness of the new teaching method if the researcher was only able to recruit, for example, 20 students per teaching condition and the size of the effect turned out to be small (thus increasing the risk of type II error). In these hypothetical research scenarios, it is very easy to see how methodological limitations (when known) can influence the deliberations regarding rejecting the null hypothesis and how results of statistical tests should not alone dictate this decision making process.

The research process is, of course, not always as simple as presented in these examples. Developing hypotheses takes time and effort, as does developing ways to test hypotheses. Running studies and collecting data, as well as getting data ready for analyses (e.g., “cleaning” the data set) and conducting the analyses, also take time and effort. Importantly, researchers make many decisions during this entire process. The researcher, of course, is privy to all of these decisions given that she or he is the one making these decisions along the way. Editors and reviewers of academic journals, as well as consumers of published research, are only privy, however, to what the researcher chooses to share of the research process. Typically, what the researcher chooses to openly share of the research process occurs after all of these decisions have been made, and such sharing traditionally occurs via a manuscript submitted for peer review and ultimately publication in academic journals. In journal articles researchers tend to share the outcomes of the research process (e.g., statistically significant results in support of hypotheses) more so than the details of the research process (e.g., hypotheses developed prior to data collection and/or analyses, all study procedures and materials, pilot testing of experimental procedures, what analyses were confirmatory or exploratory). There is presently, therefore, not a high degree of transparency in the research process.

Given that decisions regarding the ability of a study to reject or fail to reject the null hypothesis can only be based on information available for evaluation, having available fewer details of the research process can add more uncertainty and error to this process. For example, if hypotheses were partly based on exploratory analyses of a data set, and this information was not made publicly available, reviewers, and subsequently consumers, of the research may conclude that the results provide stronger confirmatory evidence of the hypotheses than they really do (risk of type I error). Also, consider the example of a researcher that fails to find statistical support for an innovative intervention targeted toward alleviating depressive symptoms, but does not share information, for example, regarding unequal pre-treatment depression scores across study conditions (i.e., initial depression scores happened to be lower in the standard care condition compared to the new treatment or control conditions). Reviewers, and subsequently consumers, of the research, may conclude that the results provide stronger evidence of the ineffectiveness of the new intervention than warranted (risk of type II error).

Properly evaluating scientific claims benefits from having access to more information of the entire research process (see Campbell, Loving & LeBel, 2014), information that is available to researchers when deciding on the strength of evidence to reject or fail to reject their own null hypotheses. This very same information, however, is not typically available to other researchers (and consumers of research) when making their own decisions regarding the theoretical, statistical, and practical significance of the findings reported by the researcher. Making the research process itself more transparent thus represents one important way to reduce the rates of both false positives and false negatives in science. Or as we say in our recent article: “Transparency in the research process, therefore, is an essential component of the scientific method because it is the only window through which we have access to this process.” (Campbell et al., 2014).

References

Campbell, L., Loving, T.J., & LeBel, E.P. (2014). Enhancing transparency of the research process to increase accuracy of findings: A guide for relationship researchers. Personal Relationships, 21, 531-545.

Gigerenzer, G., & Marewski, J.N. (2015). Surrogate science: The idol of a universal method for scientific inference. Journal of Management, 41, 421-440.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed

flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2013). Life after P-Hacking. Meeting of the Society for Personality and Social Psychology, New Orleans, LA, 17-19 January 2013. Available at SSRN: http://ssrn.com/abstract=2205186 or http://dx.doi.org/10.2139/ssrn.2205186.

My Transition to Greater Transparency in the Research Process

The December 2014 issue of Personal Relationships contains a paper by me and two co-authors (Timothy Loving and Etienne LeBel) in which we discuss how relationship scientists can transition to greater transparency in the research process. 2014 is also the year that my lab began the transition to greater transparency with our own research projects (e.g., making study materials/procedures/hypotheses available on the Open Science Framework: https://osf.io/sa9im/). Making this transition involves many challenges, and in my lab we have discussed regularly how to overcome these challenges, keeping in mind one of John Lennon’s song lyrics: “Well I tell them there’s no problem, only solutions”.

We did not arrive at this point overnight. When I was in graduate school the culture of psychological science was to share relevant aspects of the research process when submitting manuscripts for peer review. Of course, this meant that research conducted but not included in these manuscripts was never publicly shared, nor was research presented in rejected manuscripts shared. I began graduate school in 1996, and the internet was beginning to take off at that time (i.e., the beginning of “dot.com” era). New research articles, however, were still introduced primarily in print. At that time (not all that long ago, but long enough), most journals still required multiple copies of manuscripts to be submitted via regular mail, and decision letters were received by regular mail. The results of research were therefore shared primarily via print publications, and with limited page space in journals it was not prudent to devote a lot of space to sharing all details of the research process.

A lot has changed since my graduate school days, including:

  1. The internet. Academic journals have historically been limited by the number of pages available per volume. The internet smashes through this barrier, making print page space largely irrelevant. Many print journals now make supplementary material available online, and many other journals have solely an online presence. It actually seems odd now that journals should have page limitations at all given the existence of a technology that renders page limitations inconsequential. And websites such as the Open Science Framework (https://osf.io/) are available for researchers to post a great deal of material about their research projects at any time during the life of a given project (i.e., before, during, and after running a study).
  2. The publication of articles such as Ioannidis (2005), Simmons, Nelson, and Simonsohn (2011), and the entire 2012 special issue of Perspectives on Psychological Science highlighted the importance of enhancing the transparency of the research process. Reading these articles reminded me of Lykken’s (1991) excellent piece asking “What’s wrong with psychology anyway?”, as well as articles by Meehl (any year) focusing on doing “good” science, and Kerr (1998) on hypothesizing after the results are known (or HARKing). It also introduced me to the writings/lectures of theoretical physicist Richard Feynman, who did not pussyfoot around in his discussions of how to do science: total scientific honesty, while doing your best to prove yourself wrong.

After reading these articles, as well as many others on the topic, I found myself in agreement with the argument that greater transparency in the research process is a good thing for the progress of science (i.e., for accumulating an accurate knowledge base of how the world works), and that we now have the technology to make this happen. When I came to this realization I did not yet have an account on the OSF, had not yet posted any study details for any of my original research projects, and was not exactly certain what these posts should include going forward. After thinking it through for a few months and discussing the issues with Tim and Etienne (culminating in our paper on the topic), I am now committed to following our own suggestions. It takes time to adjust, but my lab is working every day to make these adjustments. We (Campbell et al, 2014) explicitly state in our paper that our suggestions are just that—suggestions. My only strong recommendation to researchers is to do what you think is best for advancing scientific discovery.

If you feel that scientific discovery benefits from:

  • Researchers sharing their carefully crafted hypotheses prior to data analyses, then do it.
  • Researchers sharing all study procedures and methods, then do it.
  • Researchers sharing all study materials, then do it.
  • Researchers sharing their data analytic plans prior to data analysis, then do it.
  • Researchers sharing the differences between the planned confirmatory analyses and subsequent exploratory analyses, then do it.

I accept that not everyone will feel that scientific discovery benefits from greater transparency in the research process, and I encourage researchers who feel this way to share and argue their doubts. In the not too distant future it may be possible to empirically test the robustness of published findings emanating from research projects where materials/procedures/hypotheses were made publicly available or not, obviating the need for what are, so far, philosophical discussions on the topic.

But if you do feel that greater transparency will benefit scientific discovery, keep in mind these words of wisdom from Aristotle: “We do not act rightly because we have virtue or excellence, but we rather have those because we have acted rightly. We are what we repeatedly do. Excellence, then, is not an act but a habit.” In other words, saying that transparency is a good thing is nice, but in the end our actions speak louder than our words.

Researchers have to make decisions on a regular basis, and one additional decision to make now is considering the transition to transparency. I respect the choices of my fellow researchers as they consider the merits of adopting more transparent research practices. Personally, for the reasons discussed, I have decided to make the transition to greater transparency in the research process in my own lab. I can report that it has been a wonderful experience so far.

 

References

Campbell, L., Loving, T.J., & LeBel, E.P. (2014). Enhancing transparency of the research process to increase accuracy of findings: A guide for relationship researchers. Personal Relationships, 21, 531-545.

Ioannidis, J.P.A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

Kerr, N.L. (1998). HARKing: Hypothesizing after results are known. Personality and Social Psychology Review, 2, 196-217.

Lykken, D.T. (1991). What’s wrong with psychology anyway? In D. Cicchetti & W.M. Grove (Eds.), Thinking clearly about psychology, Volume 1: Matters of public interest (pp. 3-39). Minneapolis: University of Minnesota Press.

Meehl, P.E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-115.

Meehl, P.E. (1990). Why summaries of research on psychological theories are often uninterpretable. Psychological Reports, 66, 195-244.

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed

flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

Special Issue, (2012). Perspectives on Psychological Science, 7.

The Need for Some Agreement Before Debating Proposals to Address the “Replicability Crisis”

“For a debate to proceed, both teams need a clear understanding of what the motion means. This requires the motion to be ‘defined’ so that everyone (audience and adjudicators included) knows what is being debated. Problems arise if the two teams present different understandings of the meaning of the motion. This can result in a ‘definition debate’, where the focus of the debate becomes the meaning of the words in the motion, rather than the motion itself. Interaction and clash between the two teams concentrates on whose definition is correct, rather than the issues raised by the motion. Definition debates should be avoided wherever possible. They make a mockery of what debating seeks to achieve.” (Stockley, 2002)

One debate occurring across many scientific disciplines, including my own (social psychology), focuses on what should be done, if anything, to deal with the “replicability crisis” (i.e., the apparent inability of some study findings to be directly replicated within and across labs). As suggested by Stockley in the quote above, for a proper debate on the “replicability crisis” (i.e., the motion) to ensue, participants need to agree on some basic facts surrounding the issue being discussed. Only then can the strength of arguments for or against various propositions for resolving the issue be evaluated.

In the current “replicability crisis”, what are some facts we can all agree upon when debating resolutions? From my own reading the past few years, here are a few things that seem fairly straightforward for our field to agree on that have the potential to be problematic for the reliability of published research findings:

1) In a series of simulations, Colquhoun (2014) demonstrated that “…if you use p = .05 as a criterion for claiming that you have discovered an effect you will make a fool of yourself at least 30% of the time.” (p. 11). Stated differently, approximately 30% of statistical analyses will yield statistically significant findings (p ≤ .05) when in fact no effect truly exists (false positives, or Type I error), a rate much higher than the typically accepted rate of 5%. I direct readers to Colquhoun’s paper, published in an open access journal (citation information below), to verify these claims. Ioannidis (2005) made similar arguments.

2) The overwhelming majority of published research papers contain statistically significant results — well over 90% of all presented findings are statistically significant, whereas very few papers publish non-significant findings (Fanelli, 2010; Sterling, 1959; Sterling, Rosebaum, & Weinkam, 1995).

3) Considering the two previous points in concert, it is undeniable that a non-trivial amount of published research findings are false-positives. Given that very few non-significant findings are published, the rate of false negatives (or claims of no effect when an effect truly exists) in the published literature is therefore very low in comparison.

The first point is based on large scale simulations using the types of statistical tests our field typically employs, and the second point is based on observations of actual published research in our journals. The third point represents a logical conclusion obtained by pairing the first two points. There are other important issues related to the “replicability crisis”, such as the use of questionable research practices to obtain p values at or below the accepted threshold of .05, but it is difficult to ascertain the prevalence of these practices and I will not include them on this list.

If we agree that presently a non-trivial amount of published research findings are false-positives for the reasons discussed above, we can then debate the merits of different proposal to address this issue. But, if there is disagreement among participants in this debate on the relative amount of published research that contain false positives, then it becomes very difficult to evaluate the strength of different arguments put forward because the arguments are not discussing the issue at hand but rather the definition of the motion. And if we simply keep debating the definition of the motion (i.e., the prevalence of false positives in the literature), then it is difficult to envision any proposal on how to address this issue receiving a critical mass of support within the field. It may also be the case that as a field we would be making “…a mockery of what debating seeks to achieve”.

References

Colquhoun, D. (2014). An investigation of the false discovery rate and the misinterpretation of p-values. Royal Society Open Science. 1: 140216. http://dx.doi.org/10.1098/rsos.140216.

Fanelli, D. (2010). “Positive” results increase down the hierarchy of Science. PLoS ONE, 5(4), e10068. doi:10.1371/journal.pone.0010068.

Ioannidis, J. P. A. (2005). Why most published research findings are false. PLoS Medicine, 2, e124.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance—Or vice versa. Journal of the American Statistical Association, 54, 30–34.

Sterling, T. D., Rosebaum,W. L., &Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112.

Stockely, A. (2002). Defining motions & constructing cases: Guidelines for competitors and adjudicators. Retrieved from http://www.schoolsdebate.com/docs/definitions.asp on November 26, 2014.

The Current Status of Pre-registering Study Details in Social Psychology

The past few years has witnessed much debate regarding research practices that can potentially undermine the accuracy of reported research findings (e.g., p-hacking, lack of direct replication, low statistical power; Ioannidis, Munafo, Fusar-Poli, Nosek, & David, 2014; O’Boyle, Banks & Gonzalez-Mule, 2014; Simmons, Nelson, & Simonsohn, 2011), and some leading journals that publish research in the field of social psychology have made editorial changes to address these issues (e.g., Eich, 2014; Funder et al., 2014; Journal of Experimental Social Psychology, 2014). Pre-registration of study hypotheses and methods has been suggested as one way to enhance the accuracy of reported research findings by making the research process more transparent (e.g., Campbell, Loving & LeBel, in 2014; Chambers, 2014; De Groot, 1956/2014; Krumholz & Peterson, 2014; Miguel et al., 2014; The PLOS Medicine Editors, 2014). Many journals now have a registered reports section where editors and reviewers focus on the strength of pre-registered methods and data analytic plans to test proposed hypotheses and accept articles for publication in advance of data collection (e.g., Perspectives on Psychological Science). A new journal, titled Comprehensive Results in Social Psychology (CRSP), supported by the European Social of Social Psychology as well as the Society of Australasian Social Psychologists, is the first social psychology journal to publish only pre-registered papers. Are researchers in the field of Social Psychology, however, presently following these suggestions by adopting the practice of pre-registering details of their studies?

There are different ways to answer this question, and the approach I adopted here was to cross-reference the current membership of the Society of Experimental Social Psychology (SESP; accessed October 1, 2014) with all current users of the Open Science Framework (OSF; accessed October 1-2, 2014). The membership of SESP was selected to represent the field of social psychology for the following reasons: (1) membership is open to any researcher regardless of disciplinary affiliation, (2) individuals are only eligible to be considered for membership after holding a PhD for five years and following an evaluation by committee of the degree to which their publication record advances the field of social psychology, and (3) there are presently over 1000 members in institutions all over the world. Members of SESP therefore represent a cross-section of recognized social psychological researchers. I selected the user list of the OSF because since its launch in 2011 it has positioned itself as the most recognized third-party website for posting study details in the social sciences.

To conduct the cross-referencing I first recorded all of the names listed in the membership directory of SESP (http://sesp.org/memlist.htm) in a spreadsheet. I then typed each name into the search window of the OSF website (https://osf.io) to identify current user status. If an individual was listed as a user, I navigated to his/her user page to determine (a) the number projects the user has currently posted to the OSF website, and (b) how many of these projects were currently public (i.e., fully accessible by any visitor to the site). User status, number of projects, and number of public projects were entered into the spreadsheet. It is important to note that posted projects refer to studies already conducted or currently being conducted given that project details remain on the site over time.

Descriptive analyses revealed that of the 1002 current members of SESP, 98 (or 9.8%) had created accounts on the OSF website. The two highest frequencies were for posting zero projects (i.e., having an account only; 26.5%) and for posting one project (35.7%). The frequencies for posting more than one project then decrease very rapidly. The number of public projects is 44%, meaning that the details of a slight majority of projects (56%) are not publicly available. This is perhaps understandable given that researchers may prefer to wait to share pre-registered study details when a manuscript containing data from a given study has been accepted for publication.

On the one hand it is a positive development to see that in a relatively short period of time (i.e., since 2011) close to 10% of researchers identified by their peers as making significant contributions to the study of social psychology (i.e., members of SESP) have created a user account on the OSF, the most prominent online site devoted to increasing transparency in the research process. On the other hand, over 90% of SESP members currently are not users on the OSF, and the individuals that are users presently post very few projects. It is very likely that the low number of posted projects does not reflect the actual number of research projects conducted (active or completed) in the respective labs of SESP members that have posted projects on the OSF. It can then be concluded that pre-registering of study details is currently a very uncommon practice in the field of social psychology, at least within the membership of SESP, on the OSF. There presently exists a gap, therefore, between the suggestions to pre-register study details to enhance transparency of the research process and the employment of this practice among active researchers in social psychology.

This practice is likely to become more common going forward, but one potential explanation for the low rate of pre-registering study details at this time is the potential concern that it is cumbersome, that not all study hypotheses are established at the time of data collection, and that other researchers may “scoop” posted hypotheses and methods (see Campbell et al., 2014). To the extent that these pose real risks to researchers adopting pre-registration, the act of pre-registration itself could be argued to hurt the advance of ideas in our field. This argument is largely philosophical at this time given that there is simply not enough empirical evidence upon which to evaluate this possibility.

 

References

Campbell, L., Loving, T.J., & LeBel, E.P. (2014). Enhancing transparency of the research process to increase accuracy of findings: A guide for relationship researchers. Personal Relationships. DOI: 10.1111/pere.12053

Chambers, (2014). Psychology’s ‘registration revolution’. Retrieved from: http://www.theguardian.com/science/head-quarters/2014/may/20/psychology-registration-revolution.

De Groot, A. D. (1956/2014). The meaning of “significance” for different types of research. Translated and annotated by Eric-Jan Wagenmakers, Denny Borsboom, Josine Verhagen, Rogier Kievit, Marjan Bakker, Angelique Cramer, Dora Matzke, Don Mellenbergh, and Han L. J. van der Maas. Acta Psychologica, 148, 188-194.

Eich, E. (2014). Business not as usual. Psychological Science, 25, 3-6.

Funder, D.C., Levine, J.M., Mackie, D.M., Morf, C.C., Vazire, S., & West, S.G. (2014). Improving the dependability of research in personality and social psychology: Recommendations for research and educational practice. Personality and Social Psychology Review, 18, 3-12.

Ioannidis, J.P., Munafo, M.R., Fusar-Poli, P., Nosek, B.A., & David, S.P. (2014). Publication and other reporting biases in cognitive sciences: Detection, prevalence, and prevention. Trends in Cognitive Science, 18, 235-241.

Journal of Experimental Psychology (2014). JESP editorial guidelines. Retrieved from http://www.journals.elsevier.com/journal-of-experimental-social-psychology/news/jesp-editorial-guidelines/.

Krumholz, H.M., & Peterson, E.D. (2014). Open access to clinical trials data. The Journal of the American Medical Association. 312, 1002-1003.

Miguel, E., Camerer, C., Casey, K., Cohen, J., Esterling, K.M., Gerber, A….Van der Laan, M. (2014). Promoting transparency in social science research. Science343(6166), 30-31.

O’Boyle, Jr., E.H., Banks, G.C., & Gonzalez-Mule, E. (2014). The Chrysalis effect: How ugly initial results metamorphosize into beautiful articles. Journal of Management. doi 10.1177/0149206314527133).

Simmons, J.P., Nelson, L.D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359-1366.

The PLOS Medicine Editors, (2014). Observational studies: Getting clear about transparency. PLoS Med 11(8): e1001711. doi:10.1371/journal.pmed.1001711.