by Judith Curry
In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not.
False Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant
JP Simmons, LD Nelson, U. Simonsjohn
Abstract. In this article, we accomplish two things. First, we show that despite empirical psychologists’ nominal endorsement of a low rate of false-positive findings (≤ .05), flexibility in data collection, analysis, and reporting dramatically increases actual false-positive rates. In many cases, a researcher is more likely to falsely find evidence that an effect exists than to correctly find evidence that it does not. We present computer simulations and a pair of actual experiments that demonstrate how unacceptably easy it is to accumulate (and report) statistically significant evidence for a false hypothesis. Second, we suggest a simple, low-cost, and straightforwardly effective disclosure-based solution to this problem. The solution involves six concrete requirements for authors and four guidelines for reviewers, all of which impose a minimal burden on the publication process.
Psychological Science published online 17 October 2011 DOI: 10.1177/0956797611417632 [full text]
The paper is discussed by Steven Novella at neurologica blog Publishing False Positives. Some excerpts (JC bold for emphasis):
In their paper Simmons et al describe in detail what skeptical scientists have known and been saying for years, and what other research has also demonstrated, that researcher bias can have a profound influence on the outcome of a study. They are looking specifically at how data is collected and analyzed and showing that the choices the researcher make can influence the outcome. They referred to these choices as “researcher degrees of freedom;” choices, for example, about which variables to include, when to stop collecting data, which comparisons to make, and which statistical analyses to use.
Each of these choices may be innocent and reasonable, and the researchers can easily justify the choices they make. But when added together these degrees of freedom allow for researchers to extract statistical significance out of almost any data set. Simmons and his colleagues, in fact, found that using four common decisions about data (using two dependent variables, adding 10 more observations, controlling for gender, or dropping a condition from the test) would allow for false positive statistical significance at the p<0.05 level 60% of the time, and p<0.01 level 21% of the time.
This means that any paper published with a statistical significance of p<0.05 could be more likely to be a false positive than true positive.
Worse – this effect is not really researcher fraud. In most cases researchers could be honestly making necessary choices about data collection and analysis, and they could really believe they are making the correct choices, or at least reasonable choices. But their bias will influence those choices in ways that researchers may not be aware of. Further, researchers may simply be using the techniques that “work” – meaning they give the results the researcher wants.
Worse still – it is not necessary to disclose the information necessary to detect the effect of these choices on the outcome. All of these choices about the data can be excluded from the published study. There is therefore no way for a reviewer or reader of the article to know all the “degrees of freedom” the researchers had, what analyses they tried and rejected, how they decided when to stop collecting data, etc.
They hit the nail on the head when they write that the goal of science is to “discover and disseminate truth.” We want to find out what is really true, not just verify our biases and desires. That is the skeptical outlook, and it is why we are so critical of papers purporting to demonstrate highly implausible claims with flimsy data. We require high levels of statistical significance, reasonable effect sizes, transparency in the data and statistical methods, and independent replication before we would conclude that a new phenomenon is likely to be true. This is the reasonable position, historically justified, in my opinion, because of the many false positives that were prematurely accepted in the past (and continue to be today).
Requirements for authors and guidelines for reviewers
The paper makes the following recommendations:
- Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.
- Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data- collection justification.
- Authors must list all variables collected in a study.
- Authors must report all experimental conditions, including failed manipulations.
- If observations are eliminated, authors must also report what the statistical results are if those observations are included.
- If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.
- Reviewers should ensure that authors follow the requirements.
- Reviewers should be more tolerant of imperfections in results.
- Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.
- If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.
They also discuss other options that they feel would not be effective or practical. Disclosing all the raw data is certainly a good idea, but readers are unlikely to analyze the raw data on their own. They also don’t like replacing p-value analysis with a Bayesian analysis because they feel this would just increase the degrees of freedom. I am not sure I agree with them there – for example, they argue that a Bayesian analysis requires judgments about the prior probability, but it doesn’t. You can simply calculate the change in prior probability from the new data (essentially what a Bayesian approach is), without deciding what the prior probability was. It seems to me that Bayesian vs p-value both have the same problems of bias, so I agree it’s not a solution but I don’t feel it would be worse.
JC conclusion: While the examples and recommendations are specific to the field of psychology, these general issues are germane to any scientific problem that uses statistical analysis and inference. Lets look at the hockeystick papers (MBH 98, 99) as an example. How should these rules be modified for such a study? Based upon what we now know about these papers and the background of what went into them, how do these papers measure up by the standards of the false positive analysis?