Wednesday, March 21, 2012

Why studies don't always repeat...


What we have here is a failure to replicate!


There are thousands of published research studies every year purporting to show a curative effect of some new drug.  In a WSJ article (12/02/11) entitled “Scientists’ Elusive Goal: Reproducing Study Results”  Gautam Naik lists a slew of  such ‘breakthroughs,’ reported in prestigious peer-reviewed scientific journals like Nature, Science and The Lancet, that cannot be replicated by others, particularly by pharmaceutical companies that would like to profit from them.
Naik cites a number of possible reasons: increased competition to publish, the proliferation of scientific journals, differing details among attempts to replicate – fraud.  But these possibilities seem inadequate to account for the prevalence of this problem, which is an apparent breakdown of science.   Why do so many drugs that are apparently effective in one study fail to hold up in others? 
            Naik names one clue to the problem: bias in favor of publishing positive results.  This is pretty obvious, if you think about it.  Unless there is already overwhelming reason to believe a given drug will have an effect, a study reporting that, indeed, it has no effect is not too newsworthy.  Consequently, unless a negative result contradicts some widely accepted belief, journals, hard-pressed for space, will not publish it.  (They may not publish it anyway, for a different set of reasons, but that is another story.) 
But there is another factor that, taken together with the bias in favor of positive results, can account for the large number of failures to repeat results reported in respected scientific journals.  This sentence gives a clue.  “Statistically, the studies were very robust,” says one scientist, describing a study that nevertheless failed to replicate.  All these studies rely on statistics, on comparison between a control and an experimental group, to establish their effects.  A ‘robust’ result is not robust as it might be in physics or engineering.  You can’t just keep repeating the experiment to see if you get the same result again and again.  Resources, ethics and a host of other practical considerations mean the experiment is usually done just once, with a number of subjects, and then statistics are used to see if the difference between controls and experimentals is real or not. 
The significance level used to establish ‘robustness’ is typically 5%.  That is to say, a result is accepted as real – or at least publishable – if statistics show that if the study were to be repeated with many individuals, and if the control and experimental groups do not in fact differ, i.e., the drug has no effect, then no more than 5% of the differences between the two groups would be as large as what was observed in the study that was actually done.  In other words, the probability of a ‘false positive’ is no more than 5%.  
So perhaps as many as 5% of published studies will not be replicable?  Well, no, the actual number will be much larger.  To see why, imagine a hundred hypothetical studies testing a hundred different drugs.  Let’s stipulate that for 20% of them there is a real effect of the drug.   For some small percentage of these ‘real’ effects the outcome will nevertheless fail to reach the 5% significance level.  Let’s neglect these and assume that 100% of real effects show up as significant, that’s 20 out of the 100.  What about the failures, the 80 studies where there is no real effect?  Well, 5% of them – four – will show up as positive even though the drug is really ineffective.  These are the 5% false positives.   So 86 will show up (correctly) as negative.  How many of this 86 will be published?  Well, for the reasons I just gave, almost none.  So what are left with: a total of 24 studies (20+4) showing a positive effect of a drug, but of these 24, four, nearly 17 percent of the total, will be false.  So, given the understandable bias in publishing positive results, and the accepted 5% significance, level, the number of published studies that are false, the result of experimental variability, not real effects, will be much larger than 5%.  We should not be surprised at Naik’s report. 
The number of published false positives obviously depends on the number of really effective drugs that are being tested.  For example, if only 5% of experimental drugs tested are actually effective, 95% will be ineffective.  Thus, the number of false positives rises from 4 in the previous example to nearly five and the number of correct positives falls from 20 to 5.  Result: almost 49% of published studies will be unreplicable.  (Bayer apparently found recently that two-thirds of its attempts to replicate failed.)  Paradoxically, the wider researchers cast their net, the more compounds they actually investigate, the worse the replicability problem will become.  Figures I have heard are on the order of 10,000 compounds tested for every one found to be effective.  Given this number of failures, the number of statistical false positives is certain to also be very large.     
There is a simple solution, which rests on the fact that there is nothing magic about the 5% significance level.  It has no scientific basis.  It is in fact completely arbitrary.  It’s real basis seems to be social and political.   It allows earnest tenure/grant/promotion seekers, with a reasonable amount of work, to achieve a publishable result.  Even if their treatments really have no effect, one in twenty times they will get something publishable.   Persistence always pays in the end.   The simple solution, therefore, is to set the standard for publication higher, let’s say 1% significance, which would reduce the errors (false positives) in my example from 4 to less than one.  At a stroke, the flood of scientific papers will be reduced to a manageable flow, the number of unreplicable results will be massively reduced, and much wasted labor examining small or negligible effects will be eliminated.