E's flat, ah's flat too: Ad-hoc statistics

I am currently reading E T Jaynes' "Probability Theory: The Logic of Science", his posthumous textbook published in 2003. Jaynes was a lifelong promoter of Bayesian methods in probability and statistics, the inventor of the "maximum entropy" method of assigning priors, and, for much of his career, at loggerheads with "orthodox" (or "frequentist") statisticians, who dismissed Bayesian ideas of "prior" and "posterior" probabilities except where these could be rigorously justified as limits of large numbers of trials. Jaynes, drawing on previous work of Cox, Polya, Jeffries and others (including himself), argues that probability theory is the unique generalisation of Boolean logic to statements that have varying degrees of plausibility. Specifically, given three reasonable-sounding "desiderata", he shows that the rules of probability theory follow uniquely, with no reference to trials and sample spaces and the usual language. His point, hammered again and again throughout the book, is that prior information is essential and must not be thrown away: "If we humans threw away what we knew yesterday in reasoning about our problems today, we would be below the level of wild animals." Meanwhile, he condemns much orthodox statistics as "ad-hockery", and even when valid, of extremely limited applicability.

The book is full of interesting nuggets, historical insights and examples of misleading statistics. I just came across the following striking example.

According to Jaynes, the data in this example are real but the circumstances have been simplified. In experiment A, patients were given one of two treatments, an old one and a new one, and the number of "failures" (deaths) and "successes" (recoveries) were compared. The results were:

Experiment A
Old: 16519 failures, 4343 successes
(success rate 20.8 +/- 0.28 %)
New: 742 failures, 122 successes
(success rate 14.1 +/- 1.10 %)

Experiment B was the same experiment conducted two years later. The results were

Experiment B
Old: 3876 failures, 14488 successes
(success rate 78.9 +/- 0.30 %)
New: 1233 failures, 3907 successes
(success rate 76.0 +/- 0.60 %)

The results were "discouraging": the new treatment, in both experiments, showed a lower success rate.

Says Jaynes: "But then one of them had a brilliant idea: let us pool the data, simply adding up" the totals over experiments A and B for each method. This "pooled data" yields the results:

Pooled data
Old: 20395 failures, 18831 successes
(success rate 48.0 +/- 0.25 %)
New: 1975 failures, 4029 successes
(success rate 67.1 +/- 0.61 %)

And, lo and behold, the "pooled data" show the new method performing strikingly better. Says Jaynes, "they eagerly publish this gratifying conclusion, presenting only the pooled data; and become (for a short time) famous as great discoverers."

How is it that pooling the data changes the results? The point is that, when pooling in this manner, certain essential facts are being hidden: both methods performed much better in Experiment B; and experiment B contained many more instances of the new method, with somewhat fewer instances of the old method.

Here is another example of dodgy statistics that I came across a while ago: this one is particularly distressing because it was a review, meant to settle a long-running argument.

Peter Duesberg believes that AIDS is not caused by the HIV virus, but by drug overuse (in the original San Francisco bay area outbreaks), malnutrition (in Africa), and the antiretroviral drugs themselves (in the HIV+ patients being treated). Today this is seen as a crackpot view, but back in the 1980s it was at least worthy of consideration. By 1994, mainstream HIV researchers were beginning to get fed up of his arguments.

One of Duesberg's arguments was that AIDS-like symptoms were induced by antiretroviral drugs like AZT (the first antiretroviral approved for use). An example of how he and the mainstream researchers could interpret statistical data in opposite ways is found in a review by Jon Cohen, "Reviewing the data - IV: Could Drugs, Rather Than a Virus, Be the Cause of AIDS?" One of the things at issue is how to interpret data from the "Concorde study", which tracked 877 individuals who were treated with AZT soon after entering the study (the "Imm" group), and 872 individuals who were given deferred treatment with AZT or not given AZT at all (the "Def" group). At the end of the three-year study, 96 deaths occurred in the "Imm" group, and 76 in the "Def" group. Duesberg is quoted as saying, in a written response to Science magazine: "The Concorde data exactly prove my point: The mortality of the AZT-treated HIV-positives was 25% higher than that of the placebo group."

But "25% higher" is a meaningless number. If four deaths occurred in the Def group and five in the Imm group, that would be an increase of 25% but nobody would consider that significant. If there were 400 deaths in the Def group and 500 deaths in the Imm group, most people's gut reaction would be that this is a significant increase. How to assess the significance in this case?

First, Cohen quotes experts who note that 22 of these deaths occurred from causes unrelated to AZT or AIDS, such as traffic accidents and suicides. Subtracting those leads to 81 Imm deaths and 69 Def deaths -- a 17% increase, but how significant is that?

Enter the "experts", and I quote:

In addition, say the critics, there is a deeper flaw in Duesberg's analysis: He does not take account of the total number of people in the Imm and Def groups. His reasoning for ignoring the denominator is, as he told Science in an interview, that "it was the same in the two groups." But National Institute of Allergy and Infectious Diseases Director Anthony Fauci says this type of analysis means "ignoring an important part of a calculation." Specifically, there were 96 total deaths out of 877 in the Imm group, implying that 10.9% of the people who were immediately treated with AZT died. In the deferred treatment group, there were 76 deaths among 872 people, or 8.7%.

The appropriate conclusion, say the authors of the Concorde study, is that the difference in mortality between Imm and Def groups is not 25% but 10.9% minus 8.7% -- or 2.2%. Subtracting the deaths from causes unrelated to AZT or AIDS, the difference drops to 1.3%. As the Concorde paper notes, neither difference (2.2% or 1.3%) is statistically significant.

So, apparently, the answer to bad statistics is atrocious statistics. (No wonder AIDS deniers are still around today.) What these people seem to be saying is that the corrected difference is 1.3% of the total population and is not statistically significant (why they assert this is unclear). If one person died in the Def group, and thirteen died in the Imm group, that difference would be the same 1.3%: would it still be statistically insignificant?

Actually, using some simple assumptions one can quickly check how significant these numbers really are. Suppose a patient in the Def group has a fixed probability p of dying in the duration of the experiment. (Of course, not all patients are equally fit, but without knowing other prior information, this is the best we can do.) Given the data (uncorrected, for now, for "other" deaths), our best assumption is p = 76/872 = 0.087. The distribution is a binomial distribution, a bell-shaped curve when the numbers are large: for a population size of N=872 (Def group), its mean is 76 and its standard deviation is the square root of Np(1-p), or about 8.3. For the Imm group, the numbers are nearly unchanged. 96 is more than two standard deviations away from 76, so it would seem that Duesberg was right in pronouncing it significant: there is only a 2% probability that one would see such numbers in the absence of any effect from AZT.

But we can improve on this calculation. We assumed that p was equal to its best estimate, but of course any value of p, other than zero or one, could in theory produce these data. What we need to ask is: given that 76 deaths were seen in the Def group, what is the distribution of expected deaths in the Imm group if AZT had no effect, and where does the number 96 lie on that distribution? I won't get into the details here, but if we assume that we have no a priori expectation on the probability p that a person from Def would die, then the distribution of p is proportional to the likelihood of seeing 76 deaths given p. More generally, if there are N individuals in the population and one observes k deaths, the distribution of p is proportional to the probability of seeing k deaths given p; that is, it is proportional to $ \left( N \atop k \right) p^k (1-p)^{N-k}$. The normalisation factor is obtained by integrating from 0 to 1. The probability of seeing $K$ deaths in the Imm trial, if AZT had no effect, is $ \left( N \atop K \right) p^K (1-p)^{N-K}$, averaged over all values of $p$ with the preceding probability distribution for $p$. If we do the math, we get the following distribution for $K$:
$P(K) = (N+1) \left(N \atop K \right) \left(N \atop k \right) \frac{(k+K)! (2N-k-K)!} {(2N+1)!} $
(We have for simplicity assumed the total number of patients to be the same in both groups, since the actual difference is small.)

If we plot this as a function of $K$, we get a bell-shaped curve as follows:

The red line is the number 96 that were observed in the Imm group: it lies well within the "bell", and clearly it is not a significant difference.

What if we correct the numbers? There were evidently 15 unrelated deaths in the Imm group and 7 in the Def group; and 81 relevant deaths out of 862 in Imm, 69 out of 865 in Def. Taking N = 865 and k = 69, the plot is as follows:

The red line marks the observed number 81 in the Imm group, and statistically it is even less significant than earlier.

The statistics I have used dates to the 19th century. What I find worrisome is that, in 1994, the scientific world was doing their best to shut Duesberg up, and marshalled their best statistics and published them in one of the most prestigious journals (Science) -- and this was the best they could do? The quoted extract above, claiming that 1.3% of 877 is "not statistically significant", is so horrifying to me that I have to wonder: what else in the biomedical literature has been "proved" with the effect of such statistics? Just to illustrate the point, here is the hypothetical case where 3 people died in the control group, and 15 died in the Imm group, out of a total of 872:

Clearly, in this case, 1.3% is statistically significant.

The point here is that statistics is not a trivial task. According to Jaynes, the large majority of "orthodox" 20th-century statisticians got things very wrong. But even within orthodox statistics are applicable, it is not a task to be done mechanically or unthinkingly. It is not fair to expect a biological, medical or clinical researcher to be an expert in this field. Biomedical journals routinely ask reviewers whether expert statistical reviews of manuscripts are necessary. Despite that, I wonder how much bad statistics slips through, and how much damage it causes.

Physicists usually do not undergo serious courses in statistics in their education, and don't commonly use orthodox statistical tests. Jaynes observes in his book that this is a good thing: the gut instinct of a physicist is often a better measure of significance than the "ad-hockery" of orthodox statistics. His solution is to start, in all instances, with the basic laws of probability theory and approach hypothesis testing as a Bayesian problem. This is not usually an easy task, but it is necessary.

3 comments:

km said...: Rahul:

You should write this as a column in a newspaper. (Assuming that newspapers still cover serious science stories. Ha.)

Fantastic stuff.; 5/06/2010 10:34 pm
Rahul Siddharthan said...: km - Well, it's hardly newspaper material, but I may flesh it out and clean it up a bit and send it elsewhere.; 5/07/2010 10:29 am
Anonymous said...: Rahul:

You bring up several interesting points, not the least of which is that it's often very hard to decide the right numerical quantity to answer a question stated in plain English.

The analysis about statistical significance is interesting, but it seems to me that selection bias could be a more fundamental issue. AZT was presumably prescribed early to patients with worse health, perhaps a lower CD4 count. So it's quite possible that the reason for more deaths in the "lmm" group was that they were less healthy to start with. Perhaps the study already accounted for this; were the patients matched on CD4 and other health factors when they were enrolled in the study?; 6/10/2010 8:19 pm

E's flat, ah's flat too

Wednesday, May 05, 2010

Ad-hoc statistics

3 comments:

About Me

Blog Archive