A while back (11/2/11) I reviewed the book Stats.con by James Penston. That book discussed how the statistics used in randomized clinical trials can be highly deceptive. How Not to be Wrong also covers some aspects of statistical misuse, in more detail, and certainly in a much more entertaining way. Some of his comments are funny as hell.
Jordan Ellenberg |
Consider the widespread use of a statistic call the p value, which estimates the probability that the result of a study could have just been a chance coincidence rather than an actual meaningful finding. A study is generally considered positive if the p value is 5% or less.
5% is of course not 0%. There is a one in twenty probability that the study results that are deemed positive were in fact negative. But what happens if journals only publish the positive studies and not the negative ones, when there might be a large number of negative studies, and when the positive study results are not reproduced (replicated) in a second study? Well, people start believing things that are not true, that's what.
The reason for this is because, as the author points out, improbable things actually happen quite frequently. Especially if you do lots and lots of things – like experiments.
Another issue he mentions is that, if your sample size is too small, the chances increase dramatically that one of your subjects will be an outlier that dramatically but artificially changes the average for whatever characteristic you are measuring. With a small sample, you are more likely to get a few extra prodigies or slackers in a study of people's ability to perform certain tasks. A famous example: if Bill Gates walks into a bar with a few other people, the average guy in the room is a billionaire.
Another issue he mentions is that, if your sample size is too small, the chances increase dramatically that one of your subjects will be an outlier that dramatically but artificially changes the average for whatever characteristic you are measuring. With a small sample, you are more likely to get a few extra prodigies or slackers in a study of people's ability to perform certain tasks. A famous example: if Bill Gates walks into a bar with a few other people, the average guy in the room is a billionaire.
Here’s how the author starts out a discussion of the p value problem (pages 145-46) :
"Imagine yourself a haruspex; that is, your profession is to make predictions about future events by sacrificing sheep and then examining the features of their entrails...You do not, of course, consider your predictions to be reliable merely because you follow the practices commanded by the Etruscan deities. That would be ridiculous. You require evidence. And so you and your colleagues submit all your work to the peer-reviewed International Journal of Haruspicy, which demands without exception that all published results clear the bar of statistical significance.
Haruspicy, especially rigorous evidence-based haruspicy, is not an easy gig. For one thing, you spend a lot of your time spattered with blood and bile. For another, a lot of your experiments don't work. You try to use sheep guts to predict the price of Apple stock, and you fail; you try to model Democratic vote share among Hispanics, and you fail…The gods are very picky and it's not always clear precisely which arrangement of the internal organs and which precise incantations will reliably unlock the future. Sometimes different haruspices run the same experiment and it works for one but not the other — who knows why? It's frustrating…
But it's all worth it for those moments of discovery, where everything works, and you find that the texture and protrusions of the liver really do predict the severity of the following year's flu season, and, with a silent thank-you to the gods, you publish
You might find this happens about one time in twenty.
That's what I'd expect, anyway. Because I, unlike you, don't believe in haruspicy. I think the sheep's guts don't know anything about the flu data, and when they match up it's just luck. In other words, in every matter concerning divination from entrails, I'm a proponent of the null hypothesis [that there is no connection between the sheep entrails and the future]. So in my world, it's pretty unlikely that any given haruspectic experiment will succeed.
How unlikely? The standard threshold for statistical significance, and thus for publication in IJoH, is fixed by convention to be a p-value of .05, or 1 in 20... If the null hypothesis is always true — that is, if haruspicy is undiluted hocus-pocus —then only one in twenty experiments will be publishable.
And yet there are hundreds of haruspices, and thousands of ripped-open sheep, and even one in twenty divinations provides plenty of material to fill each issue of the journal with novel results, demonstrating the efficacy of the methods and the wisdom of the gods. A protocol that worked in one case and gets published usually fails when another harupex tries it, but experiments without statistically significant results do not get published, so no one ever finds out about the failure to replicate. And even if word starts getting around, there are always small differences the experts can point to that explain why the follow-up study didn't succeed."
The book covers many subjects about which the non-mathematically-inclined can learn to think in a mathematical way in order to avoid coming to certain wrong conclusions and to zero in on correct ones. Many of these, however, are irrelevant to this blog – the chapters on lotteries come to mind. I of course found those parts a bit less interesting. But the chapters relevant to medical studies are so right on.
Another important topic the author covers is known mathematically as regression to the mean. This phenomenon can lead, as examples, to overestimates about the genetic component of human traits and explains why fad diets always seem to work at first but then later on everyone seems to forget about them. As mentioned, when you average any measurement applied to human beings, the averages can be deceptive.
In addition to the sample size considerations described above, you can get into trouble if you start with a sample that contains people who are higher or larger on average on the relevant variable that the average person in the general population.
In addition to the sample size considerations described above, you can get into trouble if you start with a sample that contains people who are higher or larger on average on the relevant variable that the average person in the general population.
If two tall people marry, their progeny will usually be, on average, tall compared to others in the general population. However, they are not all that likely to be taller than their parents. As Ellenberg states, “…the children of a great composer, or scientist, or political leader, often excel in the same field, but seldom so much as their illustrious parents” (p. 301). Their heredity mingles with chance environmental considerations, and pushes them back toward the population average. That is the meaning of regression to the mean.
To understand this, think about those who embark on weight loss diets. One needs to consider the fact that most people’s weight tends to fluctuate a few pounds either way depending on a lot of chance factors, such as their happening by an ice cream truck. And when are people most likely to start a diet? When their weight is at the top of their range! So by the law of averages, they are probably in many instances going to lose weight whether they diet or not. But when they do diet, guess what happens? They attribute the loss to the fantastic new diet!
I can not say for certain, but I wonder if studies on borderline personality disorders (BPD) yield misleading results because of regression to the mean. Long term follow-up studies on patients with the disorder seem to indicate that it seems to go away after a few years in a significant percentage of subjects. This finding is misleading, however, when you look closer.
To make the BPD diagnosis, the subject needs to exhibit 5 of the 9 possible criteria. Many of the "improved" subjects merely went from 5 criteria down to 4 of them, and were therefore not diagnosed with BPD any longer. Actually, they became just what we call "subthreshold" for the disorder. Their problematic relationships, however, were still pretty much the same.
These results could mean that subjects with BPD may naturally vacillate between meeting criteria for the disorder and being subthreshold, or between exhibiting a high number of the criteria and a lower one. Which would mean that if they qualified for the diagnosis at the beginning of the long term follow-up study, a significant proportion of the long-term study subjects were at their worst. If so, the study results may indicate regression to the mean, and therefore say nothing else significant about the long term prognosis for the disorder.
Other important statistical issues the author discusses clearly and brilliantly include assumptions that two variables are related in a linear fashion when the are not (non-linearity - cause and effect relationships that are not based purely on an increase in one variable always leading to either an increase or decrease in another); torturing the data until it confesses (running multiple tests on your study data, controlling for different things, until something significant seems to pop up); and the following problem inherent in studies designed to see if two things like being married and smoking are correlated:
"Surely the chance is very small that the proportion of married people is exactly the same as the proportion of smokers in the whole population. So, absent a crazy coincidence, marriage and smoking will be correlated, either positively or negatively."
Any one who is serious about critically evaluating the medical literature owes it to themselves to read this book.
I can not say for certain, but I wonder if studies on borderline personality disorders (BPD) yield misleading results because of regression to the mean. Long term follow-up studies on patients with the disorder seem to indicate that it seems to go away after a few years in a significant percentage of subjects. This finding is misleading, however, when you look closer.
To make the BPD diagnosis, the subject needs to exhibit 5 of the 9 possible criteria. Many of the "improved" subjects merely went from 5 criteria down to 4 of them, and were therefore not diagnosed with BPD any longer. Actually, they became just what we call "subthreshold" for the disorder. Their problematic relationships, however, were still pretty much the same.
These results could mean that subjects with BPD may naturally vacillate between meeting criteria for the disorder and being subthreshold, or between exhibiting a high number of the criteria and a lower one. Which would mean that if they qualified for the diagnosis at the beginning of the long term follow-up study, a significant proportion of the long-term study subjects were at their worst. If so, the study results may indicate regression to the mean, and therefore say nothing else significant about the long term prognosis for the disorder.
Other important statistical issues the author discusses clearly and brilliantly include assumptions that two variables are related in a linear fashion when the are not (non-linearity - cause and effect relationships that are not based purely on an increase in one variable always leading to either an increase or decrease in another); torturing the data until it confesses (running multiple tests on your study data, controlling for different things, until something significant seems to pop up); and the following problem inherent in studies designed to see if two things like being married and smoking are correlated:
"Surely the chance is very small that the proportion of married people is exactly the same as the proportion of smokers in the whole population. So, absent a crazy coincidence, marriage and smoking will be correlated, either positively or negatively."
Any one who is serious about critically evaluating the medical literature owes it to themselves to read this book.
Tidak ada komentar:
Posting Komentar