Last night on twitter there was a bit of a firestorm over this New York Times snippet about p-values (here is my favorite twitter-snark response). While the article has a surprising number of controversial sentences for only 180 words, the most offending sentence is:
By convention, a p-value higher than 0.05 usually indicates that the results of the study, however good or bad, were probably due only to chance.
One problem with this sentence is that it commits a statistical cardinal sin by stating the equivalent of “the null hypothesis is probably true.” The correct interpretation for a p-value greater than 0.05 is “we cannot reject the null hypothesis” which can mean many things (for example, we did not collect enough data).
Another problem I have with the sentence is that the phrase “however good or bad” is incredibly misleading — it’s like saying that even if you see a fantastically big result with low variance, you still might call it “due to chance.” The idea of a p-value is that it’s a way of defining good or bad. Even if there’s a “big” increase or decrease in an outcome, is it really meaningful if there’s bigger variance in that change? (No.)
I’d hate to be accused of being an armchair critic, so here is my attempt to convey the meaning/importance of p-values to a non-stat, non-math audience. I think the key to having a good discussion of this is to provide more philosophical background (and I think that’s why these discussion are always so difficult). Yes this is over three times the word count of the original NYTimes article, but fortunately the internet has no word limits.
(n.b. I don’t actually hate to be accused of being an armchair critic. Being an armchair critic is the best part of tweeting, by far.)
(Also fair warning — I am totally not opening the “should we even be using the p-value” can of worms.)
Putting a Value to ‘Real’ in Medical Research
by Hilary Parker
In the US judicial system, a person on trial is “innocent until proven guilty.” Similarly, in a clinical trial for a new drug, the drug must be assumed ineffective until “proven” effective. And just like in the courts, the definition of “proven” is fuzzy. Jurors must believe “beyond a reasonable doubt” that someone is guilty to convict. In medicine, the p-value is one attempt at summarizing whether or not a drug has been shown to be effective “beyond a reasonable doubt.” Medicine is set up like the court system for good reason — we want to avoid claiming that ineffective drugs are effective, just like we want to avoid locking up innocent people.
To understand whether or not a drug has been shown to be effective “beyond a reasonable doubt” we must understand the purpose of a clinical trial. A clinical trial provides an understanding of the possible improvements that a drug can cause in two ways: (1) it shows the average improvement that the drug gives, and (2) it accounts for the variance in that average improvement (which is determined by the number of patients in the trial as well as the true variation in the improvements that the drug causes).
Why are both (1) and (2) necessary? Think about it this way — if I were to tell you that the price between two books was $10, would you think that’s a big price difference, or a small one? If it were the difference in price between two paperback books, it’d be a pretty big difference since paperback books are usually under $20. However if it were the difference in price between two hardcover books, it’d be a smaller difference in prices, since hardcover books vary more in price, in part because they are more expensive. And if it were the difference in price between two ebook versions of printed books, that’d be a huge difference, since ebook prices (for printed books) are quite stable at the $9.99 mark.
We intuitively understand what a big difference is for different types of books because we’ve contextualized them — we’ve been looking at book prices for years, and understand the variation in the prices. With the effectiveness of a new drug in a clinical trial, however, we don’t have that context. Instead of looking at the price of a single book, we’re looking at the average improvement that a drug causes — but either way, the number is meaningless without knowing the variance. Therefore, we have to determine (1) the average improvement, and (2) the variance of the average improvement, and use both of these quantities to determine whether the drug causes a “good” enough average improvement in the trial to call the drug effective. The p-value is simply a way of summarizing this conclusion quickly.
So, to put things back in terms of the p-value: let’s say that someone reports that NewDrug is shown to increase healthiness points by 10 points on average, with a p-value of 0.01. The p-value provides some context for whether or not an average increase of 10 points in this trial is “good” enough for the drug to be called effective, and is calculated by looking at both the average improvement, and the variance in improvements for different patients in the trial (while also controlling for the number of people in the trial). The correct way to interpret a p-value of 0.01 is: “If in reality NewDrug is NOT effective, then the probability of seeing an average increase in healthiness points at least this big if we repeated this trial is 1% (0.01*100).” The convention is that that’s a low enough probability to say that NewDrug has been shown to be effective “beyond a reasonable doubt” since it is less than 5%. (Many, many statisticians loathe this cut-off, but it is the standard for now.)
A key thing to understand about the p-value in a clinical trial is that a p-value greater than 0.05 doesn’t “prove” that NewDrug is ineffective — it just means that NewDrug wasn’t shown to be effective in the clinical trial. We can all think of examples of court cases where the person was probably guilty, but was not convicted (OJ Simpson, anyone?). Similarly with the p-value, if a trial reports a p-value greater than 0.05, it doesn’t “prove” that the drug is ineffective. It just means that researchers failed to show that the drug was effective “beyond a reasonable doubt” in the trial. Perhaps the drug really is ineffective, or perhaps the researchers simply did not gather enough samples to make a convincing case.
edit: In case you’re doubting my credentials regarding p-values, I’ll just leave this right here…
A pretty solid explanation. What I would like to know now is, if you put this piece and the the NYTimes piece in front of someone who had never encountered any stats before, which would they prefer? Yours is more accurate, but it’s longer and requires more work to get through. Would fewer readers make it to the end? Is the additional effort to understand it worth the gain in understanding over the NYT piece?
What I’m getting at here is the difference between science and science education. One of my favorite quotes: “education is the process of telling smaller and smaller lies.”
Also, just a note: I read “however good or bad” in the original article as referring to the possibility that the drug might be causing bad effects, not just helpful ones.
Thanks for writing this.
It’s a good point, and I tried to keep the article short, but explaining it clearly to a non-expert in 180 words wasn’t possible. Probably the bigger issue is that we can’t make up for a complete lack of statistical literacy in the population with just a 180 word article. Art Benjamin gave a fantastic 3 minute TED talk about how we should change high school math education to include more statistics: http://www.ted.com/talks/arthur_benjamin_s_formula_for_changing_math_education.html
Also I agree with your interpretation of the “however good or bad” — will clarify in the post!
For what it’s worth, I read both articles through to the end.
The NYTimes one told me nothing–left me confused. (To be fair, it could have been edited down to incoherence from an originally coherent draft.) Cognitive dissonance.
This one took longer to get through, yes, but not a lot longer. And it was not more work–it was less work to read because it was understandable. No cognitive dissonance.
The NYTimes article has a little bit too much hand-waving for my taste (ex: This number [the p stands for probability] is arrived at through a complex calculation designed to quantify…). Such black box allusions make the public disinterested in understanding the underlying mechanisms of the p-value. That being said, I don’t think the NYTimes article was terribly written or inaccurate outside of the sentence with the “good or bad result” phrase you referenced.
I think this sentence was also not so great: “This number (the p stands for probability) is arrived at through a complex calculation designed to quantify the probability that the results of an experiment were not due to chance.”
Here’s a good rebuttal to that — https://twitter.com/kevin2kane/status/311377021030772737
Question: are you already stats literate? If so, then it’s not your taste that matters here. I am curious to show this pair to someone who doesn’t already have any stats background and — like most people — doesn’t particularly care, and see which they prefer.
I just accidentally found your blog via our shared LinkedIn connection. Pretty interesting, especially for me, as a Ph.D. candidate and a beginner in statistics and quantitative research, who is trying to implement his dissertation data analysis in R.
I’d like to share with you several resources related to this post’s topic, which I thought might be of your interest:
1. Regina Nuzzo’s article “Scientific method: Statistical errors” in Nature: http://www.nature.com/news/scientific-method-statistical-errors-1.14700;
2. Online book by Alex Reinhart, called “Statistics Done Wrong”: http://www.statisticsdonewrong.com;
3. The work by Geoff Cumming on “new statistics”, represented by his paper “The New Statistics: Why and How”: http://pss.sagepub.com/content/25/1/7.