Good news! 42% of doctors can correctly answer a true-false question on p-values! That’s only 8% worse than a coin flip!

And this paragraph is your friendly reminder that six months after this study was published, the FDA decided it was unsafe for individuals to look at their own genome since they might misunderstand the risks involved. Instead, they must rely on their doctor. I am sure that statisticians and math professors making life-changing health or reproductive decisions feel perfectly confident being at the mercy of people whose statistics knowledge is worse than chance.

Now that I’ve got the sensationalism out of the way, let’s look at this study more closely.

The sample is 4000 Ob/Gyn residents. Ob/Gyn is a prestigious specialty that’s able to select people with very good grades in medical school, so we’re not looking at dummies here. These residents (beginning doctors) did a bit worse than more experienced doctors (whose performance was still not stellar). I don’t know whether this reflects doctors learning more about statistics as they progress, better statistical education in Ye Olde Days than in the current generation, or both.

The study looked at two questions. First was the one I mentioned above: “True or false: the p-value is the probability that the null hypothesis is correct”. The correct answer is “false” – the p-value is the chance of obtaining results at least as extreme as those actually obtained if the null hypothesis were true. 42% correctly said it was false, 46% said it was true, and 12% didn’t even want to hazard a guess.

The question seems sketchy to me. It is indeed technically false, but it seems pretty close to the truth. If I were asked to explain why the definition as given was false, the best I could do is say that your probability of the null hypothesis being true should take into account both something like your p-value, and your prior. But since no one ever receives Bayesian statistical education, I am not sure it is fair to expect a doctor to be able to generate that objection. What I would want a doctor to know is that the lower the p-value, the more conclusively the study has rejected the null hypothesis. The false definition as given accurately captures that key insight. So I’m not sure it proves anything other than doctors not being really nitpicky over definitions.

(which is also false, actually)

Next came very nearly the exact same question about mammogram results as Eliezer’s Short Explanation Of Bayes Theorem. It offered five multiple-choice answers, so we would expect 20% correct by chance. Instead, 26% of doctors got it correct. What shocks me about this one is that the question very nearly does all the work for you and throws the right answer in your face. Compare the way it was phrased in Eliezer’s example:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

to the way it was phrased on the obstetrician study:

Ten out of every 1,000 women have breast cancer. Of these 10 women with breast cancer, 9 test positive. Of the 990 women without cancer, about 89 nevertheless test positive. A woman tests positive and wants to know whether she has breast cancer for sure, or at least what the chances are. What is the best answer?

The obstetrician study seems to be doing everything it can to guide people to the correct result, and 74% of people still got it wrong. And nitpicky definitions don’t provide much of an excuse here.

There were three other results of this study worth highlighting.

First, people who got the statistics questions wrong were *more* likely to say they had good training in statistical literacy than those who did not, giving a rare demonstration of the Dunning-Kruger effect in the wild. Doctors who didn’t know statistics were apparently so inadequate that they didn’t realize there was any more to know, whereas those who did know some statistics at least had a faint inkling that something was missing.

Second, women rated their statistical literacy significantly worse than men did (note that a large majority of Ob/Gyn residents are women) but did not actually do any worse on the questions. This highlights an important limitation of self-report (tendency to confuse incompetence with humility) and probably has some broader gender-related implications as well.

And third, even though 42% of people got Question 1 correct and 26% of people Question 2, only 12% of people got both questions correct. Just from eyeballing those numbers, it doesn’t look like getting one question right made you much more likely to do better on the other. This is very consistent with most people lucking in to the correct answer.

I do not want to use this to attack doctors. Most doctors are technicians and not academics, and they cultivate, and should cultivate, only as much statistical knowledge as is useful for them. For a technician, “a p-value is that thing that gets lower when it means there’s really strong evidence” is probably enough. For a technician, “I can’t remember what exactly the positive predictive value of a mammogram is but it doesn’t matter because you should follow up all suspicious mammograms with further testing anyway” is probably enough.

But it really does seem relevant that only 12% of doctors can answer two simple statistics questions correctly when you’re trying to deny the entire non-doctor population access to certain information because only doctors are good enough at statistics to understand it.

To be fair, the largest reason the FDA suspended 23andme was that the company refused to reply to several requests for information. Now that they’ve had their wake up call, I expect that the company will actually manage their relationship with the FDA, tweak their marketing, and go back to what they were doing.

How should the FDA respond to a company that repeatedly ignores requests for info?

Nothing. They should do nothing.

I think Scott’s description is more accurate than yours.

But, I think the most accurate description is that FDA doesn’t trust the public to know what it is doing, or even if it has policies.

Agreed that this was part of the problem, but read the FDA’s letter to 23andMe as well.

Or they could move jurisdiction.

For a while, my go-to fix whenever the “doctors can’t do probability very well” problem came up was “it is a legal requirement that every test result must have the true positive, false positive, true negative, and false negative numbers printed right next to it. No, not overleaf or on page XX of the standard medical reference text for this condition.”

But, well, the breast cancer study actually did that. It even made an improvement on my suggested fix, by removing the negative numbers entirely, since this is a positive result. So, damn, that’s basically an experimental test of my hypothesis and it won’t work.

Gigerenzer’s work shows that the way they presented the question–using frequencies instead of percentages–leads to more correct answers. So I think your answer should still help. It’s just that doctors are so incredibly bad at statistics that it’s not quite enough help.

From the training-related answers, it sounds like medical school just needs to start taking statistics seriously.

Why can’t the legal requirement be that the data that patients actually care about is published: if the test says positive, what’s the chance I have cancer, and if the test says negative, what’s the chance I have cancer. Why publish the true/false positive rates when this is what is actually useful?

Steven for Congress!

That would require the test providers to know the base rate, and it would make it difficult to combine the results of multiple tests.

However, maybe publish the likelihood ratio, an assumed prior, and the implied posterior probability.

For p-values, I think a better explanation is — if we simplify things a bit, we can say it’s roughly supposed to represent the probability of getting this data if the hypothesis is false; whereas we want the probability of the hypothesis being false given this data. So it’s a matter of P(E|H) vs. P(H|E). While that is simplifying things a bit — the hypothesis being true and the null hypothesis being false are not the same thing, e.g. — I think it gives a better picture than “It depends on both the p-value and your prior”.

Null hypothesis: No breast cancer.

Observation: Positive test result.

Probability of an observation this extreme if the null hypothesis is true (p-value): 10%.

Actual probability that the null hypothesis is true: 90%.

As far as I can tell, “the p-value is the probability that the null hypothesis is correct” is just horrifically wrong. Killing people wrong (http://en.wikipedia.org/wiki/Prosecutor%27s_fallacy#The_Sally_Clark_case). There’s nothing nitpicky about it.

While you could argue that doctors don’t need to know what a p-value is (beyond “small value good”), that they don’t remember (or possibly never understood in the first place) such a basic concept suggests that they will generally deal pretty badly with statistics.

Agreed. Maybe you can’t blame them because Bayesianism isn’t widely enough taught, but it seems like it would be pretty valuable if we could get doctors to understand that if you throw 20 false at a wall, on average 1 will turn out to have a p value less than 0.05, even if all the claims are completely ridiculous.

My nephew just did an introductory statistics course and Bayes was not mentioned at all, and nor was the existence of fat-tailed distributions. He has the strong impression that the Normal distribution is ubiquitious and would almost certainly fail the above tests.

Not that fat-tailed distributions are of significance, as they only relate to things like financial risk, earthquakes, tsunamis, typhoons and hurricanes, distribution of wealth, size of wars, size of pandemics and a few other minor things.

Oh, also: The mammogram question and answer from the study, actually readable version

Ten out of every 1,000 women have breast

cancer. Of these 10 women with breast cancer, 9 test

positive. Of the 990 women without cancer, about 89

nevertheless test positive. A woman tests positive and wants

to know whether she has breast cancer for sure, or at least

what the chances are. What is the best answer?

(1) The probability that she has breast cancer is about 90%

26%

(2) The probability that she has breast cancer is about 81%

11%

(3) Out of 10 women with a positive mammogram, about 9 have breast cancer

9%

(4) Out of 10 women with a positive mammogram, about 1 has breast cancer

26% [The correct answer, according to the study]

(5) The probability that she has breast cancer is about 1%.

13%

(6) No answer or more than one response

15%

A few things immediately leap out at us here. First, there is no correct answer using a percentage. Second, it seems like 2 (11%), 3 (9%), 5 (13%), and 6 (15%) are explained by random chance, and we can handwave that about 12% of 1 and 4 are as well, so (statisticians forgive me) ~70% of the answers given were randomly chosen, which is still terrible. Third, for something other than chance (ideally, the technician’s statistical knowledge!) determining the answer, it’s between “90%” and “1 in 10”, which is terrible too.

But there’s a note of dissonance here. “You have a 90% chance” is functionally identical to “9 out of 10 women”. Why did the percentage get 26% to the frequency’s 9%? That is not covered by random chance. They must have been using some difference between the two answers to discriminate. Maybe they were looking for a percentage to be the correct answer? (unfairly reaching here: maybe the study was designed that way on purpose?)

It’s pretty inconclusive since “they figured out (incorrectly) it was 9/10 and picked the first answer to say that” is also a good explanation, but this study’s setup bothers me. Why was no correct percentage answer provided when the question has a woman asking for “the chances”?

Something else bothers me. I feel like I’ve done a better analysis of their results than they did, and I don’t know statistics either so they can’t use that excuse.

Doesn’t it often happen that people give P(E|H) instead of P(H|E)? The 90% is just the first of the two.

It does seem odd that some of the answers have different degrees of precision. But how many significant figures would you want to see before you called an answer “correct”? Think carefully.

I’ve gotten to the point where my reaction to most medical research papers is to assume they’re trash. Either they’re looking at ridiculously small samples (“To investigate why autism has risen from 1 in 2000 children to 1 in 250, we followed a sample of 36 children for two years”) or flogging every possible correlation in a data set until they find one that breaks 5%.

> But it really does seem relevant that only 12% of doctors can answer two simple statistics questions correctly when you’re trying to deny the entire non-doctor population access to certain information because only doctors are good enough at statistics to understand it.

Well, the FDA didn’t actually say or imply anything about doctors’ statistical abilities. The FDA is trying to prevent unnecessary tests, procedures, and maybe worries on the part of the patient. So long as the doctors follow their protocols (“standard of care”?), they don’t need to know anything about statistics.

That doesn’t make the FDA’s ruling any less terrible, of course.

Wait what!? The training is inadequate, the men are overconfident, but the study recommends to address the women’s perfectly accurate feelings that their training is inadequate? How about addressing the inadequate training itself?

I think saying “we need to respond to their feelings that training is inadequate” is being used as a way of saying that we need to address those feelings by improving training.

I was more amused by the claim that because so many residents are women, it’s especially important to take their feelings into account – but only because I’ve read a lot of stuff elsewhere saying that since women are such a small minority in field X it’s especially important to take their feelings into account. If women are especially important when they’re a majority, and especially important when they’re a minority, why not just drop the pretense that their percent in the population has anything to do with it?

I think it’s assumed that men’s feelings are already taken into account. Systemically, perhaps.

Or maybe that men don’t have feelings.

Frankly, feelings is a funny word to use there; I’d say evaluation, something that implies some thought about the matter (though Jonathan’s suggestion to simply excise it and speak of the actual reality is better still). Or has our language lost the distinction between feelings and thought?

As a general rule, whenever possible, pop writeups (and in this case even original papers) of findings on gender differences are spun in ways that raise women’s status, and throw on as many epicycles as necessary. As you point out, spinning contradictory studies to point to the same conclusion is no real challenge. Presumably signalling care for women is good for status, circulation numbers, or both.

I do deplore “signalling” being used as a political weapon, especially when it’s aimed at my politics.

With all due respect Athrelon, I think your explanation is pretty lazy. With costly signalling, the motives are only half the picture. What matters more are the costs (particularly the fact that costs are different for different groups). Without paying attention to the costs, “signalling” is just our old friend the circumstantial ad hominem, dressed in the clothing of economics.

So, in the name of charity, let’s pay attention to the costs . If everyone does it then it can’t be very costly. Furthermore it doesn’t seem a great effort to write the equivalent of “women’s feelings are important”. The cost must lie in the knowledge that you ought to say women’s feelings are important in the first place.

In other words, raising women’s status is a shibboleth. Something you do because some other group doesn’t.

So who’s the other group? People who don’t want to, or don’t think to raise the status of women. Anti-feminists and people ignorant of gender politics in other words. The reasons for someone at a university to not want to appear anti-feminist are obvious enough I would presume. As a hint, Pat Robertson made some laughably anti-feminist remarks in a 1992 letter, and nobody wants to look like Pat Robertson while doing a pop science write up.

So, one way we can explain the shibboleth is that scientists and pop science writers don’t want to appear to be religious nutcases.

This would explain why we don’t see remarks that lower the status of women. I’d like to know how frequent remarks are that raise the status of women before I explain why everyone does it.

What is this utopian world you describe when you are not surrounded with examples of the Dunning-Kruger effect everywhere you turn? ‘Cuz I’m moving there, if it exists.

Aha, you must not be very good at discerning true cases of the Dunning-Kruger effect, then! 😀

Scott, I notice that you’ve added Marginal Revolution to your Links, in spite of (what I perceived as) Tyler Cowen’s condescending negative evaluation of an ancient post of yours that he dredged up, and the ensuing volley of rude and hostile comments. There’s a fine line between charity and masochism.

It reminded me that MR exists and is vaguely among the sort of things I like. I use my sidebar sort of in place of an RSS reader and it’s more to remind me of what I want to read than to endorse things or to thank them for being a good blog-ally.

To be fair that post was really, really bad. It it did lead me to your (very good) newer post though.

The Marginal Revolution comment section is one of the the worst this side of a PUA blog’s though. I think the only reason I still read them is out of misplaced sense of nostalgia.

This study seems to leave a lot to be desired. First, it uses a single sub-specialty, which makes the statement “42% of doctors can….”a bit of a generalization and not as accurate as “42% of OB/Gyn doctors can…”. Secondly, how does the ability of OB/Gyn doctors to understand p-values relate to geneticists being able to read the genome better than an untrained individual? It’s apples and oranges in a way since OB/Gyn aren’t the ones who would be discerning someone’s genome and looking at possible accurances of mutations and whether those mutations are even important. OB/Gyn are more likely to use p-values as described at the end of the article as a threshold of accepting or not accepting published scientific findings. Now if the study had been done with 4,000 geneticists or other physicians who directly interpret genomic data I would find the study more compelling evidence of the unfairness of the FDA’s decision. However , as is, my opinion of OB/Gyn doctors’ ability to do statistics and the FDA’s decision are two unrelated things.

The reason for the choice of ob/gyns and the reason that these studies always use the example of mammograms is because banning mammograms is a reasonable idea. And I don’t mean direct-to-consumer mammograms.

No, ob/gyns won’t be evaluating genetic tests (except BRCA). But neither will geneticists. The relevant comparison is genetic counselors, a job created in 1979 to counsel pregnant women about the morality of aborting Downs fetuses.

Why do you think banning mammograms is a good idea?

Great post. That’s all I can say.

You can’t actually interpret a p-value without using Bayes. It is a conditional probability – the probability of getting a result as or more extreme IF the null hypothesis was true.

Say 10% of null hypotheses can be disproven for a given study design (power, size of effect etc) and the rest are true within the limitations of our sample size.

Say we get p<0.05 and the study had 80% power. What is the probability that the null hypothesis is false?

Out of 1000 trials 100 will have a false null hypothesis.

We will get p<0.05 for 80 of these (because we have 80% power).

900 will have a true null hypothesis.

We will get p<0.05 for 45 of these.

In total we have 125 'positive' results and 45 of them are false positives.

So we can only be ~65% certain that the null hypothesis is false when p<0.05 in this situation.

If you repeat the test for all the positive trials (65% of which have a false null hypothesis this time around) you get down to a p-value of around 0.03. A Bayesian argument for confirmatory trials.

A randomised controlled trial is just a diagnostic test for treatment differences. The p-value is the specificity and the power is the sensitivity. Calculate PPV and NPV as normal.

[This was first pointed out to me by MKB Parmar and has been written a lot about by John Ioannides.]

The reason it’s false really has nothing to do with Bayesian statistics. To evaluate the probability in the question would require Bayesian statistics, but if we’re looking at p-values, we’re doing frequentist hypothesis testing, so we’re not doing Bayesian statistics. The p-value is a conditional probability, and is correctly defined in the first sentence on the Wikipedia page on p-values (there are quite different ways to write it that are equivalent, though). Specifically, the p-value is the probability of a result at least as extreme as the sample result, given the null hypothesis is true. That can be a *very* different thing to the wrong answer in the question, which we never evaluate. The p-value gives you some idea of whether ‘it was just chance’ is a plausible explanation of what you saw. If it isn’t a very plausible explanation you’re left with three possibilities: (i) H0 is true and a really low chance event occurred; (ii) one or more of the assumptions was wrong; or (iii) H0 isn’t true. Any of those explanations might be the case.

Why are you saying that a p-value has nothing to do with Bayes whilst acknowledging that it is a conditional probability?

Bayes’ theorem is how we deal with conditional probabilities. You cannot interpret a p-value without using it. The probability of obtaining a false positive depends on the probability that the null hypothesis is false. Without an estimate of that probability you have no way of interpreting a p-value.

– In the example I used (with a prior of 10% false H0s) a p-value of 0.05 gave a probability of a false positive of 0.36.

– If the prior was that we are equally likely to be testing a false H0 as a true one then it would be 9%.

– The two probabilities only become equal when we are 100% certain that the null hypothesis is false. In which case we would need to do the research in the first place.

This is really basic stuff and very few researchers understand it. This is a huge failing of statistics.