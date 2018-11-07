Four years ago I examined the claim that SSRIs are little better than placebo. Since then, some of my thinking on this question has changed.
First, we got Cipriani et al’s meta-analysis of anti-depressants. It avoids some of the pitfalls of Kirsch and comes to about the same conclusion. This knocks down a few of the lines of argument in my part 4 about how the effect size might look more like 0.5 than 0.3. The effect size is probably about 0.3.
Second, I’ve seen enough to realize that the anomalously low effect size of SSRIs in studies should be viewed not as an SSRI-specific phenomenon, but as part of a general trend towards much lower-than-expected effect sizes for every psychiatric medication (every medication full stop?). I wrote about this in my post on melatonin:
The consensus stresses that melatonin is a very weak hypnotic. The Buscemi meta-analysis cites this as their reason for declaring negative results despite a statistically significant effect – the supplement only made people get to sleep about ten minutes faster. “Ten minutes” sounds pretty pathetic, but we need to think of this in context. Even the strongest sleep medications, like Ambien, only show up in studies as getting you to sleep ten or twenty minutes faster; this NYT article says that “viewed as a group, [newer sleeping pills like Ambien, Lunesta, and Sonata] reduced the average time to go to sleep 12.8 minutes compared with fake pills, and increased total sleep time 11.4 minutes.” I don’t know of any statistically-principled comparison between melatonin and Ambien, but the difference is hardly (pun not intended) day and night. Rather than say “melatonin is crap”, I would argue that all sleeping pills have measurable effects that vastly underperform their subjective effects.
Or take benzodiazepines, a class of anxiety drugs including things like Xanax, Ativan, and Klonopin. Everyone knows these are effective (at least at first, before patients develop tolerance or become addicted). The studies find them to have about equal efficacy as SSRIs. You could almost convince me that SSRIs don’t have a detectable effect in the real world; you will never convince me that benzos don’t. Even morphine for pain gets an effect size of 0.4, little better than SSRI’s 0.3 and not enough to meet anyone’s criteria for “clinically significant”. Leucht 2012 provides similarly grim statistics for everything else.
I don’t know whether this means that we should conclude “nothing works” or “we need to reconsider how we think about effect sizes”.
All this leads to the third thing I’ve been thinking about. Given that the effect size really is about 0.3, how do we square the scientific evidence (that SSRIs “work” but do so little that no normal person could possibly detect them) with the clinical evidence (that psychiatrists and patients often find SSRIs sometimes save lives and often make depression substantially better?)
The traditional way to do this is to say that psychiatrists and patients are wrong. Given all the possible biases involved, they misattribute placebo effects to the drugs, or credit some cases that would have remitted anyway to the beneficial effect of SSRIs, or disproportionately remember the times the drugs work over the times they don’t. While “people are biased” is always an option, this doesn’t fit the magnitude of the clinical evidence that I (and most other psychiatrists) observe. There are patients who will regularly get better on an antidepressant, get worse when they stop it, get better when they go back on it, get worse when they stop it again, et cetera. This raises some questions of its own, like why patients keep stopping antidepressants that they clearly need in order to function, but makes bias less likely. Overall the clinical evidence that these drugs work is so strong that I will grasp at pretty much any straw in order to save my sanity and confirm that this is actually a real effect.
Every clinician knows that different people respond to antidepressants differently or not at all. Some patients will have an obvious and dramatic response to the first antidepressant they try. Other patients will have no response to the first antidepressant, but after trying five different things you’ll find one that works really well. Still other patients will apparently never respond to anything.
Overall only about 30% – 50% of the time when I start a patient on a particular antidepressant, do we end up deciding this is definitely the right medication for them and they should definitely stay on it. This fits national and global statistics. According to a Korean study, the median amount of time a patient stays on their antidepressant prescription is three months. A Japanese study finds only 44% of patients continued their antidepressants the recommended six months; an American study finds 31%.
Suppose that one-third of patients have some gene that makes them respond to Prozac with an effect size of 1.0 (very large and impressive), and nobody else responds. In a randomized controlled trial of Prozac, the average effect size will show up as 0.33 (one-third of patients get effect size of 1, two-thirds get effect size of 0). This matches the studies. In the clinic, one-third of patients will be obvious Prozac responders, and their psychiatrist will keep them on Prozac and be very impressed with it as an antidepressant and sing the praises of SSRIs. Two-thirds of patients will get no benefit, and their doctors will write them off as non-responders and try something else. Maybe the something else will work, and then the doctors will sing the praises of that SSRI, or maybe they’ll just say it’s “treatment-resistant depression” and so doesn’t count.
In other words, doctors’ observation “SSRIs work very well” is an existence statement “there are some patients for whom SSRIs work very well” – and not a universal observation “SSRIs will always work well for all patients”. Nobody has ever claimed the latter so it’s not surprising that it doesn’t match the studies.
I linked Gueorguieva and Krystal on the original post; they are saying some kind of much more statistically sophisticated version of this. But I can’t find any other literature on this possibility, which is surprising, because if it were true it should be pretty obvious, and if it were false it should still be worth somebody’s time to debunk.
If this were true, it would strengthen the case for the throughput-based model I talk about in Recommendations vs. Guidelines and Anxiety Sampler Kits. Instead of worrying only about a medicine’s effect size and side effects, we should worry about whether it is a cheap experiment or an expensive experiment. Imagine a drug that instantly cures 5% of people’s depression, but causes terrible nausea in the other 95%. The traditional model would reject this drug, since its effect size in studies is low and it has severe side effects. On the throughput model, give this drug to everybody, 5% of people will be instantly cured, 95% of people will suffer nausea for a day before realizing it doesn’t work for them, and then the 5% will keep taking it and the other 95% can do something else. This is obviously a huge exaggeration, but I think the principle holds. If there’s enough variability, the benefit-to-side-effect ratio of SSRIs is interesting only insofar as it tells us where in our guideline to put them. After that, what matters is the benefit-to-side-effect ratio for each individual patient.
I don’t hear this talked about much and I don’t know if this is consistent with the studies that have been done.
Fourth, even though SSRIs are branded “antidepressants”, they have an equal right to be called anti-anxiety medications. There’s some evidence that they may work better for this indication than for depression, although it’s hard to tell. I think Irving Kirsch himself makes this claim: he analyzed the efficacy of SSRIs for everything and found a “relatively large effect size” of 0.7 for anxiety (though the study was limited to children). Depression and anxiety are highly comorbid and half of people with a depressive disorder also have an anxiety disorder; there are reasons to think that at some deep level they may be aspects of the same condition. If SSRIs effectively treated anxiety, this might make depressed people feel better in a way that doesn’t necessarily show up on formal depression tests, but which they would express to their psychiatrist as “I feel better”. Or, psychiatrists might have a vague positive glow around SSRIs if it successfully treats their anxiety patients (who may be the same people as their depression patients) and not be very good at separating that positive glow into “depression efficacy” and “anxiety efficacy”. Then they might believe they’ve had good experiences with using SSRIs for depression.
I don’t know if this is true and some other studies find that results for anxiety are almost as abysmal as for depression.
To clarify: were the analysed SSRI papers for a single medication each, or were the papers testing the entire rigmarole of trying out a bunch of different SSRIs like what psychs actually do? I skimmed the abstracts and wasn’t really sure (is that what the “arms” thing is about?), but if it’s the former… well that seems like giving people randomly sized pairs of shoes without fitting them, and then concluding footwear provides no statistically significant amount of comfort.
They were for a single SSRI each. Most of the trials that get analyzed for this sort of thing are drug companies doing the studies to prove their drug works to the FDA, so usually it will just include their drug.
Oh. Well different variants of SSRIs can affect the same person in wildly different ways, right? If so that seems like it’d significantly weaken the conclusions we can draw from this meta analysis.
Like continuing the shoe analogy, running a study on one SSRI is like running a study on the comfort of size 7 running shoes. You’re gonna have a group of thrilled runners, maybe another group of mildly content folk within half a size, and a whole bunch of people who toss the shoes after 5 minutes. Even if you have repeat studies for all the shoe sizes, you’re not testing the effectiveness of shoes as actually used by the population – fitted for their actual size -you’re just figuring out that people have shoe sizes.
(Treatment resistant depression is like a barefoot runner or someone allergic to nylon in this tortured analogy)
How about instead of randomly trying different SSRI’s for each patient:
1) try the first one randomly
2) try the second one randomly
3) figure out the statistical relationship between which one was not working first and which one was working second, that is, is antidepressant D more likely to work for people for whom A didn’t work, or B didn’t work, or C didn’t work?
Ultimately what we want to know is the exact biological difference and reason that makes one work and the other not for a person. But until we know that it can be statistically approximated, if D is more likely to work for those for whom A didn’t than for those whom B or C didn’t, it not only improves treatment but also can give hints about the reason.
Also, I suspect experienced psychiatrists were already spotting such patterns, it is just not yet formally analyzed.
Lots of people have brought that up. I think the main barrier is getting a big enough dataset. You would either have to do it as a formal study, or just sort of scattershot encourage people to report what had happened naturalistically. If the former, the study would have to be gigantic. If the latter, the data wouldn’t be very trustworthy, and it would be hard to get people to report it in a trustworthy and privacy-law-maintaining way.
An interesting recent paper on different types of depression, including one which is SSRI resistant : https://www.sciencedaily.com/releases/2018/10/181031093337.htm
Important section :
The three distinct sub-types of depression were characterized by two main factors: functional connectivity patterns synchronized between different regions of the brain and childhood trauma experience. They found that the brain’s functional connectivity in regions that involved the angular gyrus — a brain region associated with processing language and numbers, spatial cognition, attention, and other aspects of cognition — played a large role in determining whether SSRIs were effective in treating depression.
Patients with increased functional connectivity between the brain’s different regions who had also experienced childhood trauma had a sub-type of depression that is unresponsive to treatment by SSRIs drugs, the researchers found. On the other hand, the other two subtypes — where the participants’ brains did not show increased connectivity among its different regions or where participants had not experienced childhood trauma — tended to respond positively to treatments using SSRIs drugs.
Oh yeah, the three subtypes of depression paper. Lots of fun, just like the 5 subtypes of depression paper last year, or the unrelated 3 subtypes of depression paper also from last year, and the 4 subtypes of depression paper that came out in 2016, and the traditional 2 subtypes that have been recognized since forever.
Hm. This association with cognition sounds like being smart. But at least CBT works better if you are smart? A quick googling seems to show that IQ is not positively correlated with CBT efficacy and some results even show a negative one. Being good at cognition does not imply leveraging your cognition works well for therapy. Hm. Brains are weird! Are there at least anecdotal stories what tends to work for smart people with childhood trauma? This might characterize some of SSCers, I suspect.
It seems like we need to update our language for interpreting statistics to describe treatments that “work really well for some people, but not for most.” Because we really should support these treatments, rather than knocking them out because of low-average effect size.
As far as pricing the cost-benefits, we should also look at the cost-benefit for a patient to engage in the whole category. “Is your depression so bad that it’s worth taking on possibly 5-20 nausea-inducing drugs to find one that fixes your depression?” i.e., Are you willing to go down a chemical rabbit hole to find a cure?
PS. Can I get some pointers on how to interpret effect sizes? In particular, what units are they? (Google isn’t helping me answer this question)
See the discussion of effect size in Part 4 of http://slatestarcodex.com/2014/07/07/ssris-much-more-than-you-wanted-to-know/
You would expect antibiotics to work really well for people with a bacterial infection and not at all for people with a viral infection. Because they are different diseases. My point is, the language of the illness is what needs to be updated. Sometimes illnesses are categorized by symptoms, this seems to be the case with depression. Sometimes by mechanism, like infections. Every time our child brings home yet another virus from the kindergarten my wife and me get different symptoms and then often my doc says tells me great, you have caught a bacterial infection on top of the viral one, and yet my symptoms did not change. Obviously it is better to categorize illnesses by their mechanism as this is what matters for treatment, and this scenario obviously suggests different mechanisms.
And if you have no idea about the mechanism, name it after the treatment. “antibiotic responding infection” is a weird, but I think not horribly bad way to call a bacterial infection. At least way better than to call it “coughing and sore throat”. The first gives more clues to what is going on.
So my point is “antibiotics work really well for some people with coughing and sore throat and not for others” and then talking about effect size is just the wrong way to talk about it, IMHO. If we haven’t even discovered bacteria yet, and don’t know garlic and honey are antibiotics, we just know sometimes garlic and honey helps with sore throats and sometimes not, call the kinds where it helps “sore throats responding to garlic and honey”. Don’t talk about effect sizes, talk about potentially, likely, different illnesses/mechanisms.
I like this approach, but I think the heart of the issue is that we *don’t* understand the mechanisms of depression yet, and until we do making such a distinction is hard. It seems like the best we can do today is develop a graph of people’s reactions to different SSRIs and try to find relationships.
The units of effect sizes are somewhat irrelevant as the purpose of effect size is to arrive at a context-independent comparison of relative magnitudes of effect. Units are one of the things that effect sizes are specifically designed to elide.
Hmm, to make sure I understand: is the argument that the effect size could be considered clinically insignificant only because it’s large in some people and nonexistent in others?
I don’t know anything about psychiatric trials, but anecdotally from my experience many years ago following biotech trials for a finance company, identifying subpopulations for whom the drug worked was very much discussed. (Usually in the context of an otherwise failed Phase II trial – well, it didn’t work *overall*, but look at right here….! Many drugs are developed by niche biotech companies around a particular idea, and so the company is trying to salvage their existence.) This was a decade ago so I’m struggling to remember particular examples, unfortunately.
I found a couple examples of recent studies[0][1] trying to use this idea, though. I’m surprised it’s not happening the way you’re saying in psychiatry; it seems like a no-brainer. But I think it’s fairly common in other fields of medicine.
Would those subpopulations still be findable even if there was no visible thing linking them together?
So if for example whenever someone notices they have depression they have essentially rolled a 10 sided die, and this medication only helps if you rolled a 7 on that die, would you be able to pick out a subpopulation of that if you have no access to the die that was rolled?
You’ve previously written about how all forms of therapy perform equally poorly, and the only variable which correlates with success is “an authoritative-sounding psychologist.” Despite this, many people swear by their preferred form of psychotherapy. Elsewhere, I’ve heard it claimed that 12-step programs perform as well as placebo. Despite this, the 12-step program maintains a large following.
Isn’t this a good explanation for both these phenomena? It’s not obvious to me that we should expect different human beings to react similarly given similar stimuli. In fact, to me, it’s counterintuitive to expect that they would.
More generally, doesn’t this reveal a pretty fundamental problem with psychological research as it’s conducted today? Researchers subject their test subjects to some experimental conditions, and then measure those subjects’ resulting behavior. But the “experimental conditions” interact, in an extremely unpredictable and non-linear manner, with the test subjects’ psychological states. To further complicate things, we have no good way of quantifying or even qualifying those psychological states, which means we cannot test for them, let alone control for them! From the subjects’ biochemistry, to what they read yesterday in the paper, to whether their wives and husbands kissed them goodbye on their way to the study — there are too many significant variables for us to enumerate, not to mention the impossibility of measuring them.
What I understand from this blog is that we try to separate the effects of “shared” and “non-shared environment.” But it seems to me, (and what you’ve written increases my suspicion), that even “shared environment” is functionally non-shared. To give an example, we think that, say, parenting style, is a form of “shared environment.” But the same parenting style may have vastly interactions with two identical twins, growing up in the very same house at the very same time. Does strict religious parenting work? Bishop Ned had identical twins. One went to Harvard, the other died of an OD. What was the difference? The second’s faith was broken by an early experience of great hypocrisy, whereas the first’s faith remained intact. Consequently, the “same parenting” had vastly different effects on genetically identical children living in the same house at the same time.
The result is that psychological research, as it’s practiced today, can only figure out the most obvious, and usually boring, facets of human psychology. It can pinpoint those things which truly don’t vary with a subject’s psychological state. If you give me a research lab, I’ll be able to show you with a high degree of certainty that siccing dogs on research subjects ignites their adrenal response. But if you ask me to give a good predictor of their relative levels of success, or happiness, or capacity for dynamism, I won’t be able to explain more than 50% of the data, no matter how many trials you let me run.
I like different foods than my friends. Different women attract me. Particular personalities attract and repel me. We know, intuitively, that the same stimuli affect different people differently, and even the same people differently from second to second. Does psychological research have any real solution to this problem?
Honestly, I’d argue that taking the mean and ignoring all other data is almost always summarizing to the point of uselessness, and depending on the kind of data, the average can take on a completely non-sensical value. Distribution matters, and to that end, there’s a huge difference between treatments that have roughly the same effect on everyone, those with a wide range of equally likely effects, and those that seem to have an all or nothing effect.
The first might have mean = median = mode with little difference between min and max. The second might have mean = median, no mode, a very wide range and percentiles that are roughly linear. The third might have a mean larger than median, a mode near zero, a wide range, and very few intermediate values with the 70th percentile very low nd the third quartile near the max. Even if all three have the same average, these represent very different distributions(low average might indicate the first is generally worthless, the second is good for a few, decent for a few, poor for a few, and useless for a few, and the third if great for a few and useless for everyone else, but the mean alone doesn’t tell us which we’re working with).
Right, if there are super-responders, that’s probably clear from the data. But no one publishes the raw data and it’s plausible that no one looks at the distribution. If people were studying this hypothesis, they would also compare results after 6 months to results after 1 year.
So, how do we get away from “single number” statistics? This just further convinces me that one of the problems with modern statistics is the focus on looking at whether a single number is above or below a threshold. What can we do to encourage statistics to be done more holistically?
The go/no-go decision is ultimately going to be a threshold; the issue as I see it is that the statistic being used is an inadequate proxy for the outcome of interest.
ETA: e.g., the ratio of average benefit to average side-effect will (very likely) differ from the average ratio of benefit to side-effect; I have no idea what math is actually done, but I wouldn’t be the least surprised if it was somewhat simplistic (most people aren’t statisticians).
Better yet, report your data alongside summary statistics. Then people can run whatever analysis they want.
There’s been such a big trend of “New meta-analysis makes effect X disappear” reports that it has me wondering about meta-analysis itself. I’m not enough of an expert to have a grounded opinion, but is it possible that there’s something about the way these recent meta-analysis studies are being done that is guaranteed to make effects diminish or vanish entirely?
How do OCD and Tourette’s respond? My impression was that the conditions are on a spectrum that looks like
Depression – – Anxiety – – OCD – – Tourette’s
I have OCD, with a sprinkling of ticcing, and the first thing to (mostly) go away when I go on an SSRI and the first thing to come back when I go off is the ticcing. This seems easier to quantify and to be at least somewhat objective about.
You skipped past the placebo effect pretty quickly in this post. Isn’t it also consistent with the evidence that most of the clinically-observed benefits of SSRIs could be obtained by psychiatrists prescribing placebos? And we just don’t know because no psychiatrists actually do that? For example,
Would these same patients also regularly get better on a placebo, get worse when they stop it, etc.? Nobody’s tried placebos in a clinical setting enough to know, I assume.
I guess this doesn’t address your point about similarly-low effect sizes for benzos and morphine.
(Disclosure: I’ve been on an SSRI for ~15 years. At this point I mostly take it because why risk changing what works? But I remain curious whether if someone switched my pills for placebos, whether my lived experience would change in any way.)
He spends almost the whole original SSRI post discussing that, I’d recommend checking it out, it’s worth the read.
Couldn’t some statistical analysis of the trials data differentiate between a case where people generally have a lousy response to the therapy and a case where some subgroups have a large reactions, and some barely any? Like testing whether the distribution of the effects is gaussian?
If I understand correctly this pattern is also consistent with all SSRIs not having much effect: depression occurs in irregular remission-relapse cycles, so the patient shows up at the doctor office when their symptoms are especially bad, the doctor prescribes them a SSRI, after a few weeks they conclude it doesn’t work, they prescribe them another SSRI, and so on, until depression spontaneously goes in remission and this is interpreted as the particular SSRI that the patient is on having cured their depression. After some time the patient stops taking the SSRI, and after some more time their depression relapses, they go back to the doctor and the doctor interprets the relapse as caused by having stopped taking the medication.
This hypothesis implies that taking up the same SSRI again will probably not do much, because it wasn’t doing much in the first place, but this is probably confounded by the placebo effect and the patient feeling ashamed for having stopped taking a medication. Is this plausible?
EDIT:
Assuming, on the contrary, that SSRIs can have large effects with large individual variability, is there any biological theory for why this might be the case? Something like receptors having different shapes in different people and thus having different chemical affinity to different SSRI molecules? Or depression being multi-causal and different types of SSRIs acting on different causes? (this sound strange because the general mechanism of action of SSRIs should be the same)
That doesn’t track very well with my experience stopping SSRIs.
I was just on one for 9 months for depression/anxiety. While on it, my depression and anxiety improved and by the end my general mood was much higher than it had been for at least a year, maybe more. However, I also had side effects: my focus was worse, I couldn’t enjoy the occasional drink, I felt like analytical conversations were harder (I had no awareness of a verbal train of thought, and I couldn’t compose replies at the same time I was listening and processing arguments). Also, having to take a daily medication is a pain in the ass, especially if you have trouble remembering things like bringing your pills with you.
So I stopped, and after re-adjusting my depression and anxiety are worse again. The train of thought is back, and it turns out my brain was using that for unhelpful thoughts. It’s harder to be happy.
Fortunately, I’m looking at this like a fascinating A/B test in software, and I’m back to CBT with a specific focus on “hey, I’m now thinking and doing these unhelpful things, can we address those?” But I’m not sure if that will keep working the next time I get a big dose of life stress, and if I hadn’t stopped and gone “wait a minute, this bullshit is back!” it’d be easy to slide down into old holes.
About three months ago I started on the new migraine drugs, the CGRP inhibitors, and went from ~20 headache days per month to ~2 headache days per month. On paper, the “effect size” of this drug is small, on average only shaving off 1-2 headache days per month for the average recipient. If you dig into the publications, the distribution of responses is roughly Pareto distributed, with most people not responding very much and a relative minority responding quite strongly.
Yet I have heard first-hand accounts of neurologists telling their patients that this drug “barely works” and isn’t even worth trying. I can only assume that they glance at the average effect size and dismiss the drug. This means there are a fraction of their patients who could be almost completely cured by the drug (like me) who are not even getting a chance to try it. Assuming my interpretation of their reasoning is true, this is absolutely maddening. It seems like the most basic understanding of statistics should screen off this kind of mistake.
more things (including people) should be thought of in terms of (nonlinear) dynamical systems. how you feel, and how you’re “doing” is the result of a whole bunch of interconnected terms, like the amount of sleep you got, and how well you cope with adversity, and whatever the hell is going on with your brain chemistry.
a tiny effect like 10 more minutes of sleep per night isn’t just 10 more minutes of sleep, it’s also a reduction in the daily stress generated by not enough sleep, which allows you more coping budget for other things, which gets you to the end of the day feeling better, which makes it easier to sleep, which…
the sleep meds can be viewed as forcing functions acting on the dynamic system, same as other meds. just pushing a little bit every day in a good direction can get the overall state of the dynamic system to stay in a better region.
Seems like you could test #3 by just looking at the data. If the treatment group has a high standard deviation, you can infer a responders/nonresponders narrative.
… if only the data were public.
Yeah, it seems like we want at least a graph that shows the distribution of effect sizes for individuals, and I’m kind of surprised we don’t have this already, given that this hypothesis seems both fairly obvious and testable.
Are these results that surprising? These drugs have obvious acute effects, but the studies aren’t of the acute effects. I’m not surprised that morphine for chronic back pain scores badly.
I’m not sure about benzos. Is the normal usage ad libitum for panic attacks or daily usage? Is this study daily usage? Elimination of panic attacks is a dramatic effect, but would it show up in the HAM-A?
