SSRIs: An Update

Four years ago I examined the claim that SSRIs are little better than placebo. Since then, some of my thinking on this question has changed.

First, we got Cipriani et al’s meta-analysis of anti-depressants. It avoids some of the pitfalls of Kirsch and comes to about the same conclusion. This knocks down a few of the lines of argument in my part 4 about how the effect size might look more like 0.5 than 0.3. The effect size is probably about 0.3.

Second, I’ve seen enough to realize that the anomalously low effect size of SSRIs in studies should be viewed not as an SSRI-specific phenomenon, but as part of a general trend towards much lower-than-expected effect sizes for every psychiatric medication (every medication full stop?). I wrote about this in my post on melatonin:

The consensus stresses that melatonin is a very weak hypnotic. The Buscemi meta-analysis cites this as their reason for declaring negative results despite a statistically significant effect – the supplement only made people get to sleep about ten minutes faster. “Ten minutes” sounds pretty pathetic, but we need to think of this in context. Even the strongest sleep medications, like Ambien, only show up in studies as getting you to sleep ten or twenty minutes faster; this NYT article says that “viewed as a group, [newer sleeping pills like Ambien, Lunesta, and Sonata] reduced the average time to go to sleep 12.8 minutes compared with fake pills, and increased total sleep time 11.4 minutes.” I don’t know of any statistically-principled comparison between melatonin and Ambien, but the difference is hardly (pun not intended) day and night. Rather than say “melatonin is crap”, I would argue that all sleeping pills have measurable effects that vastly underperform their subjective effects.

Or take benzodiazepines, a class of anxiety drugs including things like Xanax, Ativan, and Klonopin. Everyone knows these are effective (at least at first, before patients develop tolerance or become addicted). The studies find them to have about equal efficacy as SSRIs. You could almost convince me that SSRIs don’t have a detectable effect in the real world; you will never convince me that benzos don’t. Even morphine for pain gets an effect size of 0.4, little better than SSRI’s 0.3 and not enough to meet anyone’s criteria for “clinically significant”. Leucht 2012 provides similarly grim statistics for everything else.

I don’t know whether this means that we should conclude “nothing works” or “we need to reconsider how we think about effect sizes”.

All this leads to the third thing I’ve been thinking about. Given that the effect size really is about 0.3, how do we square the scientific evidence (that SSRIs “work” but do so little that no normal person could possibly detect them) with the clinical evidence (that psychiatrists and patients often find SSRIs sometimes save lives and often make depression substantially better?)

The traditional way to do this is to say that psychiatrists and patients are wrong. Given all the possible biases involved, they misattribute placebo effects to the drugs, or credit some cases that would have remitted anyway to the beneficial effect of SSRIs, or disproportionately remember the times the drugs work over the times they don’t. While “people are biased” is always an option, this doesn’t fit the magnitude of the clinical evidence that I (and most other psychiatrists) observe. There are patients who will regularly get better on an antidepressant, get worse when they stop it, get better when they go back on it, get worse when they stop it again, et cetera. This raises some questions of its own, like why patients keep stopping antidepressants that they clearly need in order to function, but makes bias less likely. Overall the clinical evidence that these drugs work is so strong that I will grasp at pretty much any straw in order to save my sanity and confirm that this is actually a real effect.

Every clinician knows that different people respond to antidepressants differently or not at all. Some patients will have an obvious and dramatic response to the first antidepressant they try. Other patients will have no response to the first antidepressant, but after trying five different things you’ll find one that works really well. Still other patients will apparently never respond to anything.

Overall only about 30% – 50% of the time when I start a patient on a particular antidepressant, do we end up deciding this is definitely the right medication for them and they should definitely stay on it. This fits national and global statistics. According to a Korean study, the median amount of time a patient stays on their antidepressant prescription is three months. A Japanese study finds only 44% of patients continued their antidepressants the recommended six months; an American study finds 31%.

Suppose that one-third of patients have some gene that makes them respond to Prozac with an effect size of 1.0 (very large and impressive), and nobody else responds. In a randomized controlled trial of Prozac, the average effect size will show up as 0.33 (one-third of patients get effect size of 1, two-thirds get effect size of 0). This matches the studies. In the clinic, one-third of patients will be obvious Prozac responders, and their psychiatrist will keep them on Prozac and be very impressed with it as an antidepressant and sing the praises of SSRIs. Two-thirds of patients will get no benefit, and their doctors will write them off as non-responders and try something else. Maybe the something else will work, and then the doctors will sing the praises of that SSRI, or maybe they’ll just say it’s “treatment-resistant depression” and so doesn’t count.

In other words, doctors’ observation “SSRIs work very well” is an existence statement “there are some patients for whom SSRIs work very well” – and not a universal observation “SSRIs will always work well for all patients”. Nobody has ever claimed the latter so it’s not surprising that it doesn’t match the studies.

I linked Gueorguieva and Krystal on the original post; they are saying some kind of much more statistically sophisticated version of this. But I can’t find any other literature on this possibility, which is surprising, because if it were true it should be pretty obvious, and if it were false it should still be worth somebody’s time to debunk.

If this were true, it would strengthen the case for the throughput-based model I talk about in Recommendations vs. Guidelines and Anxiety Sampler Kits. Instead of worrying only about a medicine’s effect size and side effects, we should worry about whether it is a cheap experiment or an expensive experiment. Imagine a drug that instantly cures 5% of people’s depression, but causes terrible nausea in the other 95%. The traditional model would reject this drug, since its effect size in studies is low and it has severe side effects. On the throughput model, give this drug to everybody, 5% of people will be instantly cured, 95% of people will suffer nausea for a day before realizing it doesn’t work for them, and then the 5% will keep taking it and the other 95% can do something else. This is obviously a huge exaggeration, but I think the principle holds. If there’s enough variability, the benefit-to-side-effect ratio of SSRIs is interesting only insofar as it tells us where in our guideline to put them. After that, what matters is the benefit-to-side-effect ratio for each individual patient.

I don’t hear this talked about much and I don’t know if this is consistent with the studies that have been done.

Fourth, even though SSRIs are branded “antidepressants”, they have an equal right to be called anti-anxiety medications. There’s some evidence that they may work better for this indication than for depression, although it’s hard to tell. I think Irving Kirsch himself makes this claim: he analyzed the efficacy of SSRIs for everything and found a “relatively large effect size” of 0.7 for anxiety (though the study was limited to children). Depression and anxiety are highly comorbid and half of people with a depressive disorder also have an anxiety disorder; there are reasons to think that at some deep level they may be aspects of the same condition. If SSRIs effectively treated anxiety, this might make depressed people feel better in a way that doesn’t necessarily show up on formal depression tests, but which they would express to their psychiatrist as “I feel better”. Or, psychiatrists might have a vague positive glow around SSRIs if it successfully treats their anxiety patients (who may be the same people as their depression patients) and not be very good at separating that positive glow into “depression efficacy” and “anxiety efficacy”. Then they might believe they’ve had good experiences with using SSRIs for depression.

I don’t know if this is true and some other studies find that results for anxiety are almost as abysmal as for depression.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

74 Responses to SSRIs: An Update

  1. ManyCookies says:

    To clarify: were the analysed SSRI papers for a single medication each, or were the papers testing the entire rigmarole of trying out a bunch of different SSRIs like what psychs actually do? I skimmed the abstracts and wasn’t really sure (is that what the “arms” thing is about?), but if it’s the former… well that seems like giving people randomly sized pairs of shoes without fitting them, and then concluding footwear provides no statistically significant amount of comfort.

    • Scott Alexander says:

      They were for a single SSRI each. Most of the trials that get analyzed for this sort of thing are drug companies doing the studies to prove their drug works to the FDA, so usually it will just include their drug.

      • ManyCookies says:

        Oh. Well different variants of SSRIs can affect the same person in wildly different ways, right? If so that seems like it’d significantly weaken the conclusions we can draw from this meta analysis.

        Like continuing the shoe analogy, running a study on one SSRI is like running a study on the comfort of size 7 running shoes. You’re gonna have a group of thrilled runners, maybe another group of mildly content folk within half a size, and a whole bunch of people who toss the shoes after 5 minutes. Even if you have repeat studies for all the shoe sizes, you’re not testing the effectiveness of shoes as actually used by the population – fitted for their actual size -you’re just figuring out that people have shoe sizes.

        (Treatment resistant depression is like a barefoot runner or someone allergic to nylon in this tortured analogy)

  2. nameless1 says:

    How about instead of randomly trying different SSRI’s for each patient:

    1) try the first one randomly

    2) try the second one randomly

    3) figure out the statistical relationship between which one was not working first and which one was working second, that is, is antidepressant D more likely to work for people for whom A didn’t work, or B didn’t work, or C didn’t work?

    Ultimately what we want to know is the exact biological difference and reason that makes one work and the other not for a person. But until we know that it can be statistically approximated, if D is more likely to work for those for whom A didn’t than for those whom B or C didn’t, it not only improves treatment but also can give hints about the reason.

    Also, I suspect experienced psychiatrists were already spotting such patterns, it is just not yet formally analyzed.

    • Scott Alexander says:

      Lots of people have brought that up. I think the main barrier is getting a big enough dataset. You would either have to do it as a formal study, or just sort of scattershot encourage people to report what had happened naturalistically. If the former, the study would have to be gigantic. If the latter, the data wouldn’t be very trustworthy, and it would be hard to get people to report it in a trustworthy and privacy-law-maintaining way.

      • VivaLaPanda says:

        Obviously the legal stuff is a challenge here, but the incentive model could be as simple as:
        Unlock full access to the dataset by contributing N datapoints. It’s how Glassdoor used to work (don’t know if it still does), and it was pretty effective.

  3. andrewducker says:

    An interesting recent paper on different types of depression, including one which is SSRI resistant :

    Important section :
    The three distinct sub-types of depression were characterized by two main factors: functional connectivity patterns synchronized between different regions of the brain and childhood trauma experience. They found that the brain’s functional connectivity in regions that involved the angular gyrus — a brain region associated with processing language and numbers, spatial cognition, attention, and other aspects of cognition — played a large role in determining whether SSRIs were effective in treating depression.

    Patients with increased functional connectivity between the brain’s different regions who had also experienced childhood trauma had a sub-type of depression that is unresponsive to treatment by SSRIs drugs, the researchers found. On the other hand, the other two subtypes — where the participants’ brains did not show increased connectivity among its different regions or where participants had not experienced childhood trauma — tended to respond positively to treatments using SSRIs drugs.

    • Scott Alexander says:

      Oh yeah, the three subtypes of depression paper. Lots of fun, just like the 5 subtypes of depression paper last year, or the unrelated 3 subtypes of depression paper also from last year, and the 4 subtypes of depression paper that came out in 2016, and the traditional 2 subtypes that have been recognized since forever.

      • andrewducker says:

        It shouldn’t really surprise anyone that depending on which variables you look at, and what parameters you set when looking for clusters that you get different numbers of them.

        The relevant bit is that “Patients with increased functional connectivity between the brain’s different regions who had also experienced childhood trauma had a sub-type of depression that is unresponsive to treatment by SSRIs”.

        Now, whether that holds up in a different sample is important, obviously. But if it does then you’ve got something deeply interesting going on.

    • nameless1 says:

      Hm. This association with cognition sounds like being smart. But at least CBT works better if you are smart? A quick googling seems to show that IQ is not positively correlated with CBT efficacy and some results even show a negative one. Being good at cognition does not imply leveraging your cognition works well for therapy. Hm. Brains are weird! Are there at least anecdotal stories what tends to work for smart people with childhood trauma? This might characterize some of SSCers, I suspect.

    • MH says:

      I’m neither good enough at statistics to evaluate their methods nor invested enough to do the work if I was, but as a general rule of thumb this part of the article seems like the sort of thing that should ring a lot of alarm bells:

      “With over 3000 measurable features, including whether or not participants had experienced trauma, the scientists were faced with the dilemma of finding a way to analyze such a large data set accurately. “The major challenge in this study was to develop a statistical tool that could extract relevant information for clustering similar subjects together,” says Dr. Tomoki Tokuda, a statistician and the lead author of the study. He therefore designed a novel statistical method that would help detect multiple ways of data clustering and the features responsible for it. ”

      The methodology seems to have been to gather a huge quantity of data about 134 people, including brain scans of a whole bunch of different areas, and then dump it all into a statistical engine of some kind and say “Now where are the correlations!”

      Again I don’t know exactly what their statistical method was or what its advantages are – maybe this is finally the holy grail method that will actually make this strategy work!
      As a general rule though it’s the (tempting) sort of approach, especially with limited data sets, that leads you to this kind of thing:

  4. It seems like we need to update our language for interpreting statistics to describe treatments that “work really well for some people, but not for most.” Because we really should support these treatments, rather than knocking them out because of low-average effect size.

    As far as pricing the cost-benefits, we should also look at the cost-benefit for a patient to engage in the whole category. “Is your depression so bad that it’s worth taking on possibly 5-20 nausea-inducing drugs to find one that fixes your depression?” i.e., Are you willing to go down a chemical rabbit hole to find a cure?

    PS. Can I get some pointers on how to interpret effect sizes? In particular, what units are they? (Google isn’t helping me answer this question)

    • nameless1 says:

      You would expect antibiotics to work really well for people with a bacterial infection and not at all for people with a viral infection. Because they are different diseases. My point is, the language of the illness is what needs to be updated. Sometimes illnesses are categorized by symptoms, this seems to be the case with depression. Sometimes by mechanism, like infections. Every time our child brings home yet another virus from the kindergarten my wife and me get different symptoms and then often my doc says tells me great, you have caught a bacterial infection on top of the viral one, and yet my symptoms did not change. Obviously it is better to categorize illnesses by their mechanism as this is what matters for treatment, and this scenario obviously suggests different mechanisms.

      And if you have no idea about the mechanism, name it after the treatment. “antibiotic responding infection” is a weird, but I think not horribly bad way to call a bacterial infection. At least way better than to call it “coughing and sore throat”. The first gives more clues to what is going on.

      • nameless1 says:

        So my point is “antibiotics work really well for some people with coughing and sore throat and not for others” and then talking about effect size is just the wrong way to talk about it, IMHO. If we haven’t even discovered bacteria yet, and don’t know garlic and honey are antibiotics, we just know sometimes garlic and honey helps with sore throats and sometimes not, call the kinds where it helps “sore throats responding to garlic and honey”. Don’t talk about effect sizes, talk about potentially, likely, different illnesses/mechanisms.

        • VivaLaPanda says:

          I like this approach, but I think the heart of the issue is that we *don’t* understand the mechanisms of depression yet, and until we do making such a distinction is hard. It seems like the best we can do today is develop a graph of people’s reactions to different SSRIs and try to find relationships.

    • Freddie deBoer says:

      The units of effect sizes are somewhat irrelevant as the purpose of effect size is to arrive at a context-independent comparison of relative magnitudes of effect. Units are one of the things that effect sizes are specifically designed to elide.

    • jonathanpaulson says:

      The units of effect size are standard deviations. One standard deviation is how far a random person is from the “average person”. (I assume in this case on a score on a “how depressed are you?” questionnaire?).

  5. mr_capybara says:

    Hmm, to make sure I understand: is the argument that the effect size could be considered clinically insignificant only because it’s large in some people and nonexistent in others?

    I don’t know anything about psychiatric trials, but anecdotally from my experience many years ago following biotech trials for a finance company, identifying subpopulations for whom the drug worked was very much discussed. (Usually in the context of an otherwise failed Phase II trial – well, it didn’t work *overall*, but look at right here….! Many drugs are developed by niche biotech companies around a particular idea, and so the company is trying to salvage their existence.) This was a decade ago so I’m struggling to remember particular examples, unfortunately.

    I found a couple examples of recent studies[0][1] trying to use this idea, though. I’m surprised it’s not happening the way you’re saying in psychiatry; it seems like a no-brainer. But I think it’s fairly common in other fields of medicine.


    • lunawarrior says:

      Would those subpopulations still be findable even if there was no visible thing linking them together?

      So if for example whenever someone notices they have depression they have essentially rolled a 10 sided die, and this medication only helps if you rolled a 7 on that die, would you be able to pick out a subpopulation of that if you have no access to the die that was rolled?

      • gmaxwell says:

        If the drug has few concerning side effects, sure… you administer it and see if it works. Which is essentially what Scott describes as the industry’s approach.

  6. phoniel says:

    You’ve previously written about how all forms of therapy perform equally poorly, and the only variable which correlates with success is “an authoritative-sounding psychologist.” Despite this, many people swear by their preferred form of psychotherapy. Elsewhere, I’ve heard it claimed that 12-step programs perform as well as placebo. Despite this, the 12-step program maintains a large following.

    Isn’t this a good explanation for both these phenomena? It’s not obvious to me that we should expect different human beings to react similarly given similar stimuli. In fact, to me, it’s counterintuitive to expect that they would.

    More generally, doesn’t this reveal a pretty fundamental problem with psychological research as it’s conducted today? Researchers subject their test subjects to some experimental conditions, and then measure those subjects’ resulting behavior. But the “experimental conditions” interact, in an extremely unpredictable and non-linear manner, with the test subjects’ psychological states. To further complicate things, we have no good way of quantifying or even qualifying those psychological states, which means we cannot test for them, let alone control for them! From the subjects’ biochemistry, to what they read yesterday in the paper, to whether their wives and husbands kissed them goodbye on their way to the study — there are too many significant variables for us to enumerate, not to mention the impossibility of measuring them.

    What I understand from this blog is that we try to separate the effects of “shared” and “non-shared environment.” But it seems to me, (and what you’ve written increases my suspicion), that even “shared environment” is functionally non-shared. To give an example, we think that, say, parenting style, is a form of “shared environment.” But the same parenting style may have vastly interactions with two identical twins, growing up in the very same house at the very same time. Does strict religious parenting work? Bishop Ned had identical twins. One went to Harvard, the other died of an OD. What was the difference? The second’s faith was broken by an early experience of great hypocrisy, whereas the first’s faith remained intact. Consequently, the “same parenting” had vastly different effects on genetically identical children living in the same house at the same time.

    The result is that psychological research, as it’s practiced today, can only figure out the most obvious, and usually boring, facets of human psychology. It can pinpoint those things which truly don’t vary with a subject’s psychological state. If you give me a research lab, I’ll be able to show you with a high degree of certainty that siccing dogs on research subjects ignites their adrenal response. But if you ask me to give a good predictor of their relative levels of success, or happiness, or capacity for dynamism, I won’t be able to explain more than 50% of the data, no matter how many trials you let me run.

    I like different foods than my friends. Different women attract me. Particular personalities attract and repel me. We know, intuitively, that the same stimuli affect different people differently, and even the same people differently from second to second. Does psychological research have any real solution to this problem?

  7. Jeffery Mewtamer says:

    Honestly, I’d argue that taking the mean and ignoring all other data is almost always summarizing to the point of uselessness, and depending on the kind of data, the average can take on a completely non-sensical value. Distribution matters, and to that end, there’s a huge difference between treatments that have roughly the same effect on everyone, those with a wide range of equally likely effects, and those that seem to have an all or nothing effect.

    The first might have mean = median = mode with little difference between min and max. The second might have mean = median, no mode, a very wide range and percentiles that are roughly linear. The third might have a mean larger than median, a mode near zero, a wide range, and very few intermediate values with the 70th percentile very low nd the third quartile near the max. Even if all three have the same average, these represent very different distributions(low average might indicate the first is generally worthless, the second is good for a few, decent for a few, poor for a few, and useless for a few, and the third if great for a few and useless for everyone else, but the mean alone doesn’t tell us which we’re working with).

    • Douglas Knight says:

      Right, if there are super-responders, that’s probably clear from the data. But no one publishes the raw data and it’s plausible that no one looks at the distribution. If people were studying this hypothesis, they would also compare results after 6 months to results after 1 year.

  8. Izaak says:

    So, how do we get away from “single number” statistics? This just further convinces me that one of the problems with modern statistics is the focus on looking at whether a single number is above or below a threshold. What can we do to encourage statistics to be done more holistically?

    • Ghillie Dhu says:

      The go/no-go decision is ultimately going to be a threshold; the issue as I see it is that the statistic being used is an inadequate proxy for the outcome of interest.

      ETA: e.g., the ratio of average benefit to average side-effect will (very likely) differ from the average ratio of benefit to side-effect; I have no idea what math is actually done, but I wouldn’t be the least surprised if it was somewhat simplistic (most people aren’t statisticians).

    • VivaLaPanda says:

      Better yet, report your data alongside summary statistics. Then people can run whatever analysis they want.

  9. eightieshair says:

    There’s been such a big trend of “New meta-analysis makes effect X disappear” reports that it has me wondering about meta-analysis itself. I’m not enough of an expert to have a grounded opinion, but is it possible that there’s something about the way these recent meta-analysis studies are being done that is guaranteed to make effects diminish or vanish entirely?

    • brmic says:

      No. It’s hard to cover all that is possible but to take on the obvious candidate:
      Is it possible that the inclusion of unpublished studies lowers the estimates? Almost certainly that’s the case. But to argue that they should be excluded not only fails to deal with everything we know about publication bias, it also essentially amounts to arguing, that these studies are not part of the population-of-studies that make up the effect. (This is basically a variant of the reasoning around outlier removal at the study level.) Such things occur, for instance if studies hand out antidepressants to psychotic patients one can legitimately say they’re not part of the population-of-studies whose average effect size we want to estimate. (Someone else can come along and define a different population that includes those studies if they’re so inclined.)
      But, generally speaking, there is still a substantial gap between ‘study could legitimately not be published because of flaws’ and ‘study worse than typical clinical practice’.

      OTOH, published effect sizes being massive overestimates which shrink substantially upon replication is pr for the course.

  10. domenic says:

    You skipped past the placebo effect pretty quickly in this post. Isn’t it also consistent with the evidence that most of the clinically-observed benefits of SSRIs could be obtained by psychiatrists prescribing placebos? And we just don’t know because no psychiatrists actually do that? For example,

    There are patients who will regularly get better on an antidepressant, get worse when they stop it, get better when they go back on it, get worse when they stop it again, et cetera.

    Would these same patients also regularly get better on a placebo, get worse when they stop it, etc.? Nobody’s tried placebos in a clinical setting enough to know, I assume.

    I guess this doesn’t address your point about similarly-low effect sizes for benzos and morphine.

    (Disclosure: I’ve been on an SSRI for ~15 years. At this point I mostly take it because why risk changing what works? But I remain curious whether if someone switched my pills for placebos, whether my lived experience would change in any way.)

    • VivaLaPanda says:

      He spends almost the whole original SSRI post discussing that, I’d recommend checking it out, it’s worth the read.

  11. TakLoufer says:

    Couldn’t some statistical analysis of the trials data differentiate between a case where people generally have a lousy response to the therapy and a case where some subgroups have a large reactions, and some barely any? Like testing whether the distribution of the effects is gaussian?

    • brmic says:

      In short, no. Though not for lack of trying.
      In slightly longer: Everything where the groups are ‘obvious’ we’ve already discovered, explained and treat accordingly. For instance, people presenting with ‘running nose’ either have a bacterial or a viral infection and thus respond to antibiotics or don’t. For the currently still difficult stuff, the situation is more like this: The underlying effect is not clearly binary, instead you have a middle group (or more) where it works somewhat. I.e. if the antidepressant is also anxiolytic, some people report improvement in their depression simply because they are less anxious after taking the stuff, despite ‘true depression’ being unchanged. That middle group reports effects on depression correlated in size with their anxiety, so you get lots of lovely distributed effects from that middle group. On top of that, image there’s also some variation in the ‘it works’ group, say from 100% cure to 80%, dependent on dose, therapeutic relationship, personalities etc..
      Now, imagine on top of that, placebo effects, in all groups.
      Now, imagine on top of that, therapeutic relationship effects in all groups.
      Now, imagine on top of that, spontaneous remissions in all groups.
      Now, imagine on top of that, life events (like death or a spouse or finding the love of one’s life).

      The resulting distribution will look gaussian, even tough the effect isn’t. People are trying to get to the bottom of things, but it’s hard.

      Even harder if the outcome is binary, like suicide yes/no.

  12. vV_Vv says:

    There are patients who will regularly get better on an antidepressant, get worse when they stop it, get better when they go back on it, get worse when they stop it again, et cetera. This raises some questions of its own, like why patients keep stopping antidepressants that they clearly need in order to function, but makes bias less likely. Overall the clinical evidence that these drugs work is so strong that I will grasp at pretty much any straw in order to save my sanity and confirm that this is actually a real effect.

    Every clinician knows that different people respond to antidepressants differently or not at all. Some patients will have an obvious and dramatic response to the first antidepressant they try. Other patients will have no response to the first antidepressant, but after trying five different things you’ll find one that works really well. Still other patients will apparently never respond to anything.

    If I understand correctly this pattern is also consistent with all SSRIs not having much effect: depression occurs in irregular remission-relapse cycles, so the patient shows up at the doctor office when their symptoms are especially bad, the doctor prescribes them a SSRI, after a few weeks they conclude it doesn’t work, they prescribe them another SSRI, and so on, until depression spontaneously goes in remission and this is interpreted as the particular SSRI that the patient is on having cured their depression. After some time the patient stops taking the SSRI, and after some more time their depression relapses, they go back to the doctor and the doctor interprets the relapse as caused by having stopped taking the medication.

    This hypothesis implies that taking up the same SSRI again will probably not do much, because it wasn’t doing much in the first place, but this is probably confounded by the placebo effect and the patient feeling ashamed for having stopped taking a medication. Is this plausible?


    Assuming, on the contrary, that SSRIs can have large effects with large individual variability, is there any biological theory for why this might be the case? Something like receptors having different shapes in different people and thus having different chemical affinity to different SSRI molecules? Or depression being multi-causal and different types of SSRIs acting on different causes? (this sound strange because the general mechanism of action of SSRIs should be the same)

    • Aliene says:

      That doesn’t track very well with my experience stopping SSRIs.

      I was just on one for 9 months for depression/anxiety. While on it, my depression and anxiety improved and by the end my general mood was much higher than it had been for at least a year, maybe more. However, I also had side effects: my focus was worse, I couldn’t enjoy the occasional drink, I felt like analytical conversations were harder (I had no awareness of a verbal train of thought, and I couldn’t compose replies at the same time I was listening and processing arguments). Also, having to take a daily medication is a pain in the ass, especially if you have trouble remembering things like bringing your pills with you.

      So I stopped, and after re-adjusting my depression and anxiety are worse again. The train of thought is back, and it turns out my brain was using that for unhelpful thoughts. It’s harder to be happy.

      Fortunately, I’m looking at this like a fascinating A/B test in software, and I’m back to CBT with a specific focus on “hey, I’m now thinking and doing these unhelpful things, can we address those?” But I’m not sure if that will keep working the next time I get a big dose of life stress, and if I hadn’t stopped and gone “wait a minute, this bullshit is back!” it’d be easy to slide down into old holes.

    • MH says:

      What you’re thinking of in the main part is probably ‘reversion to mean’ which is covered in the original post on depression, and is a big-deal-problem when it comes to studying chronic conditions. It’s a big part of the placebo effect itself.
      But if that were the major explanation it wouldn’t matter which SSRI you took the second time either. If the first one didn’t work, and the second one hit around the time the patient started on the downslope when they go back, the second one shouldn’t have any more likelihood of helping them out the second time than the first one did the first time. But people do seem to have preferences.

      (Thinking of the placebo effect as a cause, rather than an effect brought about by multiple causes, is the origins of a surprisingly large amount of nonsense out there about the value of placebos in the first place.)

      Also as far as I know the answer to the questions in the edit is “yes”. It is probably all of those things and also some other things as well. It doesn’t help that a lot of the symptoms of depression, right up to the point where things get really really bad, overlap with the symptoms of being really stressed out or unhappy for completely normal non-medical causes.
      Benzos will help with anxiety whether or not someone has an anxiety disorder, but whether antidepressants will help with people who don’t have a (of what are almost certainly a whole bunch of different) medical disorder(s) isn’t clear and also the answer is probably “for these people yes and for those people no and for these people over here sort of yes and sort of no”.

    • Paul Torek says:

      Here’s another angle that raises closely related worries. There’s often a backlash effect from stopping SSRIs, especially cold turkey or too quickly. The short term effect of stopping, or of restarting during the backlash period, can easily be more dramatic than the original effect. So I think Scott should be slower to doubt his on-again off-again patients, who may indeed be better off without the SSRI. Not everyone benefits from an anti-sex drug with moderate antidepressive side effects (to quote one skeptical clinician), even if they feel temporarily horrible without it.

  13. moridinamael says:

    About three months ago I started on the new migraine drugs, the CGRP inhibitors, and went from ~20 headache days per month to ~2 headache days per month. On paper, the “effect size” of this drug is small, on average only shaving off 1-2 headache days per month for the average recipient. If you dig into the publications, the distribution of responses is roughly Pareto distributed, with most people not responding very much and a relative minority responding quite strongly.

    Yet I have heard first-hand accounts of neurologists telling their patients that this drug “barely works” and isn’t even worth trying. I can only assume that they glance at the average effect size and dismiss the drug. This means there are a fraction of their patients who could be almost completely cured by the drug (like me) who are not even getting a chance to try it. Assuming my interpretation of their reasoning is true, this is absolutely maddening. It seems like the most basic understanding of statistics should screen off this kind of mistake.

    • Douglas Knight says:

      Why were you able to find the distribution for this drug but the rest of us can’t find it for antidepressants?
      Are you searching differently? Any tips?
      Or are the drugs different? Perhaps people looked for this, saw it wasn’t there so didn’t publish? Is it plausible that no one has looked?

      • moridinamael says:

        I’m regrettably late in responding to this, but I have an easy answer. The drug was literally delivered with a detailed summary of the research findings, including multiple charts illustrating the distribution of likelihoods of different effect sizes, versus placebo, based on two different dosages. I didn’t have to go searching – the drug company did the research and shoved the results directly into my hands. I can’t explain why a company would do this for a completely new drug with only a limited run of studies done for FDA approval purposes, while SSRIs, which have many years of gathered data, don’t. Perhaps it’s much much easier to record migraine frequency than depression symptoms?

    • MH says:

      “It seems like the most basic understanding of statistics should screen off this kind of mistake” is a sentence that is broadly applicable enough that it would make a decent tattoo.

  14. Doug says:

    Midterm Elections -> Scott sees a long-term environmental impact of pre-K
    Sessions Fired -> Scott opposes marijuana legalization
    RBG Hospitalized -> SSRIs are now effective medication

    I’m hoping that Mueller doesn’t indict Trump, or I’m afraid Scott may become a young-earth creationist.

  15. kominek says:

    Given that the effect size really is about 0.3, how do we square the scientific evidence (that SSRIs “work” but do so little that no normal person could possibly detect them) with the clinical evidence (that psychiatrists and patients often find SSRIs sometimes save lives and often make depression substantially better?)

    more things (including people) should be thought of in terms of (nonlinear) dynamical systems. how you feel, and how you’re “doing” is the result of a whole bunch of interconnected terms, like the amount of sleep you got, and how well you cope with adversity, and whatever the hell is going on with your brain chemistry.

    a tiny effect like 10 more minutes of sleep per night isn’t just 10 more minutes of sleep, it’s also a reduction in the daily stress generated by not enough sleep, which allows you more coping budget for other things, which gets you to the end of the day feeling better, which makes it easier to sleep, which…

    the sleep meds can be viewed as forcing functions acting on the dynamic system, same as other meds. just pushing a little bit every day in a good direction can get the overall state of the dynamic system to stay in a better region.

  16. VolumeWarrior says:

    Seems like you could test #3 by just looking at the data. If the treatment group has a high standard deviation, you can infer a responders/nonresponders narrative.

    … if only the data were public.

    • tcheasdfjkl says:

      Yeah, it seems like we want at least a graph that shows the distribution of effect sizes for individuals, and I’m kind of surprised we don’t have this already, given that this hypothesis seems both fairly obvious and testable.

  17. Douglas Knight says:

    you will never convince me that benzos don’t. Even morphine for pain gets an effect size of 0.4

    Are these results that surprising? These drugs have obvious acute effects, but the studies aren’t of the acute effects. I’m not surprised that morphine for chronic back pain scores badly.

    I’m not sure about benzos. Is the normal usage ad libitum for panic attacks or daily usage? Is this study daily usage? Elimination of panic attacks is a dramatic effect, but would it show up in the HAM-A?

  18. tcheasdfjkl says:

    Off-topic, but “Gueorguieva” (the name of one of the researchers) is an amazing spelling. My first thought was that this was someone from a Slavic-language-speaking country who had grown up in a French- or Spanish-speaking country (in both French and Spanish “gue” and “gui” are used to designate that the g should be pronounced with a hard /g/ sound – “ge” and “gi” have a different sound for g) and therefore had her name transliterated in a French- or Spanish-compatible way (in English the intuitive/standard transliteration would be “Georgieva”). But apparently she both was born in Bulgaria and went to college there and then grad school in Florida. So probably this is just a case of Soviet transliteration guidelines being aimed at French orthography and the results looking funny to anglophone eyes. (In retrospect, I should probably have expected that, since my own last name has the same issue.)

  19. K says:

    The effect size / variance issue also plagues weight loss trials. If you look at weight loss RCTs, the mean loss is normally only a kg or two at 12-18 months, which is obviously pretty trivial. A few years ago I finally found a trial that published a scatter plot of both arms, and the variance was huge – some people lost 30% of their bodyweight with surgery.

    I think really what these trails ought to be presenting is the proportion of patients in each arm who met a pre-specified threshold, e.g. 10% of their bodyweight, an X point decrease in their K10 score or a score than went from “very high” to “moderate”, or whatever. That would be vastly more interpretable for doctors, and it answers a much more relevant question for patients, which is “What are the chances this drug will work for me?” rather than “what does this drug to ‘on average’ in a random group of people?”. Cancer is the only disease I can think of where you see both types of statistics presented regularly (median survival and proportion surviving at 1 and 5 years).

  20. Error says:

    Suppose that one-third of patients have some gene that makes them respond to Prozac with an effect size of 1.0 (very large and impressive), and nobody else responds. In a randomized controlled trial of Prozac, the average effect size will show up as 0.33 (one-third of patients get effect size of 1, two-thirds get effect size of 0). This matches the studies.

    This seems like such an easy hypothesis to check that I find myself confused. Wouldn’t it show up as SSRI’s having an obviously higher effect-size variance than placebo? Or you could compare the top and bottom decile of each. Do existing studies not look for this sort of thing?

  21. SEE says:

    In favor of the differential-effect hypothesis, I note the average effect size of SSRIs on my major depressive disorder, specifically, works out to something like 0.25.

    This is because of the seven I was tried on, the first three did nothing, the fourth was amazingly effective (but caused me to break out in hives and so had to be discontinued), the fifth and sixth failed, and then the seventh worked noticeably but not amazingly well when combined with trazodone.

    (There were also several non-SSRIs tried along the way that didn’t do any good. A tricyclic, lithium, buspirone . . . )

    Just coincidence? Placebo? Well, #4 had incredibly immediate mood response to both starting and stopping, and the handful of times I’ve been directed to wean myself off of #7 over two decades (because of weight and liver side effects and the hopes that after many years my underlying depression has resolved itself), I’ve gotten severely depressed and had to resume.

    Assume a population of the severely depressed like me but with differing effective SSRIs . . . and, well, not only do the study results explain themselves, but thank America and Big Pharma that “me-too drugs” are encouraged rather than suppressed by the current system.

    • twocents says:

      @SEE you’ve probably considered this, but is there any chance your allergic reaction to #4 was due to an inactive ingredient in the pill rather than to the drug itself? It might be worth a careful trial of a generic made by a different manufacturer or a compounding pharmacy to rule out this possibility. You might also consult with an allergist about whether a desensitization protocol is feasible. I’ve never heard of this being done for SSRIs, but it seems theoretically possible.

      • SEE says:

        At the time, #4 was still under patent (and would be for most of a decade), so a generic wasn’t a viable choice. And my psychiatrist-at-the-time didn’t want to take any risks at all with anaphylactic shock even as teenager-me suggested I could live with the hives. I dunno if I’d even heard of a compounding pharmacy at the time.

        Now it’s been out-of-patent for a good long while, so I could try seeing if it was an inactive ingredient. On the other hand, since I’ve never had a hives reaction to anything else ever, it seems implausible there was some inactive ingredient in Eli Lilly-made Prozac in 1994 that hasn’t been in anything, prescription, OTC or non-drug, that I’ve ever taken.

  22. Argos says:

    TL;DR: I am skeptical that it makes sense to model patient responses in clinical SSRI trials as “for some it works, for some it doesn’t”, although I concede that it makes a lot of sense intuitively. Also, if you want more relevant information on whether some treatment/drug will help you, google for ‘treatment NNT’.

    This is a very important topic, and just like the others I am a little surprised that the “some people respond and some don’t” hypothesis does not get more discussion. It’s also extremely relevant to our personal health and stuff, so here is a bit of research mixed with some musings: I would have loved to find redeeming evidence for SSRIs, but what I found contradicts that hypothesis somewhat, and is more in support of there only being one class of patients. Long post forthcoming:

    As everybody already pointed out, verifying the hypothesis would be trivially easy if open data was available on the drug trials, but I was not able to find any. However, some of the summary statistics that are avaiable are completely sufficient to answer that question:

    Odds ratios and effect sizes. Odds ratios are only defined in terms of a binary, clinically significant outcome, while effect sizes operate on a continous measure. As some commenters pointed out, formulas exist to convert between those two measures, but those formulas assume either a normal or a logistic distribution (or just something bellshaped); while the two cluster hypothesis would expect a bimodal distribution (two peaks). So all one has to do is grab a few studies which report the OR for their sample, and compare it with the estimate you would get plugging the cohen’s d , which on can estimate from the studies that report a continous outcome measure, into those formulas.

    If the estimate of the OR is lower than the actual OR, you are underestimating the number of patients that are responders.

    Before I’ll go into SSRIs specifically, these formulas have empirical support from studies about pain relief:
    And here is one review for the applicability of these formulas for treatments of chronic depression:
    Also, the whole issue gets at least *some* attention, googling responder analysis turned up some interesting links.

    Now specifically about the article by Gueorguieva and Krystal, which I will freely admit goes over my head stats wise. However, there are some points I find questionable:

    First of all, they blame failures of antidepressent trials on the “responder vs non responder” pattern, where due to the use of continous measures as opposed to binary outcomes no significant result was found. But some studies use discrete outcome measures, so that is an acceptable study design, so why do drug companies that fund those studies not just use those measures? Surely, even a half competent statistician would have been able to discern a bimodal distribution, and these studies have been conducted since the 60s.

    Now, in your past articles of the two large meta reviews an effect size of 0.3 and odds ratios of 1.66 were featured, which according to the conversion formulas are mostly identical. Probably not a coincidence, as most studies used in the recent meta analysis actually use simple effect sizes, and I presume that those were converted to the OR of 1.66 using the standard formulas. For Gueorguieva and Krystal to have found something meaningful, their OR should be larger than 1.66. From my interpretation of their results, this is not the case:

    They define their own two group of responders and non responders, and maintain that these are two distinct groups. There is no clear cutoff for being a responder, and most importantly it differs from the usual criterion of being a responder, a certain reduction in your depression score. But taking their definition, the OR of being a SSRI responder vs a placebo responder is 0.69, or to put it another way around, it’s 1 / 0.69 = 1.41 which is lower than the estimate based on the effect sizes! So I don’t really see a saving grace in their approach, or I just don’t understand it.

    Another red flag is their decision to model the data as consisting of two distinct clusters is not very convincing based on their provided data. Table 2 shows that the performance of the model with three clusters is very close to the two cluster model.

    Pretty unrelated to all of this, but during researching this, I found the metric of Number to Treat, which is not about how statistically significant a treatment is, but how likely it is to be clinically significant. For example, according to this site, the Mediterrenean diet has NNT of 18 to prevent a repeat heart attack, meaning that if 18 people follow the diet, one of them will not get a heart attack that he/she otherwise would have. It’s nice, because it does not deal with relative risks compared to a baseline; instead it is a metric that directly tells you the absolute benefit

    • Douglas Knight says:

      That’s a good point about converting odds ratios and z-scores, but I’d worry that when we have both they are the result of such conversion, especially in meta-analyses.

  23. Steve Sailer says:

    “But I can’t find any other literature on this possibility, which is surprising, because if it were true it should be pretty obvious, and if it were false it should still be worth somebody’s time to debunk.”

    Or maybe you’ve figured out something that is new, true, and important.

    I wouldn’t rule that out.

  24. Murphy says:

    “The traditional way to do this is to say that psychiatrists and patients are wrong. Given all the possible biases involved, they misattribute placebo effects to the drugs, or credit some cases that would have remitted anyway to the beneficial effect of SSRIs, or disproportionately remember the times the drugs work over the times they don’t. While “people are biased” is always an option, this doesn’t fit the magnitude of the clinical evidence that I (and most other psychiatrists) observe. There are patients who will regularly get better on an antidepressant, get worse when they stop it, get better when they go back on it, get worse when they stop it again, et cetera. This raises some questions of its own, like why patients keep stopping antidepressants that they clearly need in order to function, but makes bias less likely. Overall the clinical evidence that these drugs work is so strong that I will grasp at pretty much any straw in order to save my sanity and confirm that this is actually a real effect.”

    The harsh reality is that clinicians are utterly terribly at assessing the efficacy of drugs used in normal treatment unless they’re miracle-level stuff like antibiotics.

    There’s a reason we do gigantic RCT’s, they’re not just some formal fun regulatory ritual that serves no real purpose.

    Doctors utterly suck at looking at the handful of patients in front of them and distinguishing the effects of a drug from all the surrounding crap. Your patient regularly gets worse when they stop taking their meds? Wow! Have you also noticed that they stop taking their meds whenever they’re single and there’s nobody there to badger them to take them and all the other structured stuff in their life falls apart?

    [repeat forabout a zillion variations]

    Doctors are awful at looking at their comparatively tiny number of patients and distinguishing what drugs work. If you swapped out a random drug for sugar pills many doctors wouldn’t notice even if it was something really important with a big effect size and their patients were dying at a greatly increased rate.

    it’s a simple, boring unsexy answer but it’s the basic problem.

    I also get the feeling it’s counter to the intuitions of a lot of commenters here who would like to believe that we could do away with all the expensive clinical trials stuff without any real loss. There was even a cringingly awful economics paper being thrown around here a while back that was based entirely on the assumption that doctors wouldn’t be affected by ad campaigns and could reliably distinguish low-effiacy meds from high-effiacy through normal practice.

    For doctors to really notice you need spectacular effects, either super-rare and dramatic side effects or miracle cures that see someone walking out the door that the doctor had written off as already dead.

    You’ve pointed out before how many patients appear to show improvement long before it should be possible for the drugs to have had any physical effect.

  25. jastice01 says:

    Depression, Anxiety, OCD, Tourette’s are commonly comorbid, but otherwise somewhat orthogonal, certainly not on a linear spectrum. For example, I have noticable Tourette’s, hardly any OCD, only low brackground anxiety, no depression. I haven’t tried SSRI’s.

  26. Purcell_effect says:

    What has always bothered me about the field of medicine (as a complete outsider) is that the diseases are defined by the set of symptoms they exhibit. On the other hand, medication doesn’t target symptoms, but some underlying mechanisms. In the absence of one-to-one mapping between the biological abnormalities causing the diseases and their symptoms, it is no wonder that we end up with “puzzles” like “why do SSRIs work for some people but not for others?” A more level-headed way of thinking about the problem would be to give one name to the set of biological abnormalities that are healed by SSRIs (let’s say depression A) and another name for other types of abnormalities. It wouldn’t even have to be discrete classification, but could also be a set of continuous parameters describing whatever is causing the symptoms. Then the field could move on from wondering about “effect sizes” (i.e. “this-and-this SSRI seems to work in 10% of the cases”) to more concrete things like “this SSRI works in 100% of the cases on people who have an inbalance of X chemical in Y part of the brain with gene Z present.” I wouldn’t even be so sure that statistics would be the correct tool to attack this kind of problems, but rather going very deep into a few cases where a given medicine works, figure out what exactly happens in the bodies of those people and compare how it is different to what happens in the bodies of other people who get zero benefit from the same medicine.

  27. Richard Roman says:

    Benzodiazepines may fail as anxiolytics, but they do perform far better (more than DOUBLE the effect size of SSRIs) if sleep onset latency is the outcome criterion, according to this review:
    See Fig. 1:

    Not all psychiatric drugs are failures, but both SSRIs and DRDR2 antagonists most certainly are.

    Also, partitioners’ anecdotes are horrible “evidence” for drug effectiveness. If a certain patient gets worse during treatment, this does not imply any deleterious effects of the treatment. The treatment could be vastly efficient and the patient’s health could be much worse without it. (This is the most obvious counter-example.)

    On a side note: Does anyone know how to find data on the raw effect sizes of the actual mental disorders? Given the fact that RCTs for both depression and schizophrenia tend to use psychometric test scores (HDRS + BDI and PANSS + SAPS/SANS + BPRS, respectively) with no external validation as primary outcomes, I think it is safe to assume their effect sizes to be at least 4, in order to avoid an overlap of the 2-sigma-intervals of patients and healthy controls.

    If you compare the ~0.3 and ~0.5 effect sizes of SSRIs and DRD2 antagonists, respectively, to the hypothetical ~4.0 effect sizes of the disorders they are supposed to treat, this signifies 7.5% and 12.5% symptom reduction, which is hardly worth the adverse effects.

    Conclusion of the Kirsch and the Cipriani meta-analyses:
    1) The “shotgun approach” to psychiatric pharmacotherapy should be dead by now.
    2) Unless the “bipartite response” theory is true for both SSRIs and DRD2 antagonists AND there are reliable ways of preselecting responders, so are these classes of drugs.
    3) Even if the response is bimodal and responders can be identified without throwing a dozen of SSRIs at them within a whole year, there is still the elephant in the room: What to do with non-responders that do not remit spontaneously?

    The “bipartite response” theory is one that can be easy examined: Just obtain the frequency distribution of symptom extent/reduction in the verum group of any RCT and check whether it is bimodal or not. If it is true, you can expect one peak for non-responders, that reflects the improvement achieved by the natural course of the disease and placebo effects, and one peak for responders, who experienced additional symptom alleviation through the pharmacodynamic effects of SSRIs.

    • Douglas Knight says:

      Nothing about medical statistics is safe to assume. I think that Scott wrote that he tried to answer this question and failed. My guess is 2 sigma.

      • Richard Roman says:

        Regarding effect sizes of the raw disorder, I found this very old (1995) paper:

        According to Table 4 (p. 478), patients (n = 135) score a mean of 22.39 on the Hamilton Depression Rating Scale (HDRS) with a standard deviation of 3.69, whereas healthy controls (n = 117) score 3.67 on average (SD = 3.11). The weighted mean standard deviation is (117 * 3.11 + 135 * 3.69) / (117 + 135) = 3.42. The difference of the mean values is 22.39 – 3.67 = 18.72. the effect size of depression is 18.72 / 3.42 = 5.47.

        Similarly, the Hamilton Depression Inventory 17 (HDI-17) indicates 3.99 (SD = 2.99) for the control group and 22.22 (SD = 5.12) for major depression. In this case, the weighted mean standard deviation is (117 * 2.99 + 135 * 5.12) / (117 + 135) = 4.13, the difference of the mean values being 22.22 – 3.99 = 18.23. Here, the effect size of depression is 18.23 / 4.13 = 4.41.

        As for schizophrenia, I failed to figure out concrete numbers and only found this paper:

        In Fig. 1 (, you see Positive and Negative Symptom Scale (PANSS) average for healthy controls and early course and chronic schizophrenia patients (just ignore the ketamine groups for this purpose). The authors don’t mention raw numbers even in the supplementary info, but from visual inspection the mean values for PANSS total scores seem to be ~30, ~71 and ~61, respectively. In the text, they mention the effect size of early stage vs. chronic schizophrenia to be d = 1.3. So, if we assume the pooled standard deviation to be 7.7 (roughly (71 – 61) / 1.3), we get effect sizes of (61 – 30) / 7.7 = ~4.0 for chronic and (71 – 30) / 7.7 = ~5.3 for early stage schizophrenia.

        tl;dr: Effect sizes for the raw disorders seem to be about 4.4 – 5.5 for depression and 4.0 – 5.3 for schizophrenia.

        Compare that to the effect sizes of the most effective drugs (amitriptyline with d = 0.48 according to Cipriani’s recent meta-analysis and clozapine with d = 0.88 according to Leucht’s 2013 meta-analysis).

        If you calculate the crude ratios, this signifies ~9% (0.48 / 5.47) to ~11% (0.48 / 4.41) and ~17% (0.88 / 5.3) to ~22% (0.88 / 4.0) symptom reduction with the most effective available drugs for depression and schizophrenia, respectively.

        Psychiatric pharmacotherapy apparently has a bunch of HUGE problems right now. And psychological treatments do not seem to fare any better, frankly. If anybody remedies from depression or schizophrenia to a meaningful extent at this stage, this is most likely due to spontaneous remission and maybe placebo/non-specific effects, not the actual treatment.

        • Douglas Knight says:

          I wouldn’t call 1995 very old.

          • Richard Roman says:

            I wish I could provide some better data, with larger sample sizes especially, but these papers are the only ones I’ve found so far.

            Cohen’s guidelines for interpreting effect sizes (0.2 = small, 0.5 = moderate, 0.8 = large) are largely bogus for medical statistics.

            It is just far more reasonable to compare the effect sizes of the treatment that is to be examined with those of:
            a) the raw disease (standardized mean difference between patients and healthy controls)
            b) the natural course/regression to the mean of the untreated disease within the time of a typical trial
            c) placebo and other non-specific effects

            In order to be considered effective, a particular treatment has to alleviate a noticeable and relevant proportion of overall symptoms AND be at least non-trivial when compared to the natural course. (The later is another concern for SSRIs, Kirsch’s 2008 meta-analysis found an 0.92 effect size for natural course + placebo, which is about thrice the effect of SSRIs.)

            On a side note, a little thought experiment: Let’s assume SSRIs’ antidepressant effects turn up cumulatively all within a single time frame and there are no placebo effects or spontaneous remissions. Now, patients take their SSRI for a month (no delayed onset either). How many completely symptom-free/asymptomatic days could they expect on average?

            Given the ~6.8% (0.30 / 4.41, see the comment above) symptom relief by SSRIs, the estimate is about two days (0.068 * 30 days = 2.04 days) of a whole month. But in order to achieve that outcome, they’d be required to take their SSRIs regularly throughout the month and experience possible adverse effects for the whole time.

            The big question here: Are these two non-depressed days worth the side effects?

            In my opinion, this not even a debate about “SSRIs are weak, yet unequivocally beneficial drugs, but we could really use something else/better that bridges the gap between SSRI response and full remission” right now. Personally, I wouldn’t even feel comfortable to assert that SSRIs’ expectational net benefits are non-negative at this stage.

            (And this is without even considering the possibility that some unknown proportion of the 0.3 effect size does actually reflect the active versus inert placebo difference, rather than “true” pharmacodynamic effects.)

  28. MH says:

    “There are patients who will regularly get better on an antidepressant, get worse when they stop it, get better when they go back on it, get worse when they stop it again, et cetera. This raises some questions of its own, like why patients keep stopping antidepressants that they clearly need in order to function, but makes bias less likely. ”

    Isn’t this one just the obvious fact that (particularly in the chronic longer term cases) mental illnesses don’t just hit people like a blanket thrown on top of them, but live down inside who they are as a person? Drugs that help out with them, especially ones that come with real cost in side effects, seem like the sorts of things where you’d see this pattern showing up to me.

    While I did have trouble at one point I got luck and turned out to not be the sort of person who needs to keep taking the drugs forever in order to function. But boy does the sort of person I am change if I ever stop taking them…

  29. BPC says:

    “This raises some questions of its own, like why patients keep stopping antidepressants that they clearly need in order to function”

    Speaking as someone who has tried to stop twice and failed both times, the answer is pretty straightforward – SSRIs, like every medication, has side-effects. The escitalopram I’m taking doesn’t really play nice with booze (and I like drinking) or weed (and I like toking), and it has had negative effects on my libido. Also, if I forget my pills while on vacation somewhere, I’m going to end up going through forced withdrawal unless I can convince a pharmacist that I just forgot my meds, which is not always a given. If I could function just as well without SSRIs, I would very much appreciate it. Which is why I tried to get off them. It did not go well.

  30. christmansm says:

    OCD and tic disorders appear closely related in a number of interesting respects. Nonetheless, while SSRIs are notably effective in OCD (larger effect size than MDD) they don’t have any evidence for reducing tics themselves. That linear model makes some intuitive sense, but I don’t think it’s likely to help predict much.

  31. tvt35cwm says:

    You are saying, “but what if the distribution is bimodal?”.

    Surely this kind of issue can be resolved by reporting not just the location of the treatment effect, but its dispersion (variance) and modality (number of peaks in the curve). Simply change requirements for publication in the big journals so where a statistical analysis is used, papers must disclose location, dispersion, skew, and modality, not just location.

  32. Ruben says:

    One simple reason (not mutually exclusive to the others): our outcomes are not measured so reliably – some may even have quite poor reliability of change (what you’d want for change scores to be reliable). Reliability estimates in the literature may overestimate the reliability of what we actually want to measure (because some of the reliability comes from artifactual stuff like similar response styles).

  33. LB says:

    I just read an article discussing research being done to understand the Placebo Effect wherein it was stated that the “A 2015 study published in the journal Pain analyzed 84 clinical trials of pain medication conducted between 1990 and 2013 and found that in some cases the efficacy of placebo had grown sharply, narrowing the gap with the drugs’ effect from 27 percent on average to just 9 percent.”. It was found that the interaction between the doctor and patient greatly affected the outcomes. The article also touched on Scott’s thought regarding some patients having a gene that makes them more or less susceptible to a particular treatment. The comments contained interesting thoughts on this topic as well, including the concern that if the use of “Placebos” became widespread (because they worked), they might stop working as well because people’s belief in the treatment would drop – very Catch-22 like.

  34. name99 says:

    There is even a way to make this mathematical (ie valid to a certain class of people who refuse to accept anything less).

    Consider a random variable X, and a function of that random variable, f.
    It is a common-place if you have studied measure-theoretic probability (and a source of astonishment to the bulk of people who have not) that f()!= if f is non-linear. ie f of the mean value of X does not equal the mean value of f(X).
    (in fact, to first order, when f is continuous enough, the difference between the two depends on both the second-derivative of f and the standard deviation of X).

    OK, relevance to us is that we have a random variable X which could be an indicator variable (prozac responder vs prozac insensitive) or maybe “amount of protein created in response to the drug” or whatever; and this has wide dispersion. That’s input to a utility function f, the utility derived from a particular value of X. So if the indicator variable X is 1, then the utility (prozac sensitive) is, I don’t know, +1000. And if the indicator variable X is 0, the utility is, I don’t know, -1 for some hassle, some dollar cost, some nausea.

    We now have basically a perfect storm of an extremely non-linear function being fed a variable with extremely wide dispersion. And yet (this is my interpretation of the sort of language you’re describing) this effect size language is language that is only mathematically acceptable when the utility function f is linear and/or the random variable feeding it is extremely low dispersion.

    In other words, to put it bluntly, using this sort of language of effect size, and the linear mentality behind it, is amateur hour, a totally unacceptable usage of mathematics. (And something we would not see in, eg serious finance, which is where much of this mathematics was perfected, because many derivatives contracts are non-linear, so this stuff matters!)
    Perhaps you can persuade every medical researcher you meet to at least take this mathematics as seriously as the lowest-level quant out there?