More Confounders

[Epistemic status: Somewhat confident in the medical analysis, a little out of my depth discussing the statistics]

For years, we’ve been warning patients that their sleeping pills could kill them. How? In every way possible. People taking sleeping pills not only have higher all-cause mortality. They have higher mortality from every individual cause studied. Death from cancer? Higher. Death from heart disease? Higher. Death from lung disease? Higher. Death from car accidents? Higher. Death from suicide? Higher. Nobody’s ever proven that sleeping pill users are more likely to get hit by meteors, but nobody’s ever proven that they aren’t.

In case this isn’t scary enough, it only takes a few sleeping pills before your risk of death starts shooting up. Even if you take sleeping pills only a few nights per year, your chance of dying double or triple.

When these studies first came out, doctors were understandably skeptical. First, it seems suspicious that so few sleeping pills could have such a profound effect. Second, why would sleeping pills raise your risk of everything at once? Lung disease? Well, okay, sleeping pills can cause respiratory depression. Suicide? Well, okay, overdosing on sleeping pills is a popular suicide method. Car accidents? Well, sleeping pills can keep you groggy in the morning, and maybe you don’t drive very well on your way to work. But cancer? Nobody has a good theory for this. Heart disease? Seems kind of weird. Also, there are lots of different kinds of sleeping pills with different biological mechanisms; why should they all cause these effects?

The natural explanation was that the studies were confounded. People who have lots of problems in their lives are more stressed. Stress makes it harder to sleep at night. People who can’t sleep at night get sleeping pills. Therefore, sleeping pill users have more problems, for every kind of problem you can think of. When problems get bad enough, they kill you. This is why sleeping pill users are more likely to die of everything.

This is a reasonable and reassuring explanation. But people tried to do studies to test it, and the studies kept finding that sleeping pills increased mortality even when adjusted for confounders. Let’s look at a few of the big ones:

Kripke et al 2012 followed 10,529 patients and 23,676 controls for an average of 2.5 years. They used a sophisticated de-confounding method which “controlled for risk factors and [used] up to 116 strata, which exactly matched cases and controls by 12 classes of comorbidity”. Sleeping pill users still had 3-5x the risk of death, regardless of which of various diverse sedatives they took. Even users in their lowest-exposure category, fewer than 18 pills per year, had 3.6x the mortality rate. Cancer rate in particular increased by 1.35x.

Kao et al 2012 followed 14,950 patients and 60,000+ matched controls for three years. They tried to match cases and controls by age, sex, and eight common medical and psychiatric comorbidities. They still found that Ambien approximately doubled rates of oral, kidney, esophageal, breast, lung, liver, and bladder cancer, and slightly increased rates of various other types types of cancer as well.

Welch et al 2017 took 34,727 patients on sleeping pillsand 69,418 controls and followed them for eight years. They controlled for sex, age, sleep disorders, anxiety disorders, other psychiatric disorders, a measure of general medical morbidity, smoking, alcohol use, medical clinic (as a proxy for socioeconomic status), and prescriptions for other drugs. They also excluded all deaths in the first year of their study to avoid patients who were prescribed sleeping pills for some kind of time-sensitive crisis – and check the paper for descriptions of some more complicated techniques they used for this. But even with all of these measures in place to prevent confounding, they still found that the patients on sedatives had three times the death rate.

This became one of the rare topics to make it out of the medical journals and into popular consciousness. Time Magazine: Sleeping Pills Linked With Early Death. AARP: Rest Uneasy: Sleeping Pills Linked To Early Death, Cancer. The Guardian: Sleeping Pills Increase Risk Of Death, Study Suggests. Most doctors I know are aware of these results, and have at least considered changing their sedative prescribing habits. I’ve gone back and forth: such high risks are inherently hard-to-believe, but the studies sure do seem pretty good.

This is the context you need to understand Patorno et al 2017: Benzodiazepines And Risk Of All Cause Mortality In Adults: Cohort Study.

P&a focus on benzodiazepines, a class of sedatives commonly used as sleeping pills, and one of the types of drugs analyzed in the studies above. They do the same kind of analysis as the other studies, using a New Jersey Medicare database to follow 4,182,305 benzodiazepine users and 35,626,849 non-users for nine years. But unlike the other studies, they find minimal to zero difference in mortality risk between users and non-users. Why the difference?

Daniel Kripke, one of the main proponents of the sleeping-pills-are-dangerous hypothesis, thinks it’s because of the switch from looking at all sleeping pills to looking at benzodiazepines in particular. In a review article, he writes:

[Patorno et al] was not included [in this review] because it was not focused on hypnotics, specifically excluded nonbenzodiazepine “Z” drugs such as zolpidem, and failed to compare drug use of cases and controls during follow-ups.

I’m not sure this matters that much. Most of the studies of sleeping pills, including Kripke’s own study, including benzodiazepines, specifically analyzed them as a separate subgroup, and found they greatly increased mortality risk. For example, Kripke 2012 finds that the benzodiazepine sleeping pill temazepam increased death hazard ratio by 3.7x, the same as Ambien and everything else. If Patorno’s study is right, Kripke’s study is wrong about benzodiazepines and so (one assumes) probably wrong in the same way about Ambien and everything else. I understand why Kripke might not want to include it in a systematic review with stringent inclusion criteria, but we still have to take it seriously.

He’s also concerned about the use of an intention-to-treat design. This is where your experimental group is “anyone who was prescribed medication to begin with” and your control group is “anyone who was not prescribed medication to begin with”. If people switch, they stay in the same group – for example, someone taking medication stops taking it, they’re still in the “taking medication” group. This is the gold standard for medical research because having people switch groups midstream can introduce extra biases. But if people in the “taking medication” group end up taking no more medication than people in the “not taking medication” group, obviously it’s impossible for your study to get a positive finding. So although P&a were justified in using an intention-to-treat design, Kripke is also justified in worrying that it might get the wrong result.

But the authors respond by giving a list of theoretical reasons why they were right to use intention-to-treat, and (more relevantly) repeating their analysis doing the statistics the other way and showing it doesn’t change the results (see page 10 here). Also, they point out that some of the studies that did show the large increases in mortality also used intention-to-treat, so this can’t explain the differences between their studies and previous ones. Overall I find their responses to Dr. Kripke’s concerns convincing. Also, my priors on a few sleeping pills per year tripling your risk of everything is so low that I’m biased towards believing P&a.

So why did they get such different results from so many earlier studies? In their response to Kripke, they offer a clear answer:

They adjusted for three hundred confounders.

This is a totally unreasonable number of confounders to adjust for. I’ve never seen any other study do anything even close. Most other papers in this area have adjusted for ten or twenty confounders. Kripke’s study adjusted for age, sex, ethnicity, marital status, BMI, alcohol use, smoking, and twelve diseases. Adjusting for nineteen things is impressive. It’s the sort of thing you do when you really want to cover your bases. Adjusting for 300 different confounders is totally above and beyond what anyone would normally consider.

Reading between the lines, one of the P&a co-authors was Robert Glynn, a Harvard professor of statistics who helped develop an algorithm that automatically identifies massive numbers of confounders to form a “propensity score”, then adjusts for it. The P&a study was one of the first applications of the algorithm on a controversial medical question. It looks like this study was partly intended to test it out. And it got the opposite result from almost every past study in this field.

I don’t know enough to judge the statistics involved. I can imagine ways in which trying to adjust out so many things might cause some form of overfitting, though I have no evidence this is actually a concern. And I don’t want to throw out decades of studies linking sleeping pills and mortality just because one contrary study comes along with a fancy new statistical gadget.

But I think it’s important to notice: if they’re right, everyone else is wrong. If you’re using a study design that controls for things, you’re operating on an assumption that you have a pretty good idea what things are important to control for, and that if you control for the ten or twenty most important ones you can think of then that’s enough. If P&a are right (and again, I don’t want to immediately jump to that conclusion, but it seems plausible) then this assumption is wrong. At least it’s wrong in the domain of benzodiazepine prescription and mortality. Who knows how many other domains it might be wrong in? Everyone who tries to “control for confounders” who isn’t using something at least as good as P&a’s algorithm isn’t up to the task they’ve set themselves, and we should doubt their results (also, measurement issues!)

This reminds me of how a lot of the mysteries that troubled geneticists in samples of 1,000 or 5,000 people suddenly disappeared once they got samples of 100,000 or 500,000 people. Or how a lot of seasonal affective disorder patients who don’t respond to light boxes will anecdotally respond to gigantic really really unreasonably bright light boxes. Or of lots of things, really.

166 thoughts on “More Confounders

  1. NooneOfConsequence

    Hi. Long time lurker, first time poster. I am a biostatistician and felt like I needed to voice some skepticism of both the methods and the data used in Patorno et al. My problem with the first is the use of propensity score matching in an inappropriate context. Note that I am *not* saying that propensity scores are useless in general, but that *matching* on propensity scores generated in high-dimensional data (e.g., when taking 300 possible predictors of benzodiazapene use as registered in a conglomerate of 1.2 million EHR and insurance claim records from a patchwork quilt of providers and ACO’s in the UnitedHealthcare network) is quite liable to introduce bias. This working paper by Gary King (also at Harvard) was what helped me understand the
    mechanism of why this occurs:

    My summary is that what King and Nielsen call “the PSM paradox” happens because the propensity score is outside the space of the original data, and in some situations *including this one* is prone to inducing model dependence and biased estimation. This is especially true as larger numbers of experimental units get pruned out of the dataset by not having a suitable match somewhere else in the sample, as you’d expect: the less of your sample going into your analysis, the more you lean on the assumption that the excluded data is like the included data. One more way of looking at it: If you lived in scientopia, and could conduct the perfect clinical trial to settle this question, you would never make an entry criteria of “we need to also find someone equally likely to opt into the treatment as you are, given your constellation of baseline covariates”. That would exclude a huge number of eligible subjects, and make your findings much less generalizable, but that is what you do in PSM. PSM gives you a better case-control study, not a poor man’s RCT.

    So the methods alone were able to make me raise an eyebrow, but it actually gets worse, as I have experience working with what I believe is that same dataset from Optum Clinformatics, or at least an overlapping set of it. My device industry employer leases data from them, for purposes of gathering evidence of product performance in a post-trial clinical setting. And calling that set of records “data” isn’t as accurate as calling it “low-grade data ore”, as cleaning it has taken longer than almost any other similar effort I’ve ever done, and would have been impossible if we did not have some former employees of the vendor working for us now. It’s clear that the hundreds of underlying sources getting mushed together each had very different structures originally, causing many values for a lot of key variables to be missing in a very much not-at-random way. Add to that the fact that it’s performed pitifully when validated against our device registries, and and maybe a dozen other reasons, and you’ll understand why I vocally objected to our renewal of the data lease. I’ve seen a lot of attempts made by myself and coworkers to say anything useful with it, and it seems to only provide good estimates in a few special cases. Maybe this analysis by Patorno et al would be one, but I have my doubts, especially given this study’s variance with the other literature.

    And FWIW I will be cautioning my mom about taking sleeping meds next time I talk to her.

    1. NeuroStats

      Another statistician here, but not an expert on the use of propensity scores.

      Given that the paper has a null result, one critical question is whether they have overadjusted and whether causal effect estimate is downward biased. I realize PSM methods might introduce upward bias for several reasons, but is that a relevant concern in this scenario, given how the authors have gone about choosing their set of confounders?

      A recent review from Sander Greenland and colleagues, suggests bias could go in either direction.

      Propensity-score methods are sometimes promoted to address the concerns we have discussed. Even in cohort studies, however, propensity-score matching may lead to overadjustment and variance inflation, or poor control of strong confounders [60, 61, 62], and can also generate spurious results in case–control studies [63].

      The Pearl paper cited as [61] above in particular says

      In summary, the effectiveness of PS methods rests critically on the choice of covariates, X, and
      that choice cannot be left to guesswork; it requires that we understand, at least figuratively, what
      relationships may exist between observed and unobserved covariates and how the choice of the
      former can bring about strong ignorability or a reasonable approximation thereof.

      If we were to assume that lack of statistical power is not play here, the key issue I think is whether their method of choosing variables is likely to create downward bias?, . Especially by inadvertently treating instrumental variables or mediators as confounders to match on? This is clearly a problem some of the co-authors are aware of in previous work and try to mitigate using available subject matter knowledge. But they might still be wrong though.

      Given results of the paper and techniques used, I strongly suspect that the previous literature at the very least has substantial residual confounding. Unless they have failed to exclude substantial instrumental variables, it seems more likely that the rest of the field fails to adequately control for unmeasured confounding. This seems likely even if i) the conclusion of this study is too pessimistic & there genuinely is a small causal effect of benzodiazepines on all-cause mortality, ii) the situation does not adequately generalize to the case of sleeping pills.

      One other possibility is that there is both substantial residual confounding as well as M-bias/selection bias & Z-bias in the choice of covariates scientists routinely control for. If we are in this regime and our current knowledge is utterly inadequate to rule out M/Z-bias then it seems totally unclear whether either the previous literature or the Patorno paper can be trusted.

      What you say about the data quality is most concerning. As some others point out, it would be interesting to know if this low quality data would agree with previous literature if it were similarly analyzed. But, I hesitate to suggest this as I personally dislike doing “wrong analyses” on empirical datasets as a point of comparison. It just makes people prefer a theoretically wrong analysis, just because it produces the answer that they like.

  2. fwiffo

    There is a major statistical myth in this post. The myth is that it is possible to adjust for confounders. A corollary of this myth is the myth that adjusting for more confounders will get you closer to the truth.

    There are two key types of confounders: observable and unobservable confounders. If observability is randomly selected from the universe of phenomena, then the strategy in this post will work — if you control for *enough* confounders, you can basically soak up all the confounding and end up with a credible causal estimate.

    But the universe doesn’t really work that way. Humans are complicated, and the things about them that we can measure are kind of crappy and correlated. Suppose we want to predict suicidality. We can run one hundred thousand medical tests on you — the first handful probably have some signal for predicting suicidality, but I can practically guarantee that tests #101-#100,000 will have zero signal at all. The statistician can flex his muscles and say I CONTROLLED FOR 100,000 COVARIATES, but it doesn’t matter. All of the interesting stuff on suicidality is unobservable. Covariates numbering greater than 100 very rarely have useful signal that is independent of the first 100.

    Economists are obsessed with this problem. The best example is training programs for the unemployed. Do they work? When you compare completers to non-completers, you find that completers are more likely to find jobs. You can throw in *100,000 COVARIATES* and you will still find completers are more likely to find jobs. But if you run an RCT, you will find there’s basically no effect, or way less than you expected. It’s easy to see why — the key unobservable factor is motivation and effort. Sure we can test for some of these things — we can get your Myers-Briggs type, your grit score, your big 5 scores, your enneagram, etc. etc. etc.. — but these are all really weak measures and missing the signal of the motivation that we really want to measure. In practice these bring down the program effects a tiny bit, but none of them are really capturing your unobservable motivation to find a new job, which is going to be really highly correlated with your motivation to complete the training program.

    So to me it’s not at all surprising that the observational evidence didn’t hold up. But I don’t particularly trust the null study any more than the other studies — this method is not guaranteed to find a causal effect, especially when we worry that there may be some hard to measure confounding variable (like stress, anxiety, mood) that is going to be highly correlated with difficulty sleeping and mortality risk.

  3. iss17

    From personal experience: From Sept 2007 to July 2008, I took Ambien virtually every day (shouldn’t have been prescribed). On days on which I did not take Ambien, I woke up with a seizure. I could predict whether that would occur based on whether I took Ambien the previous night. Neurologist didn’t believe they were seizures, then confirmed I had a seizure condition on EEG but argued that they weren’t from Ambien and ordered Depakote. I said it was from Ambien (because I was able to predict the symptoms based on whether I took Ambien) and didn’t take the Depakote. Several years later, I learned that withdrawal seizures are common with Ambien. Ambien also left me susceptible to headaches and balance issues and to this day, I am more susceptible to those than I was before. I also had weird joint issues in the first month, which started a couple weeks after I started Ambien and coincided with my heaviest usage of Ambien. (And I can think of nothing else that would cause those joint issues; doctors were baffled.) I am less certain that the joint issues were due to Ambien, but I have my suspicions. And those are the ramifications I’ve experienced from Ambien so far; who knows whether they will get worse in the future? Entire experience left me a lot more skeptical of the typical doctor.

  4. helloo


    Didn’t you already provide the relationship for at least several ways sleeping pills increase death- lung disease, suicide, car accidents, etc.
    Even if confounding issues help explain away the increased risk for cancer, did they also explain away the increased risk for everything else to?

  5. Dack

    For years, we’ve been warning patients that their sleeping pills could kill them.

    Melatonin would not count as a sleeping pill, right?

  6. slatestarreader

    Patorno et al. report 3 major analyses: 1) unadjusted hazards, 2) propensity score matched hazards, and 3) high-dimensional propensity score matched hazards [the huge number of covariates version].

    First, the estimated unadjusted hazards are much lower than those reported in previous studies (1.78x, 95% CI: 1.73-1.85), which is already something to chew on. Second, using “basic” propensity score matching with a bunch of covariates [something like 57] (which basically summarizes all the predictor covariates into a single covariate, propensity score), they find the hazard drops to .89x, .85-.93. Maybe there are other important covariates not included in there that actually bias results in the other direction; their preferred PS-matching version [the >300 one] estimates a hazard of 1.00x, .96-1.04.

    Thanks for linking to the original paper on high-dimensional PS-matching; it also has a commentary that is in line with my opinion of this method- it might be a reasonable way to automate covariate inclusion in propensity scoring, but it’s tough to say it’s far superior and the potential problems with it are not completely transparent. In the development paper, they have a couple example studies where the high-D PS-m (>200) leads to estimates about the same as adjusting for 46 covariates, and about the same as adjusting for age, sex, race, and year; all better than the unadjusted estimates. As the commentary, and some commenters, here have pointed out, covariate selection is an important process, and any limitations with that and with PS-m would still apply.

    However, just because they adjust for *a lot* doesn’t invalidate their results. If they adjust for fewer covariates via PS-matching, they find results in contrast to previous research. And frankly, their unadjusted results, while consistent in direction, are inconsistent in effect size with previous findings.

    On intention-to-treat, that seems more obviously problematic (and yet apparently previous research used it, too!). If you’re claiming it’s the drugs that (don’t) cause the deaths, I’m more interested in the consumption than the prescription of the drugs. But they take care of that with the follow-up analysis.

  7. Virriman

    The study suggests that the total effect of the extra 280+ confounders on mortality and sleeping pill use is much more important than any of the previous researchers thought, even though their individual effects are negligible. If we can’t pick the right confounders to control for, we probably can’t pick the right effects to measure either in a lot of cases. Could the algorithm be tuned to search for unconsidered dependent effects of various interventions instead of just independent confounders?

    The studies surrounding minimum wage are puzzling in the same ways as the earlier studies on sleeping pills, but in the opposite direction. The simplest theories predict that a minimum wage should have a negative impact on employment, but there are decades of studies showing the impact is negligible. The simple theories are probably incomplete in some way, but there isn’t any consensus on it.

    Proponents of a minimum wage think that it has some unmeasured positive externalities that counteract the negative effects on quantity of labor demanded – maybe higher wages stimulate the economy, happier workers are more productive, richer households can support longer job searches, the negative effects all fall on the rich, etc. Detractors of minimum wage either believe that the studies are systematically flawed or that the negative effects are hidden somewhere they aren’t being measured – meaner bosses, less vacation time, unpleasant schedules, increased discrimination, etc.

    There is no obvious way to confirm or rule out many of the proposed explanations, but if we can set an algorithm to discover previously unexamined effects of minimum wage, the non-obvious might become obvious.

  8. Ruben

    Like everyone else in the comment section, I have my doubt about the everything-and-the-kitchen-sink approach to regression. I’d like to add that twin or sibling difference studies in many cases give us tighter control with less accidental adjustment for colliders, mediators etc.

    Also Mendelian randomisation might work here (not sure where we are on the genetics of sleep problems yet though).

    1. IrishDude

      Assuming the confounders were measured prior to the treatment, why wouldn’t you want to include everything-and-the-kitchen sink? In randomized control trials, the goal with random assignment is that you balance all observed and unobserved characteristics among the treatment and control groups. If 10% of the treatment group has characteristic A, you expect 10% of the control group to also have characteristic A (whether characteristic A is observed or not). Any unbalancing between the treatment and control makes it more difficult to interpret outcomes as it’s unclear whether the outcomes are caused by the treatment or caused by the different mix of cases.

      Propensity score matching attempts, imperfectly, to also create balanced ‘treatment’ and control groups in observational data. It’s pseudo-randomization. It can only create balance on observed variables, which is the method’s shortcoming, but I don’t see any reason why you wouldn’t want balance on every variable you can observe as long as the observed variables are measured prior to the ‘treatment’.

      1. Ruben

        Very briefly, I wouldn’t bet that in an analysis with 300 confounders, the authors even could ensure that all adjusted variables were measured prior to the exposure.
        But anyway, I think propensity score matching isn’t bad per se (e.g., I see more value for it in within-subject comparisons/analyses of change which are …erm.. difficult for mortality). I mainly wanted to add that sibling comparisons etc. would give us tighter control with less to worry about. See this tweet

  9. Fluffy Buffalo

    Did anyone try this fancy new statistical tool on a problem where there’s wide consensus? If correcting for 300 confounders told you, for example, that smoking is A-Okay, we can throw it in the trash right away.
    On the other hand, shouldn’t a massive multivariate analysis give you all sorts of useful information from the fitting coefficients of the confonders? I.e., if you observe that patients who take sleeping pills have a higher risk of death before adjusting, shouldn’t you then be able to tell what it is that kills them? Shouldn’t that lead to more untargeted studies – we collect a huge load of data on a ton of people, and instead of asking, “do sleeping pills kill them”, we ask, “what kills them, and can we fix that”? Or would that amount to opening the gates of p-hacking hell?

    1. Murphy

      if you look at appendix eTable 2 you can see the list. Mostly it’s unsurprising: things we already know are associated with death like heart disease.

  10. gray

    If my doctor suggested I take sleeping pills I would say ‘no thank you’. I believe this would be true even if I had told the doctor I had problems sleeping. In my view this propensity to ‘just say no’ would make me a person less likely to be a victim of many other types of over medication and over treatment. And thus less likely to die early. This could be picked up by the type of study talked about as confounding variables would likely encapsulate other instances of whether treatment was accepted or rejected. A simple study design with less confounders might not pick this up.


    1. Murphy

      it might go the other way: someone who says no to sleeping pills may also be more inclined to say no to pills to treat their heart disease or their depression.

  11. sbelknap

    I review Patorno et al here.

    Patorno and colleagues address an important issue, citing unmeasured confounding bias as a threat to the validity of previous studies of sedative-hypnotic exposure and mortality (1-3). Alas, null bias threatens the validity of their study. They defined “users” as those exposed to benzodiazepines, but omitted z-drugs, carbamates, barbiturates, ethanol, and valerenic acid. Like benzodiazepines, these drugs are ligands of positive allosteric modulator sites of gamma􏰈-aminobutyric acid (GABA)A receptors and enhance GABAergic inhibition of central neurotransmission. Users were those with a new benzodiazepine prescription fill, excluding those with benzodiazepine fills in the previous 6 months. This is a weak criterion, as some users probably had negligible exposure to benzodiazepines; better would be ≥ 1 refill. Comparators had no benzodiazepine prescription fills ≤ 6 months before the match date, but some had benzodiazepine fills afterwards. Both users and comparators had exposure to other GABAergic agonists. Matching included fills for other sedating drugs but ignored dose and exposure duration of benzodiazepines, other GABAergic agonists, opioids, and other sedatives. These flaws blur differences between users and comparators, obscuring any effect on mortality.
    There is no administrative dataset so vast, nor propensity-matching algorithm so clever as to compensate for these design flaws. There is other compelling evidence that benzodiazepines increase mortality (4). Prudence dictates avoidance of sedative-hypnotics for indications lacking evidence of safety.
    Steven M. Belknap, MD, FACP, FCP
    Northwestern University Feinberg School of Medicine
    Chicago, Illinois, USA

    1. Belknap SM. In adults, use of anxiolytic or hypnotic drugs was associated with increased risk for mortality [Comment]. Ann Intern Med. 2014;161:JC11.
    2. Weich S, Pearce HL, Croft P, et al. Effect of anxiolytic and hypnotic drug pre- scriptions on mortality hazards: retrospective cohort study. BMJ. 2014;348: g1996.
    3. Kripke DF, Langer RD, Kline LE. Hypnotics’ association with mortality or cancer: a matched cohort study. BMJ Open. 2012;2:e000850.
    4. Kripke DF. Mortality risk of hypnotics: Strengths and limits of evidence. Drug Saf. 2016;39:93-107.

  12. blakeriley

    Very few here are engaging with the methods being used, either propensity matching or the variable screening method.

    Overall, not much new here. Propensity matching comes from Rosenbaum and Rubin (1983) with 23,000 cites. I’m sure James Robins would like a word about which Harvard professor should be associated with causal estimation in epidemiology.

    The propensity score is the coarsest valid summary of potential confounders. Assuming they’re following the procedure in Schneeweiss et al closely, they coarsen it further by binning propensities into deciles. The propensity regression can still overfit, though the danger is reduced compared to including the variables straight in the outcome regression. I wish they would have regularized the propensity model, but with this many observations, 300 binary variables in a linear model isn’t that big a deal.

    The variable screening algorithm is the new part. Calling it an algorithm is a bit of a stretch though, and it seems mostly innocuous. First, focus only the most common variables. Next, do a bit of feature generation for frequency of occurence. Sort the variables by the bias each would account for when considered in isolation, then select the top N. That’s more like throwing out variables that can’t be confounders than actively identifying variables as confounders. Ideally they would have done this screening on a split sample.

    Propensity matching is still biased when conditioning on a collider. Glynn is aware of this danger based on his other work. My understanding is propensity matching would mitigate the bias compared to including a collider in the outcome regression as a control. If there’s an issue in this paper, it’s from them misunderstanding their data and accidentally including variables recorded after the intent to treat, not methodologically.

    1. NeuroStats


      I don’t think variable screening approaches are that new however. They are overall well justified and in my experience can be the least worst option, especially if one uses stability ranking. There is a wide literature on this. E.g. Baranowski et al builds on a lot of previous work in this. It seems pretty intuitive to apply such methods to rank and screen confounders instead.

      As you say, accidentally including a variable that is in fact a strong collider that they are unaware of is the biggest issue and potential reason why this result might not hold.

  13. tgb

    In order to “control” for a confounder variable, you have to measure the effect that confounder has. The goal of this study is to measure the effect that benzodiazepines have. Is there intuition for why measuring the effect of a confounder is easier than of the primary target? Or is this a flawed way to think about it? I’m a little hazy on what I mean by this, I hope the question makes sense.

    1. Michael Watts

      There are multiple ways you might theoretically go about “controlling” for a variable.

      1. Data point 甲 has a raw variable-of-interest of 0.3 and a raw controlled-variable of 0.7. Since 1 unit of controlled-variable is known to be worth half a unit of variable-of-interest, point 甲’s adjusted variable-of-interest is -0.05.

      This approach is going to fall down at small scales. (Unless you have really, really tight correlations, such as ±1.) I have no opinion on how well it will work at large scales. It sounds like this is what you’re imagining.

      2. We have the following data, with dimensions in the order (dependent variable, independent variable, confounder 1, confounder 2):

      甲 = (0,0,0,0)
      乙 = (18,6,0,1)
      丙 = (-8,4,1,0)
      丁 = (10,10,1,1)

      Controlling for confounder 1, we see the matched point pairs {(0,0); (18,6)} and {(-8,4); (10,10)}; it’s easy to conclude that one more unit of the independent variable is worth three more units of the dependent variable.

      Controlling for confounder 2, we see the pairs {(0,0); (-8,4)} and {(18,6); (10,10)}; this way, we can conclude that one more unit of the independent variable is worth two fewer units of the dependent variable.

      But our methodology here in approach 2 is spotless; we decided what it meant for two points to be equivalent, and then we estimated the effect of one variable on another variable within groups of equivalent points. Based on other commentary here, this matched-points approach is what P&a are doing.

      1. tgb

        I see your example but I’m not sure it is what is being done in the study quoted. It’s more of a third option where you try to summarize all your confounding variables into one one parameter and then pair datapoints with nearly equal propensity scores. But there’s still the issue of figuring out home much each confounder contributes to that score.

        Like if you look at the Wikipedia page for propensity scores, you see that the score is defined as P(receiving treatment | X = x) where X is your vector of confounders. But with 300 confounders, how can you estimate this robustly? In all likelihood most individuals have unique values for x. So is it robust and we don’t really need to know this to much accuracy? Or is estimating this probability somehow easier than estimating the original probability of P(dies | receives treatment)?

        The only distinction I see is that the end-result probability I haven’t written down well since it’s really about counterfactuals, we want P(dies | received treatment even though they really didn’t receive treatment). Is that the difference?

        1. Michael Watts

          I see your example but I’m not sure it is what is being done in the study quoted. It’s more of a third option where you try to summarize all your confounding variables into one one parameter and then pair datapoints with nearly equal propensity scores.

          I don’t understand how you think those are different. I gave two examples, one matching points on equal x_3 and one matching points on equal x_4. Both of those are examples of the general case of matching points on an arbitrary function f(x_1,x_2,x_3,x_4,x_5,…), as is the “third option” you describe. In every case, you’re determining that some points are equivalent to other points and then making your comparisons within the equivalence classes you’ve defined.

          the score is defined as P(receiving treatment | X = x) where X is your vector of confounders. But with 300 confounders, how can you estimate this robustly? In all likelihood most individuals have unique values for x.

          I agree; you cannot do this robustly.

  14. nobody.really

    [Epistemic status: Somewhat confident in the medical analysis, a little out of my depth discussing the statistics]

    * * *

    Even if you take sleeping pills only a few nights per year, your chance of dying doubles or triples.

    Can’t speak to P&a specifically, but I’d want to double-check the stats cited above. Last I checked, my chance of dying was darn close to 100%. So if I started taking sleeping pills, my chance of dying would go to 200%-300%?

    Look, sure, I regret that I have but one life to give for my country and all that–but this is ridiculous.

  15. deciusbrutus

    Would you be able to figure out a lot of that in advance by looking at the groups on day 1 and saying “People who are taking sleeping pills score 3x as high on aggregate on this score based on over 300 potential confounders”?

    Because it seems to be that the the P&a study had to prove that people who take sleeping pills are significantly different from people who don’t.

  16. NoRandomWalk

    Maybe I’m missing something super basic…but the more variables you add as confounders, the harder it becomes to separate the signal from the noise.

    Let’s say you have a variable that causes cancer, like smoking. And you flip coins and add the outcomes as cofounders. Some of the coins due to random chance will be heads for fewer healthy people and will be correlated with both smoking and cancer diagnoses. Control on enough coins, and you won’t be able to tell that smoking causes cancer.

    “they find minimal to zero difference in mortality risk between users and non-users” is exactly what you would expect if you have enough coins.

    Can someone who’s read the study tell me if the study has enough data/statistical power to reject the null that sleeping pills are not harmful if the 300 confounders are literally random coin flips?

    1. NeuroStats

      They are not blindly adding in lots of variables into a regression. They are specifically choosing a list of variables that first predict/contribute to a measure of confounding bias, which they later incorporate into the model predicting all-cause mortality.
      Having a large number of variables in a model is exactly the domain of high dimensional statistics. They are building on an extensive literature on variable screening & ranking developed in high dimensional statistics.

      Links to some of their others papers here

  17. baconbits9

    I didn’t see anyone else ask the question so- what data set did they use to identify the confounders? The data set they ran their model on had 39,000,000 people in it, where did they get a large enough data set from to find 300 confounders that didn’t massively overlap with this data set?

    1. JPNunez

      Large healthcare providers. They have detailed history on procedures done to patients. The confounders come from using those procedures as proxies for possible health problems.

        1. JPNunez

          But very few were taking Benzodiazepines so there’s the “control” group.

          To select patients who did not start benzodiazepine treatment with similar opportunity to be evaluated and treated by a physician as patients who started benzodiazepines and within a similar time window, for each benzodiazepine user we identified a random patient who had a medical visit within 14 days either way of the treatment start date for the corresponding benzodiazepine user and fulfilled the same inclusion criteria as benzodiazepine new users—that is, six months of continuous health plan enrollment before the selected medical visit and no use of any benzodiazepine in the six months before and including the date of the visit (ie, the index date for patients who did not start benzodiazepines). We also required patients who either did or did not start benzodiazepines to have at least one filled prescription in the 90 days before the index date and at least one filled prescription in the previous 91 through 180 days, to balance surveillance between groups and to make sure that patients in both groups were in contact with the healthcare system and had equal treatment opportunity.

          edit: they start with a dataset of 39M patients, but they eventually trim it down to 3.5M patients, roughly half having taken benzos, half not.

  18. Freddie deBoer

    Seems like a strange case for medical ethics – presumably they haven’t done an RCT because it’s unethical to give people substances known to increase all-cause morbidity. But in fact the reason to do the RCT would be to find out whether that increase in all-cause morbidity is a product of the methodological issues arising from not having done an RCT in the first place. If the situation was reversed and observational trials found no increase in morbidity, an RCT might be run on that assumption of no increase only for researchers to find out that there was an increase in morbidity and that their RCT was thus unethical.

    1. 10240

      But doctors do give patients these drugs even though it’s thought that they increase all-cause mortality. Just not as an experiment. I don’t think it would be unethical to do an RCT with people who are on the fence about whether they want to take sleeping pills (after learning that they increase mortality). I don’t know if an IRB would agree with me.

  19. Murphy

    They include metrics, scoring each confounder.

    looking at their list, scrolling down and picking out the things that were the biggest confounders by eye:

    Male sex

    Comorbiditiesand life style factors:

    Other OA and MSK
    Neuropathic pain
    Back pain


    Other OA and MSK
    Neuropathic pain
    Back pain
    ACE inhibitors
    Oral corticosteroids
    Other hypnotics

    Indicators of health care utilization

    No. physician visits
    No. prescription drugs
    Psychiatric visit
    No. ICD9 diagnoses, 3rd digit level

    So it kinda looks like it can be summarized as “people who are sick or in constant pain are likely to take sleeping tablets.”

    But I think there may also be a danger of controlling away any possible real effects.

    I’d like to see the same methodology applied to some medication we know has side effects and see if it shows as no effect as well once we control away how sick the people are.

    the high dimensional propensity score was estimated on the basis of more than 300 covariates.

    we selected a total of 200 empirically identified confounders and combined these with the investigator identified covariates (see supplementary appendix eTable 2)

    Weirdly I don’t see 300 or even 200 items listed in eTable 2

    I count 105

    Am I missing something?

    1. Scott Alexander Post author

      There was something somewhere about how they used 200 algorithm-identified confounders plus a hundred they picked themselves, I’m wondering if those are the 100 they picked.

      1. Murphy

        That… somewhat still leaves me with the question of what the other 200 were.

        I can’t find them, though perhaps I’m being blind somehow.

        1. Corey

          I’m assuming (albeit based on no information) that the others are machine-learning-like opaque lists of coefficients, not necessarily relating to anything in the real world or anything human-comprehensible.

        2. JPNunez

          The other 200 are machine generated to compensate for unmeasured data.

          Through the high dimensional propensity score algorithm, an automated technique that identifies and prioritizes covariates that may serve as proxies for unmeasured confounders in large electronic healthcare databases,[28] we selected a total of 200 empirically identified confounders and combined these with the investigator identified covariates (see supplementary appendix eTable 2) to estimate an empirically enriched propensity score.

          Together they get to the 300 controls I guess. The data comes from millions of patients records from an insurance company.

          Oh, note 28 is

          although that paper uses a population of 40k patients, way less than the million+ people in Patorno et all.

          e: more info on [28] is here. I was wrong, it’s not the same paper but it is the same author and tool.

          Wait, reading this I think I was wrong; Schneeweiss says that they have proxies for ICD9 codes. For example, if a patient had an oxygen tank during a stay at a hospital, they use this as a proxy for “frailty” and it is now a variable. There seems to be a lot of these proxies generated, although eTable2 in P&a only lists this as a single entry. I have to assume it’s this cause the P&a paper says that these proxies were found empirically.


          1. JPNunez

            Yeah, those must be the 200 extra confounders.

            Schneeweiss 2018

            The basic HDPS version considered only the 200 most prevalent codes in each data dimension and for each code created three binary variables, indicating at least one occurrence of the code, sporadic occurrences, and many occurrences during the CAP (covariate assessment period)

            in eTable2, where it says “No ICD9 diagnoses” it just means the amount diagnoses associated with a code, and it’s not the 200 confounders. I wonder if that introduces a small correlation.

          2. Murphy

            That would be distressingly close to “and then we adjusted by some magical numbers we’re not going to really tell you about…”

            This also sounds worrying like adjusting by every possible effect then declaring there to be no effect.

            “after adjusting for the metric of the number of stitches patient received from medical personnel … juggling knives is found to cause no increase in knife wounds”

            If you adjust for incidence of lung cancer then you’re gonna have a hard time showing smoking causes lung cancer. etc

            With hidden adjustments we can’t really know.

  20. Scott Alexander Post author

    I don’t think they have direct access to “sleep quality”, but I think it’s what they’re indirectly trying to measure with all the different diseases and psychiatric issues and so on.

    1. notpeerreviewed

      I’ve run the numbers on some studies for somnologists, and the only measures we had for sleep quality were extremely crude. Maybe there’s better stuff out there that we didn’t have access to, but I wouldn’t be surprised if there isn’t.

    2. Purplehermann

      If I’m remembering right (don’t have the book in front of me) they can detect brainwaves, and normal healthy brainwaves have a few patterns (REM, deep sleep…) and when using sleeping pills patients are essentially drugged, their brainwaves are not similar to healthy sleep.

  21. JohnBuridan

    I am not statistically talented, but adjusting for more like 100s confounders as opposed to dozens fits my model about how complicated even the world in large groups is. I figure that the closer one gets to social science the more confounders there are. Confounders are why I didn’t continue studying psychology in college, because I couldn’t understand how these studies we would read would control for so little and still be valid (my professor thought I was being too much of a philosopher; it didn’t help that our textbook spent a lot of time taking a dump on philosophy).

    Of course, there is erring on the other side too and taking too many confounders, but obviously one needs good judgment to know what confounders to count.

  22. Purplehermann

    Matthew Walker in “Why We Sleep” says pretty much what you’re saying (he also says that sleeping pills ruin your sleep, so it’s not just correlation but causation to some degree as well with those who use pills more often)

  23. rapa-nui

    ” But cancer? Nobody has a good theory for this. Heart disease? Seems kind of weird. Also, there are lots of different kinds of sleeping pills with different biological mechanisms; why should they all cause these effects?”

    I think it’s totally reasonable- first of all, these are pharmaceuticals, which are the treatment equivalent of a shotgun. Even the most exquisitely designed drug will have hundreds (if not thousands) of off-target interactions, and we can expect people who take these sleeping pills to be on them for a long time. Second of all, I doubt we have anything even remotely close to a good framework for how these pills alter sleep biology. Sure, they help a patient subjectively get to sleep, but is that sleep performing all the absolutely critical biological functions its supposed to? Here’s a good prior you can trust: a biological phenomenon that leaves the organism vulnerable to attack for multiple hours at a time can only be maintained by natural selection if it is absolutely crucial to maintaining biological function. So yeah, messing with sleep can be expected to screw with literally everything.

    “They adjusted for three hundred confounders.”

    That’s not rigor- that’s an admission that the way you are choosing to study the problem is fundamentally non-scientific. I know clinicians have to do something to earn their paychecks and publish papers, but if you have 300 uncontrolled variables messing with your Scientific Method… just give up in trying to say anything meaningful. I don’t care how sophisticated your statistical techniques are, your “laboratory” is too messy. I don’t advocate going the China route and using prisoners and political dissidents as involuntary guinea pigs either, but at least that way you could have a properly controlled double-blinded study.

    (Bias and lack-of-expertise disclaimer: Not affiliated in any way shape or form with Kripke. Not a sleep biologist. Not a clinical researcher. Not a pharmacologist. Not a statistician.)

  24. vaniver

    I don’t know enough to judge the statistics involved. I can imagine ways in which trying to adjust out so many things might cause some form of overfitting, though I have no evidence this is actually a concern.

    My back-of-the-envelope calculations suggest that 4M active users and 35M controls is enough that overfitting shouldn’t be able to totally extinguish any signal. However, a nice complement to whatever power calculations Glynn did might be some empirical data exploration, where we create synthetic response variables of various real underlying strengths and see what they look like after being put through the procedure with the real independent variables. (Normally one would do this sort of thing with synthetic independent variables, using whatever initializations you like based on whatever random graph theory, but the real connections and relationships make for a much better testbed.)

  25. discountdoublecheck

    I see a lot of comments trying to pull this apart. I don’t have time to reply to them all in detail, so I’m going to try to shotgun a bunch at once.

    Several people start with the proposition that the additional controls are the relevant difference between the papers, and want to try to figure out what variables are causing this change. I see a number of comments proposing all kinds of search algorithms. I don’t want to say that this is impossible (it isn’t). But this is a much harder problem even than merely brute force searching through all the possible included variables and seeing what drops. The issue is that once you perform that algorithm, your inference needs to be conditional on the search procedure you performed producing the result you got. Probably the easiest way to understand this set of problems involves looking at the literature on inference in stepwise regression. This is why we pre-specify.

    More broadly, loads of people say things along the lines of ‘things are multi-causal, probably everything is causing this’, and then talking about trying to isolate the ‘important’ confounds — either using priors about importance or using search algorithms mentioned above. Those people are missing a single very important point. We want the causal effect of the sleeping pills, and we only have observational data. That means that our single biggest problem is the massive degree of selection on all possible dimensions. *Any one* variable which has a slight correlation with death and with sleeping pill use, and which we don’t adjust for, screws us. Any one variable which we don’t *correctly* adjust for, screws us.

    Now, you can’t just throw everything in.
    Mikk14 talks about concerns with bad controls. This is a real issue. A classic example here would be — after people start taking sleeping pills, they are (or aren’t) diagnosed with diabetes. If we control for diabetes diagnoses *after* the start of pill taking, we have a bad control. You can imagine that the pill induced diabetes, which was then diagnosed and we ‘control’ for the diagnosis and the death (or not) caused by that. Huge issues arise. However, in general, if we measured the controls before the pills started being taken, we hope we can deal with this. (This probably works in medicine where the after can’t easily affect the before. This logic works less well in economic settings where individuals plan around a later time period — thus the future can affect the now — imagine someone expecting to start taking sleeping pills who is diagnosed first).

    But I really want to emphasize the point about missed controls. I didn’t go read the paper in depth, so I can’t swear to anything. But I glanced at their list of controls. Here’s a biggy I think they missed — wealth/income. We know that wealth & income are closely related to life expectancy. Could they be related to sleeping pill consumption? I don’t know — but I do know that pills cost money. So when you take one, your wealth goes down. Among other possible ways in which being wealthy and taking sleeping pills might be linked (anyone ever stay up stressed about bills?), that is a direct link which definitely exists.
    My point here is that while some folks seem to think they’ve clearly gone overboard with 300 controls, they still missed really important things. And it doesn’t take important things to screw up your inferential procedure. It takes one small unimportant thing.

    \sorry for the rant.

    Edit 1: Noticed some folks indicating they feel RCTs are called for. Those folks are of course blithely ignoring rules about giving people drugs which we believe cause death at high rates — just to see if its true. If there is an IRB you could get that past, it should be disbanded.

    1. spandrel

      I agree. The biggest problem with observational studies is unobserved confounders. They include no social risk factors in their propensity score model – income, race/ethnicity, social support. Are people who are wealthy/white/married more likely to use BZP than otherwise similar individuals who do not, and are they also generally healthier? If so, that might explain the negative finding.

    2. Lambert

      Why not do the opposite RCT?
      Find people already taking sleeping pills, then supply them either with what they used to take, or a placebo (after a tapering period?)

      Things should be dose-dependant, right? Why not an RCT of the usual dose vs 0.5 * usual?

  26. Eponymous

    This might seem to be a stupid question, but…why not do an experimental study? Aren’t pre-registered double-blind random control trials the thing you use to test drugs?

    Also, why didn’t the FDA do a bunch of those before approving these drugs in the first place? I thought that was what they did?

    1. gwern

      I would guess that the experiments were all too short-term to pick up on all-cause mortality or cancer increases. (You can’t run multi-decade-long experiments before drug approval, not without a lot of major changes to the patent & approval & clinical trial system.) They’re looking for lack of efficacy (but sleep drugs definitely seem like they work) and obvious short-term harms. There’s always some hypothetical long-term harm which could be invoked, but they usually don’t exist. So…

      Incidentally, since we’re debating whether propensity scoring or other correlational methods recover causal estimates, an interesting paper worth considering is the remarkable Facebook paper “A Comparison of Approaches to Advertising Measurement: Evidence from Big Field Experiments at Facebook”, Gordon et al 2016, which essentially asks, how many variables in your propensity scoring analysis do you need to get the causal estimate known from actual Facebook A/B experiments? It turns out that in addition to all the obvious variables, including down to zip codes, you have to throw in “thousands” of other FB datapoints before your propensity scoring analysis has a CI that overlaps the RCT’s CI (not the point-estimate, note, but the CI; pg23).

      1. Eponymous

        I would guess that the experiments were all too short-term to pick up on all-cause mortality or cancer increases.

        Then maybe they were underpowered to begin with? What’s the typical sample size on these things anyway?

        Let’s see — base mortality rates are like 2/1000 in peoples’ 30s. So in a sample of 10k, that’s ~20 people dying in a year, so if something doubles all-cause mortality you would see ~40. I think that’s statistically significant? And some of the studies above claimed all-cause mortality increased 3-5x!

        Could somebody track down these old trials (assuming they were done) and do a follow-up study? I guess it’s been a while…

        1. gwern

          They’re powered to detect their endpoints, which are typically not long-term or all-cause mortality. In your specific scenario, you’d need a total n=23,473 to be reasonably well-powered for a simple proportion test:

          power.prop.test(p1=2/1000, p2=4/1000, power=0.80)

          Two-sample comparison of proportions power calculation

          n = 11736.8206
          p1 = 0.002
          p2 = 0.004
          sig.level = 0.05
          power = 0.8
          alternative = two.sided

          NOTE: n is number in *each* group

          I wouldn’t be surprised if that was 2x as many people used in all the RCTs used for FDA approval combined.

          Could somebody track down these old trials (assuming they were done) and do a follow-up study? I guess it’s been a while…

          I doubt it. The US isn’t Scandinavia. Can’t just get the ID number of each participant and look them up in the population registry to see when they died. The attrition and biases in followup at this point undoubtedly are far larger than the actual sleeping drug effects.

          1. Eponymous

            Honestly, I have no idea how big they are. I just googled and found this.

            The drugs are listed alphabetically. It looks like for the first four the samples for testing for side effects were:

            ADDYI: ~3.1K
            ADLYXIN: ~4.5K
            AEMCOLO: ~620
            AIMOVIG: ~1.3K

            These are all pretty recent, of course.

            (I hope there isn’t alphabetical bias! Hmm, let’s see — Zurampic: 1K, ZULRESSO: ~250, Zontivity: 26K!)

  27. Matt M

    Disclaimer: I am a hardcore libertarian who believes the FDA should be abolished, and that any and all needed drug regulation can be handled by private entities.

    That said… I’m a bit confused here… how exactly do pills that make you more likely to die from… everything get through FDA approval? Can someone give me a quick explanation as to how, if the original studies are true, any doctor in good conscience would prescribe these things at all?

    1. JPNunez

      From what I understand the pills were approved before the relationship to higher mortality was found. Probably, other than chances of dying of overdose or a couple of random reasons, nothing wrong must have been found back then.

      *checks* Benzodiazepines were discovered in 1955 and marketed in 1963 per wikipedia. The studies are more recent. Early problems with diazepam are about addiction, not death rates.

    2. notpeerreviewed

      I don’t know all that much about clinical trials, but 1.35x lifetime risk of cancer is not actually very much, and it’s doubtful it would show up in the timeframe of clinical trials.

      1. OriginalSeeing

        “I don’t know all that much about clinical trials, but 1.35x lifetime risk of cancer is not actually very much, and it’s doubtful it would show up in the timeframe of clinical trials.”

        A quick search of the (potentially untrustworthy) top google results for “lifetime risk of cancer” gives answers from 33% to 50%. Multiply that by 1.35 and you get 44.55% to 67.5% . Those are pretty serious bumps.

        I strongly agree about the timeframe of clinical trials part though. I seriously doubt any professional Actuaries measure mortality risks purely based on 3 year trials.

        1. JPNunez

          Bah, I looked this up and seems right? There’s one paper saying it is lower, but overall everyone quotes the 33% lifetime risk of cancer which sounds wrong based on anecdotal data.

          Either 1/3 of everyone just develops a malignant tumor and lives with it or…???

          e: ah, looked up death causes and sounds right in certain countries. I guess that Canada has it worse than America and Chile, since people in America/Chile seem to die of other reasons, and cancer only makes 20% or so of deaths; as healthcare improves, cancer starts making up more of the deaths.


          1. M

            Everybody dies of something. If you live long enough, it will almost certainly be cancer.

            Frankly, in Western countries the real cause of death being cancer might be higher, if some of the suicides are indirectly caused by a diagnosis.

          2. Eponymous

            Everybody dies of something.

            …so far.

            If you live long enough, it will almost certainly be cancer.

            Nope. Second leading cause of death for 85+, but only 12% of deaths. Heart disease is #1, at almost 29%. That’s actually quite a bit lower than the overall number: 21% of all deaths are due to cancer. But that’s still #2 behind the heart (23%).

    3. HeelBearCub

      You are reacting to something that only shows up in large scale epidemiological studies. I believe that it doesn’t show up in the FDA trials. You won’t see it until the drug is approved. There is debate about whether drug companies should be required to continue to assess this kind of data after they have FDA approval

      AFAIK, sleeping pills aren’t associated with this increased morbidity within the time frame of an FDA study. Whatever effect there is, it’s not immediate, or even short term, (unless combined with other depressants like alcohol).

    4. Garrett

      The other trick is that drugs are frequently approved for one category of use but not for another category of use, but are used there anyway.

      For example, a common use of benzodiazapines is to deal with procedural anxiety. Eg. someone who needs an MRI but has claustrophobia can be given eg. Valium so they can get the scan done. This is seen as safe and the risk/reward is very, very small.

      And then you given them to people who have a fear of sleeping due to night terrors/PTSD and they do better.

      Some of this might be small enough that it takes a long time to discover.

  28. jsbmd

    When I learned about regression models, one of the first examples was the utility of forward and backward step-wise regression modeling. In the later case, the least significant covariates were removed step-wise until the model with a subset of covariates was as strong statistically as possible. The paper described does not even hint that this standard statistical method was considered, which should have been part of the sensitivity analyses they performed.

    Further, when presenting complex and/or controversial findings, journals permit supplemental data files to be posted. Would have been interesting and appropriate to include the list of the 300 covariates. The paper describes how 100 were selected by investigators (those are listed in the supplement) and then 200 selected automatically. The 200 are not disclosed anywhere.

    1. spandrel

      Stepwise selection is now highly deprecated (look up “problems with stepwise”), and as a peer reviewer I would generally not recommend an article for publication if the authors relied on it to select their variables.

      1. NeuroStats

        Naive forward and backward stepwise selection (adding or removing one variable at a time) is statistically inefficient and has been largely abandoned in favor of regularization. Evaluating variables one at a time requires that each variable that deserves to be in the model have sufficiently large true effect sizes, that surpass noise levels. If there is a sea of weak and true correlated variables that only jointly have a large effect size, you need other methods. There are some clever greedy algorithms (that internally do some forward and backward variable selection) that can also do this as well but they are mostly a theoretical curiosity since they aren’t implemented in popular stats/ml packages.

        Nevertheless, leaving variables out one at a time or multiple randomly chosen variables all at once are common techniques for creating variable importance measures i.e. you quantify how important a variable is in improving predictions and give it a score between 0 and 1. This is only desirable to quantify importance of variables of interest, but not for confounders. The reason being that variable importance measures help one be cautious and conservative about interpreting predictive models and not reading too much into the presence of a particular variable. It isn’t desirable to use a stringent threshold for importance to keep confounding variables in the model – you wouldn’t want to discard a large number of weak confounding variables simply because they have low variable importance measures. They might jointly contribute to a great deal of residual confounding which would be important to eliminate.

        In these applications, it doesn’t matter if one chooses the true confounder or merely sufficiently good proxies for true confounder. All that matters is that the (non)linear combination of the observed confounding variables spans the subspace of true/unobserved confounding variation contributing to both sleep quality and all-cause mortality. From a brief reading of the method paper , this is basically what they are going for. They have a measure of confounding bias and they choose all the candidate confounders by ranking all the variables that contribute to this bias upto a justifiable maximum given study sample size. This approach is overall quite consistent the variable ranking literature for the purposes of screening i.e. you want to eliminate as many variables that are not confounders as possible, but it is okay to leave in weak and potentially false positive confounders as long as you have the sample size to fit a larger model.

        It is more relevant to do a sensitivity analysis on the assumptions made for choosing the set of confounders in the first place, which they did to some extent. But this is still an emerging area of research and it requires everyone to build on principled and mathematical notions of what constitutes confounding and how to find smallest possible set of variables that are sufficient to capture all confounding variation. It is clear that Glynn and colleagues are aware such results from the DAG literature that not all variables should be controlled for. They advocate taking knowledge of structure of the causal graph into account to choose the set of potential confounders. A more recent paper from them explicitly mentions avoiding adjusting for variables that qualify as instrumental variables.

        They think collider bias is relatively weak/unlikely and make the assumption that this isn’t too big of a deal if they get wrong. In any case, to account for collider bias would require far greater mechanistic knowledge of how all the variables in their datasets are causally related to each other. If some of the variables they adjust for are strong colliders, then that would qualify as a reason to worry. This is an area where current knowledge/methods are limited for how to deal with this in the absence of a good causal graph. Leaving in or out potential colliders would be a desirable sensitivity check if something were to turn up here.

  29. HarmlessFrog

    Is there anything here that isn’t epidemiology? Because using epidemiology for generating causal relations is just wrong. Epidemiology is for generating hypotheses – “do sleeping pills increase mortality?” – but unless the RR is truly gigantic, you can’t say the association is causal. 2-5x is enough to merit experiments to figure it out. It hardly means there must be anything there.

    1. notpeerreviewed

      Can you explain what you mean when you say “epidemiology”? When I think of the word “epidemiology” I think of an entire field of study, not a specific research methodology.

      1. HarmlessFrog

        By ‘epidemiology’ I mean observational studies of populations. As opposed to, say, animal studies, in vitro experiments, or clinical trials.

        1. lewford

          This was my impression as well. Unless you have an RCT or are a clever psuedo-experimental research design (ex. a lot of econometric modeling like difference-in-differences and regression discontinuity), coming to any conclusions about causality is probably not justified.

          That said, its still in our interest to estimate the effects of sleeping pills with the available research. And given that there is no RCTs or causal modeling, it makes sense to try to work out what is going on in these conflicting papers. I didn’t look very closely at the Patorno paper, but it seems like they could re-run their model using only the variables included in the previous studies to see if that changes their results (i.e is it the data or the model that is resulting in differing findings)

  30. OriginalSeeing

    What other things commonly lead to a 3-5x risk increase of death from cancer, car-accidents, heart disease, lung disease, suicide, and more?

    1. Eponymous

      Sleep deprivation? (Not sure about cancer, but the rest).

      The proposed mechanism might be that sleeping pills aren’t inducing “natural sleep” (whatever that is), so you still get (some of) the damage from not sleeping.

      Another thing that raises all cause mortality: genetic load. Bad genes, basically.

  31. OriginalSeeing

    Do their attempts at controlling for 300 confounders usually lead to the same conclusion held previously or the opposite? (Or whatever appropriate mathematical concept that would match that)

    This seems like an important point to know in response to the easy knee jerk of assuming it might make things overfit.

    Also, doesn’t “using a New Jersey Medicare database to follow 4,182,305 benzodiazepine users and 35,626,849 non-users for nine years” mean that their sample size was 2 orders of magnitude larger than the others you listed?

  32. VK

    Heads up to Scott – I think the paragraph about Kripke’s first concern is repeated twice with different words.

    Very interesting article – curious to see how this all pans out in a few years

  33. Bugmaster

    Can someone explain, in simple terms, what correcting for 300 confounders looks like ? It sounds a little bit like GWAS to me, and thus prone to the same problems. If you create a 300-dimensional space, it’s relatively easy to find whatever association you want in there; certainly easier than in 10-dimensional space.

    1. notpeerreviewed

      In this case, they use something called propensity score matching, which is somewhat different from the default way of correcting for confounders. It doesn’t suffer from multiple comparisons in the way GWAS does, because they’re not measuring coefficients for specific confounders; they’re trying to get everything else out of the way so they can accurately measure the coefficient for sleeping pills.

      In doing so, IMHO, they’re problem introducing a different kind of problem called “conditioning on a collider.”

      1. Enkidum

        Thanks for taking the time to clarify this several times – it’s nice when someone’s actually done a bit of homework rather than just blindly commenting like, uh, me.

    2. quanta413

      I thought GWAS corrected for the high dimensionality of the number of polymorphisms? Without additional confounding structure, the number of spurious signals expected due to high dimensionality is relatively well understood. At least if you’re using something simple like an additive model with no interaction terms.

      There was a paper within the last year about PGS estimates (which are supposed to be causal) shrinking due to population stratification issues though, which seems like a vaguely similar but more subtle issue. Apparently correcting for population stratification using principal components was insufficient (or something like that).

  34. Nancy Lebovitz

    I’m wondering what matching people for history of insomnia would do, though that may be too difficult. There’s probably some reason why people with insomnia are or aren’t taking sleeping pills.

    Are the computer programs that analyzed the data available, especially for the Patorno et all study?

  35. bean

    I think there’s an editing error in the section about Kripke’s objections to the study. It looks like you wrote two different versions of that, and forgot to remove one before publishing.

  36. Anon.

    I’m currently writing a piece on how the entire “controlling for confounders” thing is a bad idea in general. Just adding some confounders isn’t enough, and even if the R^2 is extremely high, there is no guarantee that the coefficients even have the right sign.

    Other than exceptional cases (eg smoking where the effect size is absolutely humongous and there are prior theoretical reasons to expect the relation to exist) you should probably ignore all studies with this type of design.

    1. AlexanderTheGrand

      Can you in brief offer a solution besides confounder analysis to answer this type of question?

      1. Anon.

        You need a source of random variation. Surprisingly often you can find natural sources of variation, so not everything needs an RCT.

        1. 10240

          With a natural source of variation, how do you know that it doesn’t correlate with a confounder?

  37. JPNunez

    Can this massive control algorithm be applied to other things? Sounds like a gamechanger.

  38. Corey

    With automated confounders, what are the chances some of them correlate highly with sleeping pill use, guaranteeing any effect would go away?
    Sometimes you even see this with manual confounders.

    1. notpeerreviewed

      what are the chances some of them correlate highly with sleeping pill use

      The intent-to-treat methodology *if done perfectly* deals with this problem. Of course, if your intent-to-treat methodology was perfect, you wouldn’t need to control for confounders in the first place.

  39. brmic

    I’m confused by not finding any mention of lasso or ridge regression (penalized regeression), random forest or something like that in either Patorno et al. or Schneeweiss, Rassen, Glynn et at. (the algorithm paper) as these well established methods are able to deal with p>n numbers of confounders/covariates and thus would from my perspective be an obvious first step in exploring the difference in results.
    If these methods show the same null results, one could then check the confounders they highlight against those used earlier. If they show non-null results, the propensity score algorithm deserves intense scrutiny.

    That said, Patorno et al. report a significant increase in mortality if they extend their observation period which they attribute to residual confounding. They don’t have a good reason for this. It’s also possible that they overcorrected or that the truth is in the middle: Previous results were an overestimate (always a safe bet) but there’s a true effect, it’s just smaller.

    Third, I don’t see the problem in clinical practice. People generally don’t take hypnotics for fun, so whatever risk there is is probably a tradeoff against the sleeping problem they’re trying to fix. Once people use sleeping pills, their risks of death et al. are higher (1.2% vs. 6.1% per Table 1 in Kripke 2012) and it doesn’t really matter much whether that is due to some confounder or the pills.

    1. acymetric

      People generally don’t take hypnotics for fun

      I suppose it depends on how strong your “generally” qualifier is, but are you sure about this?

      I guess maybe also if you draw a distinction between “recreationally” and “for fun” but I think most people would see them as synonyms in this case.

    2. Scott Alexander Post author

      I’m confused by your last paragraph. Suppose you have a patient who has a little trouble sleeping, but it’s not ruining their life or anything.

      If sleeping pills had no side effects, you’d want to give some just to make their life a little easier.

      If sleeping pills have very serious side effects, you’d tell them to just deal with it.

      1. brmic

        If sleeping pills had no side effects, you’d want to give some just to make their life a little easier.

        Err no, I hope not. There are a lot of non pharmaceutical interventions to improve sleep. I’d hope you’d suggest _several_ of them first. Also, try waiting, maybe the problem goes into remission on its own.
        If it doesn’t, and continues to be a real _problem_ (not just a nuisance) then you suggest the pills, mentioning the potential risk.
        I get that it’s a bit of a hit to one’s ‘just world’ view that there’s not as much perfectly harmless pills with only good effects as one would want there to be, but ultimately I submit that the notion itself is a bit silly and we’d all do well to keep that in mind.

        1. acymetric

          Reread the premise…emphasis on if sleeping pills had no side effects (which would mean there is no potential risk to mention).

          1. brmic

            Yes, and my point is that your prior for that should be:

    3. notpeerreviewed

      I’m confused by not finding any mention of lasso or ridge regression (penalized regression), random forest or something like that

      Those techniques (random forests especially) are primarily used for predictive analysis, not causal analysis. Lasso regressions could still be good for producing sparse models, but that’s not what the researchers were trying to do here – they knew ahead of time that their priority was measuring one specific treatment, so they didn’t care much about how the rest of the model turned out.

      1. spandrel

        Correct. Random forests are useful for isolating the ‘best predictors’ for a given outcome, but not for isolating the effect of a single predictor. The propensity score approach used here allows the authors to (presumably) bundle up all confounding effects in a single number, which they can ignore. My concern would be that the propensity score factors they retained include one or more endogenous variables, which would cause the main treatment effect to be attenuated.

      2. brmic

        Yes, but even in that case they could have used those tools in the propensity score analysis. The issue I’m talking about here is that they’re doing two things differently from earlier research: (a) use 300 covariables and (b) the prop score algorithm. IMHO it’s important to find out which of these two things (if any) is responsible for their null results. Plugging the 300 variables into an _established_ method would be an uncontroversial and simple first step to doing that.

  40. Corey

    Sleep apnea is, classically, heavily under-diagnosed, so maybe there’s something there.
    But it’s less under-diagnosed in the obese, and it correlates with BMI which everyone controls for, so it’s hard to tell which way the effect would go.

    1. notpeerreviewed

      Sleeping pills don’t treat sleep apnea, though; only the class of “sleeping disorders for which sleeping pills are a common treatment” is relevant here.

  41. tobias3

    I think the problem is that the statistics don’t include the probability that the data is wrong. So all the co-founders might well be an accurate predictor of where data is wrong in the insurance database.

    You might also have some experience of that in the US… how often does it happen that doctors put something on the insurance claim and gives the patient something else?

    1. Steve Sailer

      Are the data randomly wrong or wrong in a consistent direction?

      “how often does it happen that doctors put something on the insurance claim and gives the patient something else?”

      Lately, sleeping controls are fairly tightly controlled substances, and doctors who play games with sleeping pills can get in big trouble with the law.

    2. notpeerreviewed

      We usually don’t need to worry about this, unless the data is systematically wrong, or the quality is so low that there’s no signal in the noise.

      1. Blueberry pie

        > unless the data is systematically wrong

        Most real-world data involving humans and especially health records are systematically wrong. E.g. different doctors/departments/hospitals/… might have different thresholds for assigning a diagnosis/running a test/inviting for followup/… . Different doctors/… are also likely to see different patients (senior doctors will see harder cases + effects of location etc). Voilá – systematically wrong data.

    3. wwbrannon

      I wonder if this could be due to some kind of attenuation bias. Let’s say hypothetically that a) sleeping pills really are dangerous, and b) NJ is bad at record-keeping, and this database has much lower-quality data than the sources used in other studies. Then there’s lots of measurement error in the 300 confounders, the propensity score Patorno et al compute won’t produce good matches, and the treatment effect estimates are the sum of the real treatment effect and some kind of noise. Presumably the noise is larger in magnitude than the treatment effect.

      If the noise is also zero-mean, which (without having done any of the math here) seems plausible, we’d get treatment effect estimates attenuated toward zero.

  42. Steve Sailer

    How many people start taking sleeping pills during a hospital stay? Hospitals tend to be enthusiastic about offering sleeping pills to patients staying overnight.

    1. Evan Þ

      Hospitals tend to be enthusiastic about checking on patients overnight, so they have some reason behind giving out sleeping pills.

      Ideally, they’d check on patients less – but hospitals are also (we hope) unfamiliar environments to the patients, so there’d still be some difficulty falling asleep.

      1. Steve Sailer

        What percentage of sleeping pill users were introduced to sleeping pills in the hospital?

        E.g., do many people say something like, “Wow, when I went in for my cancer surgery, I didn’t think I’d be able to sleep in the busy hospital, but that Ambien knocked me right out. I should use them at home, especially now that I’m having a hard time sleeping due to my fear of dying.”

        You could see if there was a link between hospitalization and using sleeping pills, the apparent sleeping pills and mortality correlation might really be a hospitalization and mortality causal correlation.

  43. Dragor

    Is there any way to test for a confounder in a study by running an analysis of something we *know* doesn’t cause harm. Like, suppose we knew melatonin wasn’t dangerous, and we ran a study on people who who take melatonin, and our result said “*bing!* *bing!* 3X risk of cancer”, so we would know our model somewhere. Is that a thing that we do? I mean, that sounds a *lot* like a control group in a clinical trial I guess…. Maybe I am just (ironically) sleep deprived?

    1. Steve Sailer

      You can make up random confounders, such as was the subject born on an even or odd day of the month or has an even or odd street address? If you put the 300 confounders in rank order of importance, hopefully this kind of arbitrary sanity check shows up near the bottom.

      Say you made up 100 such random items and now adjusted not for 300 confounders but for 400 confounders. Hopefully, adding random noise to the study doesn’t make your conclusion more confident. If it does, you’ve probably got a problem with your approach of having 300 confounders.

      1. Watchman

        Date of birth has a material effect on well-being at least in societies where there is a single annual intake to education. Basically being born earlier in the school year is likely to lead to better educational and therefore life outcomes because you’re better equipped to cope with education.

        Might also be health implications of say being horn in Lapkand in winter or in tropical countries during fever season.

        1. Joseph Greenwood

          Being born in January versus November might lead to different life outcomes, but the claim being made is that being born in June 1 versus June 2 is irrelevant to life outcomes, and (hopefully) sleep pill mortality rates.

        2. The Big Red Scary

          “Basically being born earlier in the school year is likely to lead to better educational and therefore life outcomes because you’re better equipped to cope with education.”

          Is this hypothetical or actually shows up in data?

          1. Steve Sailer

            I would hope that odd or even day of the month is pretty random.

            However, birthdate within the year is a big deal in sports achievement, depending on the birthdate cutoff in that sport. E.g., the best Canadian hockey players tend to have been born early in the year so that they were always the oldest on their 8 year old team, 9 year old team, etc., and thus were selected for all star teams, travel squads, more coaching,etc.

            E.g., Wayne Gretzky, who from an early age was recognized as a future great, was born January 26, 1961. So Gretzky was older than about 93% of other Canadian boys born in 1961.

            In contrast, another superb Canadian athlete, Larry Walker, washed out of hockey, perhaps due to being born on December 1, 1966 and being stuck in leagues with older boys. He then switched his primary focus to baseball and became the 1997 National League MVP, hitting .366 with 49 homers.

            But whether Gretzky was born on the 26th or 27th of the month is likely pretty irrelevant to his fabulous career.

            Or maybe not …

          2. The Big Red Scary

            Such an effect in sports is more consistent with my priors, since only one person plays a given position at a time. But this is much less consistent with my priors for education: people can simultaneously solve quadratic equations.

      2. notpeerreviewed

        I don’t think it’s the truly random controls we need to worry about; rather, it’s the ones that have real causal associations with health outcomes, but don’t fit into the causal diagram in exactly the way the researchers had in mind.

    2. j1000000

      IANADoctor but are there many things that we “know” are safe? When I was a kid my mother would change her recipes and meal patterns every time the news had a new story about some study on some thing being dangerous in high or low or normal quantities. It was an almost constant battle — one day salt, the next day coffee, red meat, meat generally, juice, fish with mercury, fat, wheat, carbs, fruit and vegetables sprayed with chemicals, margarine, sugar, on and on. She wasn’t even one of those hyperneurotic helicopter-parent types, she just wanted to make healthy food!

  44. Steve Sailer

    I often find it helpful to put things in rank order and then stare at the top items and the bottom items. If you put these 300 confounders in order of effect size, what’s at the top of the list? (Age, I hope, or, say, having cancer.) What percentage of the total effect do the top 10 confounders have? The top 25?

    Next, do the confounders in this study correlate closely with confounders in studies of general mortality? Or are there confounders that cause particularly high mortality in people who take sleeping pills but not among the general public?

    It could be that seemingly odd high effect confounders turn out to offer important insights.

    For example, if, say, living in a 2 story house rather than a 1 story house correlates with much higher mortality among sleeping pill takers, but not among non-takers, maybe falling down the stairs after taking a sleeping pill is a big problem. Or if, say, drinking a lot of diet soda correlates with high mortality in a sleeping pill study but not in general, maybe there is some unknown interaction between sleeping pills and artificial sweeteners.

    What you don’t want to do is adjust away all the useful information in your study. If it turns out, say, that people with red hair tend to drop dead from taking sleeping pills, you don’t want to adjust away that information as being a confounder.

    You should be able to run the 300 confounders through a factor analysis. With factor analysis you end up staring at a bunch of items that often occur together and you have to play a mental game of trying to guess how they are related.

  45. Michael Watts

    So, I’m pretty sure that in a study of 40 million people, adjusting for 40 million potential confounding variables is enough to guarantee a perfect match between your model and the data. The more degrees of freedom you give the model, the better you can fit anything.

    And it was my impression that this is why we don’t try to adjust for 300 confounders, in anything.

    So some questions based on this theory:

    1. Take some data sets for a relationship we think is well accepted. By controlling for 300 potential confounders, how much can we shrink the relationship?

    2. By adding 300 “adjustments” to the data, how much can we magnify the relationship?

    My personal inclination is that this is a case where the “epistemic learned helplessness” model applies in full; adjusting for 300 confounders is wrong because that approach can never be helpful — to do this number of adjustments correctly, you’d need to draw on divine inspiration. Since you can’t do that, the amount of information you can extract from such a design is zero. I tend to agree that it’s unlikely for a few sleeping pills a year to causatively double all-cause mortality, but I’d prefer not to have to rely on an all-purpose proof to prove that.

    Am I totally out of line here?

    1. AlexanderTheGrand

      Divine inspiration, or “an algorithm that automatically identifies and adjusts for massive numbers of confounders in some kind of principled way.”

      1. Blueberry pie

        Having some background in statistics, I tried to grasp what the algorithm does. I don’t think I understood it completely, but what they do is not magic and I don’t think you can call it “principled”. Don’t get me wrong – the approach is IMHO a clever hack, but I am highly skeptical of any automatic variable selection methods when we want explanation (it may work for prediction). Have can we be sure this is not just noise-mining? Would they get a similar results if they used 300 random predictors?

        And they do a lot of questionable choices – e.g. for any intervention/medication/… in the record they bin it into 3 binary variables “occurs at least once”, “occurs more than [median in the dataset] times” and “occurs more than [75th percentile of the dataset] times”. Why 3? Why cut at exactly those points?

        A “principled” way would IMHO be to assume a monotonic non-linear (but smooth) effect of the number of occurrences of each intervention/medication/… . Also some kind of prior/regularization (we would a priori expect most effects to be small and interaction effects to be even smaller). Then you would inspect the hell out of the model, look which predictors are important, see what happens when you omit them, look for patient groups that are badly fit by the data, compare patients just below the cutoff threshold to those just above, …

        But this is both very hard to do well and computationally demanding (to the point that it might not be usable on a dataset of the size they have).

        I like that they see how the results change with some different choices in modelling. It would be nice to inspect a huge number of choices as in

    2. notpeerreviewed

      Am I totally out of line here?

      Sort of, actually, at least in the first paragraph. But questions 1 and 2 are still good ones – when you’re using an unusual statistical technique, it’s a good idea to validate it against known data, especially if you found the opposite thing that every other study has ever found.

      As for the specific issue: Overfitting is a problem for predictive modeling, and while some of the same issues apply to causal modeling, they show up in different ways. In a causal model, controlling for 300 confounders basically means you’re doing multiple comparisons – you should expect that many of the relationships you’ve found are spurious. However, if you don’t care about the coefficients on the control variables – if all you care about is what the effect is of sleeping pills – you don’t really care that some of other associations aren’t real.

    3. syrrim

      The thing they’re trying to test is:

      “Suppose we had two identical people. We give one the treatment, and the other we don’t. Is the one given the treatment worse off?”

      Obviously we’d like to use a randomized, controlled study, but we can’t. So instead what we’d like to do is identify all the pairs of identical people with one in the treatment group and one in the control group, and compare them. If the person given the treatment fares worse, then we know it has to be because of the treatment, since the people were otherwise identical.

      1. Michael Watts

        If the person given the treatment fares worse, then we know it has to be because of the treatment, since the people were otherwise identical.

        This is a methodology, but it relies on an assumption that we know with 100% certainty is false. Those people were not otherwise identical; they differed in uncountable ways and the two groups they were divided into differed systematically in uncountable ways.

        So we have to choose which differences count and which don’t. Adjusting for 300 confounders, we have 300 differences that do count and an infinite number that don’t.

        But we got to pick our 300 dimensions. Since we’re picking from an infinite pool, that gives us a very large amount of leeway to select a set of differences that give us the answer we were looking for to begin with.

        Exercise: by keeping the 300 confounders used here, and adding 300 additional confounders, restore the original estimate of the effect of sleeping pills on all-cause mortality.

  46. Faza (TCM)

    But I think it’s important to notice: if they’re right, everyone else is wrong. If you’re using a study design that controls for things, you’re operating on an assumption that you have a pretty good idea what things are important to control for, and that if you control for the ten or twenty most important ones you can think of then that’s enough. If P&a are right (and again, I don’t want to immediately jump to that conclusion, but it seems plausible) then this assumption is wrong. At least it’s wrong in the domain of benzodiazepine prescription and mortality. Who knows how many other domains it might be wrong in?

    Wouldn’t the next logical step be to see how many (and which) potential confounders we can discard from the model before the result starts to change noticeably?

    I can imagine a couple of, non-exclusive, scenarios here:
    1. Of the 300 confounders identified by Glynn’s algorithm, some (many? most?) are irrelevant to the result – we can discard them and still get the same answer. Crucially, this might include the “sensible” confounders (ones that would be picked by human experts in other studies). We have here a model that has given us an interesting (that is: different than expected) result. Can we get the same result with a smaller model?

    2. It is my suspicion that when it comes to human biology, multiple concurrent causes are more likely to be an explanation for a given result than a single cause (the reason I believe this is that in a natural selection scenario it seems more likely that organisms will be able to cope with a variety of environmental insults, only succumbing to overwhelming adversity; the weak get selected out fairly early in the process). Therefore, it is possible that sleeping pills may only contribute to mortality combined with this, that and the other. Or it might be the case that the observed increased mortality is caused just by this, that and the other, which just happen to be correlated (as a complex or individually) with sleeping pill use.

    If using a more complex (more confounders) model gives you a different result than using a simpler one (fewer confounders), it may be the case that the new result is an artefact of model complexity, or it may be the case that the simpler model missed the important bits.

    1. Watchman

      Was going to comment basically the same point: there is nothing to show that all the studies are not correct. Indeed since I believe we have a good idea how benzodiazepines work, we know that taken sensibly they should not increase mortality all else being equal. The confounders are what is not equal, so perhaps playing with these is the sensible next step.

      Mind you, someone presumably needs to repeat the new study on a different population in case New Jersey is just atypical. Not that anyone would ever suggest that…

    2. Enkidum

      Wouldn’t the next logical step be to see how many (and which) potential confounders we can discard from the model before the result starts to change noticeably?

      Yeah, that would make life a lot easier. Sadly, the order in which you remove variables can drastically change the effects of doing so. Consider all possible orders of 300 variables, and realize that this is essentially impossible to do in a principled way. There are various tricks that people use (mostly along the lines of grouping related ones, or saying “we think for a priori reasons this one does/doesn’t matter”) but I can’t see any of them working with 300.

      Caveat: there are probably more clever ways of going about this that stats nerds have developed, and I haven’t read the paper.

      1. Faza (TCM)

        Given that we’re trying to eliminate completely spurious variables – at the outset, at least – is that such a big problem?

        Let’s consider a completely naive approach: re-run the analysis 300 times, omitting a single variable in each case. Either we will find that omitting certain variables alters the result more than omitting other ones – in which case we have a rough idea of what’s important and what isn’t and we can order our variables by magnitude of change, from most significant to least; or, we find that omitting any variable changes the result roughly as much as omitting any other variable. That would suggest that all of them are of roughly equal importance.

        My guess would be that the first scenario (ordering of variables by significance) is more likely and that on the “least significant” end of the scale we would find a number of variables that we can throw away in aggregate without significantly altering the result.

        1. Blueberry pie

          Unfortunately the “naive approach” you describe doesn’t work that well in practice. Sometimes combination of variables are important, e.g. removing any single of a pair (triplet,…) of connected variables could not hurt your prediction much, making all of them seem unimportant, but removing all two (three, …) would be disastrous for the fit.

          1. Faza (TCM)

            If that’s the case, we know there’s something going on.

            I’d started an earlier draft of my previous comment where I went into more detail, but decided against it in the end. One of the angles I looked at was iterating until something happens and then looking closely at the sequence of steps used to arrive there.

            In the context of your example: we’ve identified, by the naive approach, say, five variables (out of 300) that look like we can get rid of any of them without changing our results significantly. So we throw all of them out and suddenly we register a big change.

            Great! There’s something interesting going on here. We can now focus on five variables (rather than 300) and try to determine which of them are, in fact, insignificant (I purposefully picked a greater number of variables than you) and which of them are acting as a complex (the effect only manifests in combination). While we’re at it, we can try to understand what they mean.

            This is an iterative process with many repetitions, so I don’t realistically expect anyone to do it. However, it seems to me that working down from a set of variables until we get an expected result (i.e. in line with pre-existing research) is hell of a lot easier than working up from pre-existing research to find an unexpected result.

            In other words, we’re trying to find a plausible path from there (results@300) to here (results@10) and subsequently reason about that, which is easier than getting from here to there, if we don’t even know if there’s a there there.

          2. Enkidum

            What @Blueberry Pie said, plus…

            You find five variables using your technique that appear important. What you means is that at least one of those variables, possibly in combination with any of the other 295 variables (likely in combination with several of them) is important.

            This is a well-known problem, and the last I checked the literature (about 10 years ago, with a medium-at-best understanding of the math) there was no accepted solution even for relatively small sets of 10 variables or so. Someone here will know: is the number of possible orders 10! or something like that? Anyways, it’s a lot. And even once you’ve run every possible order taking things out one at a time, it’s a huge problem to figure out what’s really going on.

          3. NeuroStats

            +1 Blueberry pie on why naive single variable at a time is not the answer.

            There is nearly 20 years worth of theory/methodology and 1000+ solid papers across applied math, statistics and information theory on doing correct variable selection using a variety of methods that are not naive stepwise variable selection or combinatorial subset selection. We know how to solve these problems when the true model has some parsimonious representation but it remains an active problem to create customized solutions for different fields.

            Of all the difficulties in this paper frankly, their ranking based variable screening is probably fine. I said more here

  47. blumenko

    Collinearity can be a problem with so many variables, but I think that would show up as huge confidence intervals which may give a formal negative result that still allows the possibility of a strong effect. The paper here has tight intervals.

    1. notpeerreviewed

      It looks like they use propensity score matching, rather than including the control variables directly in the regression, and though my understanding of the technique is a bit fuzzy, I think that helps defend against that problem.

  48. Joy

    I’d like to see the graph of the extra mortality as a function of the number of confounders adjusted for. Presumably it would converge to zero once the number gets large. And the same graph for various simulated datasets.

  49. Blueberry pie

    IMHO a big part of the problem is that “control for” rarely means what you think it means. Usually it means that a variable was added to a linear model. When the response to that variable is not completely linear OR your measurement of the variable is biased OR you bin a continuous predictor (e. g. using taking/not taking medication instead of actual dose, using age groups instead of age, but possibly even using age in years instead of days), residual confounding may easily persist. And that is even when the variable you “control for” actually is the only causal predictor of the outcome.

    A bit more thorough discussion at the comments in:

    Inference from observational data is just hard 🙁

    1. Enkidum

      And when you multiply that problem by 300… ouchie.

      Yeah, I haven’t read the paper, but I’m suspicious. Then again, maybe there is a way that clever math can rescue this, it’s above my pay grade. This is why I run experiments.

      1. vV_Vv

        But if the model with control variables is not expressive enough you’ll get a positive result, not a negative result like this study. The opposite is possible though, if the model is too expressive it might overfit. In principle you could check for overfitting by doing cross-validation or other stuff, but in practice it doesn’t work.

        The only possible concern with this study is that among the 300 potential confounders, there might be variables in the causal path between the intervention and the outcome, and controlling on them may make the effect appear weaker than it is.

        A related issue is when you control on a collider: a variable which is independently caused both by the intervention and the outcome, in that case the effect may make the effect appear stronger than it is.

    2. notpeerreviewed

      IMHO a big part of the problem is that “control for” rarely means what you think it means.

      In this case, “control for” doesn’t even mean “control for” in the traditional sense – they used propensity score matching, which is related but not the same.

  50. mikk14

    I’m sure these folks are smart enough to have thought of this, but just for clarification, the position “more controls is always better” is wrong. Sometimes, you have bad controls. I think the golden reference here is Section 3.2.3 of Mostly Harmless Econometrics and also this comment.

    When you have 300 controls, how sure can you be to have examined all of them carefully enough to conclude they’re not bad controls?

    1. janrandom

      Can you give an example of what a potentially bad control would be in this case? Maybe sleeping habit related controls?

      1. mikk14

        I’m a bit out of my depth because I am also not great at statistics, so apologies if this is incorrect/unhelpful. The way I understand bad controls is that they are usually variables that are themselves outcomes of the treatment variable. So in this case, a bad control would be controlling whether a person has respiratory issues, since the sleeping medication does cause respiratory depression. Of course the control is going to explain away the treatment variable, because that’s how regression analysis works.

        Again, this is a cartoonishly simplified example, because why on earth would you control for respiratory issue happening after the treatment. I’m sure the authors did their homework. But with 300 variables, it’s possible that something subtle slipped through.

        1. notpeerreviewed

          I think that explanation is just fine. The most common technical name for the bias is “conditioning on a collider”, and there are surprisingly many researchers who don’t realize it’s a problem. I can’t use SciHub on this computer so I can’t tell if Glynn is one of them.

          “All-cause death” is just about the worst outcome variable you can imagine when it comes to this kind of bias; it has “all-cause” right in the name, suggesting that there are lots and lots of potential colliders.

          1. aesthesia

            The paper mentions colliders, and says that they are one reason it’s necessary to prune the set of potential covariates by hand first. They also claim that most people think the bias due to colliders is weak.

            I think–though I’m far from sure–that in a regression context, adding inappropriate covariates will tend to bias the outcome away from finding a relationship between the variables of interest. If so, this method would be a good way of verifying that an observed correlation is robust, but not necessarily of showing that two things aren’t related.

        2. 10240

          That would be an issue if we measured respiratory issues after the patients start taking the pills, but not if we measure them before the treatment. Indeed, we should measure everything we control for before starting the treatment, shouldn’t we?

          1. vV_Vv

            I assume researchers avoid the obviously inappropriate confounders, but when you have 300 of them it’s quite possible that there is a weak and circuitous causal path from the intervention to some of the confounders.

          2. 10240

            @vV_Vv There can be no path from the intervention to the confounders if the confounders are measured before the intervention.

          3. vV_Vv


            When does the intervention actually occur? These aren’t randomized controlled trials, so there is no clear cutoff point when the decision is made.

          4. 10240

            @vV_Vv It’s indeed not as simple as I thought. One could say that the intervention starts when one first starts to take sleeping pills, and only data from before that should be used when determining confounder variables.

      2. Dain the Caswallawn

        Not exactly sure if this would translate into an intent-to-treat medical scenario, but one possible example of how controlling for a confounder could seriously damage a study’s correctness is this:

        1) Suppose that in reality, eating Cheese-Its increases your weight. This increase in weight results in an increased mortality rate.
        2) A correlational study attempts to figure out whether Cheese-Its increase mortality. They statistically control for the weight of study participants, and find that when you do so the correlation between cheese-it consumption and mortality rate disappears. They then publish a study informing people that Cheese-Its do not cause an increase in mortality.

        The problem, of course, is that the variable they were controlling for was actually part of the causal chain by which Cheese-Its caused increased mortality.

        I’m not sure if this kind of objection is relevant in an intent-to-treat medication study, though.

        1. notpeerreviewed

          I’m not sure if this kind of objection is relevant in an intent-to-treat medication study, though.

          It’s still relevant. Imagine an “intent-to-snack” study on Cheez-Its, where half the snackers (and none of the non-snackers) became obese and died. Controlling for obesity would indicate that obesity, rather than snacking, was associated with death.

          1. Dave Orr

            I don’t think this is right. Controlling for obesity here would control for obesity before snacking, not obesity caused by snacking. Then you could still notice that some people had increased mortality and increased obesity post snacks.

          2. Cliff

            Right, but lets say that obesity pre-trial indicates a propensity to overeat/oversnack. If you control for pre-trial obesity then the fact that Cheez-its kill people who overeat could be obscured, right?

          3. zzzzort

            Cliff, I wouldn’t say that Cheez-it’s caused the death, if they were just going to eat some other equally unhealthy food instead. From a policy perspective, we generally care about causal results of intent-to-treat.

            Also, in this scenario you would also expect that providing someone with cheez-its, even someone prone to overeating and capable of procuring their own food, would still lead to an increase in overeating.

          4. Cliff

            Hmm… thinking about this a bit more, if you control by adjusting expected mortality based on their pre-existing obesity then it should still be ok and show the correct result I guess.

          5. skybrian

            Maybe look at both groups?

            If you find that there is no effect for people who ate Cheezits and didn’t gain weight, but there is one for those who did, it seems like you might have found something interesting?

      3. cvxxcvcxbxvcbx

        Gwern had a tweet for this. Someone did anyway, it might not have been Gwern.
        It was something like “The advantage between expensive and cheap restaurants disappears when you control for quality of the food, decor, and service.”

Comments are closed.