5-HTTLPR: A Pointed Review

In 1996, some researchers discovered that depressed people often had an unusual version of the serotonin transporter gene 5-HTTLPR. The study became a psychiatric sensation, getting thousands of citations and sparking dozens of replication attempts (page 3 here lists 46).

Soon scientists moved beyond replicating the finding to trying to elucidate the mechanism. Seven studies (see here for list) found that 5-HTTLPR affected activation of the amygdala, a part of the brain involved in processing negative stimuli. In one especially interesting study, it was found to bias how the amygdala processed ambiguous facial expression; in another, it modulated how the emotional systems of the amygdala connected to the attentional systems of the anterior cingulate cortex. In addition, 5-HTTLPR was found to directly affect the reactivity of the HPA axis, the stress processing circuit leading from the adrenal glands to the brain.

As interest increased, studies began pointing to 5-HTTLPR in other psychiatric conditions as well. One study found a role in seasonal affective disorder, another in insomnia. A meta-analysis of twelve studies found a role (p = 0.001) in PTSD. A meta-analysis of twenty-three studies found a role (p = 0.000016) in anxiety-related personality traits. Even psychosis and Alzheimer’s disease, not traditionally considered serotonergic conditions, were affected. But my favorite study along these lines has to be 5-HTTLPR Polymorphism Is Associated With Nostalgia-Proneness.

Some people in bad life situations become depressed, and others seem unaffected; researchers began to suspect that genes like 5-HTTLPR might be involved not just in causing depression directly, but in modulating how we respond to life events. A meta-analysis looked at 54 studies of the interaction and found “strong evidence that 5-HTTLPR moderates the relationship between stress and depression, with the s allele associated with an increased risk of developing depression under stress (P = .00002)”. This relationship was then independently re-confirmed for every conceivable population and form of stress. Depressed children undergoing childhood adversity. Depressed children with depressed mothers. Depressed youth. Depressed adolescent girls undergoing peer victimization. They all developed different amounts of depression based on their 5-HTTLPR genotype. The mainstream media caught on and dubbed 5-HTTLPR and a few similar variants “orchid genes”, because orchids are sensitive to stress but will bloom beautifully under the right conditions. Stories about “orchid genes” made it into The Atlantic, Wired, and The New York Times.

If finding an interaction between two things is exciting, finding an interaction between even more things must be even better! Enter studies on how the interaction between 5-HTTLPR and stress in depressed youth itself interacted with MAO-A levels and gender. What about parenting style? Evidence That 5-HTTLPR x Positive Parenting Is Associated With Positive Affect “For Better And Worse” What about decision-making? Gender Moderates The Association Between 5-HTTLPR And Decision-Making Under Uncertainty, But Not Under Risk. What about single motherhood? The influence of family structure, the TPH2 G-703T and the 5-HTTLPR serotonergic genes upon affective problems in children aged 10–14 years. What if we just throw all the interesting genes together and see what happens? Three-Way Interaction Effect Of 5-HTTLPR, BDNF Val66Met, And Childhood Adversity On Depression.

If 5-HTTLPR plays such an important role in depression, might it also have relevance for antidepressant treatment? A few studies of specific antidepressants started suggesting the answer was yes – see eg 5-HTTLPR Predicts Non-Remission In Major Depression Patients Treated With Citalopram and Influence Of 5-HTTLPR On The Antidepressant Response To Fluvoxamine In Japanese Depressed Patients. A meta-analysis of 15 studies found that 5-HTTLPR genotype really did affect SSRI efficacy (p = 0.0001). Does this mean psychiatrists should be testing for 5-HTTLPR before treating patients? A cost-effectiveness analysis says it does. There’s only one problem.






Or at least this is the conclusion I draw from Border et al’s No Support For Historical Candidate Gene Or Candidate Gene-by-Interaction Hypotheses For Major Depression Across Multiple Large Samples, in last month’s American Journal Of Psychiatry.

They don’t ignore the evidence I mention above. In fact, they count just how much evidence there is, and find 450 studies on 5-HTTLPR before theirs, most of which were positive. But they point out that this doesn’t make sense given our modern understanding of genetics. Outside a few cases like cystic fibrosis, most important traits are massively polygenic or even omnigenic; no one gene should be able to have measurable effects. So maybe this deserves a second look.

While psychiatrists have been playing around with samples of a few hundred people (the initial study “discovering” 5-HTTLPR used n = 1024), geneticists have been building up the infrastructure to analyze samples of hundreds of thousands of people using standardized techniques. Border et al focus this infrastructure on 5-HTTLPR and its fellow depression genes, scanning a sample of 600,000+ people and using techniques twenty years more advanced than most of the studies above had access to. They claim to be able to simultaneously test almost every hypothesis ever made about 5-HTTLPR, including “main effects of polymorphisms and genes, interaction effects on both the additive and multiplicative scales and, in G3E analyses, considering multiple indices of environmental exposure (e.g., traumatic events in childhood or adulthood)”. What they find is…nothing. Neither 5-HTTLPR nor any of seventeen other comparable “depression genes” had any effect on depression.

I love this paper because it is ruthless. The authors know exactly what they are doing, and they are clearly enjoying every second of it. They explain that given what we now know about polygenicity, the highest-effect-size depression genes require samples of about 34,000 people to detect, and so any study with fewer than 34,000 people that says anything about specific genes is almost definitely a false positive; they go on to show that the median sample size for previous studies in this area was 345. They show off the power of their methodology by demonstrating that negative life events cause depression at p = 0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001, because it’s pretty easy to get a low p-value in a sample of 600,000 people if an effect is real. In contrast, the gene-interaction effect of 5-HTTLPR has a p-value of .919, and the main effect from the gene itself doesn’t even consistently point in the right direction. Using what they call “exceedingly liberal significance thresholds” which are 10,000 times easier to meet than the usual standards in genetics, they are unable to find any effect. This isn’t a research paper. This is a massacre.

Let me back off a second and try to be as fair as possible to the psychiatric research community.

First, over the past fifteen years, many people within the psychiatric community have been sounding warnings about 5-HTTLPR. The first study showing failure to replicate came out in 2005. A meta-analysis by Risch et al from 2009 found no effect and prompted commentary saying that 5-HTTLPR was an embarrassment to the field. After 2010, even the positive meta-analyses (of which there were many) became guarded, saying only that they seemed to detect an effect but weren’t sure it was real. This meta-analysis on depression says there is “a small but statistically significant effect” but that “we caution it is possible the effect has an artifactual basis”. This meta-analysis of 5-HTTLPR amygdala studies says there is a link, but that “most studies to date are nevertheless lacking in statistical power”.

Counter: there were also a lot of meta-analyses that found the opposite. The Slate article on the “orchid gene” came out after Risch’s work, mentioned it, but then quoted a scientist calling it “bullshit”. I don’t think the warnings did anything more than convince people that this was a controversial field with lots of evidence on both sides. For that matter, I don’t know if this new paper will do anything more than convince people of that. Maybe I trust geneticists saying “no, listen to me, it’s definitely like this” more than the average psychiatrist does. Maybe we’re still far from hearing the last of 5-HTTLPR and its friends.

Second, this paper doesn’t directly prove that every single study on 5-HTTLPR was wrong. It doesn’t prove that it doesn’t cause depression in children with depressed mothers in particular. It doesn’t prove that it doesn’t cause insomnia, or PTSD, or nostalgia-proneness. It doesn’t prove that it doesn’t affect amygdala function.

Counter: it doesn’t directly prove this, but it casts doubt upon them. The authors of this paper are geneticists who are politely trying to explain how genetics works to psychiatrists. They are arguing that single genes usually matter less than people think. They do an analysis of depression to demonstrate that they know what they’re talking about, but the points they are making apply to insomnia, nostalgia, and everything else. So all the studies above are at least questionable.

Third, most of these studies were done between 2000 – 2010, when we understood less about genetics. Surely you can’t blame people for trying?

Counter: The problem isn’t that people studied this. The problem is that the studies came out positive when they shouldn’t have. This was a perfectly fine thing to study before we understood genetics well, but the whole point of studying is that, once you have done 450 studies on something, you should end up with more knowledge than you started with. In this case we ended up with less.

(if you’re wondering how you can do 450 studies on something and get it wrong, you may be new here – read eg here and here)

Also, the studies keep coming. Association Between 5-HTTLPR Polymorphism, Suicide Attempts, And Comorbidity In Mexican Adolescents With Major Depressive Disorder is from this January. Examining The Effect Of 5-HTTLPR ON Depressive Symptoms In Postmenopausal Women 1 Year After Initial Breast Cancer Treatment is from this February. Association Of DRD2, 5-HTTLPR, And 5-HTTVNTR With PTSD In Tibetan Adolescents was published after the Border et al paper! Come on!

Having presented the case for taking it easy, I also want to present the opposite case: the one for being as concerned as possible.

First, what bothers me isn’t just that people said 5-HTTLPR mattered and it didn’t. It’s that we built whole imaginary edifices, whole castles in the air on top of this idea of 5-HTTLPR mattering. We “figured out” how 5-HTTLPR exerted its effects, what parts of the brain it was active in, what sorts of things it interacted with, how its effects were enhanced or suppressed by the effects of other imaginary depression genes. This isn’t just an explorer coming back from the Orient and claiming there are unicorns there. It’s the explorer describing the life cycle of unicorns, what unicorns eat, all the different subspecies of unicorn, which cuts of unicorn meat are tastiest, and a blow-by-blow account of a wrestling match between unicorns and Bigfoot.

This is why I start worrying when people talk about how maybe the replication crisis is overblown because sometimes experiments will go differently in different contexts. The problem isn’t just that sometimes an effect exists in a cold room but not in a hot room. The problem is more like “you can get an entire field with hundreds of studies analyzing the behavior of something that doesn’t exist”. There is no amount of context-sensitivity that can help this.

Second, most studies about 5-HTTLPR served to reinforce all of our earlier preconceptions. Start with the elephant in the room: 5-HTTLPR is a serotonin transporter gene. SSRIs act on the serotonin transporter. If 5-HTTLPR played an important role in depression, we were right to focus on serotonin and extra-right to prescribe SSRIs; in fact, you could think of SSRIs as directly countering a genetic deficiency in depressed people. I don’t have any evidence that the pharmaceutical industry funded 5-HTTLPR studies or pushed 5-HTTLPR. As far as I can tell, they just created a general buzz of excitement around the serotonin transporter, scientists looked there, and then – since crappy science will find whatever it’s looking for – it was appropriately discovered that yes, changes in the serotonin transporter gene caused depression.

But this was just the worst example of a general tendency. Lots of people were already investigating the role of the HPA axis in depression – so lo and behold, it was discovered that 5-HTTLPR affected the HPA axis. Other groups were already investigating the role of BDNF in depression – so lo and behold, it was discovered that 5-HTTLPR affected BDNF. Lots of people already thought bad parenting caused depression – so lo and behold, it was discovered that 5-HTTLPR modulated the effects of bad parenting. Once 5-HTTLPR became a buzzword, everyone who thought anything at all went off and did a study showing that 5-HTTLPR played a role in whatever they had been studying before.

From the outside, this looked like people confirming they had been on the right track. If you previously doubted that bad parenting played a role in depression, now you could open up a journal and discover that the gene for depression interacts with bad parenting! If you’d previously doubted there was a role for the amygdala, you could open up a journal and find that the gene for depression affects amygdala function. Everything people wanted to believe anyway got a new veneer of scientific credibility, and it was all fake.

Third, antidepressant pharmacogenomic testing.

This is the thing where your psychiatrist orders a genetic test that tells her which antidepressant is right for you. Everyone keeps talking these up as the future of psychiatry, saying how it’s so cool how now we have true personalized medicine, how it’s an outrage that insurance companies won’t cover them, etc, etc, etc. The tests have made their way into hospitals, into psychiatry residency programs, and various high-priced concierge medical systems. A company that makes them recently sold for $410 million, and the industry as a whole may be valued in the billions of dollars; the tests themselves cost as much as $2000 per person, most of which depressed patients have to pay out of pocket. I keep trying to tell people these tests don’t work, but this hasn’t affected their popularity.

A lot of these tests rely on 5-HTTLPR. GeneSight, one of the most popular, uses seven genes. One is SLC6A4, the gene containing 5-HTTLPR as a subregion. Another is HTR2A, which Border et al debunked in the same study as 5-HTTLPR. The studies above do not directly prove that these genes don’t affect antidepressant response. But since the only reason we thought that they might was because of evidence they affected depression, and now it seems they don’t affect depression, it’s less likely that they affect antidepressant response too.

The other five are liver enzymes. I am not an expert on the liver and I can’t say for sure that you can’t use a few genes to test liver enzymes’ metabolism of antidepressants. But people who are experts in the liver tell me you can’t. And given that GeneSight has already used two genes that we know don’t work, why should we trust that they did any better a job with the liver than they did with the brain?

Remember, GeneSight and their competitors refuse to release the proprietary algorithms they use to make predictions. They refuse to let any independent researchers study whether their technique works. They dismiss all the independent scientists saying that their claims are impossible by arguing that they’re light-years ahead of mainstream science and can do things that nobody else can. If you believed them before, you should be more cautious now. They are not light-years ahead of mainstream science. They took some genes that mainstream science had made a fuss over and claimed they could use them to predict depression. Now we think they were wrong about those. What are the chances they’re right about the others?

Yes, GeneSight has ten or twenty studies proving that their methods work. Those were all done by scientists working for GeneSight. Remember, if you have bad science you can prove whatever you want. What does GeneSight want? Is it possible they want their product to work and make them $410 million? This sounds like the kind of thing that companies sometimes want, I dunno.

I’m really worried I don’t see anyone updating on this. From where I’m sitting, the Border et al study passed unremarked upon. Maybe I’m not plugged in to the right discussion networks, I don’t know.

But I think we should take a second to remember that yes, this is really bad. That this is a rare case where methodological improvements allowed a conclusive test of a popular hypothesis, and it failed badly. How many other cases like this are there, where there’s no geneticist with a 600,000 person sample size to check if it’s true or not? How many of our scientific edifices are built on air? How many useless products are out there under the guise of good science? We still don’t know.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

155 Responses to 5-HTTLPR: A Pointed Review

  1. Puuha Pete says:

    In paper nurse told that mothers poop on labour to transfer their bacteria to newborn. Sample size that. Seems kinda stupid when they also recommend to wipe from front to back.

  2. rthorat says:

    In regards to liver enzyme testing…I read the previous post you linked. I see that you hint at the reality: that SSRIs simply have no treatment efficacy. That is why response has no relation to blood levels of the drugs.

    The real value in liver enzyme testing is identifying which patients you are most likely to kill with your ineffective SSRIs. Data indicates that patients who kill themselves or attempt to do so are highly likely to have a genetic profile that results in lower levels of enzymes that are the primary metabolizer of whatever drug they were on. Alternatively, for drugs where a metabolite is the active drug, having elevated enzyme levels can cause serious reactions. Bad, life threatening reactions to SSRIs could be predicted with liver enzyme testing.

    See here: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3513220/
    And here: https://www.sciencedirect.com/science/article/pii/S1752928X16300051

  3. Andrew Klaassen says:

    Has anyone done a large-scale study like this on the proposed link between androgen receptor copy number variation and aggression?

  4. zstansfi says:

    I’m confused as to how this is news. It’s 2019, not 2010. The fallacies of the candidate gene research studies, and the widespread failures of these studies to replicate in GWAS have been widely evident and commented upon for close to a decade.

    The only people who didn’t know these genes were junk research are the people whose careers were too invested in the results of their own research to look critically at the field.

    Psychiatrists ordering genomic tests? Give me a break.

  5. arch1 says:

    Scott, in your wonderful unicorn metaphor, you could also make the point that what we have here is not just one explorer, but a wave of (presumably) hundreds.

  6. metacelsus says:

    It seems Derek Lowe has seen this and written his own comments: https://blogs.sciencemag.org/pipeline/archives/2019/05/10/there-is-no-depression-gene

    It’s exciting to see my two favorite blogs referencing each other.

  7. As an author on the Border et al. paper discussed, I wanted to make a few comments in response to the article and the responses below it. Before that, however, a shout out to my coauthors, especially my brilliant graduate student, Richard Border, who did most of the legwork on this project. I don’t speak for all my coauthors here; these opinions are my own.

    First, this post by Scott Alexander is brilliant (although perhaps more strident and definitive than I’d be comfortable with). People outside the field rarely understand the nuances of interpretation. How can an outsider have such a good grasp of the field? Many of my colleagues don’t. This paper has already been misinterpreted several times (e.g., see our rejoinder to a different blog here). That Scott got this basically right is impressive, and that he was able to place it in context better than we did ourselves in the paper…. Bravo!

    Second, as one of the responders noted, most responsible scientists in the field (which is most of them!) knew this result would happen before we showed it and were probably underwhelmed by the finding (indeed, we and others already showed the same issue occurs with schizophrenia candidate genes here and here, and predicted this would be the case here). The interesting part about the paper to me is less about the scientific finding and more about the sociology of science, which is also what Scott was focussed on. How in the hell could the field have studied a phenomenon for 25 years that wasn’t there? It’s deeply troubling. There are 3 basic answers to that question: 1) priors for a given polymorphisms that you pick based on an inchoate understanding of the biology are ~0, ensuring that the false discovery rate is ~100%; 2) publication bias, such that authors are more willing to submit, and editors to accept, papers that show positive results (especially when the n is small; no one cares about an under-powered null result); and most troublingly 3) p-hacking. In regard to the latter, let’s not get too sanctimonious. It’s a matter of degree; all authors condense, simplify, and package their findings in order to make a narrative. We wouldn’t want it otherwise (do you really want to read through every single thing authors thought about and did?), but it does become a slippery slope. Thus, a lot of p-hacking has probably been done in a way that not even the authors are aware the degree to which they’ve done it. That’s why preregistration is so important. And even if you provide a condensed narrative in your paper, it’s important to do your damn best to be brutally honest in giving that condensed narrative and to provide all the ugly details and things that didn’t work out as envisioned, at least in the supplement.

    Third, it’s really important that people don’t throw out the baby with the bathwater. In the responses to this blog, several have said things like “all psychiatric genetics is garbage.” That is far from the case. I’ve been now in several different fields in my academic lifetime; psychiatric genetics is the one – BY FAR – that is the most transparent, most open to sharing results/data, most open to collaboration, and most dedicated to getting it right. This is in part a response to the sordid history of candidate gene studies, which the field has largely abandoned. Most candidate gene research that happens now is either in the dark corners of psychiatric genetics, or is being done in other non-genetics fields like neuroscience, psychiatry, psychology, sociology, and so forth.

    Fourth, psychiatric traits, including depression, are indeed heritable – that is to say, differences between people in the DNA they have contribute to differences between people in their risk to developing a disorder. Candidate gene studies not replicating says nothing about that particular issue. 50 years of consistent findings would have to have been wrong for that not to be right. But our original guesses about the bases of that heritability (candidate genes) have turned out to be wrong, and the ones that have been definitively discovered (via GWAS) have individually tiny effect sizes (explaining less than ~ 1/1000 of the genetic variation per variant) – so there are lots of them, and progress will be made as sample sizes continue to get larger.

    Last, we’re in a golden age in the field of psychiatric genetics; real things are being learned (often not what we’d expected), there is a ton of exciting research being done, and the growth feels exponential. Buckle up! I’ve tried to come up with some basic heuristics to people not in the field that could guide them to where those are. This is far from perfect, but it should work as a first approximation. Studies published in good genetics journals are usually solid: Nature Genetics; AJHG; PLoS Genetics; Behavior Genetics; Human Genetics; European Journal of Human Genetics; etc. Psychiatric/behavioral genetics papers published in non-genetics journals can often be treated with more skepticism, if only because the reviewers aren’t as well versed in the pitfalls (and there are many). Second, and obviously, any study that focusses on the “usual suspect” candidate genes should be treated with great skepticism whereas those that use genome-wide data (and especially genome-wide association studies) are typically more rigorous. Third, there are teams whose work you know you can trust – Visscher, Wray, Yang, Neale, Price, Medland, Evans, Bates, Kendler, Sullivan, etc. etc. etc. This is not to say that their stuff is always right; just that they’re dedicated to getting it right. Then there’s the converse of people whose work you roll your eyes at – (names withheld). This is a really hard thing for people not in the field to know. But the point is that the former group is bigger than the latter, but it’s the latter’s work that often (frustratingly) shows up in the media, and that dominates blogs, wiki pages, etc. Perhaps this is because this type of work is more “sexy” and makes for more “fascinating” news. I don’t know what to do about that.

    Anyway, I hope that this provides some additional nuance and context to the Border et al. paper and the excellent post by Scott.


    • Murphy says:

      >behavioral genetics papers published in non-genetics journals



      I kid, I kid.

      Also, kudos: Getting people to pre-reg, getting people to follow their pre-reg , getting people to note variation from their pre-reg and even getting journals to admit they’ve not checked… all seem to be big challenges at the moment and its good to see big studies following good practice.

    • marc200 says:

      “But the point is that the former group is bigger than the latter, but it’s the latter’s work that often (frustratingly) shows up in the media, and that dominates blogs, wiki pages, etc. Perhaps this is because this type of work is more “sexy” and makes for more “fascinating” news.”

      It seems like the reason for this is obvious — if you are willing to prioritize flashy results and overlook potential sources of bias, you are going to find it a lot easier to come up with flashy results that people outside the field find appealing to publicize.

    • Scchm says:

      I beg to disagree with Dr. Keller. Both positive (usually obtained in single gene – psychiatric disease paradigm) and negative (usually obtained in genome wide association study – psychiatric disease paradigm) associations under current methodologies remain highly suspicious.

      Why do I say that? A typical GWAS uses a chip with about 800,000 single nucleotide polymorphisms (SNPs). Meanwhile, the gnomAD project https://macarthurlab.org/2018/10/17/gnomad-v2-1/ has more than 250 million single nucleotide variations, insertions and deletions catalogued for the human genome. Furthermore, the so called structural variations comprise about as much diversity as the single nucleotide variations.

      Thus, a typical GWAS purports to make valid conclusions based on less than 0.25% of human genome variants. This does not make much sense. How can you talk about any validity of the results? It is just studying the tip of the unicorn tail. The same applies to polygenic scores which are based on the same 800,000 SNPs.

      Give me a full genome scan at 130x with long reads to detect structural variations and a sample of 300 family trios with well-defined inherited phenotypes and I will believe you are doing a real science.

      • You fundamentally misunderstand GWAS. Due to the autocorrelated nature of the genome (called linkage disequilibrium), with just 800k SNPs on an array (or even 250k), one can predict virtually all of the other common variants. Indeed, the usual approach today is to impute all of the common (minor allele frequency > .005) variants (around 12M), and we can do this with very good fidelity. So GWAS does a very good job of understanding the role of common genetic variants. The limitation has to do with power – most of the common variants have such small effects that they cannot be detected reliably, even in very large samples. Your saying that there are 250M variants – that actually is low ball. There are almost certainly even more, but these are almost all rare (MAF less than 0.05% of the variation). But with new methods (e.g., GREML) we also know that about a third to half of the genetic variation of complex traits is due to common variants. All this to say that, when we do a modern GWAS (using common array SNPs) and do not find any variant that explains more than 0.05% of the variation, that means with near certainly that NO common variants explains more than 0.05% of the variation. That tells us something important.

        • Scchm says:

          Dr Keller,

          You say “one can predict virtually all of the other common variants”, “GWAS does a very good job of understanding the role of common genetic variants” and “about a third to half of the genetic variation of complex traits is due to common variants”. I rest my case.

          My point is exactly that common variants that go into GWAS’s reflect only a minority of the human genome variation. Do you concur?

          Until very recently the field simply has not been technically equipped to do high resolution genomes on many samples. However, with the cost of full genome approaching $300, there is no justification for ignoring rare variants and continuing with the wankery of the research based on 700,000 snps arrays.

          Also, consider a philosophical question. It is an interesting scientific issue as to what causes a psychiatric illness in general, for example, schizophrenia. And polygenic scores based on 100s of components may have some explanatory power. However, when one sees a patient with schizophrenia what is important is not the general cause of schizophrenia, but what has caused schizophrenia in this particular patient. Polygenic scores do not help in the treatment. A recent paper (similarly to what you say) has concluded that common SNPs explain about 1/3 of schizophrenia. Clearly, schizophrenia, similarly to cancer, consists actually of hundreds of different disorders. So, imagine, if we could ascribe even 2-3% the schizophrenia cases to concrete rare variants and treat them accordingly with the therapy specific to the cause. That would be a huge boon to science, doctors, patients and everyone. Even at the level of orphan disease that would still be huge. Think of an example of Gleevec in cancer treatment. But polygenic scores and GWAS’s based on 800000 SNPs arrays do not advance us to that goal.

          • If I understand this correctly, it’s making an important point that had not occurred to me. If one says “gene X is responsible for only 1% of problem Y,” that could mean two very different things:

            1. If someone has problem Y and gene X and one could somehow change gene X, he would still have problem Y, but .99 as much.

            2. If someone has problem Y, there is a .01 probability that the reason is gene X, and if so changing gene X would eliminate the problem.

            The difference between those two doesn’t seem very important if we are asking about effects of genes on the population average of something like IQ or height. But it is very large if we are interested in using the knowledge to treat patients with schizophrenia or high blood pressure or … .

            Do I correctly understand your point? When people describe the cause of something as polygenic, do they mean one of these two senses or are they including both?

          • Scchm says:

            In reply to David Friedman. When people talk about polygenic causes they mean that many genes contribute to cause the disorder. When one or two or three rare mutations together are sufficient to cause the disease, it is not polygenic. However, it may appear to be so, if these are many different rare mutations in many different genes.

            If you do not look at the actual rare mutations causing the disorder but instead look at the common mutations as their proxies, due to the linkage disequilibrium mentioned by Dr. Keller, it would appear that many common mutations contribute to the disorder. Moreover, since linkage disequilibrium is different in different populations, for each population you will get a different result. Since it is difficult to get exact same population, you will get a different result in each study. Which pretty much describes the picture we see in psychiatric genetics.

          • Murphy says:

            However, with the cost of full genome approaching $300, there is no justification for ignoring rare variants and continuing with the wankery of the research based on 700,000 snps arrays.

            There’s still an order of magnitude price difference between chips and more complete sequencing. Also the $300 figure doesn’t get the kind of depth and quality we typically look for.

            imputation is reasonably reliable, in the sense that we can put good error bars on it.

            They *are* doing large scale whole genome sequencing, with a focus on rare conditions. (The UK 100K genome project)

            but it’s got it’s own challenges.

            We do look at rare variants but they take a different type of analysis. In my current job GWAS is not a big thing but we spend a lot of time looking at oddball families with horrifying conditions that may be genetic where single variants can often fully explain their condition.

            So, imagine, if we could ascribe even 2-3% the schizophrenia cases to concrete rare variants and treat them accordingly with the therapy specific to the cause.

            We’re pretty much doing that.

            Though a lot of the time it doesn’t help much with treatment. Though may help with understanding the disorder.

            due to the linkage disequilibrium mentioned by Dr. Keller, it would appear that many common mutations contribute to the disorder.

            Different inheritance patterns. When it’s a small number of SNP’s in a family you expect to see one pattern. When it’s risk from a less tractable collection of risk-alleles you typically don’t see the same inheritance pattern.

    • Camerado says:

      Thanks for this response! As somebody pretty ignorant about the field, it’s useful to have a starting point for reliable work.

  8. Murphy says:

    Not scott but small trials and small studies show a larger degree of jitter.

    the chances of a small sample giving a result that differs strongly from the real population increases the smaller the sample.

    Combined with this there’s the file-drawer effect. Small studies with boring null but accurate results are more likely to get left in the file drawer and not get published. Leaving the ones with big showy … but wrong results.

    When you have a large number of publications you can often chart the distribution of results along with the size of the study and the distribution should tend to converge on the real value.


  9. Murphy says:

    One of the co-authors responded:


    I have never in my career read a synopsis of a paper I’ve (co-)written that is better than the original paper. Until now. I have no clue who this person is or what this blog is about, but this simply nails every aspect of the issue

  10. adamshrugged says:

    The Tibetan adolescent study actually finds no effect of 5-HTTLPR, so at least that part of it is consistent with the Border et al study :). (Maybe you’re worried that they even tested for it, or about their other single-gene results, but…)

  11. arch1 says:


  12. Eponymous says:

    So…what’s the likelihood that results are correct in fields that generally *don’t* find much statistically significant support for their theories? Let alone fields that don’t bother testing in the first place, or don’t even articulate testable predictions?

    What solid ground is left? Do we lose all the social sciences?

    • Greg D says:

      There was a proposal a few years back for the Federal gov’t to fund a massive social research DB. The kicker was that only 1/2 the data would be released to scientists wanting to do research on the data set.

      You would do your research on the released 1/2, write up your results, then submit your writeup and research protocols to the keepers of the data. They would run your protocols on the other 1/2, and release your paper with those results.

      Something like that could save social “science”. Short of that? it’s all garbage.

  13. Scchm says:

    I cannot agree with you more. Psychiatric genetics appears to be entirely built on p-hacking and publication bias, surprisingly, even to a larger extent than the previous notorious example of “experimental” psychology. Unfortunately, you fail to mention that the debunking paper is based on the results of the British government bodies and charities funded massive population genetics study UK Biobank http://www.ukbiobank.ac.uk/. NHS and government founding, not the mysterious “methodology improvements”, that is what allowed the study to test 500,000 people.

    There are two interfaces to the results of UK Biobank, with the basic statistical analysis of SNPs vs medical diagnoses/trait already pre-conducted https://biobankengine.stanford.edu/ and https://genetics.opentargets.org/. So, now everyone can debunk their favorite medical genetics theory. My favorite example so far is COMT polymorphism rs4680. About 700 “scientific” peer reviewed publications have been spouting crap like this summary in snpedia
    “ rs4680(A) = Worrier. Met, more exploratory, lower COMT enzymatic activity, therefore higher dopamine levels; lower pain threshold, enhanced vulnerability to stress, yet also more efficient at processing information under most conditions
    rs4680(G) = Warrior. Val, less exploratory, higher COMT enzymatic activity, therefore lower dopamine levels; higher pain threshold, better stress resiliency, albeit with a modest reduction in executive cognition performance under most conditions”

    But what the UK Biobank tells us https://biobankengine.stanford.edu/variant/22-19951271 is that the supposed rs4680(A) “worrier” has about 2% lower chance of being anxious (0.98 for nervous feelings, self-rated, moderately statistically significant, p=0.004), and there is, of course, no difference in the “liquid intelligence” (a short intelligence test). Interestingly, somatic effects (higher trunk mass and derivative body mass indices) of the rs4680 are real (p=1E-09), although negligible (beta=0.01).

  14. Aqua says:

    Reported for sneaky link

  15. marc200 says:

    This is a classic post in the line of many other great posts Scott has made about the ways in which modern scientific institutions can generate/validate/certify entire lines of research that turn out to be completely baseless. He should collect those posts and do a general reflection on unifying themes.

    But you know what this post ALSO is? A great reason for me to again plead with Scott to review “Medical Nihilism” by Jacob Stegenga! Linked here:


    the ease with which “scientific” standards of proof can be manipulated is a major theme of the book, and he connects it to medical practice.

    P.S. as I have said before, I have no connection whatsoever to the author or the book, I just love the book and love Scott’s book reviews.

  16. thevoiceofthevoid says:

    WTF is that link?

  17. raymondneutra@gmail.com says:

    Scott, I still don’t understand the reasoning for asserting that small sample sizes alone would explain the many false positive results Please explain the rationale

    • thevoiceofthevoid says:

      Asking four times isn’t going to make anyone respond any faster.

  18. stevepittelli says:

    These studies were also held up as “proof” of the genetics of depression and, by extension, the presumed genetics of other mental disorders. They’ve been going at this for 3 decades and they have almost nothing. They can only hold up whatever the latest study is, which never seems to quite replicate. The whole field should be taken as suspect. The polygenic/omnigenic models are not based on any discoveries. They are an assumption made, due to the fact that they couldn’t find one or a few genes. “There must be a lot of them, then.” The whole thing is a shell game. Depression is not caused by genes. Not one, not 10, not 1000, not 1 million. The emperor has no clothes!

    • Scott Alexander says:

      Nah, they’re based on twin studies, which are pretty solid.

      Polygenic prediction scores for stuff like educational attainment are getting pretty good now. I don’t think they’re as good for depression, but I think that’s because less work has been done and it’s harder to measure phenotype.

  19. brianwh says:

    Damn! This why I read SSC. Log in for the unicorns, stay for the sprays of gore!

  20. Camerado says:

    I have just enough understanding of statistics (read: what I got from not-particularly-rigorous college economics courses) to follow takedowns like this, and it always makes my knowledge of my own ignorance that much more stressful. I work in a field with significant political involvement, and “GENETICS said it’s true so it must be true” is such a baseline assumption and such a cheap, quickly-trotted-out persuasive technique by all sides. No one’s interested in or educated in the kind of analysis verifying a “GENES made it happen” claim requires.

    If this brushes up too much against sensitive topics please feel free to remove my comment, but in recent years my field has become obsessed with the idea of epigenitics, particularly the truism that “any trauma suffered by previous generations must proceed in a direct line into the psyches of all people born thereafter.” I get seriously stressed about how this is now regarded as an uncontroversial and unarguable statement, and the many different individual and social problems it’s assumed to cause – I know just enough to think that sounds out of line with a modern understanding of genetics, or is at best a massive oversimplification of a much more complex finding, but I don’t know nearly enough to dig through the research myself and figure out to what extent it checks out. It might be proven and replicated, I don’t know, but in my work it’s not considered remotely important to verify a claim about genetics recited as truth.

    I’m not looking for a discussion of that specific topic – I’m more just boggled by the psychiatric community’s ability to build castles out of nothing, and the not-psychiatric community’s ability to build even bigger castles out of a blurry, half-understood view of that same nothing, seen through the wrong end of a telescope. People make policy decisions based on descriptions of those blurry castles given to them by people getting paid to make the castles appear impenetrable and attractive. When I start to think of the layers of ignorance required I go a little crazy. If a community of scientists is able to fund hundreds of studies based on nothing, what hope is there one layer down, where “it’s GENES” is all the study you need?

    • jermo sapiens says:

      Everybody craves information which reinforces their prior beliefs. Epigenetics, at least as formulated above, “any trauma suffered by previous generations must proceed in a direct line into the psyches of all people born thereafter” is extremely useful to those with a particular political agenda, so it’s not surprising at all.

      • From the Wiki article on epigenetics:

        In a groundbreaking 2003 report, Caspi and colleagues demonstrated that in a robust cohort of over one-thousand subjects assessed multiple times from preschool to adulthood, subjects who carried one or two copies of the short allele of the serotonin transporter promoter polymorphism exhibited higher rates of adult depression and suicidality when exposed to childhood maltreatment when compared to long allele homozygotes with equal ELS exposure.

        Is this one of the HTTLPR possibly bogus studies?

        • Camerado says:

          The title of the study is “Influence of life stress on depression: moderation by a polymorphism in the 5-HTT gene.”

          So maybe! I’d be interested to find out. (No idea how much more specific than 5-HTT the “5-HTTPLR” version is.)

    • Blueberry pie says:

      I am not an expert on epigenetics, so use salt.

      My understanding is that part of the problem with epigenetics is that the word is used to describe two related things:

      a) Reversible modifications to DNA bases and histones (not changing the actual DNA sequence) are an important factor the influences gene expression in a cell. Despite being reversible in principle, in practice those modifications can often be kept for the whole lifetime of a cell.
      b) Traits can be passed between generations by other means than DNA sequence and prenatal environment. Inheriting the modifications from a) is a proposed mechanism.

      Case a) is AFAIK largely uncontroversial. It is likely one of the more important mechanisms that lets different types of cells in your body stay different despite having the same DNA.

      Case b) is more difficult. AFAIK it seems likely that sometimes those things happen, but there is disagreement whether this is just one of the myriads of minor footnotes biology has (because life is such a mess) or whether it is a non-negligible element of the overall evolutionary process.

      Plausibly (and I am just speculating here), the high-credibility assigned to a) can spread over to the controversial b), because the name stays the same.

      Hope that cleared some confusion.

      EDIT: There seem to be a bunch of other mechanisms included under the (wide) epigenetics umbrella, though DNA base modifications and histone modifications are IMHO most prominently considered.

      • Camerado says:

        Thanks for the reply and clarification. Sounds like the instinct of “at best this is an oversimplification” is the right one.

        That kind of oversimplification re: epigenetics is quoted without citation in social sciences papers. Some others have commented much more eloquently about how lack of a common base knowledge between scientific fields can contribute to the phenomenon of one field accepting as fact what another field has moved on from (the analogy that immediately comes to mind is film theory’s dependence on Freudian psychoanaytical frameworks decades after that started losing credibility in actual psychological practice). My non-scientific field goes one step further and invents conclusions to unsourced studies based on a conflated understanding of secondhand source material.

        Obviously this bugs the heck out of me, as someone without the knowledge base (or interested audience, frankly) to correct it effectively except mumbling “it’s more complicated than that.” I think, in situations like this (and more closely related to the actual topic of Scott’s post), there’s the danger Scott discusses of the debunking getting glossed over, because it’s much more financially convenient to get the grant money for the castle in the air than to get it for dismantling the castle until you’re out of a field; but also, there’s sometimes a reverse-effect where, when the debunking takes, it takes with a vengeance. Any useful point that MIGHT have come from a bloated-but-fundamentally-sound research body becomes conflated with the oversimplification.

        All the more when it’s a politically useful misunderstanding of genetics. That’s a blow to credibility I’m not looking forward to.

  21. jermo sapiens says:

    Hi Scott:

    Does this story count as a major scientific failure within the meaning of this post?

  22. batterseapower says:

    Short $MYGN?

  23. Freddie deBoer says:

    But how are we to know that a disease is polygenic or single-gene like cystic fibrosis, sickle cell disease, Fragile X syndrome, muscular dystrophy, or Huntington disease until we do the research? Isn’t blanket distrust of single-gene causes unhelpful too?

    • Garrett says:

      One way is to look at how the disease presents itself. In most of these cases, the determination whether you have the disease or not is pretty clear-cut. You either have the condition or you don’t, with a possible few gradations along the way. Depression doesn’t neatly fall into that category. A psychiatrist might run a test battery which gets you a score from 0-100 with an arbitrary cut-off for “depressed”. With eg. sickle-cell disease you can look under a microscope and objectively make the determination with high repeatability.

      • eyeballfrog says:

        Single-gene diseases should also have a rather obvious Mendelian inheritance pattern.

        • mcpalenik says:

          The difficulty is when you start claiming that the “interacts” with other factors that must also be present to cause the disease, and then you can excuse the lack of the Mendelian inheritance pattern.

        • Greg D says:

          The problem is “incomplete penetrance”, which is real, and a big PITA.

  24. Dedicating Ruckus says:

    So it seems from this and other examples that it is, in fact, possible to get an entire subfield in academia devoted to a subject that literally doesn’t exist.

    Now, consider an alternate universe in which, for whatever reason, some powerful political interest finds it convenient that everyone believe very strongly that HTTLPR really does have that relationship with depression. Maybe that belief supports one of their long-held policy positions, or it’s flattering to a major constituency of theirs, or something.

    It’s easy to imagine this interest subsidizing further study into the relationship by awarding grants, and arranging for prestigious appointments for investigators who found strong relationships or helped build up the fake theory edifice. If you carried that process forward for long enough, you could imagine whole respected units of dozens of researchers forming around it. Obviously, the research output of people whose careers are now tied to this program would be heavily slanted in favor of the relationship’s existence. Throwing enough money at it, you could probably get well above 90% in support or at least acceptance.

    Most scientists aren’t corrupt, but a nonzero number are. If you find sufficiently corrupt ones and put them in well-paid leadership positions, you can have them work on suppressing any work that might undermine the core idea. This wouldn’t mostly be scientific, but just political and institutional; you could get partisans into the peer review process and have them tar any dissenting output, use your connections to suppress the careers of anyone who consistently speaks up against it, and so on. These corrupt leaders will set the tone for the whole movement in terms of getting those involved to consider any questioning of the core premise to be illegitimate, not merely mistaken. It wouldn’t be hard to do this anyway; most people will be easily predisposed to hostility to those claiming that their whole career is based on a falsehood and should be abolished.

    Now, scientists aren’t typically domain experts on every other area of science, to be able to judge the output competently on their own and from first principles. Outside their specific areas of investigation, they’re mostly limited to going along with the consensus of those in the field, as laymen are. So if you present a strong consensus backed by prestigious organizations, most scientists will accept it and import it into their worldview without doing a deep dive into whether it’s actually true. You can reinforce this effect by accusing those dissenters that do come up of being in thrall to some other political interest — no shortage of candidates here.

    All this can be paired with a similar propaganda push outside of academia, aimed at the public, politicians &c. Here you can be a lot less subtle (and indeed must be). The overall goal is to claim “science” as a brand for your particular political faction, and pair that brand inextricably with believing that HTTLPR causes depression. Presumably a major political interest will have major political enemies, and so you might come in for opposition here; however, this can actually help you, as tying this issue to a significant preexisting political split will tend to make people on your side of that split believe you automatically. If successful, the result is to create a widespread perception that anyone who disbelieves in the one particular issue must also reject all of science as a whole.

    Now, we don’t live in this particular parallel universe. (Moloch, in the background: “YOU KNOW WHAT NO ONE HATES EACH OTHER OVER YET? THE GENETICS OF DEPRESSION.”) But it’s one that’s worth considering. Episodes like this demonstrate that science is fallible, even on fairly large scales; one must remember that it’s also corruptible, and that the expected failures conditioned by a major external actor pushing in one direction might be much bigger than those that would emerge in isolation.

    • conradical says:

      I think this sort of broad concern is exactly why there’s so much….call it “passion”…in Scott’s post.

      I know it’s why I get so worked up over flagrantly wrong “science.”

      I’m reminded of the recent study showing predictive betting markets among non-scientists were accurate in being able to know which studies would replicate and which would not.

      Arguing with “Flat Earthers” and “Anti-Vaxxers” is hard enough (pick literally any other contentious topic with real-world consequences) without them having the reasonable and correct belief that many scientists are only operating under personal-self-interest and are capable of publishing whatever result they went looking for in the first place.

      What rubs me the wrong way is these “scientists” have no skin in the game. How many life-hour-dollars (hours X dollar-value-of-those-hours) have been utterly wasted by this sort of fruitless pursuit?

      If the original researchers — or anyone else who published a positive result — had been playing World of Warcraft instead, it might have been a net positive for humanity.

      In other fields, when individual actors take a public position which has significant costs born by the public (NIH funding allocation, for example), we try to make sure those actors have skin in the game.

      Instead, as Scott says, the “debunking” studies go unremarked upon, the people who published false positives have no reputational damage, and certainly no real-world consequences that I can see. Publishing a false positive appears, still, to be rather good for business.

      • Sebastian_H says:

        “Arguing with “Flat Earthers” and “Anti-Vaxxers” is hard enough (pick literally any other contentious topic with real-world consequences) without them having the reasonable and correct belief that many scientists are only operating under personal-self-interest and are capable of publishing whatever result they went looking for in the first place.”

        This is basically what I think has happened with Brexit and the Democratic Party with respect to the Rust Belt. The leftish parties in both countries kept saying again and again that globalism was helping “the country” and using GDP numbers (without distribution analysis) to back the argument. These political parties had/have a strong technocratic bent to them, and were not particularly open to hearing lower class people tell them that their equations weren’t playing out so well in the real world. Decade after decade the a large portion of the working class told the technocrats that it wasn’t actually working and that things should slow down or be be dealt with. d for decades they were ignored. At some point, a critical mass of these people (not necessarily a majority, but enough to tip political balances) decided that these “experts” must have just been lying to them for decades. So now you see things like “experts predict BREXIT disaster” and people figure there is no reason to believe it anymore.

        You’ll note this idea doesn’t require that the experts actually were lying to them. Just that their concerns were neglected by the experts whose theories turned out to be overlooking some key facts. That is without even getting to the vague suspicion that NYC and SF might be actively working to keep poor people out and keep the wealth in the big coastal cities.

    • One addition to your picture. Suppose you are a competent researcher in such a field and suspect that the current orthodoxy is wrong. Unless you have a very strong commitment to preventing false views you probably conclude that you ought to work in a different area, one where you won’t be fighting against an entrenched orthodoxy.

    • Greg D says:

      I saw what you did there. Well done. 🙂

    • marc200 says:

      “We don’t live in this particular parallel universe” — yes we do live in that universe. You just described exactly the relationship between Big Pharma and medical science. Down to the “propaganda push outside of academia” fueled by sales forces, public advertising, and lobbying.

      One reason I keep pushing the Medical Nihilism book around here is it describes the relationship to medical science that one would wish to take for self-protection in a world like the one you describe, namely a strong prior that new recommendations emerging from the medical-industrial complex are false. Note that this is different from just being a crank, as there is evidence capable of overturning a strong prior.

  25. raymondneutra@gmail.com says:

    Can someone explain how small sample size alone could explain false positive results? If all those allegedly biased researchers had access to 600,000 study subjects would they have been prevented from dredging? I grew up thinking that small sample sizes increased the risk of false negatives all other things being equal.
    I retired 10 years ago so am open to being re-educated
    Best wishes

    • Mazirian says:

      A large sample alone would certainly not prevent false positives. That’s why in today’s genetic association research, the significance threshold used is more stringent the more statistical tests you do. For example, if you test for associations between a phenotype and all common genetic variants, as in a standard genome-wide association study, the significance threshold is usually p < 5×10^-8.

      • Mazirian says:

        Besides the stricter significance threshold, the now standard requirement that findings be replicated in independent holdout samples has largely eliminated the false positive problem in genetic association studies.

    • brmic says:

      They explain that given what we now know about polygenicity, the highest-effect-size genes require samples of about 34,000 people to detect, and so any study with fewer than 34,000 people that says anything about specific genes is almost definitely a false positive; they go on to show that the median sample size for previous studies in this area was 345.

      Basically, for a genetic effect to be detectable in a sample of 345 people, it has to be very strong. It then should easily show up in 600k subjects. The alternative, and much more likely assumption is that the original result was either a lucky overestimate, or, more plausibly, p-hacked and a result of publication bias.
      More generally, no, small samples do not only increase the risk of false negatives (i.e. problem of insufficient power) but also the risk of false positives. In fact, for some effects you can be certain the effect is a false positive given the sample size and what we know about the world. Andrew Gelman explains this here: https://statmodeling.stat.columbia.edu/2017/12/15/piranha-problem-social-psychology-behavioral-economics-button-pushing-model-science-eats/
      As for the effect of p-hacking, see e.g. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=1850704 (alt link https://journals.sagepub.com/doi/full/10.1177/0956797611417632#_i1)
      tldr: Combining 4 common methods of p-hacking (or forking paths) one can easily get to 60% false positivity rate.

  26. vV_Vv says:

    Remember, GeneSight and their competitors refuse to release the proprietary algorithms they use to make predictions. They refuse to let any independent researchers study whether their technique works. They dismiss all the independent scientists saying that their claims are impossible by arguing that they’re light-years ahead of mainstream science and can do things that nobody else can.

    Re the previous discussion on health care regulation, this is an example of what happens when a sub-field of health care is not regulated: you get the snake oil salesmen with their secret formulas.

    • Matthias says:

      Alas, this doesn’t say that regulation would keep the snake oil away.

      • vV_Vv says:

        No, but it may keep the snake oil away, while without regulation you are certain to get it.

  27. skybrian says:

    It’s times like this when I try to remember that everything is correlated [1]. Maybe we should be looking at effect sizes instead?

    [1] https://www.gwern.net/Everything

  28. jajajajim says:

    This study provides strong evidence that 5-HTTLPR is not associated with developing depression. It did not test any pharmacogenetic hypotheses.

    Whether or not genetic differences in the serotonin transporter gene drive depression risk is not the end-all-be-all for whether it is important for antidepressant pharmacogenetics. The serotonin transporter is most definitely a drug target, and genetic variation in drug targets can most definitely affect drug response, whether or not that variant affects disease risk. Example from outside of psychiatry – VKORC1 encodes the drug target of warfarin and affects warfarin response/dose requirement. But common genetic variants in VKORC1 do not increase risk of VTE/DVT/PE.

    Skepticism about 5-HTTLPR being ready for clinical use in pharmacogenetics is warranted based on dozens of previous studies. But this study doesn’t move the needle much for me on topics of treatment response.

  29. mcpalenik says:

    It seemed easier to copy and paste into Matlab, which spit out 10^-144.

  30. jermo sapiens says:

    How many of our scientific edifices are built on air? How many useless products are out there under the guise of good science? We still don’t know.

    A good bet is “a huge amount”.

    This is particularly true when the subject being studied is very complex, has tons of variables (most of which we cant measure or are unaware of their significance), is part of a chaotic system, and worst of all, has political implications.

    Most scientists are not corrupt but doing good science is hard, even for very intelligent and diligent people. I also suspect most researchers have inadequate education in statistics.

    The pressure to publish, the pressure to publish conclusions that wont get you ostracized of your scientific community, the pressure to make whoever funds your research happy, are all corrupting influences on science.

    If a theory makes reliable predictions, and those predictions are used by engineers, I consider it to be solid science. Otherwise, it’s most likely bunk.

    • Murphy says:

      I’ve had to have “the talk” with co-workers. The “what you’re doing is starting to be baaasically P-hacking” talk.

      Particularly with people who aren’t too confident with their analysis skills. They’re sure their data should say X, they must just be screwing up doing the analysis somehow!

      So they tweak the analysis a little. Perhaps try a different test from the stats package.

      Pretty soon they have a “significant” result.

      Anything with a pre-reg gets a huge trust boost from me.

    • Freddie deBoer says:

      Oh, I don’t know. I don’t see a lot of planes dropping out of the skies. If bad science is so universal, why do so many things work?

      • jermo sapiens says:

        As I mentioned above:

        If a theory makes reliable predictions, and those predictions are used by engineers, I consider it to be solid science.

        I believe the Bernoulli principle falls squarely in that category. But it doesnt follow from airplanes that the latest study linking red meat to cancer is reliable, because both are based on “science”.

        Each claim made by scientists should be examined on its own merits. Using the prestige of science to bolster a claim made by scientists is completely unscientific. Things that work do so because we have mastered the science behind them using rigorous methods, not because a scientist declared it so. For example, we mastered fluid dynamics, and therefore we have airplanes. We mastered electromagnetism, and therefore we have smart phones.

        On the other hand we’re just beginning to dip our toes in nutritional science, and we’re pretty sure sugar is bad, but not that long ago we were telling people to replace butter with margarine.

        • Clutzy says:

          I think you’ve hit on an important note. Science has prestige, so people have been trying to slap it onto every field to try and appropriate its inherent trust and credibility. This goes for Psych and the “social sciences” and also for things like “Forensic science” and probably most prominent currently is climate science. These fields don’t universally lack merit and credibility, but they have used the credibility of Chemists and Physicists of the past to gain more credibility than they deserve, and to project unearned confidence in their conclusions.

        • Andkat says:

          To my understanding, fluid dynamics is notable as an area of physics widely known specifically for having major unsolved problems (i.e. turbulence), so that example falls flat as presented.

          Most of our biotechnology and pharmacology is based off of science that is massively underdetermined in terms of fundamental understanding and capacity for rational design; nonetheless, we still have an enormous corpus of profitable or indeed essential technologies derived therefrom. Core technologies (CRISPR, restriction enzymes and plasmids, PCR, most antibiotic and drug scaffolds) etc. have been built off of ripping finished goods whole-cloth from biological systems and jury-rigging them together. Capacity to make an adequately useful, profitable product or service should not be mistaken for being a test of ‘mastery’ of the underlying science.

  31. analytic_wheelbarrow says:

    Sorry if I’m a bit slow, but what specifically was wrong with all those studies that you’re claiming were bogus? I could understand how data dredging could yield the initial result (“out of thousands of genes we tests, this one had a low p-value!”), but then when researchers go out and run studies on that particular gene, how were they getting positive results?

    • Mazirian says:

      Few studies on 5-HTTLPR were direct replications of previous studies. Instead, they were “conceptual replications” and “extensions” of previous research. If the original effect couldn’t be found, researchers could try and find an effect in some subgroup or using a different outcome variable. And, of course, studies that found some kind of effect–even if it was, say, an implausible three-way interaction–were much more likely to be published that failed direct replications.

    • Murphy says:

      Another way to data dredge is when you try for a replication or near replication or similar… find your data doesn’t support it… so you try a dozen related diseases etc until you find something statistically significant.. then talk like that was you analysis plan all along.

      it’s why prereg is so valuable.

    • vV_Vv says:

      This way.

      You test a large amount of sub-hypotheses until you get significance. You can generate them combinatorially by combining different variables: outcomes, treatment, genes, sex, age, ethnicity, and so on. Scott calls this the “Elderly Hispanic Woman Effect” . That’s how you get strangely specific paper titles such as “Gender Moderates The Association Between 5-HTTLPR And Decision-Making Under Uncertainty, But Not Under Risk” or “The influence of family structure, the TPH2 G-703T and the 5-HTTLPR serotonergic genes upon affective problems in children aged 10–14 years”.

      And on top of these hypothesis selection issues, you have the whole p-hacking bag of tricks: changing the details of the statistical analysis until you get significance, stopping data collection as soon as you get significance, throwing data away (rationalizing it as discarding exploratory experiments, outliers, etc.) until you get significance, and so on.

      And on top of p-hacking, you have publication bias: papers that report null results are much less likely to be published.

    • analytic_wheelbarrow says:

      These a good replies, but I wonder if there is direct evidence that these things (data dredging, forking paths, etc) happened. In other words, studies 1-30 say X, study #31 (by a geneticist) says Not X. I’d like a little more evidence that 1-30 were flawed.

      • Greg D says:

        The answer to your question was this:

        That’s how you get strangely specific paper titles such as “Gender Moderates The Association Between 5-HTTLPR And Decision-Making Under Uncertainty, But Not Under Risk” or “The influence of family structure, the TPH2 G-703T and the 5-HTTLPR serotonergic genes upon affective problems in children aged 10–14 years”.

        Any time you see something that specific, where the question (along with all the other questions they were going to look at) wasn’t pre-registered, you should assume the study is garbage.

        If they did pre-register all the questions they were going to look at, you should do a Bonferroni correction (divide 0.05 by # of questions, that’s now the p value required for significance), and check their p values. If it doesn’t beat the corrected p value, then, again, it’s garbage.

        If it passes those tests, then it could possibly be at least worth looking at.

        • analytic_wheelbarrow says:

          I’ll accept that answer. 🙂 Thanks for explaining (and to vV_Vv for making this point in the first place).

  32. niohiki says:

    This is why I start worrying when people talk about how maybe the replication crisis is overblown because sometimes experiments will go differently in different contexts.

    I think that maybe those people come from different fields. Take the post’s example. I come from neither community, but the whole interaction here is people from Genetics coming to tell people from Psychiatry that something’s wrong. I would not be surprised at all if for the genetics community this gene causing depression was not “common knowledge” at all, and the Border et al paper was not exactly a revelation for them either (“so a single gene does not influence complex behavioural patterns? Duh”). There is an (understandable) tendency to gather all “science” into a single container (as a wise man said, the categories were…), but one should believe different communities have very different standards for statistical rigour.

    In my experience, fundamental physics is not going at all through a replication crisis. It’s going through some pretty bad cases of “oh god we’re generating petabytes of data per second and it all makes too much sense and we need to give something new to the theorists or they will keep producing more and more confusing string theory stuff”. And others. But definitely not a problem of replication.

    As for why… wild speculation. It may be because of a general perception in-community of statistics (and generally solid maths) as something intrinsically valuable, which raises the quality of any paper; or instead, a perception of statistics as something to be done with as soon as possible, only there to please the minimum standards asked of a journal’s peer-reviewing for the grant-givers to consider it an acceptable badge.

    And then I would say that this may be (only in part) motivated by how easy it is to produce good, solid, robust numbers in the field. The more one can isolate the variables of interest, the safer it is to be rigorous about them. That is, one maybe gets negative results, but they are provably negative, not just a lot of noise/undistinguishable interactions. And so CERN routinely produces articles saying “nope, this theory is wrong; nope, no such thing; nope, sorry, no SUSY”. In that context, it pays off to signal intellectual prowess by performing decent data treatment (not because it is easy, but because it is hard – and not carrying a lot of risk). Mid-way, there is genetics, where definitely nothing is properly isolated, but at least there is a minimal safety for a paper to be published if “we discovered that gene HaMeph produces protein YYZ and that regulates dopamine”, whether the dopamine has any actual phenotypic effect or not. And when you get to psychiatry, either you get an effect on the completely-noisy-system, or it’s worth nothing, so the whole intellectual-signalling becomes too much of a liability and the statistical level is just kept to the minimum.

    But because that is just wild speculation and I have no way of doing any sound statistics on it, I would not take it too seriously.

    • It may be because of a general perception in-community of statistics (and generally solid maths) as something intrinsically valuable, which raises the quality of any paper; or instead, a perception of statistics as something to be done with as soon as possible, only there to please the minimum standards asked of a journal’s peer-reviewing for the grant-givers to consider it an acceptable badge.

      I’m reminded of a comment by my father, who was a statistician as well as an economist, on DataDesk, a program for doing exploratory statistics. He said it was the first statistics program he had seen with which someone untrained in statistics would do more good than harm.

      • niohiki says:

        I… I really want to believe in a world where human ineptitude can be bounded by sufficiently well-designed software. Maybe those were more optimistic times.

    • vV_Vv says:

      In my experience, fundamental physics is not going at all through a replication crisis.

      In hep-th you can make a career out of publishing papers that contain speculations that will likely never be experimentally tested within your professional lifespan, if ever.

      It took ~40 years for the LHC to put SUSY to rest. String theory/loop quantum gravity stuff are largely untestable without experimental techniques that could access the Planck scale, which are unfeasible for the foreseeable future.

      • niohiki says:

        Yes, that’s what I was pointing to with the “confusing string theory stuff”. A whole problem on its own, no doubt. But it’s not a replication crisis, in the strict sense; and also in the more general sense of the replication crisis being related to bad statistics/bad mathematical foundations. If anything, it starts from exactly the opposite direction.

        PS: Since I come from there, I feel the need to justify a fraction of what is done under the name of “string theory”. There are a lot of very interesting maths that are being discovered thanks to it (particularly algebraic geometry, but not only). I am quite happy about that because it has returned some of the vanguard of mathematics to the good ol’ pre-Bourbaki style of shoot first, prove theorems later. These things can be worthwhile as mathematics per se, and as tools for whatever lies ahead in physics (the same way the whole Hamilton-Jacobi formalism could have been seen as a bit of intellectual self-pleasing until it became so obviously important for quantum mechanics).

        And on the other hand, I admit that most of what is done under the banner of “string theory” et al (or “AdS/CFT”; or, heavens forbid, “the Landscape(TM)”…) is philosophy-101-conjecturing using equations as if they were “suggestions of preparation” in order to justify whatever fancy/hyped/trendy model is going around at the moment, wilfully avoiding anything that could resemble experimental verification. I know, I know.

        • vV_Vv says:

          Yes, that’s what I was pointing to with the “confusing string theory stuff”. A whole problem on its own, no doubt. But it’s not a replication crisis, in the strict sense; and also in the more general sense of the replication crisis being related to bad statistics/bad mathematical foundations.

          If a research field uses effectively no experimental verification, then there is nothing to replicate, hence a replication crisis is impossible.

          If anything, it starts from exactly the opposite direction.

          It starts from the same direction of the incentives being set so that it is possible or even professionally advantageous for scientists to spend decades publishing claims that are most likely false or irrelevant.

          • mcpalenik says:

            This is, however, why there aren’t that many people getting into high energy theory these days. At least, at the school I went to, most of the people that did high energy theory didn’t even take students anymore. The ones who retired while I was there weren’t replaced with more high energy theorists. It’s not particularly easy to make a career in that field, which is why a lot of people choose not to go into it. It’s probably good to have some number of people doing high energy theory, because somebody might come across something interesting, but it’s generally understood that that’s not where the bulk of the funding, hiring, or research efforts in physics are being placed.

          • gbdub says:

            I think the key difference is that in fundamental physics, it’s much more known and acknowledged that many hypotheses are untested and maybe untestable. That’s why the LHC opened with a backlog of interesting experiments to run.

            As opposed to the case Scott discusses, where researchers had convinced themselves they had already proven their hypotheses. These were not cases where there were a bunch of theories waiting for a new tool (in this case large scale genetic databases) to be tested.

    • zzzzort says:

      A lot of it is cultural, though the culture and the content are related. I’m in biophysics, and in the more biology side of it people are just more accepting of results that contradict and less willing to point out that some theory has been disproven. The more physics side is much more confrontational, and if you try to push a theory that doesn’t agree with experiment, someone in every conference presentation will bring it up in the Q and A. Crossover can be very disconcerting for both the physicists (“I told them they were wrong and they just said my results were very interesting”) and the biologists (“I related my results to an established framework and some old guy yelled at me that he disproved that framework years ago”).

      • niohiki says:

        “I told them they were wrong and they just said my results were very interesting”

        I think that’s giving me a PTSD episode.

  33. VirgilKurkjian says:

    In regards to effects varying across contexts, Many Labs 2 aimed to examine just that, and found very little difference.

    It opens:

    Suppose a researcher, Josh, conducts an experiment finding that experiencing threat reduces academic performance compared to a control condition. Another researcher, Nina, conducts the same study at her institution and finds no effect. Person and situation explanations may come to mind immediately: (1) Nina used a sample that might differ in important ways from Josh’s sample, and (2) the situational context in Nina’s lab might differ in theoretically important but non-obvious ways from Josh’s lab. Both could be true simultaneously.

    And near the end:

    The main purpose of the investigation was to assess variability in effect sizes by sample and setting. It is reasonable to expect that many psychological phenomena are moderated by variation in sample, setting, or procedural details, and that this may impact reproducibility … However, while calculations of intra-class correlations showed very strong relations of effect sizes across the findings (ICC = 0.782), they showed near zero relations of effect sizes across samples (ICC = 0.004). Sensibly, knowing the effect being studied provides a lot more information on effect size than knowing the sample being studied. Just 11 of the 28 effects (39%) showed significant heterogeneity with the Q-test, and most of these were among the effects with the largest overall effect size. Only one of the near zero replication effect (Van Lange et al., 1997) showed significant heterogeneity with the Q-test. In other words, if no effect was observed overall, there was also very little evidence for heterogeneity among samples.

    I wouldn’t say that context-sensitivity has been debunked, but these results are pretty strong evidence against it.

  34. Murphy says:

    My first thoughts reading that: where the hell did someone get 1K sequenced samples with high quality phenotype data in 1996???

    My second thoughts, where the hell did someone get 600K sequenced samples with high quality phenotype data in 2018???

    [reading the paper, they used uk biobank data, imputed from chip data so that makes sense, there isn’t a spare 500k exomed or genomed dataset I’m unaware of]

    estimated out-of-sample imputed genotype match rates were$0.919

    Decent impulations, validated against reference set.

    relatively homogenous datset with decent phenotype data.

    All analyses were preregistered through the Open Science Framework and are available at https://osf.io/akgvz/. Sta-tistical models are described in detail in section S4 of the online supplement, and departures from the preregistered analyses are documented in section S5.


    So this has the hallmarks of high quality analysis.

    I’m inclined to believe it.


    Looking at how common the variant is….

    gnomAD European Sub 0.124
    gnomAD African Sub 0.280
    gnomAD American Sub 0.08
    gnomAD East Asian Sub 0.27
    gnomAD Other Sub 0.14
    gnomAD Ashkenazi Jewish 0.1

    Epistemic status: wild speculation

    I have a wild, guess, only supported by by gut feeling but given the variation in allele frequency across different populations…. that a bunch of the supporting studies happened to do really shitty QC and basically captured that one population or another suffers depression as slightly different rates in a given community.

    Looking up the gene in question in the databases I often use, it’s not listed as having a strong link to depression:



    So either the curators moved fast or the papers in question didn’t make the cut.

    I can entirely believe this kind of self-reinforcing shit happened. I saw a case study about a similar thing a few years back about some muscle protein. Lots of papers published dis-confirming the results but a nest of papers developed citing 2 primary papers with positive results plus each other and ignoring all the disconfirming papers.

    I recently had to wash my hands of a paper I was working on with a clinician. We had a perfectly good, boring result of “no association found in population X” that I wanted to publish.

    But she wants to keep dredging for something, anything interesting. So I dropped it and she had to go find some other bioinformatician.

    The only point from your post I’ll dispute is this one:

    Outside a few cases like cystic fibrosis, most important traits are massively polygenic or even omnigenic

    My colleagues and I spend most of our time looking at stuff massively affected by a couple of variants.

    A huge number of things are strongly controlled by a single SNP.

    I have a default skepticism for a lot of GWAS studies claiming tiny tiny effects from a thousand random alleles: my gut feeling is that a lot of them are dredging.

    Single snps that do important things aren’t too unusual. when they have a big enough effect to really matter… you don’t need 600000 people to identify them.

    But having a highly consanguineous (read, family trees that look like escher paintings) community with an unusual incidence rate where you can see clearly the split between affected and unaffected individuals is a godsent.

    • Scott Alexander says:

      I really want to hear more about when traits can vs. can’t be affected by a small number of variants. If someone says they’ve identified some gene that determines 50% of the variance in running speed, or antidepressant response, or agreeableness, or whatever, what should my priors be?

      • Murphy says:

        If someone says they’ve identified some gene that determines 50% of the variance in running speed, or antidepressant response, or agreeableness, or whatever

        My guess is very low. most things that are obviously advantageous or deleterious in a major way aren’t gonna hover at 10%/50%/70% allele frequency.

        Population variance where they claim some gene found in > [non trivial]% of the population does something big… I’ll mostly tend to roll to disbelieve.

        But if someone claims a family/village with a load of weirdly depressed people (or almost any other disorder affecting anything related to the human condition in any horrifying way you can imagine) are depressed because of a genetic quirk… believable but still make sure they’ve confirmed it segregates with the condition or they’ve got decent backing.

        And a large fraction of people have some kind of rare disorder: it’s like that old posts you had about the number of problem normal people have. Long tail. Lots of disorders so quite a lot of people with something odd.

        It’s not that single variants can’t have a big effect. It’s that really big effects either win and spread to everyone or lose and end up carried by a tiny minority of families where it hasn’t had time to die out yet.

        Very few variants with big effect sizes are going to be half way through that process at any given time.

        Exceptions are

        1: mutations that confer resistance to some disease as a tradeoff for something else

        because most diseases are geographical again big effect but may not spread to everyone.



        Genes that confer a big advantage against something that’s only a very recent issue.


        Since we didn’t really evolve with whisky it’s only been a big advantage to have some more effective versions of the genes it for a few dozen generations. Likely explains a non trivial fraction of alcohol addiction vulnerability variation

        3: genetic games of chicken.

        disorders like huntingtons where it’s theorized that there’s a tradeoff between risk of death and IQ. too many CAG repeats and the area destabilizes and you get runaway anticipation that wipes out later generations of your family.


        Increasing repeat length was associated with higher GAI scores up until roughly 40–41 repeats

        But >39 repeats means risk of the disease.

        Epistemic status: not too bad, certainly good enough to have my old head of department, a grand old neurogeneticist interested in this if still maintaining healthy scientific skepticism.


        Elena Cattaneo, a cell biologist at the University of Milan, has been investigating this idea for the past three years. Huntingtin’s exact role remains obscure. But it is known (because it is produced in the relevant cells) to be involved in both the construction of brains in embryos and in the process of learning. So Dr Cattaneo began by looking into how the huntingtin gene evolved in creatures with increasingly complex nervous systems.

        Huntingtin-like genes go back a long way, and display an intriguing pattern. A previous study had found them in Dictyostelium discoideum, an amoeba. Dictyostelium’s huntingtin gene, however, contains no CAG repeats—and amoebae, of course, have no nervous system. Dr Cattaneo added to this knowledge by showing the huntingtin genes of sea urchins (creatures which do have simple nervous systems) have two repeats; those of zebrafish have four; those of mice have seven; those of dogs, ten; and those of rhesus monkeys around 15.

        The number of repeats in a species, then, correlates with the complexity of its nervous system. Correlation, though, does not mean cause. Dr Cattaneo therefore turned to experiment. She and her colleagues collected embryonic stem cells from mice, knocked the huntingtin genes out of them, and mixed the knocked-out cells with chemicals called growth factors which encouraged them to differentiate into neuroepithelial cells.

        A neuroepithelial cell is a type of stem cell. It gives rise to neurons and the cells that support and nurture them. In one of the first steps in the development of a nervous system, neuroepithelial cells organise themselves into a structure known as the neural tube, which is the forerunner of the brain and the spinal cord. This process can be mimicked in a Petri dish, though imperfectly. In vitro, the neuroepithelial cells lack appropriate signals from the surrounding embryo, so that instead of turning into a neural tube they organise themselves into rosette-shaped structures. But organise themselves they do—unless, Dr Cattaneo found, they lack huntingtin.

        Replacing the missing gene with its equivalent from another species, however, restored the cells’ ability to organise themselves. And the degree to which it was restored depended on which species furnished the replacement. The more CAG repeats it had, the fuller the restoration. This is persuasive evidence that CAG repeats have had a role, over the course of history, in the evolution of neurological complexity. It also raises the question of whether they regulate such complexity within a species in the here-and-now.

        They may do. At the time Dr Cattaneo was doing her initial study, a group of doctors led by Mark Mühlau of the Technical University of Munich scanned the brains of around 300 healthy volunteers, and also sequenced their huntingtin genes. These researchers found a correlation between the number of a volunteer’s CAG repeats and the volume of the grey matter (in other words, nerve cells) in his or her basal ganglia. The job of these ganglia is to co-ordinate movement and thinking. And they are one of the tissues damaged by Huntington’s disease.

        Another investigation into huntingtin’s role in brains is now being carried out by Peg Nopoulos, a neurologist at the University of Iowa. She and her team are testing the cognitive and motor skills of children aged between six and 18, and comparing these volunteers’ test performances and brain scans with their CAG counts.

        So far, Dr Nopoulos has tested 80 children who have 35 or fewer repeats. She has found a strong correlation between the number of repeats and a child’s test performances. More repeats are associated with both higher intelligence and better physical co-ordination (the former effect seems more pronounced in girls and the latter in boys). Like Dr Mühlau, Dr Nopoulos has found a correlation between repeat numbers and the volume of the basal ganglia. She has also found a correlation with the volume of the cerebral cortex—another area affected by Huntington’s.

        • Scott Alexander says:

          Would it be fair to sum up what you said as a heuristic that single genes may have large effects, but it’s uncommon for them to make up a large percent of the variance in something we care about, unless there’s a recent evolutionary story we can tell to explain why?

          • Murphy says:


            I’ve given this some thought for a few hours.

            I think merely being able to tell a story is a bit too low a bar.

            I’d add in that it has to be spectacularly simple. (too many degrees of freedom make it too easy to find a story for anything)

            it’s probably safer to assume that most of the time common SNPS aren’t having a big effect because the list of exceptions is pretty short.

            Most of the exceptions are biologically high impact: frame shifts, stop loss, stop gain or messing with spice sites etc.

          • Loris says:

            I think the summary could be something like:
            A single gene determining 50% of the variance in any complex trait is inherently atypical, because variance depends on the population plus environment and the selection for such a gene would be strong, rapidly reducing that variance.
            However, if the environment has recently changed or is highly variable, or there is a trade-off against adverse effects it is more likely.
            Furthermore – if the test population is specifically engineered to target an observed trait following an apparently Mendelian inheritance pattern – such as a family group or a small genetically isolated population plus controls – 50% of the variance could easily be due to a single gene.

          • Murphy says:

            Put way better than I did.

          • Loris says:

            Thank you Murphy.

            I realised after posting that there is a single gene present in about half of most human populations which has a very significant effect on many traits, and possibly explains over 50% of the variance of quite a few : TDF (a.k.a. SRY). This is the gene on the Y chromosome which initiates male sex determination.
            I say this almost as a joke (there is effectively a whole chromosomal copy difference between genotypically typical male and female), but it’s also a counterexample to keep in mind to ” it’s uncommon for them to make up a large percent of the variance in something we care about”.
            In a sense, it’s so commonplace we don’t even notice.

          • deciusbrutus says:

            Sickle cell? Although there is a evolutionary story about that. G6PD deficiency appears to lack that particular story.

          • a reader says:

            What about MAO-A, the so called “warrior gene”? It is a gene supposed to have large effect in violence and propensity to crime, something we surely care about.

            But there was a family with apparently Mendelian inheritance pattern – the men who had the nonfunctional allele were extremely violent, the other men in the family and all the women were normal (but some women transmitted it to some of their sons):

            Male members of a large Dutch kindred displaying abnormal violent behaviour were found to have low MAO-A activity linked to a deleterious point mutation in the 8th exon of the gene. The unaffected male members within the family did not carry this mutation. The first study that investigated behaviour in response to provocation showed that, overall, MAOA-L individuals showed higher levels of aggression than MAOA-H (high MAOA activity) subjects.


            What do you people think, will it be confirmed or proved false in the end, by a large study like the one about 5-HTTLPR? I’d say 65% it will be confirmed.

            What could be the recent evolutionary story about it? Was the low activity variant advantageous for hunter gatherers, could it make the men better warriors?

          • Murphy says:

            MAOA is a whole gene, the family you link to the paper about is one family with weird phenotype and one, by the look of it, rare allele that segregates with the behaviour.

            The skys the limit for things that affect a single family or single small community.

            I am entirely willing to believe that some rare MAOA mutations have a huge effect.

            I am less willing to believe that very common MAOA variants have a huge effect. Not entirely unwilling… but my prior is low.

            I’d love to say more but the paper doesn’t give any info on the variant.

      • Andkat says:


        This paper may be of interest in terms of providing a concrete perspective on how a representative genetic landscape can look like with its extensive and well controlled dissection of the phenotypic impacts of common genetic variation in yeast:

        Molecular Basis of Complex Heritability in Natural Genotype-to-Phenotype Relationships

        of particular interest:
        “We considered each time point in each environ- ment as a separate quantitative trait, and the ~1,600,000 growth measurements allowed us to identify 18,007 QTLs at an empirical false-discovery rate of 1.5% ± 2.1% (by permutation test; mean ± SD), with 200 ± 52.2 QTLs identified per trait (mean ± SD) (Figure S1B). Our model explained 72.8% ± 18.5% of the broad sense heritability across the 90 traits examined (mean ± SD) (Figure S1C), and we readily discovered loci explaining as little as 0.01% of phenotypic variance (Figures S1D and S1E). The remaining ‘‘missing heritability’’ is likely due to second- or higher-order genetic interactions (Bloom et al., 2015; Poelwijk et al., 2017). Most phenotypic variance was explained by linear homozygous contributions (n = 3,165), but numerous heterozygous contributions (of n = 229 total) had effect sizes comparable to homozygous terms (Figure 1H).

        “Strikingly, fully 24.9% of the 6,604 protein-coding genes (and 13.3% of the individual segregating polymorphisms) were implicated in determining growth across this comparatively small number of environments…”

        “Synonymous natural variants, often regarded as unlikely to significantly affect phenotype (Kumar et al., 2009), had median effect sizes that were comparable to those of missense variants and larger than those of extragenic variants (p < 10^-6 by two- sample Kolmogorov-Smirnov test). "

        "Although we and others have previously found that some genes are identified as important in both deletion and mapping studies (She and Jarosz, 2018), our observations suggest that the genotype-to-phenotype map for natural genetic variation is fundamentally topologically distinct from that derived from gene deletions."

        "We considered each time point in each environ- ment as a separate quantitative trait, and the ?1,600,000 growth measurements allowed us to identify 18,007 QTLs at an empir- ical false-discovery rate of 1.5% ± 2.1% (by permutation test; mean ± SD), with 200 ± 52.2 QTLs identified per trait (mean ± SD) (Figure S1B). Our model explained 72.8% ± 18.5% of the broad sense heritability across the 90 traits examined (mean ± SD) (Figure S1C), and we readily discovered loci explaining as little as 0.01% of phenotypic variance (Figures S1D and S1E). The remaining ‘‘missing heritability’’ is likely due to second- or higher-order genetic interactions (Bloom et al., 2015; Poelwijk et al., 2017). Most phenotypic variance was explained by linear homozygous contributions (n = 3,165), but numerous heterozy- gous contributions (of n = 229 total) had effect sizes comparable to homozygous terms (Figure 1H).

        "Our results help to reconcile a vigorous debate (Cox, 2017; Liu,
        2017; McMahon, 2017) regarding the omnigenic model: while seemingly equivalent to Fisher’s ‘‘infinitesimal’’ model (which assumes infinitely many segregating causal alleles), an apparently omnigenic relationship between genotype and phenotype can often arise under realistic linkage disequilibrium without all segregating variants impacting a trait. We find that most quantitative traits likely comprise sufficiently many underlying causal loci as to appear omnigenic from the perspective of a typically (under)powered GWAS; nonetheless, enumerating these many contributors may be feasible."

        It's worth stressing that natural genetic variability is as described by others here in part is affected by a form of survivorship bias. You're mostly going to see variation that is within a range that will tend to pan out within a window of acceptable function (at least within a certain range of contexts). In a broader sense, particular single substitutions can certainly have enormous effects on protein (or RNA) function one way or the other; however, the single mutants that cause e.g. a millionfold fold drop in activity of an essential metabolic enzyme or a major shift in the specificity of a key regulator will often be lethal to begin with (in addition to the obvious caveat that you can’t explore the whole amino acid landscape in the space of single nucleotide variation from a particular codon); likewise evolutionary history has a habit of already having picked low hanging fruit of locally easily realized (i.e. one nucleotide substitution) unambiguous functional gains.

  35. Nancy Lebovitz says:

    I think I’ve been overtrained. From the beginning of the article, I was expecting this to be a symbolic story about an imaginary gene. And I was expecting a joke about Hitler, something like the gene being associated with efforts to take over the world, but instead it was about depression.

    On the serious side, I believe the underlying cause of a lot of the replication crisis is a desire to do science more cheaply in both money and effort than it can be done, and the work on 5-HTTPLR has been tremendously wasteful of already insufficient resources.

    • Aapje says:

      Doesn’t Httplr look suspiciously like Hitler? Coincidence? I think not!

      • Eric Rall says:

        This is the Internet. Everything looks suspiciously like Hitler to somebody.

      • Eponymous says:

        Doesn’t Httplr look suspiciously like Hitler? Coincidence? I think not!

        In a 26-letter alphabet, the probability that a random 6-letter string will contain “HTLR” in order is 15/26^4 = 3.28e-05. The probability that the string *starts* with H and *ends* with R is 6/26^4 = 1.31e-05.

        So we can reject “coincidence” at any reasonable significance level…

        • deciusbrutus says:

          But it’s a multivariate analysis. How many random 6-letter strings appear in the entire study? I begin by assuming that all space-separated strings are random…

  36. Steve Sailer says:

    Behavioral geneticist Robert Plomin says in his 2018 book “Blueprint” that the old candidate gene approach turned out in the end to be such a disaster that he almost retired from science about a decade ago.


  37. Yaleocon says:

    The thing that drives me crazy here is the part where this study hasn’t made waves. And won’t. The majority of scientific research won’t engage with this, and it will just… become irrelevant.

    And that means even really good scientists—like the authors of this study seem to be—will be lost. Science happens by standing on the shoulders of giants. And right now, those shoulders are hopelessly buried in crap studies. So anyone who wants to do good science has to spend 3/4 of their time digging through the crap to even find the firm foundation. And even if they succeed, there’s no guarantee their results won’t just be ignored and covered in more crap.

    Actual good science will need to ignore all existing science—or at least take it with mountains of salt—before it can come to exist.

    Someone please show me I’m wrong.

    • Aapje says:

      Indeed. I think we should do way fewer studies, and then do them properly.

      Scientists are now often more rewarded for being productive, than for doing good science. This has all kinds of perverse effects and result in lots of crappy studies.

    • Mazirian says:

      I think the reason it has not made waves is that the candidate gene approach was deemed highly unreliable by all the major research teams by the early 2010s and GWAS quickly became the standard research design (although marginal journals continue to publish candidate gene studies even today). A study showing that candidate gene findings are spurious is not news today; rather, it just confirms that the shift away from candidate genes several years ago was the correct thing to do.

      The Border et al. study is therefore not very interesting as a psychiatric genetics study. However, it is certainly interesting from the perspective of metascience, viz. how is it possible that hundreds of studies get published (in leading journals) on a completely spurious phenomenon?

  38. deciusbrutus says:

    What would you estimate the odds are that the entire study was simply falsified? Base rate of studies being falsified? A thousand times more? A thousand times less?

    • Radu Floricica says:

      Since there were meta studies that didn’t scream murder but ended up on the fence, I think it was mostly the difference from what the numbers actually said and what the written conclusions said. Statistics is pretty difficult to get right. Meta studies in particular are a bit of a God territory to do right, so many things they need to consider to get it right.

      • deciusbrutus says:

        Don’t meta studies generally assume that the studies they are studying actually happened?

        • Radu Floricica says:

          Maybe I’m just hoping they’re not. But since utter falsification is treated like a capital crime, and you can usually get to the same results just fudging statistics and interpretations… But usually it’s just letting biases guide you.

          • deciusbrutus says:

            That utter falsification is considered a capital crime is very strong evidence that it has happened.

            Thus, the base chance of falsification is greater than the P-value. The posterior odds after updating from any evidence I can imagine remain higher than the P-value.

            Of course “Studies can be outright falsified” is a horrible reason motivated cognition to choose not to update on a study. It’s merely a cap on the amount by which one should update.

  39. brett1211 says:

    Great, if terrifying post.

    Reminds me a lot of what happened last year when you could sprinkle a little “blockchain” on any 🐕 sh*t business and all of a sudden the impossible was possible. Literally 1) get the underwear 2) put underwear on the block chain 3) PROFIT!

    Sad to know that science is far from immune to manias and FOMO. In some sense it’s worse than business bc the subject matter is even more esoteric and more difficult for non-experts to dispute. Beware situations when no one has an incentive to do the work.

    Love the blog btw.

    • Radu Floricica says:

      The fun question was “so, if you replace the blockchain in your business with a database, what would change?”, and look deeply into their eyes.

      • Murphy says:

        “Uuuuuhhhhh…. Decentralized!”

        “Disruptive technology!”

        “ThErE’s cOmPaNiEs BuYiNg BlOcKChAiN WhIcH mEaNs ThEy MuSt KnOw MoRe ThAn YoU”

      • antilles says:

        This exactly! I’m very fortunate that my introduction to blockchain was at the hands of a CS PhD who basically made me read a few dozen pages worth of writing to that effect before they’d even talk to me.about the technology. It was like being pre-emptively deprogrammed so I could go out and read other articles without the delusions of grandeur sticking

    • acymetric says:

      I have a visceral reaction to the word blockchain. It isn’t that it is never an appropriate tool…just that it usually isn’t.

    • Nornagest says:

      I got a recruiter email once last year for a company that was doing blockchain AI cloud security.

      I didn’t reply.

      • acymetric says:

        By any chance was it founded by a Nigerian prince?

      • kenny says:

        The specific order of the buzzwords – “blockchain AI cloud security” – seems silly but automated fraud detection for a cryptocurrency exchange would be reasonable. BINGO!

    • Matthias says:

      Actually, science is much worse than business, because in science publication is the goal. In business the need to make money can often act as a reality check.

      Not always. Because you might be selling stories and placebos. But eg when Facebook runs some machine learning to target ads better, it cares about repeatable results. People in ML seem to be much more paranoid about overfitting, and have been so for years, than scientists seem to be about their results not replicating.

      Facebook et al are also more interested in null results than science seems to be: figuring out that a specific factor has no bearing on ad targeting is valuable insight for them.

    • PeterBorah says:

      As someone who’s been full-time in the blockchain space for five years now, less than 5% of the blockchain projects I’ve seen have made any sense whatsoever. Even the ones that kinda look like they make sense generally stop making sense if you dig into the details. It’s somewhat terrifying.

  40. Michael Handy says:

    I agree with the post.

    I also admire your fortitude in avoiding calling these studies “Literally HTTLPR”

  41. J says:

    In retrospect, it should have been obvious that hypertext transport over line printers just wasn’t a good idea.

    • liate says:

      Especially after the first 4 attempts!

      • John Schilling says:

        The first three hypertext line printers were merely destroyed by sabotage. The fourth disappeared without a trace less than 24 hours after going online, but was later recovered in an archaeological dig from late Roman Judea with a hyperlinked collection of synoptic and gnostic gospels. They probably should have stopped there, yes.

    • acymetric says:


      I will admit, not being aware of the 5-HTTLPR gene I thought this was going to be about some kind of new transfer protocol being developed that I had never heard of when I saw the title.

  42. Immortal Lurker says:

    A p value of 10^-143 basically means “this is the most reliable result the method employed could possibly produce.” Which means the null hypothesis will probably produce that result much more often than 1 in 10^143, but still pretty good.

    I wonder if it should become a part of statistical best practices to admit when you have a P value stronger than your own belief in objective reality?

    • robdonnelly says:

      Extremely small p-values are less surprising and less interesting than they seem at first glance.
      Recall what a p-value means: it is the probability of seeing a result at least as extreme as the one I observed, if (a) the null hypothesis were true AND (b) all the other assumptions of my model are also true.
      Recall what p-values are NOT: p-values are NOT the probability that the null hypothesis is false.

      For example with the example with depression being caused by negative life events, the p-value only means that it is extremely unlikely that the depression outcome they measured has exactly the same mean for people with and without negative life events.

      • robdonnelly says:

        I’m curious what the smallest p-value anyone has seen published. The smallest I know is https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2701578 which finds that online ads increase visits to the website advertised (p<10^-212)

          • daniel-wells says:

            I don’t have enough reputation points to comment on cross-validated, but this paper (doi.org/10.1126/science.aau1043) published in Science in January has a p-value of 3.6e-2382, not a typo, two thousand

          • van_doerbsten says:

            I’m not an expert on statistical analysis, but how do people even compute such low values? Given that machine accuracy at double precision is around 1e-16, how can you distinguish these results from roundoff errors?

          • amoeba says:

            @daniel-wells: Wow, thanks a lot. I thought it must be a typo (in the paper) so I emailed the first author. He replied that it is _not_ a typo. Apparently, the chi-square statistic was converted into a p-value via `pchisq(…, log.p=TRUE)` and then the outcome was exponentiated.

          • slabs says:

            Apparently, the chi-square statistic was converted into a p-value via `pchisq(…, log.p=TRUE)` and then the outcome was exponentiated.

            In R? This doesn’t seem to work for me. For example, if you try

            exp(pchisq(3000,df=1, log.p=TRUE))

            you just get
            [1] 1e+00

            as expected, because R’s lower limit is 2.225074e-308

      • Bugmaster says:

        Agreed. For example, our company does some bioinformatics research, and we often get ridiculously low P-Values that are computed mostly based on results from BLAST and other sequence alignment tools (in addition to our own inputs). When these tools say, “the P-Value of this alignment is 1e-107”, everyone knows that this should always be followed by, “…given the extremely simplistic model of organic bio-chemistry that this tool is using, and by the way, no one in the world has the right model anyway because if we did all of this research would’ve become obsolete on the spot”.

        In other words, these P-Values can only be understood as a relative measure. We can use it to rank the reliability of sequence alignments (produced by this one specific tool) against each other; we cannot use it to draw any grand cosmic conclusions about the Universe at large.

      • ProbablyMatt says:

        I’m not an expert in this type of work but I’d like to emphasize the point about the other assumptions in the model being true. Usually to obtain a p-value researchers have to either assume some tractable distribution for their test statistic or else invoke some asymptotic result to the effect that (still under some assumptions) the test statistic in question will have a distribution that is “close” to a tractable one (usually normal) for “large sample sizes”.

        If you get a very small p-value, it means you’ve observed a value of the test statistic that is very far into the “tail” of the distribution (i.e. away from where you expect the bulk of the outcomes to be if the null is true). The tail is exactly where differences between the “actual distribution” of the test statistic and whatever distribution you assumed for it become important. So in any practical circumstance such a small p-value should probably be interpreted as an artifact of a modeling assumption.

        Disclaimer: as I said I am not an expert on this area of application and there is a chance researchers used some kind of non-parametric test or otherwise tried to address these issues, but I doubt it.

      • arch1 says:

        “p-values are NOT the probability that the null hypothesis is false.”

        Did you mean to say “true” rather than “false”?

        • HeelBearCub says:

          No. Very small p values mean a low probability the results are due to random chance. The null hypothesis is the opposite of your hypothesis, so “false” is the correct word.

          • No. Very small p values mean a low probability the results are due to random chance.

            That’s not quite correct, although it is a common misunderstanding. It is true that the lower the p value, ceteris paribus, the lower the probability that the result was due to random chance. But you could have quite a low p value and still have quite a high probability that the result was due to random chance.

            The p value is the probability that the evidence for your conjecture would be as strong as it is if the null hypothesis were true–hence your conjecture false (in the particular way defined by the null hypothesis). The probability of the results conditional on the null hypothesis is not the same as the probability of the null hypothesis conditional on the results.

            Consider the following simple example. I pull a coin at random from my pocket. My theory is that it is double headed. The null hypothesis is that it is a fair coin. I flip it twice, get heads both times.

            The probability of that result, conditional on the null hypothesis, is .25. It does not follow that the probability the coin is fair is only .25. If I get five heads the p value is 1/32<.05 but the odds that it is a double headed coin are still quite low, given how rare they are.

            Going back to the original statement, p-values are not the probability that the null hypothesis is false. Nor that it is true.

        • arch1 says:

          Thanks to David F for confirming my belief that while both versions are true, only the modified version is non-obvious ( thus more worth stating).

      • eric23 says:

        Yes, it’s like the “once in 500 years” weather events (which are equivalent to p=0.002). In some places, these events now occur about once a decade. Why? because p=0.002 was true for the previous climate, but now the climate has changed.

    • A1987dM says:

      For a Gaussian distribution, p = 10^-143 means a 25.5 sigma significance, i.e. you’re measuring the quantity with 3.9% relative uncertainty. The anomalous magnetic moment of the electron has been measured to be greater than zero with a 4 billion sigma significance, corresponding to a p-value of about 10^-3.5e18.

      Then there are confidence levels outside the argument, but there always are.

    • gattsuru says:

      Yeah, I realize that these are different contexts, that this output was the numerically correct result for its particular statistical method, and that the geneticist study is certainly of higher accuracy than the original, but I flinched a little at that number. It’s something like seventy more zeroes than the one used in Stop Adding Zeroes, where were were told nothing is ever 10^-66 and you should never use that number. Ethernet doesn’t guarantee that level of accuracy over a single hop. I don’t think we’re talking an extent where spontaneous quantum failure starts to matter, but it’s not that far.

      It proves a point, and I guess that’s the point, but I don’t think it’s a good practice to use.