<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Slate Star Codex &#187; statistics</title>
	<atom:link href="http://slatestarcodex.com/tag/statistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://slatestarcodex.com</link>
	<description>In a mad world, all blogging is psychiatry blogging</description>
	<lastBuildDate>Fri, 24 Jul 2015 02:59:17 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=4.2.3</generator>
	<item>
		<title>That Chocolate Study</title>
		<link>http://slatestarcodex.com/2015/05/30/that-chocolate-study/</link>
		<comments>http://slatestarcodex.com/2015/05/30/that-chocolate-study/#comments</comments>
		<pubDate>Sun, 31 May 2015 01:44:12 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[medicine]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3658</guid>
		<description><![CDATA[Several of you asked me to write about that chocolate article that went viral recently. From I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here&#8217;s How: “Slim by Chocolate!” the headlines blared. A team of German researchers had found &#8230; <a href="http://slatestarcodex.com/2015/05/30/that-chocolate-study/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Several of you asked me to write about that chocolate article that went viral recently. From <A HREF="http://io9.com/i-fooled-millions-into-thinking-chocolate-helps-weight-1707251800">I Fooled Millions Into Thinking Chocolate Helps Weight Loss. Here&#8217;s How</A>:<br />
<blockquote>“Slim by Chocolate!” the headlines blared. A team of German researchers had found that people on a low-carb diet lost weight 10 percent faster if they ate a chocolate bar every day. It made the front page of Bild, Europe’s largest daily newspaper, just beneath their update about the Germanwings crash. From there, it ricocheted around the internet and beyond, making news in more than 20 countries and half a dozen languages. It was discussed on television news shows. It appeared in glossy print, most recently in the June issue of Shape magazine (“Why You Must Eat Chocolate Daily,” page 128). Not only does chocolate accelerate weight loss, the study found, but it leads to healthier cholesterol levels and overall increased well-being. The Bild story quotes the study’s lead author, Johannes Bohannon, Ph.D., research director of the Institute of Diet and Health: “The best part is you can buy chocolate everywhere.”</p>
<p>I am Johannes Bohannon, Ph.D. Well, actually my name is John, and I’m a journalist. I do have a Ph.D., but it’s in the molecular biology of bacteria, not humans. The Institute of Diet and Health? That’s nothing more than a website.</p>
<p>Other than those fibs, the study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.</p></blockquote>
<p>Bohannon goes on to explain that as part of a documentary about &#8220;the junk-science diet industry&#8221;, he and some collaborators designed a fake study to see if they could convince journalists. They chose to make it about chocolate:<br />
<blockquote>Gunter Frank, a general practitioner in on the prank, ran the clinical trial. Onneken had pulled him in after reading a popular book Frank wrote railing against dietary pseudoscience. Testing bitter chocolate as a dietary supplement was his idea. When I asked him why, Frank said it was a favorite of the “whole food” fanatics. “Bitter chocolate tastes bad, therefore it must be good for you,” he said. “It’s like a religion.”</p></blockquote>
<p>They recruited 16 (!) participants and divided them into three groups. One group ate their normal diet. Another ate a low-carb diet. And a third ate a low-carb diet plus some chocolate. Both the low-carb group and the low-carb + chocolate group lost weight compared to the control group, but the low-carb + chocolate group lost weight &#8220;ten percent faster&#8221;, and the difference was &#8220;statistically significant&#8221;. They also had &#8220;better cholesterol readings&#8221; and &#8220;higher scores on the well-being survey&#8221;.</p>
<p>Bohannon admits exactly how he managed this seemingly impressive result &#8211; he measured eighteen different parameters (weight, cholesterol, sodium, protein, etc) which virtually guarantees that one will be statistically significant. That one turned out to be weight loss. If it had been sodium, he would have published the study as &#8220;Chocolate Lowers Sodium Levels&#8221;.</p>
<p>Then he pitched it to various fake for-profit journals until one of them bit. Then he put out a PR release to various media outlets, and they ate it up. They ended up in a bunch of English and German language media including Bild, the Daily Star, Times of India, Cosmopolitan, Irish Examiner, and the Huffington Post.</p>
<p>The people I&#8217;ve seen discussing this seem to have drawn five conclusions, four of which are wrong:</p>
<p><b>Conclusion 1: Haha, I can&#8217;t believe people were so gullible that they actually thought chocolate caused weight loss!</b></p>
<p>Bohannon himself endorses this one, saying bitter chocolate was a favorite of &#8220;whole food fanatics&#8221; because &#8220;Bitter chocolate tastes bad, therefore it must be good for you” and “it’s like a religion.</p>
<p>But actually, there&#8217;s lots of previous research supporting health benefits from bitter chocolate, none of which Bohannon seems to be aware of.</p>
<p>A <A HREF="http://ajcn.nutrition.org/content/95/3/740.full">meta-analysis</A> of 42 randomized controlled trials totaling 1297 participants in the <i>American Journal of Clinical Nutrition</i> found that chocolate improved blood pressure, flow-mediated dilatation (a measure of vascular health), and insulin resistance (related to weight gain).</p>
<p>A <A HREF="http://jn.nutrition.org/content/141/11/1982.short">different meta-analysis</A> of 24 randomized controlled trials totalling 1106 people in the <i>Journal of Nutrition</i> also found that chocolate improved blood pressure, flow-mediated dilatation, and insulin resistance.</p>
<p>A <A HREF="http://www.cochrane.org/CD008893/HTN_effect-of-cocoa-on-blood-pressure">Cochrane Review</A> of 20 randomized controlled trials of 856 people found that chocolate improved blood pressure (it didn&#8217;t test for flow-mediated dilatation or insulin resistance)</p>
<p>A <A HREF="http://pubs.acs.org/doi/abs/10.1021/jf500333y">study on mice</A> found that mice fed more chocolate flavanols were less likely to gain weight.</p>
<p>An <A HREF="http://archinte.jamanetwork.com/article.aspx?articleid=1108800">epidemiological study</A> of 1018 people in the United States found an association between frequent chocolate consumption and lower BMI, p < 0.01.    A <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/24139727">second epidemiological study</A> of 1458 people in Europe found the same thing, again p < 0.01.    A <A HREF="http://archinte.jamanetwork.com/article.aspx?articleid=409867">cohort study</A> of 470 elderly men found chocolate intake was inversely associated with blood pressure and cardiovascular mortality, p < 0.001, not confounded by the usual suspects.    I wouldn't find any of these studies alone very convincing. But together, they compensate for each other's flaws and build a pretty robust structure. So the next flawed conclusion is:    <b>Conclusion 2: This proves that nutrition isn&#8217;t a real science and we should all just be in a state of radical skepticism about these things</b></p>
<p>What we would like to do is a perfect study where we get thousands of people, randomize them to eat-lots-of-chocolate or eat-little-chocolate at birth, then follow their weights over their entire lives. That way we could have a large sample size, perfect randomization, life-long followup, and clear applicability to other people. But for practical and ethical reasons, we can&#8217;t do that. So we do a bunch of smaller studies that each capture a few of the features of the perfect study. </p>
<p>First we do animal studies, which can have large sample sizes, perfect randomization, and life-long followup, but it&#8217;s not clear whether it applies to humans. </p>
<p>Then we do short randomized controlled trials, which can have large sample sizes, perfect randomization, and human applicability, but which only last a couple of months.</p>
<p>Then we do epidemiological studies, which can have large sample sizes, human applicability, and last for many decades, but which aren&#8217;t randomized very well and might be subject to confounders.</p>
<p>This is what happened in the chocolate studies above. Mice fed a strict diet plus chocolate for a long time gain less weight than mice fed the strict diet alone. This is suggestive, but we don&#8217;t know if it applies to humans. So we find that in randomized controlled trials, chocolate helps with some proxies for weight gain like insulin resistance. This is even more suggestive, but we don&#8217;t know if it lasts. So we find that in epidemiological studies, lifetime chocolate consumption is associated with lifetime good health outcomes. This on its own is suggestive but potentially confounded, but when we combine them with all of the others, they become more convincing.</p>
<p>(am I cheating by combining blood pressure and BMI data? Sort of, but the two measures are correlated)</p>
<p>When all of these paint the same picture, then we start thinking that maybe it&#8217;s because our hypothesis is true. Yes, maybe the mouse studies could be related to a feature of mice that doesn&#8217;t generalize to humans, <i>and</i> the randomized controlled trial results wouldn&#8217;t hold up after a couple of years, <i>and</i> the epidemiological studies are confounded. But that would be extraordinarily bad luck. More likely they&#8217;re all getting the same result because they&#8217;re all tapping into the same underlying reality.</p>
<p>This is the way science usually works, it&#8217;s the way nutrition science usually works, and it&#8217;s the way the science of whether chocolate causes weight gain usually works. These are not horrible corrupt disciplines made up entirely of shrieking weight-loss-pill peddlers trying to hawk their wares. They only turn into that when the media takes a single terrible study totally out of context and misrepresents the field.</p>
<p><b>Conclusion 3: Studies Always Need To Have High Sample Sizes</b></p>
<p>Here&#8217;s another good chocolate-related study: <A HREF="http://ajcn.nutrition.org/content/81/3/611.short">Short-term administration of dark chocolate is followed by a significant increase in insulin sensitivity and a decrease in blood pressure in healthy persons</A>.</p>
<p>Bohannon says:<br />
<blockquote>Our study was doomed by the tiny number of subjects, which amplifies the effects of uncontrolled factors&#8230;Which is why you need to use a large number of people, and balance age and gender across treatment group</p></blockquote>
<p>But I say &#8220;Short-term administration&#8230;&#8221; is a good study despite having an n = 15, one <i>less</i> than the Bohannon study. Why? Well, their procedure was pretty involved, and you wouldn&#8217;t be able to get a thousand people to go through the whole rigamarole. On the other hand, their insulin resistance measure thing was nearly twice as high in the dark chocolate group as the white chocolate group, and p < 0.001.    (Another low sample size study that was nevertheless very good: psychiatrists knew that consuming dietary tyramine when taking a MAOI antidepressant can cause a life-threatening hypertensive crisis, but they didn't know <i>how much</i> tyramine it took. In order to find out, they took a dozen people, put them on MAOIs, and then gradually fed them more and more tyramine with doctors standing by to treat the crisis as soon as it started. They found about how much tyramine it took and declared the experiment a success. If the tyramine levels were about the same in all twelve patients, then adding a thousand more patients wouldn&#8217;t help much, and it would definitely increase the risk.)</p>
<p>Sample size is important when you&#8217;re trying to detect a small effect in the middle of a large amount of natural variation. When you&#8217;re looking for a large effect in the middle of no natural variation, sample size doesn&#8217;t matter as much. For example, if there was a medicine that would help amputees grow their hands back, I would accept success with a single patient (if it worked) as proof of effectiveness (I suppose I couldn&#8217;t be sure it would <i>always</i> work until more patients had been tried, but a single patient would certainly pique my interest). You&#8217;re not going after sample size so much as after p-value.</p>
<p><b>Conclusion 4: P-Values Are Stupid And We Need To Get Rid Of Them</b></p>
<p>Bohannon says that:<br />
<blockquote>If you measure a large number of things about a small number of people, you are almost guaranteed to get a “statistically significant” result&#8230;the letter p seems to have totemic power, but it’s just a way to gauge the signal-to-noise ratio in the data&#8230;scientists are getting wise to these problems. Some journals are trying to phase out p value significance testing altogether to nudge scientists into better habits.</p></blockquote>
<p>Okay, take the &#8220;Short-term administration&#8221; study above. I would like to be able to say that since it has p < 0.001, we know it's significant. But suppose we're not allowed to do p-values. All I do is tell you "Yeah, there was a study with fifteen people that found chocolate helped with insulin resistance" and you laugh in my face.    Effect size is supposed to help with that. But suppose I tell you "There was a study with fifteen people that found chocolate helped with insulin resistance. The effect size was 0.6." I don't have any intuition at all for whether or not that's consistent with random noise. Do you?    Okay, <i>then</i> they say we&#8217;re supposed to report confidence intervals. The effect size was 0.6, with 95% confidence interval of [0.2, 1.0]. Okay. So I check the lower bound of the confidence interval, I see it&#8217;s different from zero. But now I&#8217;m not transcending the p-value. I&#8217;m just using the p-value by doing a sort of kludgy calculation of it myself &#8211; &#8220;95% confidence interval does not include zero&#8221; is the same as &#8220;p value is less than 0.05&#8243;. </p>
<p>(Imagine that, although I know the 95% confidence interval doesn&#8217;t include zero, I start wondering if the 99% confidence interval does. <i>If only</i> there were some statistic that would give me this information!)</p>
<p>But wouldn&#8217;t getting rid of p-values prevent &#8220;p-hacking&#8221;? Maybe, but it would just give way to &#8220;d-hacking&#8221;. You don&#8217;t think you could test for twenty different metabolic parameters and only report the one with the highest effect size? The only difference would be that p-hacking is completely transparent &#8211; if you do twenty tests and report a p of 0.05, I know you&#8217;re an idiot &#8211; but d-hacking would be inscrutable. If you do twenty tests and report that one of them got a d = 0.6, is that impressive? No better than chance? I have no idea. I bet there&#8217;s some calculation I could do to find out, but I also bet that it would be a lot harder than just multiplying the value by the number of tests and seeing what happens. [EDIT: On reflection not sure this is true; the possibility of p-hacking is inherent to p-values, but the possibility of d-hacking isn&#8217;t inherent to effect size. I don&#8217;t actually know how much this would matter in the real world.]</p>
<p>But wouldn&#8217;t switching from p-values to effect sizes prevent people from making a big deal about tiny effects that are nevertheless statistically significant? Yes, but sometimes we <i>want</i> to make a big deal about tiny effects that are nevertheless statistically significant! Suppose that Coca-Cola is testing a new product additive, and finds in large epidemiological studies that it causes one extra death per hundred thousand people per year. That&#8217;s an effect size of approximately zero, but it might still be statistically significant. And since about a billion people worldwide drink Coke each year, that&#8217;s a ten thousand deaths. If Coke said &#8220;Nope, effect size too small, not worth thinking about&#8221;, they would kill almost two milli-Hitlers worth of people.</p>
<p>Yeah, sure, you can never use p-values again, and run into all of these other problems. Or you can do a Bonferroni correction, which is a very simple adjustment to p-values which corrects for p-hacking. <i>Or</i> instead of taking one study at face value LIKE AN IDIOT you can wait to see if other studies replicate the findings. Remember, the whole point of p-hacking is choosing at random form a bunch of different outcomes, so if two trials both try to p-hack, they&#8217;ll end up with different outcomes and the game will be up. Seriously, <A HREF="http://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">STOP TRYING TO BASE CONCLUSIONS ON ONE STUDY</A>.</p>
<p><b>Conclusion 5: Trust Science Journalism Less</b></p>
<p>This is the one that&#8217;s correct.</p>
<p>But it&#8217;s not totally correct. Bohannon boasts of getting his findings in a couple of daily newspapers and the Huffington Post. That&#8217;s not exactly the cream of the crop. <i>The Economist</i> usually has excellent science journalism. Magazines like <I>Scientific American</i> and <i>Discover</i> can be okay, although even they get hyped. Reddit&#8217;s r/science is good, assuming you make sure to always check the comments. And there are individual blogs like <A HREF="http://blogs.plos.org/mindthebrain/">Mind the Brain</A> run by researchers in the field that can usually be trusted near-absolutely. Cochrane Collaboration will always have among the best analyses on everything. </p>
<p>If you really want to know what&#8217;s going on and can&#8217;t be bothered to ferret out all of the brilliant specialists, my highest recommendation goes to Wikipedia. It isn&#8217;t perfect, but compared to anything you&#8217;d find on a major news site, it&#8217;s like night and day. Wikipedia&#8217;s <A HREF="http://en.wikipedia.org/wiki/Health_effects_of_chocolate">Health Effects Of Chocolate</A> page is pretty impressive and backs everything it says up with good meta-analyses and studies in the best journals. Its sentence on the cardiovasuclar effects links to <A HREF="https://mospace.umsystem.edu/xmlui/bitstream/handle/10355/7720/DoesChocolateHaveBenefits.pdf?sequence=1">this letter</A>, which is very good.</p>
<p>Do you know why you can trust Wikipedia better than news sites? Because Wikipedia <i>doesn&#8217;t obsess over the single most recent study</i>. Are you starting to notice a theme?</p>
<p>For me, the takeaway from this affair is that there is no one-size-fits-all solution to make statistics impossible to hack. Getting rid of p-values is appropriate sometimes, but not other times. Demanding large sample sizes is appropriate sometimes, but not other times. Not trusting silly conclusions like &#8220;chocolate causes weight loss&#8221; works sometimes but not other times. At the end of the day, you have to actually know what you&#8217;re doing. Also, <i>try to read more than one study</i>.</p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/05/30/that-chocolate-study/feed/</wfw:commentRss>
		<slash:comments>185</slash:comments>
		</item>
		<item>
		<title>Beware Summary Statistics</title>
		<link>http://slatestarcodex.com/2015/05/19/beware-summary-statistics/</link>
		<comments>http://slatestarcodex.com/2015/05/19/beware-summary-statistics/#comments</comments>
		<pubDate>Wed, 20 May 2015 01:25:10 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3643</guid>
		<description><![CDATA[Last night I asked Tumblr two questions that had been bothering me for a while and got some pretty good answers. I. First, consider the following paragraph from JRank: Terrie Moffitt and colleagues studied 4,552 Danish men born at the &#8230; <a href="http://slatestarcodex.com/2015/05/19/beware-summary-statistics/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Last night I asked Tumblr two questions that had been bothering me for a while and got some pretty good answers.</p>
<p><b>I.</b></p>
<p>First, consider the following paragraph from <A HREF="http://law.jrank.org/pages/1363/Intelligence-Crime-Measuring-size-IQ-crime-correlation.html">JRank</A>:<br />
<blockquote>Terrie Moffitt and colleagues studied 4,552 Danish men born at the end of World War II. They examined intelligence test scores collected by the Danish army (for screening potential draftees) and criminal records drawn from the Danish National Police Register. The men who committed two or more criminal offenses by age twenty had IQ scores on average a full standard deviation below nonoffenders, and IQ and criminal offenses were significantly and negatively correlated at r = -.19.</p></blockquote>
<p>Repeat offenders are a 15 IQ points &#8211; an entire standard deviation &#8211; below the rest of the population. This matches common sense, which suggests that serial criminals are not the brightest members of society. It sounds from this like IQ is a very important predictor of crime.</p>
<p>But r = &#8211; 0.19 suggests that only about 3.6% of variance in crime is predicted by IQ. 3.6% is nothing. It sounds from this like IQ barely matters at all in predicting crime.</p>
<p>This isn&#8217;t a matter of conflicting studies: these are two ways of describing the same data. What gives?</p>
<p>The best answer I got was from <A HREF="http://pappubahry2.tumblr.com/">pappubahry2</A>, who posted the following made-up graph:</p>
<p><center><IMG SRC="http://36.media.tumblr.com/d81761fb815b798526a99995de93ad8e/tumblr_nokvzvYGR51s0g8t7o1_500.png"></center></p>
<p>Here <i>all</i> crime is committed by low IQ individuals, but the correlation between IQ and crime is still very low, r = 0.16. The reason is simple: very few people, including very few low-IQ people, commit crimes. r is kind of a mishmash of p(low IQ|criminal) and p(criminal|low IQ), and the latter may be very low even when all criminals are from the lower end of the spectrum.</p>
<p>The advice some people on Tumblr gave was to beware summary statistics. &#8220;IQ only predicts 3.6% of variance in crime&#8221; makes it sound like IQ is nearly irrelevant to criminality, but in fact it&#8217;s perfectly consistant with IQ being a very strong predictive factor.</p>
<p><b>II.</b></p>
<p>So I pressed my luck with the following question:</p>
<p><center><IMG SRC="http://41.media.tumblr.com/8d6ffde16483fb57a7c766fac796a4e1/tumblr_inline_nokx2puLFI1skdjyu_400.png"></p>
<p><i>I&#8217;m not sure why everyone&#8217;s income on this graph is so much higher than average US per capita of $30,000ish, or even average white male income of $31,000ish. I think it might be the &#8216;age 40 to 50&#8242; specifier.</i></center></p>
<p>This graph suggests IQ is an important determinant of income. But most studies say the correlation between IQ and income is at most 0.4 or so, or 16% of the variance, suggesting it&#8217;s a very minor determinant of income. Most people are earning an income, so the too-few-criminals explanation from above doesn&#8217;t apply. Again, what gives?</p>
<p>The best answer I got for this one was from <A HREF="http://su3su2u1.tumblr.com/">su3su2u1</A>, who pointed out that there was probably very high variance within the individual deciles. Pappubahry made some more graphs to demonstrate:</p>
<p><center><IMG SRC="http://40.media.tumblr.com/fed14aca6749f595aa25c0fa5781aec2/tumblr_nokyh6Hysg1s0g8t7o2_500.png"></p>
<p><IMG SRC="http://41.media.tumblr.com/20696c6d3dc63fb884bba4e959333e27/tumblr_nokyh6Hysg1s0g8t7o1_500.png"></center></p>
<p>I understand this one intellectually, but I still haven&#8217;t gotten my head around it. Regardless of the amount of variance, going from a category where I can expect to make on average $40,000 to a category where I could expect to make on average $160,000 seems like a pretty big deal, and describing it as &#8220;only predicting 16% of the variation&#8221; seems patently unfair.</p>
<p>I guess the moral is the same as the moral in the first situation: beware summary statistics. Based on the way you explain things, you can use different summary statistics to make things look very important or not important at all. And as a bunch of people recommended to me: when in doubt, demand to see the scatter plot.</p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/05/19/beware-summary-statistics/feed/</wfw:commentRss>
		<slash:comments>215</slash:comments>
		</item>
		<item>
		<title>Growth Mindset 4: Growth Of Office</title>
		<link>http://slatestarcodex.com/2015/05/07/growth-mindset-4-growth-of-office/</link>
		<comments>http://slatestarcodex.com/2015/05/07/growth-mindset-4-growth-of-office/#comments</comments>
		<pubDate>Fri, 08 May 2015 02:21:20 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[psychology]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3633</guid>
		<description><![CDATA[Previously In Series: No Clarity Around Growth Mindset&#8230;Yet // I Will Never Have The Ability To Clearly Explain My Beliefs About Growth Mindset // Growth Mindset 3: A Pox On Growth Your Houses Last month I criticized a recent paper, &#8230; <a href="http://slatestarcodex.com/2015/05/07/growth-mindset-4-growth-of-office/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><i>Previously In Series: <A HREF="http://slatestarcodex.com/2015/04/08/no-clarity-around-growth-mindset-yet/">No Clarity Around Growth Mindset&#8230;Yet</A> // <A HREF="http://slatestarcodex.com/2015/04/10/i-will-never-have-the-ability-to-clearly-explain-my-beliefs-about-growth-mindset/">I Will Never Have The Ability To Clearly Explain My Beliefs About Growth Mindset</A> // <A HREF="http://slatestarcodex.com/2015/04/22/growth-mindset-3-a-pox-on-growth-your-houses/">Growth Mindset 3: A Pox On Growth Your Houses</A></i></p>
<p>Last month I criticized a recent paper, Paunesku et al&#8217;s <A HREF="http://slatestarcodex.com/Stuff/mindset3_paper.pdf">Mindset Interventions Are A Scalable Treatment For Academic Underachievement</A>, saying that it spun a generally pessimistic set of findings about growth mindset into a generally optimistic headline.</p>
<p>Earlier today, lead author Dr. Paunesku was kind enough to write a very thorough reply, which I reproduce below:</p>
<p><b>I.</b></p>
<p>Hi Scott,</p>
<p>Thanks for your provocative blog post about my work (I&#8217;m the first author of the paper you wrote about). I&#8217;d like to take a few moments to respond to your critiques, but first I&#8217;d like to frame my response and tell you a little bit about my own motivation and that of the team I am a member of (<A HREF="https://www.perts.net/">PERTS</A>).</p>
<p>Good criticism is what makes science work. We are critical of our own work, but we are happy to have help. Often critics are not thoughtful or specific. So I very much appreciate the intent of your blog (to be thoughtful and specific).</p>
<p>What is our motivation? We are trying to improve our education system so that all students can thrive. If growth mindset is effective, we want it in every classroom possible. If it is ineffective, we want to know about it so we don&#8217;t waste people&#8217;s time. If it is effective for some students in some classrooms, we want to know where and for whom so that we can help those students.</p>
<p>What is our history and where are we now? PERTS approached social psychological interventions with a fair amount of skepticism at first. In many ways, they seemed too good to be true. But, we thought, &#8220;if this is true, we should do everything we can to spread it&#8221;. Our work over the last 5 years has been devoted to trying to see if the results that emerged from initial, small experiments (like Aronson et al., 2002 and Blackwell et al., 2007) would continue to be effective when scaled. The paper you are critiquing is a step in that process &#8212; not the end of the process. We are continuing research to see where, for whom, and at what scale social psychological approaches to improving education outcomes can be effective.</p>
<p>How do I intend to respond to your criticisms? In some cases, your facts or interpretations are simply incorrect, and I will try to explain why. I also invite you to contact me for follow up. In others cases, we simply have different opinions about what&#8217;s important, and we&#8217;ll have to agree to disagree. Regardless, I appreciate your willingness to be bold and specific in your criticism. I think that&#8217;s brave, and I think such bravery makes science stronger. </p>
<p><u>First, what is growth mindset?</u></p>
<p>This quote is from one of your other blog posts (not your critique of my paper), from your post:<br />
<blockquote>If you’re not familiar with it, growth mindset is the belief that people who believe ability doesn’t matter and only effort determines success are more resilient, skillful, hard-working, perseverant in the face of failure, and better-in-a-bunch-of-other-ways than people who emphasize the importance of ability. Therefore, we can make everyone better off by telling them ability doesn’t matter and only hard work does.</p></blockquote>
<p>If you think that&#8217;s what growth mindset is, I can certainly see why you&#8217;d find it irritating &#8212; and even destructive. I&#8217;d like to assure you that the people doing growth mindset research do not ascribe to the interpretation of growth mindset you described. Nor is that interpretation of growth mindset something we aim to communicate through our interventions. So what is growth mindset? </p>
<p>Growth mindset is not the belief that &#8220;ability doesn’t matter and only effort determines success.&#8221; Growth mindset is the belief that individuals can improve their abilities &#8212; usually through effort and by learning more effective strategies. For example, imagine a third grader struggling to learn long division for the first time. Should he interpret his struggle as a sign that he&#8217;s bad at math &#8212; as a sign that he should give up on math for good? Or would it be more adaptive if he realized that he could probably get a lot better at math if he sought out help from his peers or teachers? The student who thinks he should give up would probably do pretty badly while the student who thinks that he can improve his abilities &#8212; and tries to do so by learning new study strategies and practicing them &#8212; would do comparatively better. </p>
<p>That&#8217;s the core of growth mindset. It&#8217;s nothing crazy like thinking ability doesn&#8217;t matter. It&#8217;s keeping in mind that you can improve and that &#8212; to do so &#8212; you need to work hard and seek out and practice new, effective strategies.</p>
<p>As someone who has worked closely with Carol Dweck and with her students and colleagues for seven years now, I can personally attest that I have never heard anyone in that extended group of people express the belief that ability does not matter or that only hard work matters. In fact, a growth mindset wouldn’t make any sense if ability didn’t matter because a growth mindset is all about improving ability. </p>
<p>One of the active goals of the group I co-founded (PERTS) is to try to dispel misinterpretations of growth mindset because they can be harmful. I take it as a failure of our group that someone like you &#8212; someone who clearly cares about research and about scientific integrity &#8212; could walk away from our work with that interpretation of growth mindset. I hope that PERTS, and other groups promoting growth mindset, can get better and better at refining the way we talk about growth mindset so that people can walk away from our work understanding it more clearly. For that perspective, I hope you can continue to engage with us to improve that message so that people don&#8217;t continue to misinterpret it.</p>
<p>Anyway, here are my responses to specific points you made in your blog about my paper:</p>
<p><u>Was the control group a mindset intervention?</u></p>
<p>You wrote:<br />
<blockquote>&#8220;A quarter of the students took a placebo course that just presented some science about how different parts of the brain do different stuff. This was also classified as a “mindset intervention”, though it seems pretty different.&#8221;</p></blockquote>
<p>What makes you think it was classified as a mindset intervention? We called that the control group, and no one on our team ever thought of that as a mindset intervention. </p>
<p><u>The Elderly Hispanic Woman Effect</u></p>
<p>You wrote:<br />
<blockquote>Subgroup analysis can be useful to find more specific patterns in the data, but if it’s done post hoc it can lead to what I previously called the Elderly Hispanic Woman Effect&#8230;</p></blockquote>
<p>First, I just want to note that I love calling this the &#8220;elderly Hispanic woman effect.&#8221; It really brings out the intrinsic ridiculousness of the subgroup analyses researchers sometimes go through in search of an effect with a p<.05. It is indeed unlikely that "elderly Hispanic women" would be a meaningful subgroup for analyzing the effects of a medicine (although it might be a fun thought exercise to try to think of examples of a medicine whose effects would be likely to be moderated by being an elderly Hispanic woman).    In bringing up the elderly Hispanic woman effect, you're suggesting that we didn't have an a priori reason to think that underperforming students would benefit from these mindset interventions and that we just looked through a bunch of moderators until we found one with p<.05. Well that's not what we did, and I hope I can convince you that our choice of moderator was perfectly reasonable given prior research and theory.    There's a lot of research (and common sense too) to suggest that mindset -- and motivation in general -- matters much more when something is hard than when it is easy. Underachieving students presumably find school more difficult, so it makes sense that we'd want to focus on them. I don't think our choice of subgroup is a controversial or surprising prediction. I think anyone who knows mindset research well would predict stronger effects for students who are struggling. In other words, this is obviously not a case of the elderly Hispanic woman effect because it is totally consistent with prior theory and predictions. What ultimately matters more than any rhetorical argument, however, is whether the effect is robust -- whether it replicates.    On that front, I hope you'll be pleased to learn that we just ran a successful replication of this study (in fall 2014) in which we again found that growth mindset improves achievement specifically among at-risk high school students (currently under review). We're also planning yet another large scale replication study this fall with a nationally representative sample of schools so that we can be more confident that the interventions are effective in various types of contexts before giving them away for free to any school that wants them.    <u>Is the sense of purpose intervention just a bunch of platitudes?</u></p>
<p>You wrote:<br />
<blockquote>Still another quarter took a course about “sense of purpose” which talked about how schoolwork was meaningful and would help them accomplish lots of goals and they should be happy to do it.</p></blockquote>
<p>[Later you say that those &#8220;children were told platitudes about how doing well in school will “make their families proud” and “make a positive impact”.]</p>
<p>I wouldn&#8217;t say those are platitudes. I think you&#8217;re under-appreciating the importance of finding meaning in one&#8217;s work. It&#8217;s a pretty basic observation about human nature that people are more likely to try hard when it seems like there&#8217;s a good reason to try hard. I also think it&#8217;s a pretty basic observation about our education system that many students don&#8217;t have good reasons for trying hard in school &#8212; reasons that resonate with them emotionally and help them find the motivation to do their best in the classroom. In our purpose intervention, we don&#8217;t just tell students what to think. We try to scaffold them to think of their own reasons for working hard in school, with a focus on reasons that are more likely to have emotional resonance for students. This type of self-persuasion technique has been used for decades in attitudes research.</p>
<p>We&#8217;ve written in more depth about these ideas and explored them through a series of studies. I&#8217;d <A HREF="https://www.perts.net/static/documents/yeager_2014.pdf">encourage</A> you to read this article if you&#8217;re interested. </p>
<p><u>Our paper title and abstract are misleading</u></p>
<p>You wrote:<br />
<blockquote>Among ordinary students, the effect on the growth mindset group was completely indistinguishable from zero, and in fact they did nonsignificantly worse than the control group. This was the most basic test they performed, and it should have been the headline of the study. The study should have been titled “Growth Mindset Intervention Totally Fails To Affect GPA In Any Way”.</p></blockquote>
<p>I think the title you suggest would have been misleading. How?</p>
<p>First, we did find evidence that mindset interventions help underachieving students &#8212; and those students are very important from a policy standpoint. As we describe in the paper, those students are more likely to drop out, to end up underemployed, or to end up in prison. So if something can help those students at scale and at a low cost, it&#8217;s important for people to know that. That&#8217;s why the word &#8220;underachievement&#8221; is in the title of the paper &#8212; because we&#8217;re accurately claiming that these interventions can help the important (and large) group of students who are underachieving.</p>
<p>Second, the interventions influenced the way all students think about school in ways that are associated with achievement. Although the higher performing students didn&#8217;t show any effects on grades in the semester following the study, their mindsets did change. And, as per the arguments I presented above about the link between mindset and difficulty, it&#8217;s quite feasible that those higher-performing students will benefit from this change in mindset down the line. For example, they may choose to take harder classes (e.g., <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/24512251">Romero et al., 2014</A>) or they may be more persistent and successful in future classes that are very challenging for them.</p>
<p><u>A misinterpretation of the y-axis in <A HREF="https://www.evernote.com/shard/s326/sh/17e4e9e8-20eb-44e9-8570-0948999e982d/9c4b53ddcbb11bb196845b3eec161bee/deep/0/Growth-Mindset-3--A-Pox-On-Growth-Your-Houses---Slate-Star-Codex.png">this graph</A>.</u></p>
<p>You wrote:<br />
<blockquote>Growth mindset still doesn’t differ from zero [among at-risk students].</p></blockquote>
<p>This just seems to be a simple misreading of the graph. Either you missed the y-axis of the graph that you reproduced on your blog or you don&#8217;t know what a residual standardized score is. Either way, I&#8217;ll explain because this is pretty esoteric stuff.</p>
<p>The zero point of the y-axis on that graph is, by definition, the grand mean of the 4 conditions. In other words, the treatment conditions are all hovering around zero because zero is the average, and the average is made up mostly of treatment group students. If we had only had 2 conditions (each with 50% of the students), the y-axis &#8220;zero&#8221; would have been exactly halfway in between them. So the lack of difference from zero does not mean that the treatment was not different from control. The relevant comparison is between the error bars in the control condition and in the treatment conditions.</p>
<p>You might ask, &#8220;why are you showing such a graph?&#8221; We&#8217;re doing so to focus on the treatment contrast at the heart of our paper &#8212; the contrast between the control and treatment groups. The residual standardized graph makes it easy to see the size of that treatment contrast.</p>
<p><u>We&#8217;re combining intervention conditions</u></p>
<p>You wrote:<br />
<blockquote>Did you catch that phrase “intervention conditions”? The authors of the study write: “Because our primary research question concerned the efficacy of academic mindset interventions in general when delivered via online modules, we then collapsed the intervention conditions into a single intervention dummy code (0 = control, 1 = intervention).</p></blockquote>
<p>[This line of argument goes on for a long time to suggest that we&#8217;re unethical and that there&#8217;s actually no evidence for the effects of growth mindset on achievement.]</p>
<p>We collapsed the intervention conditions together for this analysis because we were interested in the overall effect of these interventions on achievement. We wanted to see if it is possible to use scalable, social-psychological approaches to improve the achievement of underperforming students. I&#8217;m not sure why you think that&#8217;s not a valid hypothesis to test, but we certainly think it is. Maybe this is just a matter of opinion about what&#8217;s a meaningful hypothesis to test, but I assure you that this hypothesis (contrast all treatments to control) is consistent with the goal of our group to develop treatments that make an impact on student achievement. As I described before, we have a whole center devoted to trying to improve academic achievement with these types of techniques (see perts.net); so it&#8217;s pretty natural that we&#8217;d want to see whether our social-psychological interventions improve outcomes for the students who need them most (at-risk students). </p>
<p>You&#8217;re correct that the growth mindset intervention did not have a statistically significant impact on course passing rates by itself (at a p<.05 level). However, the effect was in the expected direction with p=0.13 (or a 1-tailed p=.07 -- I hope you'll grant that a 1-tailed test is appropriate here given that we obviously predicted the treatment would improve rather than reduce performance). So the lack of a p<.05 should not be interpreted -- as you seem to interpret it -- as some sort of positive evidence that growth mindset "actually didn't work." Anyway, I would say it warrants further research to replicate this effect (work we are currently engaging in).     To summarize, we did not find direct evidence that the growth mindset intervention increased course passing rates on its own at a p<.05 level. We did find that growth mindset increased course passing rates at a trend level -- and found a significant effect on GPA. More importantly for me (though perhaps less relevant to your interest specifically in growth mindset), we did provide evidence that social-psychological interventions, like growth mindset and sense of purpose, can improve academic outcomes for at-risk students.     We're excited to be replicating this work now and giving it away in the hopes of improving outcomes for students around the world.    <u>Summary</u></p>
<p>I hope I addressed your concerns about this paper, and I welcome further discussion with you. I&#8217;d really appreciate it if you&#8217;d revise your blog post in whatever way you think is appropriate in light of my response. I&#8217;d hate for people to get the wrong impression of our work, and you don&#8217;t strike me as someone who would want to mislead people about scientific findings either. </p>
<p>Finally, you&#8217;re welcome to post my response. I may post it to my own web page because I&#8217;m sure many other people have similar questions about my work. Just let me know how you&#8217;d like to proceed with this dialog.</p>
<p>Thanks for reading,</p>
<p>Dave </p>
<p><b>II.</b></p>
<p>First of all, the obvious: this is extremely kind and extremely well-argued and a lot of it is correct and makes me feel awful for being so snarky on my last post.</p>
<p>Things in particular which I want to endorse as absolutely right about the critique:</p>
<p>I wrote &#8220;A quarter of the students took a placebo course that just presented some science about how different parts of the brain do different stuff. This was also classified as a “mindset intervention”, though it seems pretty different.&#8221; Dr. Paunesku says this is wrong. He&#8217;s right. It was an editing error on my part. I meant to add the last sentence to the part on the &#8220;sense of purpose&#8221; intervention, which was classified as a mindset intervention and which I do think seems pretty different. The placebo intervention was never classified as a mindset intervention and I completely screwed up by inserting that piece of text there rather than two sentences down where I meant it to be. It has since been corrected and I apologize for the error.</p>
<p>If another successful replication found that growth mindset continues to only help the lowest-performing students, I withdraw the complaint that this is sketchy subgroup mining, though I think that in general worrying about this is the correct thing to do.</p>
<p>I <i>did</i> misunderstand the residual standardized graph. I suggested that the control group must have severely declined, and got confused about why. In fact, the graph was not about difference between pre-study scores and post-study scores, but difference between group scores and the average score for all four groups. So when the control group is strongly negative, that means it was much worse than the average of all groups. When growth mindset is not-different-from-zero, it means growth mindset was not different from the average of all four groups, which consists of three treatment groups and one control group. So my interpretation &#8211; that growth mindset failed to change children&#8217;s grades &#8211; is not supported by the data.</p>
<p>(In my defense, I can only plead that in the two hundred fifty comments I received, many by professional psychologists and statisticians, only one person picked up on this point (admittedly, after being primed by my own misinterpretation). And the sort of data I expected to be seeing &#8211; difference between students&#8217; pre-intervention and post-intervention scores &#8211; does not seem to be available. Nevertheless, this was a huge and unforgiveable screw-up, and I apologize.)</p>
<p><b>III.</b></p>
<p>But there are also a few places where I will stick to my guns.</p>
<p>I don&#8217;t think my interpretation of growth mindset was that far off the mark. I explain this a little further in <A HREF="http://slatestarcodex.com/2015/04/10/i-will-never-have-the-ability-to-clearly-explain-my-beliefs-about-growth-mindset/">this post</A> on differing possible definitions of growth mindset, and I will continue to cite <A HREF="http://www.johnstonvbc.com/coaches_only/USOC%20-%20MINDSETS%20by%20Carol%20Dweck%202.09.pdf">this strongly worded paper by Dweck</A> as defense of my views. It&#8217;s <i>not</i> just an obvious and innocuous belief about about always believing you should be able to improve, it&#8217;s a belief about very counterintuitive effects of believing that success depends on ability versus effort. It is possible that all sophisticated researchers in the field have a very sophisticated and unobjectionable definition of growth mindset, but that&#8217;s not the way it&#8217;s presented to the public, even in articles by those same researchers.</p>
<p>Although I&#8217;m sure that to researchers in the field statements like &#8220;Doing well at school will help me achieve my goal&#8221; don&#8217;t sound like platitudes, it seems important to me in the context of discussions about growth mindset. Some people have billed growth mindset as a very exciting window into what makes learning tick, and how we should divide everyone into groups based on their mindset, and how it&#8217;s the Secret To Success, and so on. Learning that a drop-dead simple intervention &#8211; telling students to care about school more &#8211; actually does as well or better than growth mindset seems to me like a damning result. I realize it would be kind of insulting to call sense-of-purpose an &#8220;active placebo&#8221; in the medical sense, but that&#8217;s kind of how I can&#8217;t help thinking of it.</p>
<p>I&#8217;m certainly not suggesting the authors of the papers are <i>unethical</i> for combining growth mindset intervention with sense of purpose intervention. But I think the technique is dangerous, and this is an example. They got a result that was significant at p = 0.13. Dr. Paunesku suggests in his email to me that this should be one-tailed (which makes it p = 0.07) and that this obviously trends towards significance. This is a reasonable argument. But this wasn&#8217;t the reasonable argument made in the paper. Instead, they make it look like it achieved classical p < 0.05 significance, or at least make it very hard to notice that it didn't.    Even if in this case it was - I can't even say white lie, maybe a white spin - I find the technique very worrying. Suppose I want to prove homeopathy cures cancer. I make a trial with one placebo condition and two intervention conditions - chemotherapy and homeopathy. I find that the chemotherapy condition very significantly outperforms placebo, but the homeopathy condition doesn't. So I combine the two interventions into a single bin and say "Therapeutic interventions such as chemotherapy or homeopathy significantly outperform placebo." Then someone else cites it as "As per a study, homeopathy outperforms placebo." This would obviously be bad.    I am just not convinced that growth mindset and sense of purpose are similar enough that you can group them together effectively. This is what I was trying to get at in my bungled sentence about how they're both "mindset" interventions but seem pretty different. Yes, they're both things you tell children in forty-five minute sessions that seem related to how they think about school achievement. But that's a <i>really</i> broad category. </p>
<p>But doesn&#8217;t it mean something that growth-mindset was obviously trending toward significance?</p>
<p>First of all, I would have had no problem with saying &#8220;trending toward significance&#8221; and letting readers draw their own conclusions. </p>
<p>Second of all, I&#8217;m not totally sure I buy the justification for a one-tailed test here; after all, it seems like we should use a one-tailed test for homeopathy as well, since as astounding as it would be if  homeopathy helped, it would be even more astounding if homeopathy somehow made cancer worse. Further, educational interventions <i>often</i> have the opposite of their desired effect &#8211; see eg <A HREF="http://interrete.org/inclusive-classrooms-dont-necessarily-increase-friendships-for-children-with-disabilities/">this campaign to increase tolerance of the disabled</A> which made students like disabled people <i>less</i> than a control intervention. In fact, there&#8217;s no need to look further than this very study, which found (counterintuitively) that among students already exposed to sense-of-purpose interventions, adding on an extra growth-mindset intervention seemed to make them do (nonsignificantly) worse. I am not a statistician, but my understanding is you ought to have a <i>super</i> good reason to use a one-tailed test, beyond just &#8220;Intuitively my hypothesis is way more likely than the exact opposite of my hypothesis&#8221;. </p>
<p>Third of all, if we accept p < 0.13 as "trending towards significance", we have basically tripled the range of acceptable study results, even though everyone agrees our current range of acceptable study results is already way too big and <A HREF="http://slatestarcodex.com/2013/02/17/90-of-all-claims-about-the-problems-with-medical-studies-are-wrong/">some high percent of all medical studies are wrong</A> and <A HREF="http://www.nature.com/news/first-results-from-psychology-s-largest-reproducibility-test-1.17433">only 39% of psych studies replicate</A> and so on.</p>
<p>(I agree that all of this could be solved by something better than p-values, but p-values are what we&#8217;ve got)</p>
<p>I realize I&#8217;m being a jerk by insisting on the arbitrary 0.05 criterion, but in my defense, the time when only 39% of studies using a criterion replicate is a bad time to loosen that criterion.</p>
<p><b>IV.</b></p>
<p>Here&#8217;s what I still believe and what I&#8217;ve changed my mind on based on Dr. Paunesku&#8217;s response.</p>
<p>1. I totally bungled my sentence on the placebo group being a mindset intervention by mistake. I ashamedly apologize, and have corrected the original post.</p>
<p>2. I totally bungled reading the residual standard score graph. I ashamedly apologize, and have corrected the original post, and put a link in bold text to this post on the top.</p>
<p>3. I don&#8217;t know whether the thing I thought the graph showed (no significant preintervention vs. postintervention GPA improvement for growth mindset, or no difference in change from controls) is true. It may be hidden in the supplement somewhere, which I will check later. Possible apology pending further investigation.</p>
<p>4. Growth mindset still had no effect (in fact nonsignificantly negative) for students at large (as opposed to underachievers). I regret nothing.</p>
<p>5. Growth mindset still failed to reach traditional significance criteria for changing pass rates. I regret nothing.</p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/05/07/growth-mindset-4-growth-of-office/feed/</wfw:commentRss>
		<slash:comments>129</slash:comments>
		</item>
		<item>
		<title>Prescriptions, Paradoxes, and Perversities</title>
		<link>http://slatestarcodex.com/2015/04/30/prescriptions-paradoxes-and-perversities/</link>
		<comments>http://slatestarcodex.com/2015/04/30/prescriptions-paradoxes-and-perversities/#comments</comments>
		<pubDate>Thu, 30 Apr 2015 04:52:19 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[medicine]]></category>
		<category><![CDATA[psychiatry]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3629</guid>
		<description><![CDATA[[WARNING: I am not a pharmacologist. I am not a researcher. I am not a statistician. This is not medical advice. This is really weird and you should not take it too seriously until it has been confirmed] I. I&#8217;ve &#8230; <a href="http://slatestarcodex.com/2015/04/30/prescriptions-paradoxes-and-perversities/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><font size="1"><i>[WARNING: I am not a pharmacologist. I am not a researcher. I am not a statistician. This is not medical advice. This is really weird and you should not take it too seriously until it has been confirmed]</i></font></p>
<p><b>I.</b></p>
<p>I&#8217;ve been playing around with data from Internet databases that aggregate patient reviews of medications.</p>
<p>Are these any good? I looked at four of the largest such databases &#8211; <A HREF="http://www.drugs.com/drug_information.html">Drugs.com</A>, <A HREF="http://www.webmd.com/drugs/index-drugs.aspx?show=drugs">WebMD</A>, <A HREF="http://www.askapatient.com/">AskAPatient</A>, and <A HREF="http://www.druglib.com/">DrugLib</A> &#8211; as well as psychiatry-specific site <A HREF="http://www.crazymeds.us/pmwiki/pmwiki.php/Main/HomePage">CrazyMeds</A> &#8211; and took their data on twenty-three major antidepressants. Then I correlated them with one another to see if the five sites mostly agreed.</p>
<p>Correlations between Drugs.com, AskAPatient, and WebMD were generally large and positive (around 0.7). Correlations between CrazyMeds and DrugLib were generally small or negative. In retrospect this makes sense, because these two sites didn&#8217;t allow separation of ratings by condition, so for example Seroquel-for-depression was being mixed with Seroquel-for-schizophrenia. </p>
<p>So I threw out the two offending sites and kept Drugs.com, AskAPatient, and WebMD. I normalized all the data, then took the weighted average of all three sites. From this huge sample (the least-reviewed drug had 35 ratings, the most-reviewed drug 4,797) I obtained a unified opinion of patients&#8217; favorite and least favorite antidepressants.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/prescription_table.png"></center></p>
<p>This doesn&#8217;t surprise me at all. Everyone secretly knows Nardil and Parnate (the two commonly-used drugs in the MAOI class) are excellent antidepressants<sup>1</sup>. Oh, <A HREF="http://psychiatrist-blog.blogspot.com/2008/03/why-this-shrink-doesnt-prescribe-maois.html">nobody</A> will prescribe them, because of the dynamic discussed <A HREF="http://slatestarcodex.com/2015/04/25/nefarious-nefazodone-and-flashy-rare-side-effects/">here</A>, but in their hearts they know it&#8217;s true.</p>
<p>Likewise, I feel pretty good to see that Serzone, which I recently defended, is number five. I&#8217;ve had terrible luck with Viibryd, and it just seems to make people taking it more annoying, which is not a listed side effect but which I swear has happened.</p>
<p>The table also <A HREF="http://slatestarcodex.com/2015/04/30/prescriptions-paradoxes-and-perversities/#comment-201233">matches</A> the evidence from chemistry &#8211; drugs with similar molecular structure get similar ratings, as do drugs with similar function. This is, I think, a good list.</p>
<p>Which is too bad, because it makes the next part that much more terrifying.</p>
<p><b>II.</b></p>
<p>There is a sixth major Internet database of drug ratings. It is called <A HREF="https://www.healthtap.com/raterx">RateRx</A>, and it differs from the other five in an important way: it solicits ratings from doctors, not patients. It&#8217;s a great idea &#8211; if you trust your doctor to tell you which drug is best, why not take advantage of wisdom-of-crowds and trust <i>all</i> the doctors? </p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/prescription_doctors.png"></p>
<p><i>The RateRX logo. Spoiler: this is going to seem really ironic in about thirty seconds.</i></center></p>
<p>RateRx has a modest but respectable sample size &#8211; the drugs on my list got between 32 and 70 doctor reviews. There&#8217;s only one problem.</p>
<p>You remember patient reviews on the big three sites correlated about +0.7 with each other, right? So patients pretty much agree on which drugs are good and which are bad?</p>
<p>Doctor reviews on RateRx correlated at -0.21 with patient reviews. The negative relationship is nonsignificant, but that just means that at best, doctor reviews are totally uncorrelated with patient consensus.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/prescription_graph1.png"></center></p>
<p>This has an obvious but very disturbing corollary. I couldn&#8217;t get good numbers on how times each of the antidepressants on my list were prescribed, because the information I&#8217;ve seen only gives prescription numbers for a few top-selling drugs, plus we&#8217;ve got the same problem of not being able to distinguish depression prescriptions from anxiety prescriptions from psychosis prescriptions. But total number of online reviews makes a pretty good proxy. After all, the more patients are using a drug, the more are likely to review it.</p>
<p>Quick sanity check: the most reviewed drug on my list was Cymbalta. Cymbalta was also <A HREF="http://mentalhealthdaily.com/2014/08/30/most-popular-antidepressants-in-2014-cymbalta-pristiq-viibryd/">the best selling antidepressant of 2014</A>. Although my list doesn&#8217;t exactly track the best-sellers, that seems to be a function of how long a drug has been out &#8211; a best-seller that came out last year might have only 1/10th the number of reviews as a best-seller that came out ten years ago. So number of reviews seems to be a decent correlate for amount a drug is used.</p>
<p>In that case, amount a drug is used correlates highly (+0.67, p = 0.005) with doctors&#8217; opinion of the drug, which makes perfect sense since doctors are the ones prescribing it. But amount the drug gets used correlates negatively with patient rating of the drug (-0.34, p = ns), which of course is to be expected given the negative correlation between doctor opinion and patient opinion.</p>
<p>So the more patients like a drug, the less likely it is to be prescribed<sup>2</sup>.</p>
<p><b>III.</b></p>
<p>There&#8217;s one more act in this horror show.</p>
<p>Anyone familiar with these medications reading the table above has probably already noticed this one, but I figured I might as well make it official.</p>
<p>I correlated the average rating of each drug with the year it came on the market. The correlation was -0.71 (p < .001). That is, the newer a drug was, the less patients liked it<sup>3</sup>.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/prescription_graph2.png"></center></p>
<p>This pattern absolutely <i>jumps</i> out of the data. First- and second- place winners Nardil and Parnate came out in 1960 and 1961, respectively; I can&#8217;t find the exact year third-place winner Anafranil came out, but the first reference to its trade name I can find in the literature is from 1967, so I used that. In contrast, last-place winner Viibryd came out in 2011, second-to-last place winner Abilify got its depression indication in 2007, and third-to-last place winner Brintellix is as recent as 2013.</p>
<p>This result is robust to various different methods of analysis, including declaring MAOIs to be an unfair advantage for Team Old and removing all of them, changing which minor tricylics I do and don&#8217;t include in the data, and altering whether Deprenyl, a drug that technically came out in 1970 but received a gritty reboot under the name Emsam in 2006, is counted as older or newer.</p>
<p>So if you want to know what medication will make you happiest, at least according to this analysis your best bet isn&#8217;t to ask your doctor, check what&#8217;s most popular, or even check any individual online rating database. It&#8217;s to look at the approval date on the label and choose the one that came out first.</p>
<p><b>IV.</b></p>
<p>What the <i>hell</i> is going on with these data?</p>
<p>I would like to dismiss this as confounded, but I have to admit that any reasonable person would expect the confounders to go the opposite way.</p>
<p>That is: older, less popular drugs are usually brought out only when newer, more popular drugs have failed. MAOIs, the clear winner of this analysis, are very clearly reserved in the guidelines for &#8220;treatment-resistant depression&#8221;, ie depression you&#8217;ve already thrown everything you&#8217;ve got at. But these are precisely the depressions that are hardest to treat. </p>
<p>Imagine you are testing the fighting ability of three people via ten boxing matches. You ask Alice to fight a Chihuahua, Bob to fight a Doberman, and Carol to fight Cthulhu. You would expect this test to be biased in favor of Alice and against Carol. But MAOIs and all these other older rarer drugs are practically never brought out except against Cthulhu. Yet they <i>still</i> have the best win-loss record. </p>
<p>Here are the only things I can think of that might be confounding these results.</p>
<p>Perhaps because these drugs are so rare and unpopular, psychiatrists only use them when they have really really good reason. That is, the most popular drug of the year they pretty much cluster-bomb everybody with. But every so often, they see some patient who seems absolutely 100% perfect for clomipramine, a patient who practically <i>screams</i> &#8220;clomipramine!&#8221; at them, and then they give this patient clomipramine, and she does really well on it.</p>
<p>(but psychiatrists aren&#8217;t actually that good at personalizing antidepressant treatments. The only thing even <i>sort of</i> like that is that MAOIs are extra-good for a subtype called atypical depression. But that&#8217;s like a third of the depressed population, which doesn&#8217;t leave much room for this super-precise-targeting hypothesis.)</p>
<p>Or perhaps once drugs have been on the market longer, patients figure out what they like. Brintellix is so new that the Brintellix patients are the ones whose doctors said &#8220;Hey, let&#8217;s try you on Brintellix&#8221; and they said &#8220;Whatever&#8221;. MAOIs have been on the market so long that presumably MAOI patients are ones who tried a dozen antidepressants before and stayed on MAOIs because they were the only ones that worked.</p>
<p>(but Prozac has been on the market 25 years now. This should only apply to a couple of very new drugs, not the whole list.)</p>
<p>Or perhaps the older drugs have so many side effects that no one would stay on them unless they&#8217;re absolutely perfect, whereas people are happy to stay on the newer drugs even if they&#8217;re not doing much because whatever, it&#8217;s not like they&#8217;re causing any trouble.</p>
<p>(but Seroquel and Abilify, two very new drugs, have awful side effects, yet are down at the bottom along with all the other new drugs)</p>
<p>Or perhaps patients on very rare weird drugs get a special placebo effect, because they feel that their psychiatrist cares enough about them to personalize treatment. Perhaps they identify with the drug &#8211; &#8220;I am special, I&#8217;m one of the only people in the world who&#8217;s on nefazodone!&#8221; and they become attached to it and want to preach its greatness to the world.</p>
<p>(but drugs that are rare because they are especially new don&#8217;t get that benefit. I would expect people to also get excited about being given the latest, flashiest thing. But only drugs that are rare because they are old get the benefit, not drugs that are rare because they are new.)</p>
<p>Or perhaps psychiatrists tend to prescribe the drugs they &#8220;imprinted on&#8221; in medical school and residency, so older psychiatrists prescribe older drugs and the newest psychiatrists prescribe the newest drugs. But older psychiatrists are probably much more experienced and better at what they do, which could affect patients in other ways &#8211; the placebo effect of being with a doctor who radiates competence, or maybe the more experienced psychiatrists are really good at psychotherapy, and that makes the patient better, and they attribute it to the drug.</p>
<p>(but read on&#8230;)</p>
<p><b>V.</b></p>
<p>Or perhaps we should take this data at face value and assume our antidepressants have been getting worse and worse over the past fifty years.</p>
<p>This is not entirely as outlandish as it sounds. The history of the past fifty years has been a history of moving from drugs with more side effects to drugs with fewer side effects, with what I consider somewhat less than due diligence in making sure the drugs were quite as effective in the applicable population. This is a <i>very</i> complicated and controversial statement which I will be happy to defend in the comments if someone asks.</p>
<p>The big problem is: drugs go off-patent after twenty years. Drug companies want to push new, on-patent medications, and most research is funded by drug companies. So lots and lots of research is aimed at proving that newer medications invented in the past twenty years (which make drug companies money) are better than older medications (which don&#8217;t).</p>
<p>I&#8217;ll give one example. There is <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/8915561">only a single study in the entire literature</A> directly comparing the MAOIs &#8211; the very old antidepressants that did best on the patient ratings &#8211; to SSRIs, the antidepressants of the modern day<sup>4</sup>. This study found that phenelzine, a typical MAOI, was no better than Prozac, a typical SSRI. Since Prozac had fewer side effects, that made the choice in favor of Prozac easy.</p>
<p>Did you know you can look up the authors of scientific studies on LinkedIn and sometimes get very relevant information? For example, the lead author of this study has a resume that clearly lists him as working for Eli Lilly at the time the study was conducted (spoiler: Eli Lilly is the company that makes Prozac). The second author&#8217;s LinkedIn profile shows he is <i>also</i> an operations manager for Eli Lilly. Googling the fifth author&#8217;s name links to a news article about Eli Lilly making a $750,000 donation to his clinic. Also there&#8217;s a little blurb at the bottom of the paper saying &#8220;Supported by a research grant by Eli Lilly and company&#8221;, then thanking several Eli Lilly executives by name for their assistance. </p>
<p>This is the sort of study which I kind of wish had gotten replicated <i>before</i> we decided to throw away an entire generation of antidepressants based on the result. </p>
<p>But who will come to phenelzine&#8217;s defense? Not Parke-Davis , the company that made it: their patent expired sometime in the seventies, and then they were bought out by Pfizer<sup>5</sup>. And not Pfizer &#8211; without a patent they can&#8217;t make any money off Nardil, and besides, Nardil is competing with their own on-patent SSRI drug Zoloft, so Pfizer has as much incentive as everyone else to push the &#8220;SSRIs are best, better than all the rest&#8221; line.</p>
<p>Every twenty years, pharmaceutical companies have an incentive to suddenly declare that all their old antidepressants were awful and you should never use them, but whatever new antidepressant they managed to dredge up is super awesome and you should use it all the time. This sort of <i>does</i> seem like the sort of situation that might lead to older medications being better than newer ones. A couple of people have been pushing this line for years &#8211; I was introduced to it by Dr. Ken Gillman from <A HREF="http://www.psychotropical.com/">Psychotropical Research</A>, whose recommendation of MAOIs and Anafranil as most effective match the patient data very well, and whose essay <A HREF="http://www.psychotropical.com/why-most-new-antidepressants-are-ineffective">Why Most New Antidepressants Are Ineffective</A> is worth a read.</p>
<p>I&#8217;m not sure I go as far as he does &#8211; even if new antidepressants aren&#8217;t worse outright, they might still trade less efficacy for better safety. Even if they handled the tradeoff well, it would look like a net loss on patient rating data. After all, assume Drug A is 10% more effective than Drug B, but also kills 1% of its users per year, while Drug B kills nobody. Here there&#8217;s a good case that Drug B is much better and a true advance. But Drug A&#8217;s ratings would look better, since dead men tell no tales and don&#8217;t get to put their objections into online drug rating sites. Even if victims&#8217; families did give the drug the lowest possible rating, 1% of people giving a very low rating might still not counteract 99% of people giving it a higher rating.</p>
<p>And once again, <A HREF="http://slatestarcodex.com/2015/04/25/nefarious-nefazodone-and-flashy-rare-side-effects/">I&#8217;m not sure the tradeoff is handled very well at all</A>.<sup>6</sup>.</p>
<p><b>VI.</b></p>
<p>In order to distinguish between all these hypotheses, I decided to get a lot more data.</p>
<p>I grabbed all the popular antipsychotics, antihypertensives, antidiabetics, and anticonvulsants from the three databases, for a total of 55,498 ratings of 74 different drugs. I ran the same analysis on the whole set.</p>
<p>The three databases still correlate with each other at respectable levels of +0.46, +0.54, and +0.53. All of these correlations are highly significant, p < 0.01.    The negative correlation between patient rating and doctor rating remains and is now a highly significant -0.344, p < 0.01. This is robust even if antidepressants are removed from the analysis, and is notable in both psychiatric and nonpsychiatric drugs.    <center><IMG SRC="http://slatestarcodex.com/blog_images/prescription_graph3.png"></center></p>
<p>The correlation between patient rating and year of release is a no-longer-significant -0.191. This is heterogenous; antidepressants and antipsychotics show a strong bias in favor of older medications, and antidiabetics, antihypertensives, and anticonvulsants show a slight nonsignificant bias in favor of newer medications. So it would seem like the older-is-better effect is purely psychiatric.</p>
<p>I conclude that for some reason, there really is a highly significant effect across all classes of drugs that makes doctors love the drugs patients hate, and vice versa.</p>
<p>I also conclude that older psychiatric drugs seem to be liked much better by patients, and that this is not some kind of simple artifact or bias, since if such an artifact or bias existed we would expect it to repeat in other kinds of drugs, which it doesn&#8217;t.</p>
<p><b>VII.</b></p>
<p>Please feel free to check my results. <A HREF="http://slatestarcodex.com/Stuff/prescription_data.xls">Here is a spreadsheet</A> (.xls) containing all of the data I used for this analysis. Drugs are marked by class: 1 is antidepressants, 2 is antidiabetics, 3 is antipsychotics, 4 is antihypertensives, and 5 is anticonvulsants. You should be able to navigate the rest of it pretty easily. </p>
<p>One analysis that needs doing is to separate out drug effectiveness versus side effects. The numbers I used were combined satisfaction ratings, but a few databases &#8211; most notably WebMD &#8211; give you both separately. Looking more closely at those numbers might help confirm or disconfirm some of the theories above.</p>
<p>If anyone with the necessary credentials is interested in doing the hard work to publish this as a scientific paper, drop me an email and we can talk.</p>
<p><b>Footnotes</b></p>
<p><font size="1"><b>1.</b> Technically, MAOI superiority has only been proven for atypical depression, the type of depression where you can still have changing moods but you are unhappy on net. But I&#8217;d speculate that right now most patients diagnosed with depression have atypical depression, far more than the studies would indicate, simply because we&#8217;re diagnosing less and less severe cases these days, and less severe cases seem more atypical.</p>
<p><b>2.</b> First-place winner Nardil has only 16% as many reviews as last-place winner Viibryd, even though Nardil has been on the market fifty years and Viibryd for four. Despite its observed superiority, Nardil may very possibly be prescribed less than 1% as often as Viibryd.</p>
<p><b>3.</b> Pretty much the same thing is true if, instead of looking at the year they came out, you just rank them in order from earliest to latest.</p>
<p><b>4.</b> On the other hand, what we do have is a lot of studies comparing MAOIs to imipramine, and a lot of other studies comparing modern antidepressants to imipramine. For atypical depression and dysthymia, MAOIs beat imipramine handily, but the modern antidepressants are about equal to imipramine. This strongly implies the MAOIs beat the modern antidepressants in these categories.</p>
<p><b>5.</b> Interesting <A HREF="http://en.wikipedia.org/wiki/Parke-Davis">Parke-Davis</A> facts: Parke-Davis got rich by being the people to market cocaine back in the old days when people treated it as a pharmaceutical, which must have been kind of like a license to print money. They also worked on hallucinogens with no less a figure than Aleister Crowley, who got a nice tour of their facilities in Detroit.</p>
<p><b>6.</b> Consider: <A HREF="https://books.google.com/books?id=6PGzHFuS1xkC&#038;pg=PA91&#038;lpg=PA91&#038;dq=MAOI+fatality+rate&#038;source=bl&#038;ots=Ekv6SFwuz_&#038;sig=965qQ4bsYhKJPpIOCfbta4SiCJs&#038;hl=en&#038;sa=X&#038;ei=MQE_VbDmJdO3oQTX1oHIDw&#038;ved=0CEMQ6AEwBQ#v=onepage&#038;q=MAOI%20fatality%20rate&#038;f=false"><i>Seminars In General Psychiatry</i></A> estimates that MAOIs kill one person per 100,000 patient years. A third of all depressions are atypical. MAOIs <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/6375621">are</A> 25 percentage points more likely to treat atypical depression than other antidepressants. So for every 100,000 patients you give a MAOI instead of a normal antidepressant, you kill one and cure 8,250 who wouldn&#8217;t otherwise be cured. The <A HREF="https://research.tufts-nemc.org/cear4/SearchingtheCEARegistry/SearchtheCEARegistry.aspx">QALY database</A> says that a year of moderate depression is worth about 0.6 QALYs. So for every 100,000 patients you give MAOIs, you&#8217;re losing about 30 QALYs and gaining about 3,300.</font></p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/04/30/prescriptions-paradoxes-and-perversities/feed/</wfw:commentRss>
		<slash:comments>247</slash:comments>
		</item>
		<item>
		<title>Effective Altruists: Not As Mentally Ill As You Think</title>
		<link>http://slatestarcodex.com/2015/03/06/effective-altruists-not-as-mentally-ill-as-you-think/</link>
		<comments>http://slatestarcodex.com/2015/03/06/effective-altruists-not-as-mentally-ill-as-you-think/#comments</comments>
		<pubDate>Sat, 07 Mar 2015 02:08:07 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[charity]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3573</guid>
		<description><![CDATA[During my recent meetings with effective altruist groups here, I kept hearing the theory that effective altruism selects for people with mental disorders. The theory is that people with a lot of depression, anxiety, and self-hatred turn to effective altruism &#8230; <a href="http://slatestarcodex.com/2015/03/06/effective-altruists-not-as-mentally-ill-as-you-think/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>During my recent meetings with effective altruist groups here, I kept hearing the theory that effective altruism selects for people with mental disorders. The theory is that people with a lot of depression, anxiety, and self-hatred turn to effective altruism as (optimistically) a way to prove that they are good and valuable or (pessimistically) a form of self-harm in which they enact their belief that they deserve nothing and other people are more worthy.</p>
<p>And whenever this got brought up at meetings, people giggled, probably because they were thinking of good examples. I can&#8217;t deny there&#8217;s a lot of anecdotal evidence here (hi Ozy!). But when I look into it, it seems totally false.</p>
<p>My source was <A HREF="http://lesswrong.com/lw/l5k/2014_less_wrong_censussurvey/">the 2014 Less Wrong survey data</A>, which asked respondents whether they self-identified as effective altruists and whether they participated in effective altruist groups and meetups. Using that question, I separated the respondents into 758 non-effective-altruists and 422 effective altruists. The survey had also asked people whether they had been diagnosed with various mental illnesses, so I checked the rates in both groups. Including self-diagnosis there were no particular results; when I limited it to professionally diagnosed illnesses things got a little more interesting.</p>
<p>Effective altruists had about the same levels of anxiety disorders and obsessive-compulsive disorder as non-EA Less Wrongers. However, they had slightly higher levels of depression (22% vs. 17%) which was barely significant (p = 0.04) due to a large sample size. They also had more autism (8.5% vs. 5%) which was also significant (p = 0.02).</p>
<p>I expected this to be mediated by a tendency for autistic people to be more consequentialist and consequentalists to be more EA, and both these things were true to some degree, but even when I limited the analysis to all consequentialists, effective altruists still had more autism. Further, autistic people seemed to donate a higher percent of their income to charity than neurotypical people or people with other mental illnesses <i>even separated from effective altruist status</i> &#8211; that is, even among people none of whom were effective altruists, the autistic people seemed to donate more (effect not always significant) even though they generally had lower incomes.</p>
<p>I conclude that effective altruists are not unusually self-hating or scrupulous, but that they may be a little more autistic, and the reason why isn&#8217;t the obvious one.</p>
<p>A caveat, by way of presenting another interesting result. <A HREF="http://www.rmm-journal.de/downloads/Article_Rusch.pdf">Rusch (2015)</A> (h/t <A HREF="https://twitter.com/bechhof">@bechhof</A>) studied whether bankers were more consequentialist (in this case, more likely to give consequentialist answers to the Trolley Problem and Fat Man Problem) than nonbankers. He found that they were. But then he checked for confounders and found the result was entirely an artifact of men being more consequentialist than women and bankers being predominantly male.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/banker_table.png"></center></p>
<p>This is pretty astounding &#8211; men are almost six times as consequentialist as women!</p>
<p>On the other hand, in both my Less Wrong data in general and the effective altruist subgroup, men and women don&#8217;t vary much in consequentialismishness. Either Rusch&#8217;s data is wrong, or there&#8217;s a strong filter that acts to get only consequentialists into Less Wrong regardless of gender, or LW converts women to consequentialism (without further converting men).</p>
<p>Interestingly, effective altruists were <i>much</i> more consequentialist than non-effective-altruist LWers &#8211; 80% versus 50%. They also had more women than the non-effective-altruists. So it looks like LW filters for consequentalists so strongly it gets an even balance of consequentialist men and consequentialist women, and past that stage, filtering further for consequentialism doesn&#8217;t change gender balance much.</p>
<p>This points out a limitation of my statistics above. All it shows is that effective altruists don&#8217;t differ <i>from other rationalists</i> in levels of mental illness. It&#8217;s possible and indeed likely that both effective altruists and rationalists differ from the general population in all kinds of ways. It&#8217;s even possible that self-hate and scrupulosity drive people into the rationality movement in general, although I can&#8217;t imagine why that would be. It&#8217;s just that they don&#8217;t seem to have any extra power to make people effective altruists once they&#8217;re there.</p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/03/06/effective-altruists-not-as-mentally-ill-as-you-think/feed/</wfw:commentRss>
		<slash:comments>210</slash:comments>
		</item>
		<item>
		<title>How Likely Are Multifactorial Trends?</title>
		<link>http://slatestarcodex.com/2015/02/14/how-likely-are-multifactorial-trends/</link>
		<comments>http://slatestarcodex.com/2015/02/14/how-likely-are-multifactorial-trends/#comments</comments>
		<pubDate>Sun, 15 Feb 2015 03:14:11 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3556</guid>
		<description><![CDATA[Vox recently wrote about 16 Theories For Why Crime Plummeted In The US. Their story is based on a report by the Brennan Center For Justice, which I haven&#8217;t read, so I&#8217;m hesitant to critique it too much. The little &#8230; <a href="http://slatestarcodex.com/2015/02/14/how-likely-are-multifactorial-trends/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Vox recently wrote about <A HREF="http://www.vox.com/2015/2/13/8032231/crime-drop">16 Theories For Why Crime Plummeted In The US</A>.</p>
<p>Their story is based on a report by the Brennan Center For Justice, which I haven&#8217;t read, so I&#8217;m hesitant to critique it too much. The little I got off of Vox I don&#8217;t like. For example, if I understand correctly they&#8217;re arguing that the lead-crime connection is overblown because although lead was banned in the 1970s (thus affecting people who reached peak crime-committing age in the 1990s), the decline in crime continued even into the 2000s. But lead stays in the environment a long time, there&#8217;s still a lot of work to be done eliminating various sources of lead, and so <A HREF="http://pediatrics.aappublications.org/content/123/3/e376">blood lead levels continue to decline</A>. That makes their argument ring a little hollow.</p>
<p>But I want to talk about a more meta-level point.</p>
<p>The analysis ends up concluding that there is no &#8220;smoking gun&#8221; and crime probably declined because of a bunch of reasons coming together. For example, they say that &#8220;up to 12 percent of the drop in property crime during the 1990s was due to the rise in incarceration, but it was probably more like 6 percent&#8221;, and &#8220;up to ten percent of the drop in crime in the 1990s was caused by hiring more police.&#8221; The general picture I get is that there were about ten different factors, each explaining ten percent of the decline.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/crime_rate.gif"></center></p>
<p>Imagine two different perspectives on this.</p>
<p>First, a learned professor says &#8220;Oh yes, the public always wants to hear about how one big exciting thing caused the decline in crime, but that kind of thinking is unsophisticated. Something as complicated as crime is governed by many factors, and you certainly wouldn&#8217;t expect one big knockout change to lower it to this degree. Like everything else, it&#8217;s probably a combination of different things that came together, each accounting for a small percent of the variance.&#8221;</p>
<p>Second, someone counterargues: &#8220;If ten different factors caused the decline in crime, that would require that ten different things suddenly changed direction, all at the same time in 1994. That&#8217;s a pretty big coincidence. In fact, let&#8217;s reductio ad absurdum this. Imagine it was ten <i>million</i> different factors, each accounting for one ten-millionth of the decline. But that seems stupid. For example, since there are only about ten million criminals in the US, we could structure this as one factor per criminal. Imagine that, in 1994, each of America&#8217;s ten million criminals independently and coincidentally had a major life change that made crime seem less attractive. That&#8217;s ridiculous. But in that case, any other explanation based on ten million factors should seem ridiculous. And if we give a heavy credibility penalty to a story with ten million factors, we should give some credibility penalty to a story with ten factors.&#8221;</p>
<p>The second person seems to me to have a strong argument, which makes me think Vox and the Brennan Center&#8217;s model where ten different trends each explain about ten percent of the decline is unlikely.</p>
<p>I feel like somebody has already thought about this and there&#8217;s an entire literature I&#8217;m missing, but Google is failing me (badly &#8211; <a href="http://smile.amazon.com/gp/product/0316346624/ref=as_li_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=0316346624&#038;linkCode=as2&#038;tag=slastacod-20&#038;linkId=DEEG5WJMII2PCB2Y">this</a><img src="http://ir-na.amazon-adsystem.com/e/ir?t=slastacod-20&#038;l=as2&#038;o=1&#038;a=0316346624" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" /> was my first search result). Can somebody point me to it? Are there ways to calculate how much less likely a ten-factor explanation is than a one-factor explanation?</p>
<p>[EDIT: Yes, there&#8217;s the trivial case where all ten factors are correlated, for example they all have to do with an improving economy. I&#8217;m talking about the non-boring version of the question.]</p>
<p>[EDIT2: I might have subconsciously absorbed this thought process from <A HREF="http://lesswrong.com/lw/kpj/multiple_factor_explanations_should_not_appear/">Stefan Schubert</A>]</p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/02/14/how-likely-are-multifactorial-trends/feed/</wfw:commentRss>
		<slash:comments>381</slash:comments>
		</item>
		<item>
		<title>Drug Testing Welfare Users Is A Sham, But Not For The Reasons You Think</title>
		<link>http://slatestarcodex.com/2015/02/14/drug-testing-welfare-users-is-a-sham-but-not-for-the-reasons-you-think/</link>
		<comments>http://slatestarcodex.com/2015/02/14/drug-testing-welfare-users-is-a-sham-but-not-for-the-reasons-you-think/#comments</comments>
		<pubDate>Sat, 14 Feb 2015 16:30:37 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[misreporting]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3554</guid>
		<description><![CDATA[Some people say the War on Drugs is &#8216;unwinnable&#8217;. But there&#8217;s actually a foolproof solution that cures drug addiction approximately 100% of the time. That solution is &#8211; put people on welfare in Tennessee. Or at least that is what &#8230; <a href="http://slatestarcodex.com/2015/02/14/drug-testing-welfare-users-is-a-sham-but-not-for-the-reasons-you-think/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Some people say the War on Drugs is &#8216;unwinnable&#8217;. But there&#8217;s actually a foolproof solution that cures drug addiction approximately 100% of the time. That solution is &#8211; put people on welfare in Tennessee.</p>
<p>Or at least that is what I am led to believe by articles like Mic&#8217;s <A HREF="http://unvis.it/mic.com/articles/95794/a-shocking-thing-happened-when-tennessee-decided-to-drug-test-its-welfare-recipients">A Shocking Thing Happened When Tennesee Decided To Drug Test Its Welfare Recipients</A>, which describes said shocking thing as:<br />
<blockquote> 1 out of 812 applicants tested positive for drugs. One. Single. Person. Tennessee conservatives suspicious that welfare recipients are a bunch of drug-addicted slackers were proven dead wrong. Big surprise! </p>
<p>After instituting dehumanizing drug-testing requirements to welfare recipients on July 1, 10 people total were flagged for possible drug use and asked to submit to testing. Five others tested negative, and four were rejected after refusing. As Think Progress notes, that means that just 0.12% of all people applying for cash assistance in Tennessee have tested positive for drugs, compared to the 8% who have reported using drugs in the past month among the state&#8217;s general population. If you assume the four people who refused were on drugs, it&#8217;s still a paltry 0.61%. </p>
<p>In other words, the plan intended to verify right-wing beliefs that welfare recipients are a bunch of drug-addicted slackers looking for a handout has demonstrated exactly the opposite.</p></blockquote>
<p>The article has 11,000 notes on Tumblr right now, I&#8217;ve seen it all over my Facebook feed as well, and the same story has been taken up, with the same editorial line, by a host of other news sources. <A HREF="http://unvis.it/jezebel.com/state-drug-testing-program-busts-a-whopping-37-welfare-1684934735">Jezebel:</A> State Drug Program Busts A Whopping 37 Welfare Applicants</A>. <A HREF="http://blogs.wsj.com/washwire/2014/12/16/few-welfare-applicants-caught-in-drug-screening-net-so-far/">Wall Street Journal:</A> Few Welfare Applicants Caught In Drug Screening Net So Far. <A HREF="http://www.newrepublic.com/article/121009/drug-testing-welfare-recipients-texas-tennessee-tax-poor">New Republic:</A> Red States&#8217; New Tax On The Poor. <A HREF="http://www.dailykos.com/story/2015/02/13/1364275/-Tennessee-just-wasted-a-lot-of-money-drug-testing-welfare-recipients">Daily Kos</A>: Tennessee Just Wasted A Lot Of Money Drug Testing Welfare Recipients. <A HREF="http://reverbpress.com/news/another-gop-fail-0-2-percent-tennessee-welfare-recipients-found-use-illegal-drugs/">ReverbPress:</A> Another GOP Fail: 0.2% Of Tennessee Welfare Recipients Found To Use Illegal Drugs. <A HREF="http://www.mommyish.com/2015/02/11/state-drug-testing-program-only-two-percent-test-positive/">Mommyish:</A> Results Of State Drug Testing Prove Gross Assumptions About Welfare Applicants Are Wrong. <A HREF="http://www.washingtonpost.com/opinions/scott-walkers-yellow-politics/2015/02/12/1dde50c0-b2fa-11e4-827f-93f454140e2b_story.html">Washington Post:<A> Scott Walker&#8217;s Yellow Politics</A>.</p>
<p>These stories all make the point that we have many stereotypes about the poor, and one such stereotype is that the use lots of drugs, but in fact these sorts of welfare programs find them to use fewer drugs than the general population, and therefore we should stop being so prejudiced.</p>
<p>And if they were found to use only two-thirds, or half as many drugs as the general population, this might indeed be the lesson.</p>
<p>But look at the numbers in the quoted Mic article. Welfare users use only about <i>one percent</i> as many drugs as the general population. <i>Really?</i></p>
<p>No. Not really at all. According to legitimate research in this area, poor people use as many drugs as anyone else and probably more. The National Household Survey on Drug Abuse <A HREF="http://oas.samhsa.gov/2k2/GovAid/GovAid.htm">found that</A> illegal drug use was slightly higher in families on government assistance (9.6%) than families not on government assistance (6.8%). The National Coalition For The Homeless <A HREF="http://www.nationalhomeless.org/factsheets/addiction.pdf">notes that</A> about 26% of them use drugs, which is about 2.5x as high as the general population. I crunched some data I have from the hospital I work at, and it shows that poor people (defined as people who get health insurance through an aid program) have moderately higher rates of drug use related problems than the general population. So these articles are reporting a drug use rate in the Tennessee population about one percent of that ever reported in any comparable poor population anywhere else.</p>
<p>Kate from Gruntled and Hinged brings up another curious inconsistency. The false positive rate for drug tests is &#8211; well, it depends on the test procedure, but it&#8217;s usually at least 1%. So if every single welfare user in Tennessee was 100% clean, we would <i>still</i> expect between 1% to 5% positive drug tests. Instead, they got 0.12% positive drug tests. This isn&#8217;t just suspiciously good, it&#8217;s <i>impossibly</i> good.</p>
<p>So what&#8217;s going on here?</p>
<p>Before I explain, here&#8217;s a collage of the stock photos displayed above some of those news stories I linked to.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/urine_collage.png"></p>
<p><i>I now have a picture on my website called urine_collage.png</i></center></p>
<p>If you&#8217;re familiar with the state of the American media, you won&#8217;t be surprised to learn that urine was not involved in the ovewhelming majority of this program&#8217;s drug tests.</p>
<p>So how did they test people for drugs?</p>
<p>They gave them a written test, where the test question was basically &#8220;do you use illegal drugs or not?&#8221; You can see the exact procedure on the sidebar <A HREF="http://www.timesfreepress.com/news/news/story/2012/may/20/questions-linger-on-welfare-drug-testing/78354/">here</A>.</p>
<p>And lo and behold, the overwhelming majority of people answered that they didn&#8217;t.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/drug_questionnaire.png"></p>
<p><i>A more accurate stock photo they could have used</i></center></p>
<p>Now the numbers make sense. It&#8217;s not that only 0.2% of welfare recipients use drugs. All this tells us, if anything, is that 0.2% of welfare recipients are on so many drugs they can&#8217;t figure out how to check &#8220;NO&#8221; on a form.</p>
<p>Why would the government do something like this? As best I can tell, the plan was originally to give everyone urine checks, but in Florida <A HREF="http://www.politifact.com/florida/promises/scott-o-meter/promise/600/require-drug-screening-for-welfare-recipients/">the courts decided</A> that urine-checking people without prior suspicion was unconstitutional. The Republicans were pretty attached to their &#8220;drug test welfare recipients&#8221; plan and didn&#8217;t want to look like they were wimps who backed down just because of one little court case, so they decided to give people the written test in the hopes of having prior suspicion for the people who said yes. Sure, it made no sense, but they could still tell their constituents they were drug testing those welfare recipients, and <i>in principle</i> they&#8217;d won an important victory. Or something.</p>
<p>Which raises another interesting question &#8211; how did Florida&#8217;s urine-based program do before the courts struck it down?</p>
<p>According to the media, abysmally. <A HREF="http://www.msnbc.com/rachel-maddow-show/drug-testing-welfare-recipients-looks-even-worse">MSNBC:</A> Drug Testing Welfare Recipients Looks Even Worse, &#8220;[Florida Governor] Scott’s policy was an embarrassing flop. Only about 2 percent of applicants tested positive, and Florida actually lost money&#8221;. <A HREF="http://tbo.com/ap/politics/welfare-drug-testing-yields--positive-results-252458">TBO:</A> Welfare Drug Testing Yields 2% Positive Results, &#8220;Newton said that&#8217;s proof the drug-testing program is based on a stereotype, not hard facts.&#8221; <A HREF="http://www.attn.com/stories/788/drug-testing-government-handouts">ATTN:</A> Why Drug Testing Poor People Is A Waste Of Time And Money, &#8220;Florida tested welfare recipients for four months before its drug test mandate was thrown out by the courts. Only 2.6 percent of welfare recipients tested positive. The rest of the Florida&#8217;s population use drugs at a rate of 8 percent. So, again, welfare recipients used drugs less than everyone else.&#8221;</p>
<p>Now we&#8217;re merely at one-quarter of the drug use rate people with good methodologies find. Improvement!</p>
<p>So I looked up exactly how this works. Apparently welfare recipients were asked to pay for their own drug tests, and would be reimbursed if the results came back negative. 7000 welfare users did this, but <A HREF="http://www.drugfree.org/join-together/almost-1600-welfare-applicants-in-florida-decline-to-undergo-drug-testing/">1600 declined to do so</A> &#8211; numbers that were not mentioned in most of the pieces above.</p>
<p>Opponents of the program say that maybe those 1600 people could not find drug testing centers near them, or couldn&#8217;t afford to pay for the tests even with the promise of reimbursement later, or something like that. I am sure that some of them did indeed decline for reasons like those.</p>
<p>But also, people on welfare don&#8217;t have very much money [citation needed]. If I were a welfare recipient, and they were going to drug test me and not reimburse me if I came out positive, and I was on drugs, I would decline the hell out of that test.</p>
<p>Suppose that the poor in Florida use drugs at the same rate as the poor in various studies and surveys &#8211; about 10%. We have 8600 welfare recipients, so we would expect 860 drug users. Of the 7000 who agreed to testing, we know that 2.5% are drug users &#8211; that&#8217;s 175 people. That in turn would suggest that of the 1600 who refused testing, about 685 were drug users &#8211; 40% or so. That would imply that about 80% of drug users versus about 12% of nonusers refused testing.</p>
<p>These numbers seem pretty reasonable to me. Most welfare users want to keep their benefits, so the majority will agree to testing, but a few will inevitably fall through the cracks because they can&#8217;t reach a testing center or because they have moral objections to the tests. On the other hand, clued-in drug users will realize that for them, testing means a major inconvenience and monetary charge without any likely corresponding gain. So we would expect drug users to decline testing at a higher rate than nonusers. In order to use the Florida data to say that welfare recipients <i>in general</i> use drugs at a rate of 2%, we would need to assume that drug users were no more likely to refuse drug testing than nonusers, even though the testing rewarded non-use with money but punished use with a loss of money.</p>
<p>(note that there are some different numbers in different places for Florida. I assume that these represent different years, stages of testing, parts of Florida, etc, but I&#8217;m not sure. The only one that is <i>seriously</i> different from what I&#8217;m saying above is the one that says &#8220;only 1% of people declined testing&#8221;. After some search, I&#8217;m pretty sure that&#8217;s referring to that only 1% of people made appointments for testing, then cancelled later. But I am less confident in the Florida numbers than in the analysis of Tennessee)</p>
<p>So the Florida numbers are consistent with welfare recipients using drugs less, more, or the same amount as the general population.</p>
<p>So I have a question for you guys.</p>
<p>How come Brian Williams <A HREF="http://www.theguardian.com/media/2015/feb/11/brian-williams-nbc-suspends-news-anchor-for-six-months-over-helicopter-story">is being dragged over the coals</A> for lying in the media, but everyone who publishes these kinds of articles gets off scot-free?</p>
<p>If I understand correctly, Williams said that his helicopter got shot at when he was in Iraq, but in reality he was just in a helicopter in Iraq at the same time as some other helicopter nearby was getting shot at. This is obviously stretching the truth, but it seems to me it could have been worse. No important policy decisions are going to hinge upon exactly which helicopter Brian Williams was in. And he didn&#8217;t get it <i>infinitely</i> wrong &#8211; for example, there was, in some sense, a war in Iraq.</p>
<p>On the other hand, discussions of how many poor people use drugs is pretty important for all sorts of policy questions, and these people <i>completely</i> dropped the ball. So why does nobody get reprimanded for this kind of thing?</p>
<p>You might argue that Brian Williams&#8217; actions were obviously malicious and deceitful, but that screwing up drug numbers is an excusable mistake. I say it&#8217;s exactly the opposite. Brian Williams did exactly what I unfortunately do all the time &#8211; unthinkingly tell a story the much cooler way it should have happened, the way it happened in my head &#8211; rather than the way it actually did happen (my colleagues elsewhere in the psychiatry blogosphere <A HREF="http://real-psychiatry.blogspot.com/2015/02/lies-damn-lies-and-normal-brain-function.html">go further</A> and call this &#8220;normal brain function&#8221;).</p>
<p>On the other hand, I have more trouble imagining a situation in which I would accept the claim &#8220;only 0.1% of poor people use drugs, which is barely one percent of the rate in the general population&#8221; without wanting to do a little more research to see if it is true. If your reporters are capable of making this mistake honestly, <i>get better reporters</i>.</p>
<p>But I&#8217;m not sure it&#8217;s honest. A lot of these sources admit they took their story from <A HREF="http://thinkprogress.org/economy/2015/02/10/3621267/tennessee-drug-tests-after-six-months/">a Think Progress piece</A> on the issue. Think Progress <i>does</i> mention that the tests are a sham, although only in one sentence that is easy to miss. Either the secondary reporters didn&#8217;t read Think Progress thoroughly, or they consciously decided not to mention it.</p>
<p>But even if it was an honest mistake, I still have trouble excusing their arrogance. I mean look at that Jezebel article. The writer says this proves that people who think welfare recipients use drugs &#8220;consider &#8216;facts&#8217; troublesome&#8221; and that their &#8220;entire social philosophy boils down to &#8216;Ew, poor people.'&#8221;</p>
<p>You&#8217;re saying that&#8217;s not as bad as a helicopter-related embellishment?</p>
<p>Yes, okay, drug testing welfare applicants is in fact probably a bad idea. It&#8217;s a bad idea because the courts have banned doing it in a way more effective than asking them politely if they use drugs or not, but it was a bad idea even before that. It&#8217;s a bad idea because drug tests have frequent false positives, but it&#8217;s a bad idea even without that. It&#8217;s a bad idea because quitting drugs is really hard and denying people benefits isn&#8217;t going to help.</p>
<p>But if, in the service of proving this to be a bad idea, you decide it&#8217;s acceptable to fudge the numbers to make your point, horrible things happen. First, you <A HREF="http://slatestarcodex.com/2014/02/23/in-favor-of-niceness-community-and-civilization/">contribute to a culture</A> of telling lies and lose the opportunity to protest when the other side does it. Second, you make it harder to trust you on anything else.</p>
<p>But most important, <A HREF="http://lesswrong.com/lw/uy/dark_side_epistemology/">tell one lie and the truth is forever after your enemy</A>. I <A HREF="http://slatestarcodex.com/2015/02/02/practically-a-book-review-dying-to-be-free/">recently argued</A> that we need to reform suboxone prescribing laws, because it&#8217;s the best anti-addiction medicine we&#8217;ve got and right now poor people can&#8217;t access it. . Why should anyone listen to me now? They can just answer &#8220;Actually, that would be a waste of money. As per an article I read in Jezebel, pretty much no poor person has ever been addicted to drugs.&#8221; Then the laws don&#8217;t get reformed and people die. </p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/02/14/drug-testing-welfare-users-is-a-sham-but-not-for-the-reasons-you-think/feed/</wfw:commentRss>
		<slash:comments>265</slash:comments>
		</item>
		<item>
		<title>The Efficacy Of Everything In Psychiatry In One Graph Plus Several Pages Of Dense But Necessary Explanation</title>
		<link>http://slatestarcodex.com/2015/02/08/the-efficacy-of-everything-in-psychiatry-in-one-graph-plus-several-pages-of-dense-but-necessary-explanation/</link>
		<comments>http://slatestarcodex.com/2015/02/08/the-efficacy-of-everything-in-psychiatry-in-one-graph-plus-several-pages-of-dense-but-necessary-explanation/#comments</comments>
		<pubDate>Sun, 08 Feb 2015 20:35:50 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[psychiatry]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3548</guid>
		<description><![CDATA[Shamelessly stolen from my hospital&#8217;s Journal Club: Huhn et al (2014) graph the Efficacy Of Pharmacotherapy And Psychotherapy For Adult Psychiatric Disorders, and it looks like this: Before anything else &#8211; we kind of have to assume that in each &#8230; <a href="http://slatestarcodex.com/2015/02/08/the-efficacy-of-everything-in-psychiatry-in-one-graph-plus-several-pages-of-dense-but-necessary-explanation/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p>Shamelessly stolen from my hospital&#8217;s Journal Club: Huhn et al (2014) graph the <A HREF="http://slatestarcodex.com/blog_images/forestplotpaper.pdf">Efficacy Of Pharmacotherapy And Psychotherapy For Adult Psychiatric Disorders</A>, and it looks like this:</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/forestplot1.png"></center></p>
<p>Before anything else &#8211; we kind of have to assume that in each case they&#8217;re getting a representative sample of the best drugs/therapies for the disorder. In practice, there is this <i>weird</i> equivalency for most things: most common antidepressants work about equally well, most common antipsychotics work about equally well, et cetera. I don&#8217;t know as much about therapy, but I <A HREF="http://slatestarcodex.com/2013/09/19/scientific-freud/">get the impression</A> the same thing goes on there too. So probably it&#8217;s not much of a stretch to expect that the efficacy of whatever they studied at least <i>kind of</i> translates to the effectiveness of whatever real treatment you&#8217;ll get from your own psychiatrist. At the very least, even lossy and compressed information like this will tell us something.</p>
<p>The effect sizes are mostly around 0.5, with a few much higher and a few much lower. This is common for these sorts of studies. See for example Leucht et al, <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/22297588">Putting the efficacy of psychiatric and general medical medication into perspective</A>, which also finds psychiatric effect sizes average around 0.5 and finds this is about equal to average effect sizes in other fields of medicine &#8211; thus debunking the popular claim that psychiatry is less effective. Leucht and a few other authors from that piece are also involved in this one, which doesn&#8217;t surprise me much.</p>
<p>I do however admit my statistical ignorance in exactly what is going on here. Effect sizes are a good way to compare two unlike domains &#8211; for example, I <A HREF="http://slatestarcodex.com/2015/02/01/talents-part-2-attitude-vs-altitude/">recently noted</A> that leading physicists are about as smart as NBA players are tall. This paper is within that tradition. In fact, if we wanted, we could describe psychiatric medications as about one-sixth as effective as NBA players are tall. This is perfectly honorable. The height of NBA players is a tough bar to live up to. </p>
<p>But I don&#8217;t have a good intuitive feel for what it means to use standard mean differences along a non-normally distributed variable &#8211; as psychiatric diseases no doubt are. And I&#8217;m not sure where they&#8217;re even getting their distributions from. When they say schizophrenia meds have an effect size of 0.52, are they talking about the distribution of the general population, with almost everyone near zero and a few schizophrenics way off to the right? Are they talking about the distribution of how schizophrenic particular schizophrenics are, which for all I know might be a bell curve but which is probably very different depending on how you took your schizophrenia sample? I really don&#8217;t get this and it&#8217;s preventing me from getting a good feeling of exactly how comparable these numbers are to each other.</p>
<p>If we just assume they&#8217;re allowed to do what they&#8217;re doing, their graph looks about how I would expect it to look. Most psychiatrists always figured that the psychotic disorders were more susceptible to medication and the anxiety disorders to psychotherapy. But three surprises stand out.</p>
<p>First, this graph shows that drugs are more effective than therapy in treating borderline personality disorder. That&#8217;s the opposite of the conventional wisdom, which says that some drugs can decrease impulsiveness in this population but that the definitive treatment has always been Dialectical Behavioral Therapy. But it looks like their borderline psychotherapy &#8220;meta-analysis&#8221; had a sample size of 20 patients (I would hate to see what the individual studies had!) compared to thousands of patients for most of the believable results. So I wouldn&#8217;t place too much faith in this anomaly for now and would continue to recommend psychotherapy for borderlines.</p>
<p>Second, this graph shows that drugs are more effective than therapy for insomnia. Now, we <i>use</i> drugs instead of therapy for insomnia, but conventional wisdom had always been that this was very sad, and there was great therapy available for insomnia if only somebody would provide it. But here the therapy looks mediocre at best. On the other hand, the sample size is &#8220;NI&#8221;, which I don&#8217;t know what it means but doesn&#8217;t sound promising. Also, now that <A HREF="http://www.health.harvard.edu/blog/common-anticholinergic-drugs-like-benadryl-linked-increased-dementia-risk-201501287667">anticholinergics probably cause dementia</A>, every single sleeping pill now officially has terrible side effects.</p>
<p>Third, in all conditions drugs seem more effective at preventing relapse than at stopping acute episodes. My &#8220;clinical experience,&#8221; which is the fancy word doctors use for anecdotal evidence, was exactly the opposite. I now realize I probably faced a lot of selection bias &#8211; the patients who do well on their drug and don&#8217;t relapse might never see me again. Also, I have a feeling that a lot of the people who come back to me a month later and say &#8220;Well, your drug must not have worked, I&#8217;ve relapsed again&#8221; probably weren&#8217;t taking the drug correctly or at all, something which these studies probably enforce better than I can.</p>
<p>In general, the table seems to support psychotherapy being better than drugs for a lot of things. This would not be <i>too</i> surprising if true &#8211; their list is heavily tilted to the kinds of conditions therapy works well on &#8211; but a caveat is necessary.</p>
<p>The psychotherapy trials were generally of lower quality. Part of this has to do with the culture of psychotherapy research, but more has to do with the underlying territory &#8211; giving people &#8220;placebo psychotherapy&#8221; is more complicated than giving people a sugar pill and a lot of studies don&#8217;t bother. Also, in psychotherapy, it tends to be the patient&#8217;s therapist recording results more often than corresponding pharmacology studies use the prescriber to record results. That eliminates another layer of blinding.</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/forestplot2.png"></center></p>
<p>This has serious ramifications. The study finds that &#8220;low-quality psychotherapy trials in general had a higher effect size (SMD = 0.74) than high-quality trials (SMD = 0.22), p < 0.001". Those high effect sizes for psychotherapy aren't looking so good <i>now</i>, are they?</p>
<p>Actually, reread that one more time. Effect sizes for the low quality trials are triple those for the high-quality trials. If you ever wanted proof that it&#8217;s way too easy to inflate positive findings if your science isn&#8217;t really exceptionally good, there you go.</p>
<p>The most important domain where pharmacotherapy trials are worse than psychotherapy trials is publication bias. The paper suggests that this is because psychotherapy&#8217;s lack of sufficient blinding and control groups makes publication results unnecessary. In other words, psychotherapy research isn&#8217;t even good enough to have publication bias, because publication bias at least requires you to be rigorous enough to occasionally turn up a negative result to suppress. Ouch. </p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/02/08/the-efficacy-of-everything-in-psychiatry-in-one-graph-plus-several-pages-of-dense-but-necessary-explanation/feed/</wfw:commentRss>
		<slash:comments>82</slash:comments>
		</item>
		<item>
		<title>Perceptions Of Required Ability Act As A Proxy For Actual Required Ability In Explaining The Gender Gap</title>
		<link>http://slatestarcodex.com/2015/01/24/perceptions-of-required-ability-act-as-a-proxy-for-actual-required-ability-in-explaining-the-gender-gap/</link>
		<comments>http://slatestarcodex.com/2015/01/24/perceptions-of-required-ability-act-as-a-proxy-for-actual-required-ability-in-explaining-the-gender-gap/#comments</comments>
		<pubDate>Sat, 24 Jan 2015 14:20:22 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[race/gender/etc]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3529</guid>
		<description><![CDATA[I. I briefly snarked about Leslie et al (2015) last week, but I should probably snark at it more rigorously and at greater length. This is the paper that concludes that &#8220;women are underrepresented in fields whose practitioners believe that &#8230; <a href="http://slatestarcodex.com/2015/01/24/perceptions-of-required-ability-act-as-a-proxy-for-actual-required-ability-in-explaining-the-gender-gap/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><b>I.</b></p>
<p>I briefly snarked about <A HREF="http://slatestarcodex.com/Stuff/l_paper.pdf">Leslie et al (2015)</A> last week, but I should probably snark at it more rigorously and at greater length.</p>
<p>This is the paper that concludes that &#8220;women are underrepresented in fields whose practitioners believe that raw, innate talent is the main requirement for success because women are stereotyped as not possessing that talent.&#8221; They find that some survey questions intended to capture whether people believe a field requires innate talent correlate with percent women in that field at a fairly impressive level of r = -0.60. </p>
<p>The media, science blogosphere, et cetera has taken this result and run with it. A very small sample includes: <A HREF="http://www.nsf.gov/news/news_summ.jsp?cntn_id=133857">National Science Foundation</A>: Belief In Raw Brilliance May Decrease Diversity. <A HREF="http://news.sciencemag.org/education/2015/01/belief-some-fields-require-brilliance-may-keep-women-out">Science Mag</A>: the &#8220;misguided&#8221; belief that certain scientific fields require brilliance helps explain the underrepresentation of women in those fields. <A HREF="http://www.reuters.com/article/2015/01/15/us-science-sexism-idINKBN0KO2DM20150115">Reuters:</A> Fields That Cherish Genius Shun Women. <A HREF="http://www.learnu.org/study-findings-point-to-source-of-gender-gap-in-stem-programs/">LearnU:</A> Study Findings Point To Source Of Gender Gap In STEM. <A HREF="http://www.scientificamerican.com/article/hidden-hurdle-looms-for-women-in-science/">Scientific American:</A> Hidden Hurdle Looms For Women In Science</A>. <A HREF="http://www.chroniclecareers.com/article/Disciplines-That-Expect/151217/">Chronicle Of Higher Education:</A> Disciplines That Expect Brilliance Tend To Punish Women. <A HREF="http://www.newsworks.org/index.php/local/healthscience/77366-academic-gender-gaps-tied-to-stereotype-about-genius">News Works:</A> Academic Gender Gaps Tied To Stereotypes About Genius. <A HREF="http://mathbabe.org/2015/01/16/representation-of-women-and-the-genius-myth/">Mathbabe</A>: &#8220;The genius myth&#8221; keeps women out of science. <A HREF="http://www.vocativ.com/culture/science/women-in-science-sexism/">Vocativ:</A> Women Avoid Fields Full Of Self-Appointed Geniuses</A>. And so on in that vein.</p>
<p>Okay. Imagine a study with the following methodology. You survey a bunch of people to get their perceptions of who is a smoker (&#8220;97% of his close friends agree Bob smokes&#8221;). Then you correlate those numbers with who gets lung cancer. Your statistics program lights up like a Christmas tree with a bunch of super-strong correlations. You conclude &#8220;Perception of being a smoker causes lung cancer&#8221;, and make up a theory about how negative stereotypes of smokers cause stress which depresses the immune system. The media reports that as &#8220;Smoking Doesn&#8217;t Cause Cancer, Stereotypes Do&#8221;.</p>
<p>This is the basic principle behind Leslie et al (2015).</p>
<p>The obvious counterargument is that people&#8217;s perceptions may be accurate, so your perception measure might be a proxy for a real thing. In the smoking study, we expect that people&#8217;s perception of smoking only correlates with lung cancer because it correlates with actual smoking which itself correlates with lung cancer. You would expect to find that perceived smoking correlates with lung cancer <i>less than</i> actual smoking, because the perceived smoking correlation is just the actual smoking correlation plus some noise resulting from misperceptions.</p>
<p>So I expected the paper to investigate whether or not perceived required ability correlated more, the same as, or less than actual required ability. Instead, they simply write:<br />
<blockquote>Are women and African-Americans less likely to have the natural brilliance that some fields believe is required for top-level success? Although some have argued that this is so, our assessment of the literature is that the case has not been made that either group is less likely to possess innate intellectual talent<sup>1</sup>.</p></blockquote>
<p>So we will have to do this ourselves. The researchers helpfully include in their <A HREF="http://www.sciencemag.org/content/suppl/2015/01/14/347.6219.262.DC1/1261375.Leslie.SM.pdf">supplement</A> a list of the fields they studied and GRE scores for each, as part of some sub-analysis to check for selectivity.  GRE scores correlate closely with IQ and with <A HREF="http://internal.psychology.illinois.edu/~nkuncel/gre%20meta.pdf">a bunch of measures of success in graduate school</A>, so this sounds like it would be a good test of the actual required ability hypothesis. Let&#8217;s use this to figure out whether actual innate ability explains the discrepancies better or worse than perceived innate ability does.</p>
<p>When I use these data I find no effect of GRE scores on female representation.</p>
<p>But these data are surprising &#8211; for example, Computer Science had by far the lowest GRE score (and hence projected IQ?) of any field, which matches neither other sources nor my intuition. I looked more closely and found their measure combines Verbal, Quantitative, and Writing GREs. These are to some degree anti-correlated with each other across disciplines<sup>2</sup>; ie those disciplines whose students have higher Quantitative tend to have lower Writing scores (not surprising; consider a Physics department versus an English department). </p>
<p>Since the study&#8217;s analysis included two measures of verbal intelligence and only one measure of mathematical intelligence, it makes more mathematical departments appear to have lower scores and lower innate ability. Certainly a measure set up such that computer scientists get the lowest intelligence of everyone in the academy isn&#8217;t going to find innate ability related to STEM! </p>
<p>Since the gender gap tends to favor men in more mathematical subjects, if we&#8217;re checking for a basis in innate ability we should probably disentangle these tests and focus on the GRE Quantitative. I took GRE Quantitative numbers by department from <A HREF="http://www.ets.org/s/gre/pdf/gre_guide.pdf">the 2014 edition of the ETS report</A>. The results looked like this:</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/l_gre_math2.png"></center></p>
<p>There is a correlation of r = -0.82 (p = 0.0003) between average GRE Quantitative score and percent women in a discipline. This is among the strongest correlations I have ever seen in social science data. It is much larger than Leslie et al&#8217;s correlation with perceived innate ability<sup>3</sup>.</p>
<p>Despite its surprising size this is not a fluke. It&#8217;s very similar to what other people have found when attempting the same project. There&#8217;s a paper from 2002, <A HREF="http://www.sciencedirect.com/science/article/pii/S0191886901000228">Templer and Tomeo</A>, that tries the same thing and finds r = 0.76, p < 0.001. Randal Olson tried <A HREF="http://www.randalolson.com/2014/06/25/average-iq-of-students-by-college-major-and-gender-ratio/">a very similar project</A> on his blog a while back and got r = 0.86. My finding is right in the middle.</p>
<p>A friendly statistician went beyond my pay grade and did a sequential ANOVA on these results<sup>4</sup> and Leslie et al&#8217;s perceived-innate-ability results. They found that they could reject the hypothesis that the effect of actual innate ability was entirely mediated by perceived innate ability (p = 0.002), but could not reject the hypothesis that the effect of perceived-innate-ability was entirely mediated by actual-innate ability (p = 0.36). </p>
<p>In other words, we find no evidence for a continuing effect of people&#8217;s perceptions of innate ability after we adjust for what those perceptions say about actual innate ability, in much the same way we would expect to see no evidence for a continuing effect of people&#8217;s perceptions of smoking on lung cancer after we adjust for what those perceptions say about actual smoking.</p>
<p><b>II.</b></p>
<p>Correlation is not causation, but a potential causal mechanism can be sketched out.</p>
<p>I&#8217;m going to use terms like &#8220;ability&#8221; and &#8220;innate ability&#8221; and &#8220;genius&#8221; and &#8220;brilliance&#8221; because those are the terms Leslie et al use, but I should clarify. I&#8217;m using them the way Leslie et al seem to, as a contrast to hard work, the internal factors that give different people different payoffs per unit effort. So a genius is someone who can solve difficult problems with little effort; a dullard is one who can solve them only with great effort or not at all.</p>
<p>This use of &#8220;innate ability&#8221; is <i>not</i> the same thing as &#8220;genetically determined ability&#8221;. Genetically determined ability will be part of it, but there will also be many other factors. Environmental determinants of intelligence, like good nutrition and low lead levels. Exposure to intellectual stimulation during crucial developmental windows. The effect of steretoypes, insofar as those stereotypes globally decrease performance.  Even previous training in a field might represent &#8220;innate ability&#8221; under this definition, although later we&#8217;ll try to close that loophole.</p>
<p>Academic programs presumably want people with high ability. The GRE bills itself as an ability test, and under our expanded definition of ability this is a reasonable claim. So let&#8217;s talk about what would happen if programs selected based solely on ability as measured by GREs.</p>
<p>This is, of course, not the whole story. Programs also use a lot of other things like grades, interviews, and publications. But these are all correlated with GRE scores, and anyway it&#8217;s nice to have a single number to work with. So for now let&#8217;s suppose colleges accept applicants based entirely on GRE scores and see what happens. The STEM subjects we&#8217;re looking at here are presumably most interested in GRE Quantitative, so once again we&#8217;ll focus on that.</p>
<p>Mathematics unsurprisingly has the highest required GRE Quantitative score. Suppose that the GRE score of the average Mathematics student &#8211; 162.0 &#8211; represents the average level that Mathematics departments are aiming for &#8211; ie you must be this smart to enter.</p>
<p>The average man gets 154.3 ± 8.6 on GRE Quantitative. The average woman gets 149.4 ± 8.1.  So the threshold for Mathematics admission is 7.7 points ahead of the average male test-taker, or 0.9 male standard deviation units. This same threshold is 12.6 points ahead of the average female test-taker, or 1.55 female standard deviation units. </p>
<p>GRE scores are designed to follow a normal distribution, so we can plug all of this into our handy-dandy <A HREF="http://stattrek.com/online-calculator/normal.aspx">normal distribution calculator</A> and find that 19% of men and 6% of women taking the GRE meet the score threshold to get into graduate level Mathematics. 191,394 men and 244,712 women took the GRE last year, so there will be about 36,400 men and 14,700 women who pass the score bar and qualify for graduate level mathematics. That means the pool of people who can do graduate Mathematics is 29% female. And when we look at the actual gender balance in graduate Mathematics, it&#8217;s <i>also</i> 29% female.</p>
<p>Vast rivers of ink have been spilled upon the question of why so few women are in graduate Mathematics programs. Are interviewers misogynist? Are graduate students denied work-life balance? Do stereotypes cause professors to &#8220;punish&#8221; women who don&#8217;t live up to their sexist expectations? Is there a culture of sexual harassment among mathematicians? </p>
<p>But if you assume that Mathematics departments are selecting applicants based on the thing they double-dog swear they are selecting applicants based on, there is literally nothing left to be explained<sup>5</sup>.</p>
<p>I am <i>sort of</i> cheating here. The exact perfect prediction in Mathematics is a coincidence. And I can&#8217;t extend this methodology rigorously to any other subject because I would need a much more complicated model where people of a given score level are taken out of the pool as they choose the highest-score-requiring discipline, leaving fewer high-score people available for the low-score-requiring ones. Without this more complicated task, at best I can set a maximum expected gender imbalance, then eyeball whether the observed deviation from that maximum is more or less than expected. Doing such eyeballing, there are slightly fewer women in graduate Physics and Computer Science than expected and slightly more women in graduate Economics than expected.</p>
<p>But on the whole, the prediction is very good. That it is not perfect means there is still some room to talk about differences in stereotypes and work-life balance and so on creating moderate deviations from the predicted ratio in a few areas like computer science. But this is arguing over the scraps of variance left over, after differences in mathematical ability have devoured their share.</p>
<p><b>III.</b></p>
<p>There are a couple of potentially very strong objections to this hypothesis. Let me see if I can answer them.</p>
<p><u>First</u>, maybe this is a binary STEM vs. non-STEM thing. That is, STEM fields require more mathematical aptitude (obviously) and they sound like the sort to have more stereotypes about women. So is it possible that my supposedly large sample size is actually just showing an artifact of division into these two categories?</p>
<p>No. I divided the fields into STEM and non-STEM and ran an analysis within each subgroup. Within the non-STEM subgroup, there was a correlation between GRE Quantitative and percent female in a major of -0.64, p = 0.02. It is completely irresponsible to do this within the STEM subgroup, because it has n = 7 which is too small a sample size to get real results. But if we are bad people and do it anyway, we find a very similar correlation of -0.63. p is only 0.12, but with n=7 what did you expect? </p>
<p>Both of these correlations are higher than Leslie et al were able to get from their entire sample.</p>
<p><u>Second</u>, suppose that it&#8217;s something else driving gender-based patterns in academia. Maybe stereotypes or long hours or whatever. Presumably, these could operate perfectly well in undergrad. So stereotypes cause lots of men to go into undergraduate math and lots of women to go into undergraduate humanities. The men in math classes successfully learn math and the women in humanities classes successfully learn humanities. Then at the end of their time in college they all take the GRE, and unsurprisingly the men who have been taking all the math classes do better in math. In this case, the high predictive power of mathematical ability would be a <i>result</i> of stereotypes, not an alternative to them.</p>
<p>In order to investigate this possibility we could look at SAT Math instead of GRE Quantitative scores, since these would show pre-college ability. SAT scores show a gap much like that in GRE scores; in both, the percentile of the average woman is in the low 40s.</p>
<p>Here is a graph of SAT Math scores against percent women in undergraduate majors:</p>
<p><center><IMG SRC="http://slatestarcodex.com/blog_images/l_sat_math.png"></center></p>
<p>SAT Math had a correlation of -0.65, p = 0.01<sup>6</sup>. </p>
<p>This correlation is still very strong. It is still stronger than Leslie et al&#8217;s correlation with perceived required ability. But it is slightly weaker than the extremely strong correlation we find with GRE scores. Why?</p>
<p>I can&#8217;t answer that for sure, but here is a theory. The &#8220;undergraduate major&#8221; data is grabbed from what SAT test-takers put down as their preferred undergraduate major when they take the test in (usually) 11th grade. The &#8220;percent female&#8221; data is grabbed from records of degrees awarded in each field. So these are not exactly the same people on each side. One side shows the people who thought they wanted to do Physics in 11th grade. The other side shows the people who ended up completing a Physics degree. </p>
<p>The people who intend to pursue Physics but don&#8217;t end up getting a degree will be those who dropped out for some reason. While there are many reasons to drop out, one no doubt very common one is that the course was too hard. Therefore, the people who drop out will be disproportionately those with lower mathematical ability. Therefore, the average SAT Math score of 11th grade intended Physics majors will be lower than the average SAT Math score of Physics degree earners. So the analysis above likely underestimates the average SAT Math score of people in mathematical fields. This could certainly explain the lower correlation, and I predict that if we could replace our unrepresentative measure of SAT scores with a more representative one, much of the gap between this correlation and the previous one would close.</p>
<p>These data do not rule out simply pushing everything back a level and saying that these stereotypes affect what classes girls take in middle school and high school. Remember, we using &#8220;ability&#8221; as a designation for a type of excellence, not an explanatory theory of it. This simply confirms that by eleventh grade, the gap has already formed.<sup>7</sup>.</p>
<p><u>Third</u>, perhaps SAT and GRE math tests are not reflective of women&#8217;s true mathematical ability. This is the argument from stereotype threat, frequently brought up as reasons why tests should not be used to judge aptitude.</p>
<p>But this is based on a fundamental misunderstanding of stereotype threat found in the popular media, which actual researchers in the field keep trying to correct (to no avail). See for example <A HREF="http://www.psych.uw.edu.pl/jasia/sackett.pdf">Sackett, Hardison, and Cullen (2004)</A>, who point out that no research has ever claimed stereotype threat accounts for gender gaps on mathematics tests. What the research found was that, by adding an extra stereotype threat condition, you could widen those gaps further. The existing gaps on tests like the SAT and GRE correspond to the &#8220;no stereotype threat&#8221; control condition in stereotype threat experiments, and &#8220;absent stereotype threat, the two groups differ to the degree that would be expected based on differences in prior SAT scores&#8221;. Aronson and Steele, who did the original stereotype threat research and invented the field, have confirmed that this is accurate and endorsed the warning.</p>
<p>Anyway, even if the pop sci version of stereotype threat were entirely true and explained everything, it still wouldn&#8217;t rescue claims of bias or sexism in the sciences. It would merely mean that the sciences&#8217; reasonable and completely non-sexism-motivated policy of trusting test scores was ill-advised.<sup>8</sup></p>
<p><u>Fourth</u>, might there be reverse causation? That is, suppose that there are stereotypes and sexism restricting women&#8217;s entry into STEM fields, and unrelatedly men have higher test scores. Then the fields with the stereotypes would end up with the people with higher test scores, and it would look like they require more ability. Might that be all that&#8217;s happening here?</p>
<p>No. I used <A HREF="http://www.ets.org/s/gre/pdf/snapshot.pdf">gender differences in the GRE scores</A> to predict what scores we would expect each major to have if score differences came solely from differences in gender balance. This predicted less than a fifth of the variation. For example, the GRE Quantitative score difference between the average test-taker and the average Physics graduate student was 9 points, but if this were solely because of differential gender balance plus the male test advantage we would predict a difference of only 1.5 points. The effect on SAT scores is similarly underwhelming.</p>
<p><u>But</u> I think the most important thing I want to say about objections to Part II is that, whether they&#8217;re correct or not, Part I still stands. Even if the correlation between innate ability and gender balance turns out to be an artifact, Leslie et al&#8217;s correlation between perceived innate ability and gender balance is still an artifact of an artifact.</p>
<p><b>IV.</b></p>
<p>A reader of an early draft of this post pointed out the imposingly-named <A HREF="http://arxiv.org/abs/1011.0663">Nonlinear Psychometric Thresholds In Physics And Mathematics</A>. This paper uses SAT Math scores and GPA to create a model in which innate ability and hard work combine to predict the probability that a student will be successful in a certain discipline. It finds that in disciplines &#8220;such as Sociology, History, English, and Biology&#8221; these are fungible &#8211; greater work ethic can compensate for lesser innate ability and vice versa. But in disciplines such as Physics and Mathematics, this doesn&#8217;t happen. People below a certain threshold mathematical ability will be very unlikely to succeed in undergraduate Physics and Mathematics coursework no matter how hard-working they are.</p>
<p>And that brought into relief part of why this study bothers me. It ignores the pre-existing literature on the importance of innate ability versus hard work. It ignores the rigorous mathematical techniques developed to separate innate ability from hard work. Not only that, but it ignores pre-existing literature on predicting gender balance in different fields, and the pre-existing literature on GRE results and what they mean and how to use them, and all the techniques developed by people in <i>those</i> areas.</p>
<p>Having committed itself to flying blind, it takes the thing we already know how use to predict gender balance, shoves it aside in favor of a weird proxy for that thing, and finds a result mediated by that thing being a proxy for the thing they are inexplicably ignoring. Even though it just used a proxy for aptitude to predict gender balance, everyone congratulates it for having proven that aptitude does not affect gender balance.</p>
<p>Science journalism declares that the myth that ability matters has been vanquished forever. The media take the opportunity to remind us that scientists are sexist self-appointed geniuses who use stereotypes to punish women.  And our view of an important issue becomes just a little muddier.</p>
<p>I encourage everyone to reanalyze this data and see if I&#8217;m missing something. You can find the GRE data I used <A HREF="http://slatestarcodex.com/blog_images/l_gre_stats2.xlsx">here</A> and the SAT data <A HREF="http://slatestarcodex.com/Stuff/l_sat_stats.xlsx">here</A> (both in .xlsx format).</p>
<p><b>Footnotes</b></p>
<p><font size="1"><b>1.</b> They cite for this claim, among other things, Stephen Jay Gould&#8217;s <i>The Mismeasure Of Man</i></p>
<p><b>2.</b> Beware the ecological fallacy; these scores are still positively correlated in individuals.</p>
<p><b>3.</b> It was also probably more highly significant, but I can&#8217;t tell for sure because (ironically) their significance result wasn&#8217;t to enough significant digits.</p>
<p><b>4.</b> There was a small error in the percent of women in Communications in the dataset I provided them with, so these numbers are off by a tiny fraction from what you will get if you try to replicate. I didn&#8217;t feel comfortable asking them to redo the entire thing, but the small error would not have changed the results significantly, and the tiny amount it would have changed them would have been in the direction of making the innate ability results more striking rather than less.</p>
<p><b>5.</b> Although Leslie et al focused on women, they believe their results could also extend to why African-Americans are underrepresented compared to European-Americans and Asian-Americans in certain subjects. They theorize that European and Asian Americans, like men, are stereotyped as innately brilliant, but African-Americans, like women, lack this stereotype. I find this a bit off &#8211; after all, in the gender results, they contrasted the male &#8220;more innately brilliant&#8221; stereotype with the female &#8220;harder-working&#8221; stereotype, but African Americans suffer from a stereotype of not being hard-working, and Asian-Americans <i>do</i> have a stereotype of being hard-working, even more so than women. Anyway, this is only a mystery if you stick to Leslie et al&#8217;s theory of stereotypes about perceived innate ability. Once you look at GRE Quantitative scores, you find that whites average 150.8, Asians average 153.9, and blacks average 143.7, and there&#8217;s not much left to explain.</p>
<p><b>6.</b> It&#8217;s hard to correlate SAT scores with majors, because the SAT data is full of tiny vocational majors that throw off the results. For example, there are two hundred people in the country studying some form of manufacturing called &#8220;precision production&#8221;, they&#8217;re almost all male, and they have very low SAT scores. On the other hand, there are a few thousand people studying something called &#8220;family science&#8221;, they&#8217;re almost all women, and they also all have very low SAT scores. The shape of gender*major*SAT scores depends almost entirely on how many of these you count. I circumvented the entire problem by just counting the fields that approximately corresponded to the ones Leslie et al counted in their graduate-level study. I tried a few different analyses using different ways of deciding which fields to count, and as long as they were vaguely motivated by a desire to include academic subjects and not the vocational subjects with very low scores, they all came out about the same.</p>
<p><b>7.</b> The argument that stereotypes cause boys to take more middle school and high school math classes than girls is somewhat argued against by the finding that <A HREF="http://www.ppic.org/main/pressrelease.asp?i=309">actually girls take more middle school and high school math classes than boys</A>. However, there are some contrary results; for example, boys are more likely than girls to take the AP Calculus test. This entire area gets so tangled up in differing levels of interest and ability and work-ethic that it&#8217;s not worth it, at <i>my</i> level of interest and ability and work ethic, to try to work it out. The best I can say is that the gap appears by the time kids take the SAT in 11th grade. </p>
<p><b>8.</b> I can&#8217;t help adding that I continue to believe that the stereotype threat literature looks like a null field which continues to exist only through publication bias and experimenter effects. The <A HREF="https://i.imgur.com/VTwdrmH.png">funnel plot</A> shows a clear peak at &#8220;zero effect&#8221; and an asymmetry indicating a publication bias for positive results (for some discussion of why I like funnel plots, see <A HREF="http://slatestarcodex.com/2014/12/12/beware-the-man-of-one-study/">here</A>.) And a closer look at the individual research <A HREF="http://en.wikipedia.org/wiki/Stereotype_threat#Criticism">shows</A> this really disturbing pattern of experiments by true believers finding positive effects, experiments by neutral parties and skeptics not finding them, replication attempts failing, and large real-world quasi-experiments turning up nothing &#8211; in a way <A HREF="http://slatestarcodex.com/2014/04/28/the-control-group-is-out-of-control/">very reminiscent of parapsychology</A>. Although I am far from 100% sure, I would tentatively place my money on the entire idea of stereotype threat vanishing into the swamp of social psychology&#8217;s <A HREF="http://www.slate.com/articles/health_and_science/science/2014/07/replication_controversy_in_psychology_bullying_file_drawer_effect_blog_posts.html">crisis of replication</A>.</font></p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/01/24/perceptions-of-required-ability-act-as-a-proxy-for-actual-required-ability-in-explaining-the-gender-gap/feed/</wfw:commentRss>
		<slash:comments>626</slash:comments>
		</item>
		<item>
		<title>Depression Is Not A Proxy For Social Dysfunction</title>
		<link>http://slatestarcodex.com/2015/01/15/depression-is-not-a-proxy-for-social-dysfunction/</link>
		<comments>http://slatestarcodex.com/2015/01/15/depression-is-not-a-proxy-for-social-dysfunction/#comments</comments>
		<pubDate>Fri, 16 Jan 2015 01:14:32 +0000</pubDate>
		<dc:creator><![CDATA[Scott Alexander]]></dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[psychiatry]]></category>
		<category><![CDATA[statistics]]></category>

		<guid isPermaLink="false">http://slatestarcodex.com/?p=3519</guid>
		<description><![CDATA[I. Here is a terrible article from the New York Post: Sorry, Liberals, Scandinavian Countries Aren&#8217;t Utopias. Its thesis is interesting and worth exploring, but instead of a principled investigation, the article just publishes a bunch of cherry-picked smears about &#8230; <a href="http://slatestarcodex.com/2015/01/15/depression-is-not-a-proxy-for-social-dysfunction/">Continue reading <span class="pjgm-metanav">&#8594;</span></a>]]></description>
				<content:encoded><![CDATA[<p><b>I.</b></p>
<p>Here is a terrible article from the New York Post: <A HREF="http://nypost.com/2015/01/11/sorry-liberals-scandinavian-countries-arent-utopias/">Sorry, Liberals, Scandinavian Countries Aren&#8217;t Utopias</A>.</p>
<p>Its thesis is interesting and worth exploring, but instead of a principled investigation, the article just publishes a bunch of cherry-picked smears about Scandinavia. Did you know that 5% of Danes have had sex with animals?</p>
<p>(What percent of people in other countries have had sex with animals? I don&#8217;t know. More important, I see no sign that the New York Post knows either.)</p>
<p>But the part that really caught my eye was statements like these:<br />
<blockquote>Why does no one seem particularly interested in visiting Denmark? Visitors say Danes are joyless to be around. Denmark suffers from high rates of alcoholism. In its use of antidepressants it ranks fourth in the world. (Its fellow Nordics the Icelanders are in front by a wide margin) &#8230; Finland, which tops the charts in many surveys, is also a leader in categories like alcoholism, murder, suicide and antidepressant usage.</p></blockquote>
<p>The Post is not the only paper to make this argument. The Guardian (<A HREF="http://www.theguardian.com/world/2014/jan/27/scandinavian-miracle-brutal-truth-denmark-norway-sweden">&#8220;The Grim Truth Behind The Scandinavian Miracle&#8221;</A>) has said much the same thing:<br />
<blockquote>Take the Danes, for instance. True, they claim to be the happiest people in the world, but why no mention of the fact they are second only to Iceland when it comes to consuming anti-depressants?&#8230;Finland has by far the highest suicide rate in the Nordic countries.</p></blockquote>
<p>I&#8217;ve heard this same argument applied to other issues; for example, in <A HREF="http://slatestarcodex.com/2014/02/11/blogging-the-anissimov-smith-reaction-debate/">his debate with Noah Smith</A>, Michael Anissimov argues against the supposed success of modern liberal society by pointing out rising rates of depression and suicide.</p>
<p>It&#8217;s really tempting to equate depression with misery and misery with social dysfunction. Danes and Finns have high levels of depression, therefore their lives must be unusually miserable, therefore Denmark and Finland are poorly-organized societies.</p>
<p>But first of all, it&#8217;s not clear that Scandinavian countries really have very high depression and suicide rates. There are a lot of collections of statistics, and <A HREF="http://en.wikipedia.org/wiki/Epidemiology_of_depression">many of</A> <A HREF="http://en.wikipedia.org/wiki/Epidemiology_of_suicide">them show</A> Scandinavia around the middle. Going by &#8220;antidepressant prescriptions&#8221; is a terrible way to do things, because it mixes amount of depression with resources devoted to treating depression &#8211; if the Scandinavian health systems are as good as everyone says, maybe they just treat a greater percent of their depressives than everywhere else.</p>
<p>But more important, even if Scandinavia does have very high rates of depression, that doesn&#8217;t tell us much about whether they&#8217;re happy or not. <i>Depression is not the same thing as being sad</i>. Sadness is a risk factor for depression &#8211; although even there I suspect that it&#8217;s very specific kinds of sadness that we haven&#8217;t yet teased out from the general construct &#8211; but it is not the condition itself. The condition itself is a complicated mess of neurotransmitters, cytokines, hormones, changes in brain structure, and goodness only knows what else.</p>
<p>Off the top of my head, here are six plausible reasons why Scandinavia could have higher rates of depression than the United States, even if it is a utopian society of perfect happiness.</p>
<p>1. Light. Scandinavia is far north [citation needed] which puts its citizens at very high risk for seasonal affective disorder, which can present as depression.</p>
<p>2. The midnight sun. Scandinavia&#8217;s weird day-night cycle could easily disrupt people&#8217;s circadian rhythms. <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/21476953">Studies find</A> that &#8220;increasing evidence points to a role of the biological clock in the development of depression&#8230;it seems likely the circadian system plays a vital role in the genesis of the disorder</A>. This is why some European countries use <A HREF="http://en.wikipedia.org/wiki/Agomelatine">melatonergic substances</A> as antidepressants.</p>
<p>3. Parasite load. It&#8217;s positively correlated with temperature, which means Scandinavia probably has some of the lowest parasite load in the world. But low parasite load <A HREF="http://en.wikipedia.org/wiki/Hygiene_hypothesis">causes the immune system to get antsy</A> and start attacking random stuff, leading to increase risk of autoimmune disease. If there&#8217;s an immunological component to depression &#8211; and right now lots of people think there is &#8211; then that&#8217;s another risk factor right there. </p>
<p>4. Diet. The Scandinavian diet has <A HREF="http://www.nordicnibbler.com/2011/05/wheres-fresh-beef-current-state-of-food.html">unusually little fresh food</A>, because the area is a frozen wasteland and most things have to be imported from elsewhere. They&#8217;re big on frozen stuff, processed stuff, and canned stuff. I am neither an expert in Scandinavian cuisine nor in nutrition, but if depression is linked to diet and imbalance in the gut microbiome, which there&#8217;s some evidence it is, then diet is heavily implicated and the Scandinavians are in a good position to get hit extra hard.</p>
<p>5. Genetics. The New York Post article mentions that Scandinavians have an unusual variant of the MAO-A enzyme (I told you it was a weird hit piece. Scandinavia is too liberal, therefore they have bad genes?). MAO-A is also known as &#8220;the thing that processes serotonin&#8221; and &#8220;the thing that MAO inhibitors, some of the most powerful known antidepressants, inhibit&#8221;. I&#8217;m not saying this gene in particular is responsible for Scandinavian depression, I&#8217;m saying that the article itself is admitting that Scandinavia contains some genetically distinct populations and for all we know this could be involved. </p>
<p>6. Culture. Maybe <i>the</i> biggest factor in the level of depression and suicide in a culture is whether it is culturally acceptable to be depressed and commit suicide. Some of the lowest suicide rates are found in heavily religious cultures and communities who believe suicide is a mortal sin. On the other hand, one of the most suicidal countries in the world is Japan, with its heavily-mythologized history of heroic samurai taking &#8220;the honorable way out&#8221; when they had brought shame upon themselves. Well, Scandinavia is one of the least religious regions in the world. And all I know about their culture is that they produce about 100% of good death metal, and their native mythology ends with the world being plunged into eternal winter and the gods being eaten by wolves.</p>
<p><b>II.</b></p>
<p>But all this is just speculation. Let me give a concrete example of a case where social dysfunction doesn&#8217;t track depression and suicidality in a predictable way.</p>
<p>What about white versus black Americans? To some degree these two groups live in separate &#8220;societies&#8221;. Most people would consider the white society better off in most ways &#8211; higher income, better health, more family stability, less involvement with the criminal justice system. If White America and Black America were countries, White America would get all of the accolades currently given to the Scandinavians.</p>
<p>But American whites <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/15914823">have higher rates of depression than blacks</A>. There are the usual contradictory studies and arguments about how to adjust for which confounder, but I&#8217;m pretty sure this is something like a consensus position right now. More solidly, white Americans <A HREF="http://www.cdc.gov/violenceprevention/suicide/statistics/rates02.html">have much higher suicide rates</A> than black Americans.</p>
<p>(although I feel bad mentioning this, because the stereotype that blacks <i>never</i> commit suicide is wrong and sometimes prevents black people from getting the help they need.)</p>
<p>We can go a few centuries back and get even more surprising results. Although it&#8217;s difficult to get data from the era, analyses of suicide rate among African-American slaves in the antebellum South describe it as <A HREF="http://scholarworks.montana.edu/xmlui/handle/1/1654">&#8220;surprisingly low&#8221;</A>. I can&#8217;t find any hard evidence proving <A HREF="http://www.artsjournal.com/culturegulf/2007/05/the_music_will_still_be_wonder.html">Kurt Vonnegut&#8217;s contention</A> that &#8220;the suicide rate per capita among slave owners was much higher than the suicide rate among slaves&#8221;, but it seems to have been commonly believed. Kneeland writes:<br />
<blockquote>&#8220;[These low suicide rates are] consistent with suicide rates for Africa and for people of African descent living in other areas of the world, and further supports the theory that a low suicide rate is an element of African culture.&#8221;</p></blockquote>
<p>If you&#8217;re going to say that Scandinavia&#8217;s higher depression and suicide rates mean Scandinavia has it worse off than America, you also need to theorize that white people have it worse off than black people, including black slaves. Why don&#8217;t you go post something to that effect on Tumblr and see what they have to say? I&#8217;ll wait.</p>
<p><b>III.</b></p>
<p>Or maybe we&#8217;re barking up entirely the wrong tree. What if it&#8217;s not even that happy, well-functioning societies can sometimes still end up with high suicide rates? What if people become suicidally depressed precisely <A HREF="http://slatestarcodex.com/2014/12/25/book-review-whats-wrong-with-the-world/"><i>because</i></A> they live in happy, well-functioning societies?</p>
<p>This is the fascinating hypothesis of <A HREF="http://media.bonnint.net/slc/2491/249111/24911172.pdf">Daly, Oswald, and Wu (2011)</A>, who after crunching the numbers find pretty convincingly that &#8220;suicide rates tend to be highest in happy places&#8221;:<br />
<blockquote>A little-noted puzzle is that many of [the happiest] places have unusually high rates of suicide. While this fact has been remarked on occasionally for individual nations, especially for the case of Denmark, it has usually been attributed in an anecdotal way to idiosyncratic features of the location in question (eg the dark winters in Scandinavia), definitional variations in the measurement of well-being and suicide, and differences in culture and social attitudes regarding happiness and taking one&#8217;s life. Most scholars have not thought of the anecdotal observation as a systematic relationship that might be robust to replication or investigation&#8230;this paper attempts to document the existence of a happiness-suicide paradox: happier areas have a higher percentage of suicides.</p></blockquote>
<p>They then go on to show a strong positive relationship between average self-reported happiness and suicidality across Western nations &#8211; Greece is both the least happy country and the one with the lowest suicide rate &#8211; and US states, where confirmed hellholes New York and New Jersey are at or near the bottom. The relationship holds whether you adjust for confounders (including income!) or not.</p>
<p>I expected this to be a straightforward effect of modernization/industrialization/liberalism, as per Michael Anissimov&#8217;s hypothesis. The country-level data maybe sort of vaguely supports that trend &#8211; Greece and Portugal are our token incompletely-modernized countries and have very low suicide rates, Scandinavia is high, and everywhere else is sort of a toss-up. But US states really really don&#8217;t support that hypothesis &#8211; New York and Jersey both seem high on the modernization/industrialization/liberalism axis, and they&#8217;re right in the bottom left corner of the study&#8217;s graphs along with Greece and Portugal. Meanwhile, tropical paradise Hawai&#8217;i is suicidal as heck, even though it doesn&#8217;t seem espcially modern/industrial/liberalized. The US state data also torpedo &#8211; albeit less conclusively &#8211; an attempt to make the whole issue one of latitude.</p>
<p>One caveat I do have about the US data is that several of the happiest and most suicidal states &#8211; at least on the unadjusted plot &#8211; are also high-altitude. Utah, Wyoming, Colorado, Montana, Idaho are all up there at the top left side of the graph. But we already know there&#8217;s <A HREF="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3114154/">a strong positive relationship between altitude and suicide in 2584 US counties</A>, probably because the brain&#8217;s emotional regulation system doesn&#8217;t work well in low-oxygen environments. If we assume people living in beautiful open forested mountain areas are especially happy, that takes away a big chunk of the graph right there. But it leaves other chunks untouched, and I don&#8217;t think it&#8217;s going to be that simple.</p>
<p>The authors&#8217; preferred explanation is that suicide is an effect of relative rather than absolute misery. If you&#8217;re depressed and everybody around you is very happy, that makes things worse than if you&#8217;re depressed and everyone around you is also pretty miserable. Thus suicide is more common in happier societies.</p>
<p>I really don&#8217;t like this theory. Although everyone else should be happier in these societies, the person in question who might or might not commit suicide should also be, on average, happier. There&#8217;s no reason to think that the average hedonic <i>distance</i> between potential suicides and their neighbors is higher in these areas. Indeed, given that Scandinavia &#8211; and many of the other happy societies &#8211; are also some of the most equal societies, I would expect an unusually low hedonic distance between people. And in fact, I notice that <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/16276752">suicide rates by country are negatively correlated with inequality</A> &#8211; that is, the more unequal the country, the lower the suicide rate (wow, I <i>definitely</i> don&#8217;t remember seeing that one in <a href="http://smile.amazon.com/gp/product/1608193411/ref=as_li_tl?ie=UTF8&#038;camp=1789&#038;creative=390957&#038;creativeASIN=1608193411&#038;linkCode=as2&#038;tag=slastacod-20&#038;linkId=I4VK46BYUGNE3ZXD"><i>The Spirit Level</i></a><img src="http://ir-na.amazon-adsystem.com/e/ir?t=slastacod-20&#038;l=as2&#038;o=1&#038;a=1608193411" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />.)</p>
<p>On the other hand, I can&#8217;t for the life of me think of a better theory, so whatever.</p>
<p>Other things that increase suicide rates, by the way, include <A HREF="http://en.wikipedia.org/wiki/Seasonal_effects_on_suicide_rates">springtime</A>, <A HREF="http://journal.cpha.ca/index.php/cjph/article/viewFile/1059/1059">nice weather</A>, <A HREF="http://www.ncbi.nlm.nih.gov/pubmed/23021379">high levels of education</A>, and very occasionally <A HREF="http://en.wikipedia.org/wiki/Paradoxical_effect#Antidepressants">antidepressants</A>. My father, a very hard-headed internist, makes fun of me for doing psychiatry because &#8220;the whole field is just common sense&#8221;, but sometimes it <i>really isn&#8217;t</i>.</p>
<p>So you should probably think very carefully before using a difference in depression or suicide rates to support your pet theory about which societies work better than others.</p>
]]></content:encoded>
			<wfw:commentRss>http://slatestarcodex.com/2015/01/15/depression-is-not-a-proxy-for-social-dysfunction/feed/</wfw:commentRss>
		<slash:comments>300</slash:comments>
		</item>
	</channel>
</rss>
