Slate Star Codex

In a mad world, all blogging is psychiatry blogging

AI Researchers On AI Risk

I first became interested in AI risk back around 2007. At the time, most people’s response to the topic was “Haha, come back when anyone believes this besides random Internet crackpots.”

Over the next few years, a series of extremely bright and influential figures including Bill Gates, Stephen Hawking, and Elon Musk publically announced they were concerned about AI risk, along with hundreds of other intellectuals, from Oxford philosophers to MIT cosmologists to Silicon Valley tech investors. So we came back.

Then the response changed to “Sure, a couple of random academics and businesspeople might believe this stuff, but never real experts in the field who know what’s going on.”

Thus pieces like Popular Science’s Bill Gates Fears AI, But AI Researchers Know Better:

When you talk to A.I. researchers—again, genuine A.I. researchers, people who grapple with making systems that work at all, much less work too well—they are not worried about superintelligence sneaking up on them, now or in the future. Contrary to the spooky stories that Musk seems intent on telling, A.I. researchers aren’t frantically installed firewalled summoning chambers and self-destruct countdowns.

And Fusion.net’s The Case Against Killer Robots From A Guy Actually Building AI:

Andrew Ng builds artificial intelligence systems for a living. He taught AI at Stanford, built AI at Google, and then moved to the Chinese search engine giant, Baidu, to continue his work at the forefront of applying artificial intelligence to real-world problems. So when he hears people like Elon Musk or Stephen Hawking—people who are not intimately familiar with today’s technologies—talking about the wild potential for artificial intelligence to, say, wipe out the human race, you can practically hear him facepalming.

And now Ramez Naam of Marginal Revolution is trying the same thing with What Do AI Researchers Think Of The Risk Of AI?:

Elon Musk, Stephen Hawking, and Bill Gates have recently expressed concern that development of AI could lead to a ‘killer AI’ scenario, and potentially to the extinction of humanity. None of them are AI researchers or have worked substantially with AI that I know of. What do actual AI researchers think of the risks of AI?

It quotes the same couple of cherry-picked AI researchers as all the other stories – Andrew Ng, Yann LeCun, etc – then stops without mentioning whether there are alternate opinions.

There are. AI researchers, including some of the leaders in the field, have been instrumental in raising issues about AI risk and superintelligence from the very beginning. I want to start by listing some of these people, as kind of a counter-list to Naam’s, then go into why I don’t think this is a “controversy” in the classical sense that dueling lists of luminaries might lead you to expect.

The criteria for my list: I’m only mentioning the most prestigious researchers, either full professors at good schools with lots of highly-cited papers, or else very-well respected scientists in industry working at big companies with good track records. They have to be involved in AI and machine learning. They have to have multiple strong statements supporting some kind of view about a near-term singularity and/or extreme risk from superintelligent AI. Some will have written papers or books about it; others will have just gone on the record saying they think it’s important and worthy of further study.

If anyone disagrees with the inclusion of a figure here, or knows someone important I forgot, let me know and I’ll make the appropriate changes:

* * * * * * * * * *

Stuart Russell (wiki) is Professor of Computer Science at Berkeley, winner of the IJCAI Computers And Thought Award, Fellow of the Association for Computing Machinery, Fellow of the American Academy for the Advancement of Science, Director of the Center for Intelligent Systems, Blaise Pascal Chair in Paris, etc, etc. He is the co-author of Artificial Intelligence: A Modern Approach, the classic textbook in the field used by 1200 universities around the world. On his website, he writes:

The field [of AI] has operated for over 50 years on one simple assumption: the more intelligent, the better. To this must be conjoined an overriding concern for the benefit of humanity. The argument is very simple:

1. AI is likely to succeed.
2. Unconstrained success brings huge risks and huge benefits.
3. What can we do now to improve the chances of reaping the benefits and avoiding the risks?

Some organizations are already considering these questions, including the Future of Humanity Institute at Oxford, the Centre for the Study of Existential Risk at Cambridge, the Machine Intelligence Research Institute in Berkeley, and the Future of Life Institute at Harvard/MIT. I serve on the Advisory Boards of CSER and FLI.

Just as nuclear fusion researchers consider the problem of containment of fusion reactions as one of the primary problems of their field, it seems inevitable that issues of control and safety will become central to AI as the field matures. The research questions are beginning to be formulated and range from highly technical (foundational issues of rationality and utility, provable properties of agents, etc.) to broadly philosophical.

He makes a similar point on edge.org, writing:

As Steve Omohundro, Nick Bostrom, and others have explained, the combination of value misalignment with increasingly capable decision-making systems can lead to problems—perhaps even species-ending problems if the machines are more capable than humans. Some have argued that there is no conceivable risk to humanity for centuries to come, perhaps forgetting that the interval of time between Rutherford’s confident assertion that atomic energy would never be feasibly extracted and Szilárd’s invention of the neutron-induced nuclear chain reaction was less than twenty-four hours.

He has also tried to serve as an ambassador about these issues to other academics in the field, writing:

What I’m finding is that senior people in the field who have never publicly evinced any concern before are privately thinking that we do need to take this issue very seriously, and the sooner we take it seriously the better.

David McAllester (wiki) is professor and Chief Academic Officer at the U Chicago-affilitated Toyota Technological Institute, and formerly served on the faculty of MIT and Cornell. He is a fellow of the American Association of Artificial Intelligence, has authored over a hundred publications, has done research in machine learning, programming language theory, automated reasoning, AI planning, and computational linguistics, and was a major influence on the algorithms for famous chess computer Deep Blue. According to an article in the Pittsburgh Tribune Review:

Chicago professor David McAllester believes it is inevitable that fully automated intelligent machines will be able to design and build smarter, better versions of themselves, an event known as the Singularity. The Singularity would enable machines to become infinitely intelligent, and would pose an ‘incredibly dangerous scenario’, he says.

On his personal blog Machine Thoughts, he writes:

Most computer science academics dismiss any talk of real success in artificial intelligence. I think that a more rational position is that no one can really predict when human level AI will be achieved. John McCarthy once told me that when people ask him when human level AI will be achieved he says between five and five hundred years from now. McCarthy was a smart man. Given the uncertainties surrounding AI, it seems prudent to consider the issue of friendly AI…

The early stages of artificial general intelligence (AGI) will be safe. However, the early stages of AGI will provide an excellent test bed for the servant mission or other approaches to friendly AI. An experimental approach has also been promoted by Ben Goertzel in a nice blog post on friendly AI. If there is a coming era of safe (not too intelligent) AGI then we will have time to think further about later more dangerous eras.

He attended the AAAI Panel On Long-Term AI Futures, where he chaired the panel on Long-Term Control and was described as saying:

McAllester chatted with me about the upcoming ‘Singularity’, the event where computers out think humans. He wouldn’t commit to a date for the singularity but said it could happen in the next couple of decades and will definitely happen eventually. Here are some of McAllester’s views on the Singularity. There will be two milestones: Operational Sentience, when we can easily converse with computers, and the AI Chain Reaction, when a computer can bootstrap itself to a better self and repeat. We’ll notice the first milestone in automated help systems that will genuinely be helpful. Later on computers will actually be fun to talk to. The point where computer can do anything humans can do will require the second milestone.

Hans Moravec (wiki) is a former professor at the Robotics Institute of Carnegie Mellon University, namesake of Moravec’s Paradox, and founder of the SeeGrid Corporation for industrial robotic visual systems. His Sensor Fusion in Certainty Grids for Mobile Robots has been cited over a thousand times, and he was invited to write the Encyclopedia Britannica article on robotics back when encyclopedia articles were written by the world expert in a field rather than by hundreds of anonymous Internet commenters.

He is also the author of Robot: Mere Machine to Transcendent Mind, which Amazon describes as:

In this compelling book, Hans Moravec predicts machines will attain human levels of intelligence by the year 2040, and that by 2050, they will surpass us. But even though Moravec predicts the end of the domination by human beings, his is not a bleak vision. Far from railing against a future in which machines rule the world, Moravec embraces it, taking the startling view that intelligent robots will actually be our evolutionary heirs.” Moravec goes further and states that by the end of this process “the immensities of cyberspace will be teeming with unhuman superminds, engaged in affairs that are to human concerns as ours are to those of bacteria”.

Shane Legg is co-founder of DeepMind Technologies (wiki), an AI startup that was bought for Google in 2014 for about $500 million. He earned his PhD at the Dalle Molle Institute for Artificial Intelligence in Switzerland and also worked at the Gatsby Computational Neuroscience Unit in London. His dissertation Machine Superintelligence concludes:

If there is ever to be something approaching absolute power, a superintelligent machine would come close. By definition, it would be capable of achieving a vast range of goals in a wide range of environments. If we carefully prepare for this possibility in advance, not only might we avert disaster, we might bring about an age of prosperity unlike anything seen before.

In a later interview, he states:

AI is now where the internet was in 1988. Demand for machine learning skills is quite strong in specialist applications (search companies like Google, hedge funds and bio-informatics) and is growing every year. I expect this to become noticeable in the mainstream around the middle of the next decade. I expect a boom in AI around 2020 followed by a decade of rapid progress, possibly after a market correction. Human level AI will be passed in the mid 2020’s, though many people won’t accept that this has happened. After this point the risks associated with advanced AI will start to become practically important…I don’t know about a “singularity”, but I do expect things to get really crazy at some point after human level AGI has been created. That is, some time from 2025 to 2040.

He and his co-founders Demis Hassabis and Mustafa Suleyman have signed the Future of Life Institute petition on AI risks, and one of their conditions for joining Google was that the company agree to set up an AI Ethics Board to investigate these issues.

Steve Omohundro (wiki) is a former Professor of Computer Science at University of Illinois, founder of the Vision and Learning Group and the Center for Complex Systems Research, and inventor of various important advances in machine learning and machine vision. His work includes lip-reading robots, the StarLisp parallel programming language, and geometric learning algorithms. He currently runs Self-Aware Systems, “a think-tank working to ensure that intelligent technologies are beneficial for humanity”. His paper Basic AI Drives helped launch the field of machine ethics by pointing out that superintelligent systems will converge upon certain potentially dangerous goals. He writes:

We have shown that all advanced AI systems are likely to exhibit a number of basic drives. It is essential that we understand these drives in order to build technology that enables a positive future for humanity. Yudkowsky has called for the creation of ‘friendly AI’. To do this, we must develop the science underlying ‘utility engineering’, which will enable us to design utility functions that will give rise to the consequences we desire…The rapid pace of technological progress suggests that these issues may become of critical importance soon.”

See also his section here on “Rational AI For The Greater Good”.

Murray Shanahan (site) earned his PhD in Computer Science from Cambridge and is now Professor of Cognitive Robotics at Imperial College London. He has published papers in areas including robotics, logic, dynamic systems, computational neuroscience, and philosophy of mind. He is currently writing a book The Technological Singularity which will be published in August; Amazon’s blurb says:

Shanahan describes technological advances in AI, both biologically inspired and engineered from scratch. Once human-level AI — theoretically possible, but difficult to accomplish — has been achieved, he explains, the transition to superintelligent AI could be very rapid. Shanahan considers what the existence of superintelligent machines could mean for such matters as personhood, responsibility, rights, and identity. Some superhuman AI agents might be created to benefit humankind; some might go rogue. (Is Siri the template, or HAL?) The singularity presents both an existential threat to humanity and an existential opportunity for humanity to transcend its limitations. Shanahan makes it clear that we need to imagine both possibilities if we want to bring about the better outcome.

Marcus Hutter (wiki) is a professor in the Research School of Computer Science at Australian National University. He has previously worked with the Dalle Molle Institute for Artificial Intelligence and National ICT Australia, and done work on reinforcement learning, Bayesian sequence prediction, complexity theory, Solomonoff induction, computer vision, and genomic profiling. He has also written extensively on the Singularity. In Can Intelligence Explode?, he writes:

This century may witness a technological explosion of a degree deserving the name singularity. The default scenario is a society of interacting intelligent agents in a virtual world, simulated on computers with hyperbolically increasing computational resources. This is inevitably accompanied by a speed explosion when measured in physical time units, but not necessarily by an intelligence explosion…if the virtual world is inhabited by interacting free agents, evolutionary pressures should breed agents of increasing intelligence that compete about computational resources. The end-point of this intelligence evolution/acceleration (whether it deserves the name singularity or not) could be a society of these maximally intelligent individuals. Some aspect of this singularitarian society might be theoretically studied with current scientific tools. Way before the singularity, even when setting up a virtual society in our imagine, there are likely some immediate difference, for example that the value of an individual life suddenly drops, with drastic consequences.

Jurgen Schmidhuber (wiki) is Professor of Artificial Intelligence at the University of Lugano and former Professor of Cognitive Robotics at the Technische Universitat Munchen. He makes some of the most advanced neural networks in the world, has done further work in evolutionary robotics and complexity theory, and is a fellow of the European Academy of Sciences and Arts. In Singularity Hypotheses, Schmidhuber argues that “if future trends continue, we will face an intelligence explosion within the next few decades”. When asked directly about AI risk on a Reddit AMA thread, he answered:

Stuart Russell’s concerns [about AI risk] seem reasonable. So can we do anything to shape the impacts of artificial intelligence? In an answer hidden deep in a related thread I just pointed out: At first glance, recursive self-improvement through Gödel Machines seems to offer a way of shaping future superintelligences. The self-modifications of Gödel Machines are theoretically optimal in a certain sense. A Gödel Machine will execute only those changes of its own code that are provably good, according to its initial utility function. That is, in the beginning you have a chance of setting it on the “right” path. Others, however, may equip their own Gödel Machines with different utility functions. They will compete. In the resulting ecology of agents, some utility functions will be more compatible with our physical universe than others, and find a niche to survive. More on this in a paper from 2012.

Richard Sutton (wiki) is professor and iCORE chair of computer science at University of Alberta. He is a fellow of the Association for the Advancement of Artificial Intelligence, co-author of the most-used textbook on reinforcement learning, and discoverer of temporal difference learning, one of the most important methods in the field.

In his talk at the Future of Life Institute’s Future of AI Conference, Sutton states that there is “certainly a significant chance within all of our expected lifetimes” that human-level AI will be created, then goes on to say the AIs “will not be under our control”, “will compete and cooperate with us”, and that “if we make superintelligent slaves, then we will have superintelligent adversaries”. He concludes that “We need to set up mechanisms (social, legal, political, cultural) to ensure that this works out well” but that “inevitably, conventional humans will be less important.” He has also mentioned these issues at a presentation to the Gadsby Institute in London and in (of all things) a Glenn Beck book: “Richard Sutton, one of the biggest names in AI, predicts an intelligence explosion near the middle of the century”.

Andrew Davison (site) is Professor of Robot Vision at Imperial College London, leader of the Robot Vision Research Group and Dyson Robotics Laboratory, and inventor of the computerized localization-mapping system MonoSLAM. On his website, he writes:

At the risk of going out on a limb in the proper scientific circles to which I hope I belong(!), since 2006 I have begun to take very seriously the idea of the technological singularity: that exponentially increasing technology might lead to super-human AI and other developments that will change the world utterly in the surprisingly near future (i.e. perhaps the next 20–30 years). As well as from reading books like Kurzweil’s ‘The Singularity is Near’ (which I find sensational but on the whole extremely compelling), this view comes from my own overview of incredible recent progress of science and technology in general and specificially in the fields of computer vision and robotics within which I am personally working. Modern inference, learning and estimation methods based on Bayesian probability theory (see Probability Theory: The Logic of Science or free online version, highly recommended), combined with the exponentially increasing capabilities of cheaply available computer processors, are becoming capable of amazing human-like and super-human feats, particularly in the computer vision domain.

It is hard to even start thinking about all of the implications of this, positive or negative, and here I will just try to state facts and not offer much in the way of opinions (though I should say that I am definitely not in the super-optimistic camp). I strongly think that this is something that scientists and the general public should all be talking about. I’ll make a list here of some ‘singularity indicators’ I come across and try to update it regularly. These are little bits of technology or news that I come across which generally serve to reinforce my view that technology is progressing in an extraordinary, faster and faster way that will have consequences few people are yet really thinking about.

Alan Turing and I. J. Good (wiki, wiki) are men who need no introduction. Turing invented the mathematical foundations of computing and shares his name with Turing machines, Turing completeness, and the Turing Test. Good worked with Turing at Bletchley Park, helped build some of the first computers, and invented various landmark algorithms like the Fast Fourier Transform. In his paper “Can Digital Machines Think?”, Turing writes:

Let us now assume, for the sake of argument, that these machines are a genuine possibility, and look at the consequences of constructing them. To do so would of course meet with great opposition, unless we have advanced greatly in religious tolerance since the days of Galileo. There would be great opposition from the intellectuals who were afraid of being put out of a job. It is probable though that the intellectuals would be mistaken about this. There would be plenty to do in trying to keep one’s intelligence up to the standards set by the machines, for it seems probable that once the machine thinking method had started, it would not take long to outstrip our feeble powers…At some stage therefore we should have to expect the machines to take control.

During his time at the Atlas Computer Laboratory in the 60s, Good expanded on this idea in Speculations Concerning The First Ultraintelligent Machine, which argued:

Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an ‘intelligence explosion,’ and the intelligence of man would be left far behind. Thus the first ultraintelligent machine is the last invention that man need ever make

* * * * * * * * * *

I worry this list will make it look like there is some sort of big “controversy” in the field between “believers” and “skeptics” with both sides lambasting the other. This has not been my impression.

When I read the articles about skeptics, I see them making two points over and over again. First, we are nowhere near human-level intelligence right now, let alone superintelligence, and there’s no obvious path to get there from here. Second, if you start demanding bans on AI research then you are an idiot.

I agree whole-heartedly with both points. So do the leaders of the AI risk movement.

A survey of AI researchers (Muller & Bostrom, 2014) finds that on average they expect a 50% chance of human-level AI by 2040 and 90% chance of human-level AI by 2075. On average, 75% believe that superintelligence (“machine intelligence that greatly surpasses the performance of every human in most professions”) will follow within thirty years of human-level AI. There are some reasons to worry about sampling bias based on eg people who take the idea of human-level AI seriously being more likely to respond (though see the attempts made to control for such in the survey) but taken seriously it suggests that most AI researchers think there’s a good chance this is something we’ll have to worry about within a generation or two.

But outgoing MIRI director Luke Muehlhauser and Future of Humanity Institute director Nick Bostrom are both on record saying they have significantly later timelines for AI development than the scientists in the survey. If you look at Stuart Armstrong’s AI Timeline Prediction Data there doesn’t seem to be any general law that the estimates from AI risk believers are any earlier than those from AI risk skeptics. In fact, the latest estimate on the entire table is from Armstrong himself; Armstrong nevertheless currently works at the Future of Humanity Institute raising awareness of AI risk and researching superintelligence goal alignment.

The difference between skeptics and believers isn’t about when human-level AI will arrive, it’s about when we should start preparing.

Which brings us to the second non-disagreement. The “skeptic” position seems to be that, although we should probably get a couple of bright people to start working on preliminary aspects of the problem, we shouldn’t panic or start trying to ban AI research.

The “believers”, meanwhile, insist that although we shouldn’t panic or start trying to ban AI research, we should probably get a couple of bright people to start working on preliminary aspects of the problem.

Yann LeCun is probably the most vocal skeptic of AI risk. He was heavily featured in the Popular Science article, was quoted in the Marginal Revolution post, and spoke to KDNuggets and IEEE on “the inevitable singularity questions”, which he describes as “so far out that we can write science fiction about it”. But when asked to clarify his position a little more, he said:

Elon [Musk] is very worried about existential threats to humanity (which is why he is building rockets with the idea of sending humans colonize other planets). Even if the risk of an A.I. uprising is very unlikely and very far in the future, we still need to think about it, design precautionary measures, and establish guidelines. Just like bio-ethics panels were established in the 1970s and 1980s, before genetic engineering was widely used, we need to have A.I.-ethics panels and think about these issues. But, as Yoshua [Bengio] wrote, we have quite a bit of time

Eric Horvitz is another expert often mentioned as a leading voice of skepticism and restraint. His views have been profiled in articles like Out Of Control AI Will Not Kill Us, Believes Microsoft Research Chief and Nothing To Fear From Artificial Intelligence, Says Microsoft’s Eric Horvitz. But here’s what he says in a longer interview with NPR:

KASTE: Horvitz doubts that one of these virtual receptionists could ever lead to something that takes over the world. He says that’s like expecting a kite to evolve into a 747 on its own. So does that mean he thinks the singularity is ridiculous?

Mr. HORVITZ: Well, no. I think there’s been a mix of views, and I have to say that I have mixed feelings myself.

KASTE: In part because of ideas like the singularity, Horvitz and other A.I. scientists have been doing more to look at some of the ethical issues that might arise over the next few years with narrow A.I. systems. They’ve also been asking themselves some more futuristic questions. For instance, how would you go about designing an emergency off switch for a computer that can redesign itself?

Mr. HORVITZ: I do think that the stakes are high enough where even if there was a low, small chance of some of these kinds of scenarios, that it’s worth investing time and effort to be proactive.

Which is pretty much the same position as a lot of the most zealous AI risk proponents. With enemies like these, who needs friends?

A Slate article called Don’t Fear Artificial Intelligence also gets a surprising amount right:

As Musk himself suggests elsewhere in his remarks, the solution to the problem [of AI risk] lies in sober and considered collaboration between scientists and policymakers. However, it is hard to see how talk of “demons” advances this noble goal. In fact, it may actively hinder it.

First, the idea of a Skynet scenario itself has enormous holes. While computer science researchers think Musk’s musings are “not completely crazy,” they are still awfully remote from a world in which AI hype masks less artificially intelligent realities that our nation’s computer scientists grapple with:

Yann LeCun, the head of Facebook’s AI lab, summed it up in a Google+ post back in 2013: “Hype is dangerous to AI. Hype killed AI four times in the last five decades. AI Hype must be stopped.”…LeCun and others are right to fear the consequences of hype. Failure to live up to sci-fi–fueled expectations, after all, often results in harsh cuts to AI research budgets.

AI scientists are all smart people. They have no interest in falling into the usual political traps where they divide into sides that accuse each other of being insane alarmists or ostriches with their heads stuck in the sand. It looks like they’re trying to balance the need to start some preliminary work on a threat that looms way off in the distance versus the risk of engendering so much hype that it starts a giant backlash.

This is not to say that there aren’t very serious differences of opinion in how quickly we need to act. These seem to hinge mostly on whether it’s safe to say “We’ll deal with the problem when we come to it” or whether there will be some kind of “hard takeoff” which will take events out of control so quickly that we’ll want to have done our homework beforehand. I continue to see less evidence than I’d like that most AI researchers with opinions understand the latter possibility, or really any of the technical work in this area. Heck, the Marginal Revolution article quotes an expert as saying that superintelligence isn’t a big risk because “smart computers won’t create their own goals”, even though anyone who has read Bostrom knows that this is exactly the problem.

There is still a lot of work to be done. But cherry-picked articles about how “real AI researchers don’t worry about superintelligence” aren’t it.

[thanks to some people from MIRI and FLI for help with and suggestions on this post]

Beware Summary Statistics

Last night I asked Tumblr two questions that had been bothering me for a while and got some pretty good answers.

I.

First, consider the following paragraph from JRank:

Terrie Moffitt and colleagues studied 4,552 Danish men born at the end of World War II. They examined intelligence test scores collected by the Danish army (for screening potential draftees) and criminal records drawn from the Danish National Police Register. The men who committed two or more criminal offenses by age twenty had IQ scores on average a full standard deviation below nonoffenders, and IQ and criminal offenses were significantly and negatively correlated at r = -.19.

Repeat offenders are a 15 IQ points – an entire standard deviation – below the rest of the population. This matches common sense, which suggests that serial criminals are not the brightest members of society. It sounds from this like IQ is a very important predictor of crime.

But r = – 0.19 suggests that only about 3.6% of variance in crime is predicted by IQ. 3.6% is nothing. It sounds from this like IQ barely matters at all in predicting crime.

This isn’t a matter of conflicting studies: these are two ways of describing the same data. What gives?

The best answer I got was from pappubahry2, who posted the following made-up graph:

Here all crime is committed by low IQ individuals, but the correlation between IQ and crime is still very low, r = 0.16. The reason is simple: very few people, including very few low-IQ people, commit crimes. r is kind of a mishmash of p(low IQ|criminal) and p(criminal|low IQ), and the latter may be very low even when all criminals are from the lower end of the spectrum.

The advice some people on Tumblr gave was to beware summary statistics. “IQ only predicts 3.6% of variance in crime” makes it sound like IQ is nearly irrelevant to criminality, but in fact it’s perfectly consistant with IQ being a very strong predictive factor.

II.

So I pressed my luck with the following question:

I’m not sure why everyone’s income on this graph is so much higher than average US per capita of $30,000ish, or even average white male income of $31,000ish. I think it might be the ‘age 40 to 50′ specifier.

This graph suggests IQ is an important determinant of income. But most studies say the correlation between IQ and income is at most 0.4 or so, or 16% of the variance, suggesting it’s a very minor determinant of income. Most people are earning an income, so the too-few-criminals explanation from above doesn’t apply. Again, what gives?

The best answer I got for this one was from su3su2u1, who pointed out that there was probably very high variance within the individual deciles. Pappubahry made some more graphs to demonstrate:

I understand this one intellectually, but I still haven’t gotten my head around it. Regardless of the amount of variance, going from a category where I can expect to make on average $40,000 to a category where I could expect to make on average $160,000 seems like a pretty big deal, and describing it as “only predicting 16% of the variation” seems patently unfair.

I guess the moral is the same as the moral in the first situation: beware summary statistics. Based on the way you explain things, you can use different summary statistics to make things look very important or not important at all. And as a bunch of people recommended to me: when in doubt, demand to see the scatter plot.

Posted in Uncategorized | Tagged | 206 Comments

Bicameral Reasoning

[Epistemic status: Probably not the first person to think about this, possibly just reinventing scope insensitivity. Title with apologies to Julian Jaynes]

Non-American readers may not be familiar with the history of the US House and Senate.

During the Constitutional Convention, a fight broke out between the smaller states and the bigger states. The smaller states, like Delaware, wanted each state to elect a fixed number of representatives to the legislature, so that Delaware would have just as much of a say as, for example, New York. The bigger states wanted legislative representation to be proportional to population, so that if New York had ten times as many people as Delaware, they would get ten times as many representatives.

Eventually everyone just agreed to compromise by splitting the legislature into the House of Representatives and the Senate. The House worked the way New York wanted things, the Senate worked the way Delaware wanted things, and they would have to agree to get anything done.

This system has continued down to the present. Today, Delaware has only one Representative, far less than New York’s twenty-seven. But both states have an equal number of Senators, even though New York has a population of twenty million and Delaware is uninhabited except by corporations looking for tax loopholes.

To me, the House system seems much fairer. If New York has ten times the population of Delaware, but both have the same number of representatives, then Delaware citizens have ten times as much political power just because they live on one side of an arbitrary line. And New York might be tempted to split up into ten smaller states, and thus increase its political power tenfold. Heck, why don’t we just declare some random farm a state and give five people and a cow the same political power as all of California?

But despite my professed distaste for the Senate’s representational system, I find myself using something similar in parts of my own thought processes where I least expect.

Every election, I see charts like this:

And I tend to think something like “Well, I agree with this guy about the Iraq war and global warming, but I agree with that guy about election paper trails and gays in the military, so it’s kind of a toss-up.”

And this way of thinking is awful.

The Iraq War probably killed somewhere between 100,000 and 1,000,000 people. If you think that it was unnecessary, and that it was possible to know beforehand how poorly it would turn out, then killing a few hundred thousand people is a really big deal. I like having paper trails in elections as much as the next person, but if one guy isn’t going to keep a very good record of election results, and the other guy is going to kill a million people, that’s not a toss-up.

Likewise with global warming versus gays in the military. It would be nice if homosexual people have the same right to be killed by roadside explosive devices that the rest of us enjoy, but not frying the planet is pretty important too.

(if you don’t believe in global warming, fine, having a government that agrees with you and doesn’t waste 5% of the world GDP fighting it is still more important than anything else on this list)

Saying “some boxes are more important than others” doesn’t really cut it; it sounds like they might be twice, maybe three times more important, whereas in fact they might literally be a million times more important. It doesn’t convey the right sense of “Why are you even looking at that other box?”

I worry that, by portraying issues in this nice little set of boxes, this graphic is priming reasoning similar to the US Senate, where each box gets the same level of representation in my decision-making process, regardless of whether it’s a Delaware-sized box that affects a handful of people, or a New York sized box with millions of lives hanging in the balance.

I was thinking about this again back in March when I had a brief crisis caused by worrying that the moral value of the world’s chickens vastly exceeded the moral value of the world’s humans. I ended up being trivially wrong – there are only about twenty billion chickens, as opposed to the hundreds of billions I originally thought. But I was contingently wrong – in other words, I got lucky. Honestly, I didn’t know whether there were twenty billion chickens or twenty trillion.

And honestly, 99% of me doesn’t care. I do want to improve chickens, and I do think that their suffering matters. But thanks to the miracle of scope insensitivity, I don’t particularly care more about twenty trillion chickens than twenty billion chickens.

Once again, chickens seem to get two seats to my moral Senate, no matter how many of them there are. Other groups that get two seats include “starving African children”, “homeless people”, “my patients in hospital”, “my immediate family”, and “my close friends”. Obviously some of these groups contain thousands of times more people than others. They still get two seats. And so I am neither willing to reduce chickens’ values to zero value units per chicken, nor accept that if there are enough chickens they will end up able to outvote everyone else.

(I’m not sure whether “chickens” and “cows” are two separate states, or if there’s just one state of “Animals”. It probably depends on my mood. Which is worrying.)

And most recently I thought about this because of the post on California water I wrote last week. It seems very wise to say we all have to make sacrifices, and to concentrate about equally on natural categories of water use like showers, and toilets, and farms, and lawns – without noticing that one of those is ten times bigger than the other three combined. It seems like most people who think about the water crisis are using a Senate model, where each category is treated as an equally important area to optimize. In a House model, you wouldn’t be thinking about showers any more than a 2008 voter should be thinking of election paper trails.

I’m tempted to say “The House is just plain right and the Senate is just plain wrong”, but I’ve got to admit that would clash with my own very strong inclinations on things like the chicken problem. The Senate view seems to sort of fit with a class of solutions to the dust specks problem where after the somethingth dust speck or so you just stop caring about more of them, with the sort of environmentalist perspective where biodiversity itself is valuable, and with the Leibnizian answer to Job.

But I’m pretty sure those only kick in at the extremes. Take it too far, and you’re just saying the life of a Delawarean is worth twenty-something New Yorkers.

OT20: Heaven’s Open

This is the semimonthly open thread. Post about anything you want, ask random questions, whatever. Also:

1. Corrections from last week’s links: thinking probably doesn’t fuel brain cancers (thanks, Urstoff), and the discussion of the psychology replication results is still preliminary and shouldn’t have been published.

2. Comment of the week is vV_Vv asking a question of Jewish law.

3. This is your semi-annual reminder that this blog supports itself by the Amazon affiliates program, so if you like it, consider buying some of the Amazon products I mention, clicking on the Amazon link on the sidebar, or changing your Amazon bookmark to my affiliate version so I will get a share of purchases. Consider also taking a look at other sponsors MealSquares and Beeminder.

4. Continue to expect a lower volume of blogging for the near future.

Posted in Uncategorized | Tagged | 764 Comments

California, Water You Doing?

[Epistemic status: Low confidence. I have found numbers and stared at them until they made sense to me, but I have no education in this area. Tell me if I’m wrong.]

I.

There has recently been a lot of dumb fighting over who uses how much water in California, so I thought I would see if it made more sense as an infographic sort of thing:

Sources include Understanding Water Use In California, Inputs To Farm Production, California Water Usage In Crops, Urban Water Use Efficiency, Water Use In California, and Water: Who Uses How Much. There are some contradictions, probably caused by using sources from different years, and although I’m pretty confident this is right on an order of magnitude scale I’m not sure about a percentage point here or there. But that having been said:

On a state-sized level, people measure water in acre-feet, where an acre-foot is the amount of water needed to cover an area of one acre to a depth of one foot. California receives a total of 80 million acre-feet of water per year. Of those, 23 million are stuck in wild rivers (the hydrological phenomenon, not the theme park). These aren’t dammed and don’t have aqueducts to them so they can’t be used for other things. There has been a lot of misdirection over this recently, since having pristine wild rivers that fish swim in seems like an environmental cause, and so you can say that “environmentalists have locked up 23 million acre-feet of California water”. This is not a complete lie; if not for environmentalism, maybe some of these rivers would have been dammed up and added to the water system. But in practice you can’t dam every single river and most of these are way off in the middle of nowhere far away from the water-needing population. People’s ulterior motives shape whether or not they add these to the pot; I’ve put them in a different color blue to mark this.

Aside from that, another 14 million acre-feet are potentially usable, but deliberately diverted to environmental or recreational causes. These include 7.2 million for “recreational rivers”, apparently ones that people like to boat down, 1.6 million to preserve wetlands, and 5.6 million to preserve the Sacramento River Delta. According to environmentalists, this Sacramento River Delta water is non-negotiable, because if we stopped sending fresh water there the entire Sacramento River delta would turn salty and it would lead to some kind of catastrophe that would threaten our ability to get fresh water into the system at all.

34 million acre-feet of water are diverted to agriculture. The most water-expensive crop is alfalfa, which requires 5.3 million acre-feet a year. If you’re asking “Who the heck eats 5.3 million acre-feet of alfalfa?” the answer is “cows”. A bunch of other crops use about 2 million acre-feet each.

All urban water consumption totals 9 million acre-feet. Of those, 2.4 million are for commercial and industrial institutions, 3.8 million are for lawns, and 2.8 million are personal water use by average citizens in their houses. In case you’re wondering about this latter group, by my calculations all water faucets use 0.5 million, all toilets use 0.9 million, all showers use 0.5 million, leaks lose 0.3 million, and the remaining 0.6 million covers everything else – washing machines, dishwashers, et cetera.

Since numbers like these are hard to think about, it might be interesting to put them in a more intuitive form. The median California family earns $70,000 a year – let’s take a family just a little better-off than that who are making $80,000 so we can map it on nicely to California’s yearly water income of 80 million acre-feet.

The unusable 23 million acre-feet which go into wild rivers and never make it into the pot correspond to the unusable taxes the California family will have to pay. So our family is left with $57,000 post-tax income.

In this analogy, California is spending $14,000 on environment and recreation, $34,000 on agriculture, and $9,000 on all urban areas. All household uses – toilets, showers, faucets, etc – only add up to about $2,800 of their budget.

There is currently a water shortfall of about 6 million acre-feet per year, which is being sustained by exploiting non-renewable groundwater and other sources. This is the equivalent of our slightly-richer-than-average family having to borrow $6,000 from the bank each year to get by.

II.

Armed with this information, let’s see what we can make of some recent big news stories.

Apparently we are supposed to be worried about fracking depleting water in California. ThinkProgress reports that Despite Historic Drought, California Used 70 Million Gallons Of Water For Fracking Last Year. Similar concerns are raised by RT, Huffington Post, and even The New York Times. But 70 million gallons equals 214 acre-feet. Remember, alfalfa production uses 5.3 million acre feet. In our family-of-four analogy above, all the fracking in California costs them about a quarter. Worrying over fracking is like seeing an upper middle class family who are $6,000 in debt, and freaking out because one of their kids bought a gumball from a machine.

Apparently we are also supposed to be worried about Nestle bottling water in California. ABC News writes an article called Nestle Needs To Stop Bottling Water In Drought-Stricken California, Advocacy Group Says, about a group called the “Courage Campaign” who have gotten 135,000 signatures on a petition saying that Nestle needs to stop “bottling the scarce resource straight from the heart of California’s drought and selling it for profit.” Salon goes even further – their article is called Nestle’s Despicable Water Crisis Profiteering: How It’s Making A Killing While California Is Dying Of Thirst, and as always with this sort of thing Jezebel also has to get in on the action. But Nestle’s plant uses only 150 acre-feet, about one forty-thousandth the amount used to grow alfalfa, and the equivalent of about a dime to our family of four.

The Wall Street Journal says that farms are a scapegoat for the water crisis, because in fact the real culprits are environmentalists. They say that “A common claim is that agriculture consumes about 80% of ‘developed’ water supply, yet this excludes the half swiped off the top for environmental purposes.” But environmentalism only swipes half if you count among that half all of the wild rivers in the state – that is, every drop of water not collected, put in an aqueduct, and used to irrigate something is a “concession” to environmentalists. A more realistic figure for environmental causes is the 14 million acre-feet marked “Other Environmental” on the map above, and even that includes concessions to recreational boaters and to whatever catastrophe is supposed to happen if we can’t keep the Sacramento Delta working properly. It’s hard to calculate exactly how much of California’s water goes to environmental causes, but half is definitely an exaggeration.

Wired is concerned that the federal government is ordering California to spend 12,000 acre-feet of water to save six fish (h/t Alyssa Vance). Apparently these are endangered fish in some river who need to get out to the Pacific to breed, and the best way to help them do that is to fill up the river with 12,000 acre feet of water. That’s about $12 on our family’s budget, which works out to $2 per fish. I was going to say that I could totally see a family spending $2 on a fish, especially if it was one of those cool glow-in-the-dark fish I used to have when I was a kid, but then I remembered this was a metaphor and the family is actually the entire state budget of California but the six fish are still literally just six fish. Okay, yes, that seems a little much.

III.

Finally, Marginal Revolution and even some among the mysterious and endangered population of non-blog-having economists are talking about how really the system of price controls and subsidies in the water market is ridiculous and if we had a free market on water all of our problems would be solved. It looks to me like that’s probably right.

Consider: When I used to live in California, even before this recent drought I was being told to take fewer showers, to install low-flush toilets that were inconvenient and didn’t really work all that well, to limit my use of the washing machine and dishwasher, et cetera. It was actually pretty inconvenient. I assume all forty million residents of California were getting the same message, and that a lot of them would have liked to be able to pay for the right to take nice long relaxing showers.

But if all the savings from water rationing amounted to 20% of our residential water use, then that equals about 0.5 MAF, which is about 10% of the water used to irrigate alfalfa. The California alfalfa industry makes a total of $860 million worth of alfalfa hay per year. So if you calculate it out, a California resident who wants to spend her fair share of money to solve the water crisis without worrying about cutting back could do it by paying the alfalfa industry $2 to not grow $2 worth of alfalfa, thus saving as much water as if she very carefully rationed her own use.

If you were to offer California residents the opportunity to not have to go through the whole gigantic water-rationing rigamarole for $2 a head, I think even the poorest people in the state would be pretty excited about that. My mother just bought and installed a new water-saving toilet – which took quite a bit of her time and money – and furthermore, the government is going to give her a $125 rebate for doing so. Cutting water on the individual level is hard and expensive. But if instead of trying to save water ourselves, we just paid the alfalfa industry not to grow alfalfa, all the citizens of California could do their share for $2. If they also wanted to have a huge lush water-guzzling lawn, their payment to the alfalfa industry would skyrocket all the way to $5 per year.

In fact, though I am not at all sure here and I’ll want a real economist to double-check this, it seems to me if we wanted to buy out all alfalfa growers by paying them their usual yearly income to just sit around and not grow any alfalfa, that would cost $860 million per year and free up 5.3 million acre-feet, ie pretty much our entire shortfall of 6 million acre-feet, thus solving the drought. Sure, 860 million dollars sounds like a lot of money, but note that right now California newspapers have headlines like Billions In Water Spending Not Enough, Officials Say. Well, maybe that’s because you’re spending it on giving people $125 rebates for water-saving toilets, instead of buying out the alfalfa industry. I realize that paying people subsidies to misuse water to grow unprofitable crops, and then offering them countersubsidies to not take your first set of subsidies, is to say the least a very creative way to spend government money – but the point is it is better than what we’re doing now.

Posted in Uncategorized | Tagged | 667 Comments

Links 5/15: Tall And Linky

If The Machines Are Taking Our Jobs, They Are Hiding It From The Bureau Of Labor Statistics. An argument that the ‘rise of the robots’ can’t be behind stagnant employment numbers, because increasing the amount of work done by robots would make productivity-per-human go up, and it isn’t.

I was able to solve the Cheryl’s birthday Singapore logic puzzle after a few minutes, but I got stuck on the transfinite version.

The Kosher Light Switch claims that after you flip it, the light will come on, but that your flipping it doesn’t cause the light to come on, thus making it compliant with complicated Jewish ritual laws. Needless to say, this seems to depend on an interpretation of causation which is not entirely…what’s the word…kosher.

I said a while ago I thought that “affirmative consent” laws wouldn’t matter one way or the other since situations where people pressed cases based on them were unliikely to come up. I seem to have been wrong – in a recent case in Brandeis, a man was found in violation of affirmative consent laws because during the course of a two year romantic relationship, he occasionally kissed his partner goodbye in the morning without asking permission first. I’d like to blame this one on Feminism Gone Too Far, but since both parties were gay men we guys have nobody to blame but ourselves here.

The worst method of transliterating the Qiang language gives us such lovely words as “eazheabeageyegeaiju”, “gganpaeidubugeisdu”, and “chegvchagvchegvchagvlahva”. Anyone want to play a game of Terrible Qiang Transliteration Scrabble?

Be Careful Studying Discrimination Using Names. I talked about this briefly when comparing the two recent Women In STEM studies – calling one candidate “John” and the other “Jennifer” introduced a whole host of possible confounds beyond just gender. The article points out that articles which try to prove white-black discrimination by comparing “John” to “Jamal” have the same problem – Jamal isn’t just a black name, it’s a poor black name, and a fairer comparison would be a poor white name like Billy Bob. Features a pretty good reply by Women In STEM paper author Corinne Moss-Racusin, and a less good reply by the guy who wrote the John-Jamal paper.

Dodging Abilify is about the contortions some mental health patients have to go through to prevent their doctors from inappropriately prescribing latest Exciting-New-Marketing-Campaign-Drug Abilify to them. The writer may or may not be pleased to know that when Abilify goes generic in the near future, all of a sudden all of these prescriptions will stop and people will start pushing brexipiprazole instead.

South Dakota’s new ad campaign (h/t Heidi): look, lots of people want to go to Mars, but South Dakota is less inhospitable than Mars, so come to South Dakota instead. Key slogan: “If you’re someone that’s really introverted, it might not be that bad.”

The politics behind the recent campaign against Dr. Oz, and why it might have played right into his hands.

Student Course Evaluations Get An F. Professors whom students rate worst are precisely those professors whose students get the best grades in future courses, suggesting these evaluations are negatively correlated with teaching quality. Very relevant to our recent discussion on psych drugs, hopefully not relevant to past discussions on democracy!

Marijuana probably exacerbates psychosis because of its main chemical constituent THC. But a different marijuana chemical, cannabidiol, might actually a potent antipsychotic. And more evidence for same.

Dutch people swear using diseases. I bet doctors must win all verbal duels in the Netherlands.

An intervention meant to raise kindergarteners’ tolerance of disabled people by teaching them a curriculum about how great it was to have disabled friends actually lowered their tolerance of the disabled compared to a control curriculum where they learned science stuff. Researchers theorized that the science stuff made them work together in groups with other children (including disabled ones) for a practical goal rather than rubbing their noses in the difference.

A new study finds homeopathy and Prozac both outperform placebo by the same amount in treating postmenopausal depression. Ars Technica thinks it knows why the study found such a counterintuitive finding, but check the comments for why their deconstruction seems a bit premature. Overall I think both those defending the integrity of the trial and those attacking it have some good points, but the problem is that if this experiment had done anything other than propose homeopathy worked, it would never have gotten this level of scrutiny and any flaws it might or might not have would just have been allowed to pass.

This is Steven King-level creepy: Thoughts Can Fuel Some Deadly Brain Cancers.

Nostalgebraist, a very interesting guy who hangs around rationalist Tumblr, is writing fiction I’ve been enjoying a lot. His completed work, Floornight, asks – what happens if we discover the soul is real, but operates more like a quantum object than a classical object, and also some people go to study it in a giant dome in the middle of the sea surrounded by alien ghosts which is part of a plot by parallel universes to fight a war based on differing interpretations of measure? His current work-in-progress, The Northern Caves, is even better.

Somebody actually does the full scientific study and determines that atheists are no more angry than the general population. I predicted this result here two years ago.

Kazakh leader apologizes for winning election with 97.7% of the vote, saying “it would have looked undemocratic to intervene to make the victory more modest”.

Polygamists are four times more likely to get heart disease than monogamists after everything else is controlled for, which to me probably means they think they controlled for everything else but they didn’t.

First results from psychology’s largest reproducibility test: by strict criteria, only 39% of published studies replicate; by looser criteria, 63% do.

Speaking of which, you remember that study on how reading problems in a hard-to-read font makes you think about them more rationally? Totally failed to replicate multiple times, now abandoned.

RPG doormat.

A new paper finds that telling people that everyone stereotypes just makes them stereotype more.

A new paper finds black mayors (relative to white mayors) improve position of blacks (relative to whites) in cities where they are elected.

Genetic influence on political beliefs. Everything is some typical combination of heredity and nonshared environment except which party you belong to, which is mostly shared environment. In other words, you come up with your opinions on your own, then ignore them and vote for whoever your parents voted for.

John Boehner was wrong when he said we as a nation spend more money on antacids than we do on politics, but he was surprisingly close – within a factor of three or so.

A Redditor lists facts and fictions about the new spaceship drives that claim to use weird physics. Apparently if they work they will Change Everything Forever, including land transportation. But smart people are very skeptical.

Razib Khan finds that, contrary to the stereotypes, more intelligent and more liberal people are more likely to believe in free speech.

Drinking too much caffeine during pregnancy may double your baby’s risk of childhood obesity

Killing Hitler With Praise And Fire is a Choose Your Own Adventure book about a time traveler trying to assassinate the Fuhrer without messing history up too atrociously.

Posted in Uncategorized | Tagged | 504 Comments

Growth Mindset 4: Growth Of Office

Previously In Series: No Clarity Around Growth Mindset…Yet // I Will Never Have The Ability To Clearly Explain My Beliefs About Growth Mindset // Growth Mindset 3: A Pox On Growth Your Houses

Last month I criticized a recent paper, Paunesku et al’s Mindset Interventions Are A Scalable Treatment For Academic Underachievement, saying that it spun a generally pessimistic set of findings about growth mindset into a generally optimistic headline.

Earlier today, lead author Dr. Paunesku was kind enough to write a very thorough reply, which I reproduce below:

I.

Hi Scott,

Thanks for your provocative blog post about my work (I’m the first author of the paper you wrote about). I’d like to take a few moments to respond to your critiques, but first I’d like to frame my response and tell you a little bit about my own motivation and that of the team I am a member of (PERTS).

Good criticism is what makes science work. We are critical of our own work, but we are happy to have help. Often critics are not thoughtful or specific. So I very much appreciate the intent of your blog (to be thoughtful and specific).

What is our motivation? We are trying to improve our education system so that all students can thrive. If growth mindset is effective, we want it in every classroom possible. If it is ineffective, we want to know about it so we don’t waste people’s time. If it is effective for some students in some classrooms, we want to know where and for whom so that we can help those students.

What is our history and where are we now? PERTS approached social psychological interventions with a fair amount of skepticism at first. In many ways, they seemed too good to be true. But, we thought, “if this is true, we should do everything we can to spread it”. Our work over the last 5 years has been devoted to trying to see if the results that emerged from initial, small experiments (like Aronson et al., 2002 and Blackwell et al., 2007) would continue to be effective when scaled. The paper you are critiquing is a step in that process — not the end of the process. We are continuing research to see where, for whom, and at what scale social psychological approaches to improving education outcomes can be effective.

How do I intend to respond to your criticisms? In some cases, your facts or interpretations are simply incorrect, and I will try to explain why. I also invite you to contact me for follow up. In others cases, we simply have different opinions about what’s important, and we’ll have to agree to disagree. Regardless, I appreciate your willingness to be bold and specific in your criticism. I think that’s brave, and I think such bravery makes science stronger.

First, what is growth mindset?

This quote is from one of your other blog posts (not your critique of my paper), from your post:

If you’re not familiar with it, growth mindset is the belief that people who believe ability doesn’t matter and only effort determines success are more resilient, skillful, hard-working, perseverant in the face of failure, and better-in-a-bunch-of-other-ways than people who emphasize the importance of ability. Therefore, we can make everyone better off by telling them ability doesn’t matter and only hard work does.

If you think that’s what growth mindset is, I can certainly see why you’d find it irritating — and even destructive. I’d like to assure you that the people doing growth mindset research do not ascribe to the interpretation of growth mindset you described. Nor is that interpretation of growth mindset something we aim to communicate through our interventions. So what is growth mindset?

Growth mindset is not the belief that “ability doesn’t matter and only effort determines success.” Growth mindset is the belief that individuals can improve their abilities — usually through effort and by learning more effective strategies. For example, imagine a third grader struggling to learn long division for the first time. Should he interpret his struggle as a sign that he’s bad at math — as a sign that he should give up on math for good? Or would it be more adaptive if he realized that he could probably get a lot better at math if he sought out help from his peers or teachers? The student who thinks he should give up would probably do pretty badly while the student who thinks that he can improve his abilities — and tries to do so by learning new study strategies and practicing them — would do comparatively better.

That’s the core of growth mindset. It’s nothing crazy like thinking ability doesn’t matter. It’s keeping in mind that you can improve and that — to do so — you need to work hard and seek out and practice new, effective strategies.

As someone who has worked closely with Carol Dweck and with her students and colleagues for seven years now, I can personally attest that I have never heard anyone in that extended group of people express the belief that ability does not matter or that only hard work matters. In fact, a growth mindset wouldn’t make any sense if ability didn’t matter because a growth mindset is all about improving ability.

One of the active goals of the group I co-founded (PERTS) is to try to dispel misinterpretations of growth mindset because they can be harmful. I take it as a failure of our group that someone like you — someone who clearly cares about research and about scientific integrity — could walk away from our work with that interpretation of growth mindset. I hope that PERTS, and other groups promoting growth mindset, can get better and better at refining the way we talk about growth mindset so that people can walk away from our work understanding it more clearly. For that perspective, I hope you can continue to engage with us to improve that message so that people don’t continue to misinterpret it.

Anyway, here are my responses to specific points you made in your blog about my paper:

Was the control group a mindset intervention?

You wrote:

“A quarter of the students took a placebo course that just presented some science about how different parts of the brain do different stuff. This was also classified as a “mindset intervention”, though it seems pretty different.”

What makes you think it was classified as a mindset intervention? We called that the control group, and no one on our team ever thought of that as a mindset intervention.

The Elderly Hispanic Woman Effect

You wrote:

Subgroup analysis can be useful to find more specific patterns in the data, but if it’s done post hoc it can lead to what I previously called the Elderly Hispanic Woman Effect…

First, I just want to note that I love calling this the “elderly Hispanic woman effect.” It really brings out the intrinsic ridiculousness of the subgroup analyses researchers sometimes go through in search of an effect with a p<.05. It is indeed unlikely that "elderly Hispanic women" would be a meaningful subgroup for analyzing the effects of a medicine (although it might be a fun thought exercise to try to think of examples of a medicine whose effects would be likely to be moderated by being an elderly Hispanic woman).

In bringing up the elderly Hispanic woman effect, you're suggesting that we didn't have an a priori reason to think that underperforming students would benefit from these mindset interventions and that we just looked through a bunch of moderators until we found one with p<.05. Well that's not what we did, and I hope I can convince you that our choice of moderator was perfectly reasonable given prior research and theory.

There's a lot of research (and common sense too) to suggest that mindset -- and motivation in general -- matters much more when something is hard than when it is easy. Underachieving students presumably find school more difficult, so it makes sense that we'd want to focus on them. I don't think our choice of subgroup is a controversial or surprising prediction. I think anyone who knows mindset research well would predict stronger effects for students who are struggling. In other words, this is obviously not a case of the elderly Hispanic woman effect because it is totally consistent with prior theory and predictions. What ultimately matters more than any rhetorical argument, however, is whether the effect is robust -- whether it replicates.

On that front, I hope you'll be pleased to learn that we just ran a successful replication of this study (in fall 2014) in which we again found that growth mindset improves achievement specifically among at-risk high school students (currently under review). We're also planning yet another large scale replication study this fall with a nationally representative sample of schools so that we can be more confident that the interventions are effective in various types of contexts before giving them away for free to any school that wants them.

Is the sense of purpose intervention just a bunch of platitudes?

You wrote:

Still another quarter took a course about “sense of purpose” which talked about how schoolwork was meaningful and would help them accomplish lots of goals and they should be happy to do it.

[Later you say that those “children were told platitudes about how doing well in school will “make their families proud” and “make a positive impact”.]

I wouldn’t say those are platitudes. I think you’re under-appreciating the importance of finding meaning in one’s work. It’s a pretty basic observation about human nature that people are more likely to try hard when it seems like there’s a good reason to try hard. I also think it’s a pretty basic observation about our education system that many students don’t have good reasons for trying hard in school — reasons that resonate with them emotionally and help them find the motivation to do their best in the classroom. In our purpose intervention, we don’t just tell students what to think. We try to scaffold them to think of their own reasons for working hard in school, with a focus on reasons that are more likely to have emotional resonance for students. This type of self-persuasion technique has been used for decades in attitudes research.

We’ve written in more depth about these ideas and explored them through a series of studies. I’d encourage you to read this article if you’re interested.

Our paper title and abstract are misleading

You wrote:

Among ordinary students, the effect on the growth mindset group was completely indistinguishable from zero, and in fact they did nonsignificantly worse than the control group. This was the most basic test they performed, and it should have been the headline of the study. The study should have been titled “Growth Mindset Intervention Totally Fails To Affect GPA In Any Way”.

I think the title you suggest would have been misleading. How?

First, we did find evidence that mindset interventions help underachieving students — and those students are very important from a policy standpoint. As we describe in the paper, those students are more likely to drop out, to end up underemployed, or to end up in prison. So if something can help those students at scale and at a low cost, it’s important for people to know that. That’s why the word “underachievement” is in the title of the paper — because we’re accurately claiming that these interventions can help the important (and large) group of students who are underachieving.

Second, the interventions influenced the way all students think about school in ways that are associated with achievement. Although the higher performing students didn’t show any effects on grades in the semester following the study, their mindsets did change. And, as per the arguments I presented above about the link between mindset and difficulty, it’s quite feasible that those higher-performing students will benefit from this change in mindset down the line. For example, they may choose to take harder classes (e.g., Romero et al., 2014) or they may be more persistent and successful in future classes that are very challenging for them.

A misinterpretation of the y-axis in this graph.

You wrote:

Growth mindset still doesn’t differ from zero [among at-risk students].

This just seems to be a simple misreading of the graph. Either you missed the y-axis of the graph that you reproduced on your blog or you don’t know what a residual standardized score is. Either way, I’ll explain because this is pretty esoteric stuff.

The zero point of the y-axis on that graph is, by definition, the grand mean of the 4 conditions. In other words, the treatment conditions are all hovering around zero because zero is the average, and the average is made up mostly of treatment group students. If we had only had 2 conditions (each with 50% of the students), the y-axis “zero” would have been exactly halfway in between them. So the lack of difference from zero does not mean that the treatment was not different from control. The relevant comparison is between the error bars in the control condition and in the treatment conditions.

You might ask, “why are you showing such a graph?” We’re doing so to focus on the treatment contrast at the heart of our paper — the contrast between the control and treatment groups. The residual standardized graph makes it easy to see the size of that treatment contrast.

We’re combining intervention conditions

You wrote:

Did you catch that phrase “intervention conditions”? The authors of the study write: “Because our primary research question concerned the efficacy of academic mindset interventions in general when delivered via online modules, we then collapsed the intervention conditions into a single intervention dummy code (0 = control, 1 = intervention).

[This line of argument goes on for a long time to suggest that we’re unethical and that there’s actually no evidence for the effects of growth mindset on achievement.]

We collapsed the intervention conditions together for this analysis because we were interested in the overall effect of these interventions on achievement. We wanted to see if it is possible to use scalable, social-psychological approaches to improve the achievement of underperforming students. I’m not sure why you think that’s not a valid hypothesis to test, but we certainly think it is. Maybe this is just a matter of opinion about what’s a meaningful hypothesis to test, but I assure you that this hypothesis (contrast all treatments to control) is consistent with the goal of our group to develop treatments that make an impact on student achievement. As I described before, we have a whole center devoted to trying to improve academic achievement with these types of techniques (see perts.net); so it’s pretty natural that we’d want to see whether our social-psychological interventions improve outcomes for the students who need them most (at-risk students).

You’re correct that the growth mindset intervention did not have a statistically significant impact on course passing rates by itself (at a p<.05 level). However, the effect was in the expected direction with p=0.13 (or a 1-tailed p=.07 -- I hope you'll grant that a 1-tailed test is appropriate here given that we obviously predicted the treatment would improve rather than reduce performance). So the lack of a p<.05 should not be interpreted -- as you seem to interpret it -- as some sort of positive evidence that growth mindset "actually didn't work." Anyway, I would say it warrants further research to replicate this effect (work we are currently engaging in).

To summarize, we did not find direct evidence that the growth mindset intervention increased course passing rates on its own at a p<.05 level. We did find that growth mindset increased course passing rates at a trend level -- and found a significant effect on GPA. More importantly for me (though perhaps less relevant to your interest specifically in growth mindset), we did provide evidence that social-psychological interventions, like growth mindset and sense of purpose, can improve academic outcomes for at-risk students.

We're excited to be replicating this work now and giving it away in the hopes of improving outcomes for students around the world.

Summary

I hope I addressed your concerns about this paper, and I welcome further discussion with you. I’d really appreciate it if you’d revise your blog post in whatever way you think is appropriate in light of my response. I’d hate for people to get the wrong impression of our work, and you don’t strike me as someone who would want to mislead people about scientific findings either.

Finally, you’re welcome to post my response. I may post it to my own web page because I’m sure many other people have similar questions about my work. Just let me know how you’d like to proceed with this dialog.

Thanks for reading,

Dave

II.

First of all, the obvious: this is extremely kind and extremely well-argued and a lot of it is correct and makes me feel awful for being so snarky on my last post.

Things in particular which I want to endorse as absolutely right about the critique:

I wrote “A quarter of the students took a placebo course that just presented some science about how different parts of the brain do different stuff. This was also classified as a “mindset intervention”, though it seems pretty different.” Dr. Paunesku says this is wrong. He’s right. It was an editing error on my part. I meant to add the last sentence to the part on the “sense of purpose” intervention, which was classified as a mindset intervention and which I do think seems pretty different. The placebo intervention was never classified as a mindset intervention and I completely screwed up by inserting that piece of text there rather than two sentences down where I meant it to be. It has since been corrected and I apologize for the error.

If another successful replication found that growth mindset continues to only help the lowest-performing students, I withdraw the complaint that this is sketchy subgroup mining, though I think that in general worrying about this is the correct thing to do.

I did misunderstand the residual standardized graph. I suggested that the control group must have severely declined, and got confused about why. In fact, the graph was not about difference between pre-study scores and post-study scores, but difference between group scores and the average score for all four groups. So when the control group is strongly negative, that means it was much worse than the average of all groups. When growth mindset is not-different-from-zero, it means growth mindset was not different from the average of all four groups, which consists of three treatment groups and one control group. So my interpretation – that growth mindset failed to change children’s grades – is not supported by the data.

(In my defense, I can only plead that in the two hundred fifty comments I received, many by professional psychologists and statisticians, only one person picked up on this point (admittedly, after being primed by my own misinterpretation). And the sort of data I expected to be seeing – difference between students’ pre-intervention and post-intervention scores – does not seem to be available. Nevertheless, this was a huge and unforgiveable screw-up, and I apologize.)

III.

But there are also a few places where I will stick to my guns.

I don’t think my interpretation of growth mindset was that far off the mark. I explain this a little further in this post on differing possible definitions of growth mindset, and I will continue to cite this strongly worded paper by Dweck as defense of my views. It’s not just an obvious and innocuous belief about about always believing you should be able to improve, it’s a belief about very counterintuitive effects of believing that success depends on ability versus effort. It is possible that all sophisticated researchers in the field have a very sophisticated and unobjectionable definition of growth mindset, but that’s not the way it’s presented to the public, even in articles by those same researchers.

Although I’m sure that to researchers in the field statements like “Doing well at school will help me achieve my goal” don’t sound like platitudes, it seems important to me in the context of discussions about growth mindset. Some people have billed growth mindset as a very exciting window into what makes learning tick, and how we should divide everyone into groups based on their mindset, and how it’s the Secret To Success, and so on. Learning that a drop-dead simple intervention – telling students to care about school more – actually does as well or better than growth mindset seems to me like a damning result. I realize it would be kind of insulting to call sense-of-purpose an “active placebo” in the medical sense, but that’s kind of how I can’t help thinking of it.

I’m certainly not suggesting the authors of the papers are unethical for combining growth mindset intervention with sense of purpose intervention. But I think the technique is dangerous, and this is an example. They got a result that was significant at p = 0.13. Dr. Paunesku suggests in his email to me that this should be one-tailed (which makes it p = 0.07) and that this obviously trends towards significance. This is a reasonable argument. But this wasn’t the reasonable argument made in the paper. Instead, they make it look like it achieved classical p < 0.05 significance, or at least make it very hard to notice that it didn't.

Even if in this case it was - I can't even say white lie, maybe a white spin - I find the technique very worrying. Suppose I want to prove homeopathy cures cancer. I make a trial with one placebo condition and two intervention conditions - chemotherapy and homeopathy. I find that the chemotherapy condition very significantly outperforms placebo, but the homeopathy condition doesn't. So I combine the two interventions into a single bin and say "Therapeutic interventions such as chemotherapy or homeopathy significantly outperform placebo." Then someone else cites it as "As per a study, homeopathy outperforms placebo." This would obviously be bad.

I am just not convinced that growth mindset and sense of purpose are similar enough that you can group them together effectively. This is what I was trying to get at in my bungled sentence about how they're both "mindset" interventions but seem pretty different. Yes, they're both things you tell children in forty-five minute sessions that seem related to how they think about school achievement. But that's a really broad category.

But doesn’t it mean something that growth-mindset was obviously trending toward significance?

First of all, I would have had no problem with saying “trending toward significance” and letting readers draw their own conclusions.

Second of all, I’m not totally sure I buy the justification for a one-tailed test here; after all, it seems like we should use a one-tailed test for homeopathy as well, since as astounding as it would be if homeopathy helped, it would be even more astounding if homeopathy somehow made cancer worse. Further, educational interventions often have the opposite of their desired effect – see eg this campaign to increase tolerance of the disabled which made students like disabled people less than a control intervention. In fact, there’s no need to look further than this very study, which found (counterintuitively) that among students already exposed to sense-of-purpose interventions, adding on an extra growth-mindset intervention seemed to make them do (nonsignificantly) worse. I am not a statistician, but my understanding is you ought to have a super good reason to use a one-tailed test, beyond just “Intuitively my hypothesis is way more likely than the exact opposite of my hypothesis”.

Third of all, if we accept p < 0.13 as "trending towards significance", we have basically tripled the range of acceptable study results, even though everyone agrees our current range of acceptable study results is already way too big and some high percent of all medical studies are wrong and only 39% of psych studies replicate and so on.

(I agree that all of this could be solved by something better than p-values, but p-values are what we’ve got)

I realize I’m being a jerk by insisting on the arbitrary 0.05 criterion, but in my defense, the time when only 39% of studies using a criterion replicate is a bad time to loosen that criterion.

IV.

Here’s what I still believe and what I’ve changed my mind on based on Dr. Paunesku’s response.

1. I totally bungled my sentence on the placebo group being a mindset intervention by mistake. I ashamedly apologize, and have corrected the original post.

2. I totally bungled reading the residual standard score graph. I ashamedly apologize, and have corrected the original post, and put a link in bold text to this post on the top.

3. I don’t know whether the thing I thought the graph showed (no significant preintervention vs. postintervention GPA improvement for growth mindset, or no difference in change from controls) is true. It may be hidden in the supplement somewhere, which I will check later. Possible apology pending further investigation.

4. Growth mindset still had no effect (in fact nonsignificantly negative) for students at large (as opposed to underachievers). I regret nothing.

5. Growth mindset still failed to reach traditional significance criteria for changing pass rates. I regret nothing.

The Future Is Filters

Related to: The Toxoplasma of Rage

I.

Tumblr Savior is a neat program that blocks Tumblr posts containing specific words or phrases. For example, if you don’t want to hear all of the excellent reasons going around Tumblr why you should kill all men, you just block “kill all men” and they never show up. Add a few extra terms like “white dudes” (nothing good ever came of an article including the phrase “white dudes”), “trans”, “cis”, and “pictures of my vagina”, and you can make Tumblr almost usable.

(My own Tumblr Savior list is an interesting record both of my psyche and of mid-2010s current events. Sometimes I imagine a future cyber-archaeologist stumbling across it and asking “But, but…why would he ban the word ‘puppies’?” Poor, poor innocent future archaeologist.)

I recently learned about Twitter blockbots. These are lists maintained by some trustworthy people, such that subscribing to the blockbot automatically blocks everyone on the list. The original was made by some people in the social justice community to help block people they figured other members of the social justice community wouldn’t want to have to deal with. Although some people seem to be added on by hand, the bot also makes educated guesses about who to block by blacklisting accounts that follow the feeds of too many anti-social-justice leaders.

There are rumors of a similar anti-SJ block list of people who engage on online mobbing and harassment in the name of social justice, but I can’t find it online right now and I think it might have been taken down.

An article I read recently (but which I can’t find right now to link to) proposes a higher-tech solution for Facebook’s harassment problems. They want Facebook to train machine-learning programs to detect posts that most people would consider trollish. So far, so boring. The interesting part comes afterwards – instead of auto-blocking those posts, Facebook would assign them a certain number of Troll Points. Users could then set an option for how their Facebook feed should react to Troll Points – for example, by blocking every post with more than a certain amount. That way, people who were concerned about free speech and who enjoy participating in “heated discussion” would be able to do so, while people who wanted a safer and more pleasant browsing experience could have a very low cutoff for taking action.

But the really interesting part got dismissed after a sentence. What if instead of combining everything into Troll Points, Facebook assigned the points in different domains? Foul Language, Blasphemy, Racial Slurs, Threats, Harassment, Dirty Argument Tactics, et cetera. And then I could set that I don’t care about Foul Language or Blasphemy, but I really don’t want to see any Threats or Racial Slurs.

(obviously the correct anarcho-capitalist solution is to have third-party companies making these algorithms and selling them to individual Facebook users, but in a world where Facebook is trying to become more and more closed to third-party apps, that’s probably not going to happen)

So, take all this filtering technology – Tumblr Savior, Twitter blockbots, and hypothetical Facebook Troll Points, combine them together, project them about ten years into the future with slightly better machine learning, and you have an Internet where nobody has to see, even for an instant, anything they don’t want to. What are the implications?

II.

The most obvious possibility is that everyone will be better off because we can avoid trolls. In this nice black-and-white worldview, there are good people, and there are trolls, and eliminating the trolls is a simple straightforward decision that makes the good people better off. This is how The Daily Beast thinks of it (How Block Bot Could Save The Internet), and as anyone who’s been trolled or harassed online knows, there’s a lot of truth to this view.

The second most obvious possibility is that we will become a civilization of wusses safely protected from ever having to hear an opinion we disagree with, or ever having our prejudices challenged. This is how Reason thinks of it (Block Bots Automate Epistemic Closure On Twitter). Surely there’s some truth here too. How hard would it be to create a filter that blocks all conservative/liberal opinions? Just guess based on whether a text links to foxnews.com or dailykos.com, or add in linguistic cues (“death tax”, “job creators”, etc). Once such a filter existed, how many people do you think would use it proudly, bragging about how they’re no longer “wasting their time listening patiently to bigots” or whatever?

But I don’t think the scenario is quite that apocalyptic. If you’re getting all of your exposure to opinions you disagree with from them being shouted in your face by people you can’t avoid, you probably are not going to lose much by not having that happen. The people who are actually interested in holding discussions can still do that. When I was young and therefore stupid I used to hang out at politics forums specifically for this purpose.

The third possibility is that there would be a remarkable shift of discourse in favor of the powerful and against the powerless.

Terrorism has always been a useful weapon of the powerless. The powerful get laws passed through Congress or whatever, but the powerless don’t have that opportunity. They need to get people to pay attention, and blowing those people up has always been an effective tool in that repertoire. We see this most obviously in places like Palestine and the Basque Country. Likewise, as many people have pointed out, the recent riots in Baltimore can be thought of as a group of powerless people trying to make their anger heard in one of the only ways available to them. It would be politically un-savvy to call this “terrorism”, but as acts of destruction intended to promote a political struggle, they probably fit into the same cluster.

But the next step down from terrorism is annoyism. Terrorism is meant to convince by terrorizing those who ignore your cause; annoyism is meant to convince by annoying people who ignore your cause. Think of a bunch of protesters shouting on a major road, or throwing red paint over people wearing fur, or passive-aggressive Tumblr posts starting “dear white dudes”, or, in probably the purest example of the idea, the Black Brunch protests, where a bunch of black people burst into predominantly white restaurants and shout at patrons about how they’re probably complicit with racism. Even if there’s no implicit threat of force, the point is it’s unpleasant and people can’t ignore it even if they want to.

And so the traditional revolutionary chant goes: “No justice, no peace.” But the thing about filters is that they offer the opportunity for peace regardless of whether or not there is justice. At least they do online, which is where people in the future are going to be spending a lot more of their time.

Imagine you are a rich person who doesn’t want to have to listen to people talking about how rich people need to be socially responsible all the time. It makes you feel guilty, and they are saying mean things like that you don’t deserve all of the money you have, and shouting about social parasites and so on.

So you tell your automated filter to just never let you see any message like that again.

There is an oft-discussed division between politically right or neutral loud angry people (“trolls”) and loud angry people on the political left, (“you are not allowed to dictate the terms on which victims of oppression express their righteous anger”). Machine learning programs will not accept that division, and the latter can be magicked out of visibility just as easily as the former.

Imagine being able to put an entire movement on mute. While I can’t deny the appeal, I’m not sure we – and especially not the social justice community, which is currently laughing at the complaints of people who object to their blockbot – have entirely thought this one through.

III.

The part I find most interesting about all of these possibilities is that they force us to bring previously unconscious social decisions into consciousness.

I think most people, if asked “Is it important to listen to arguments by people who disagree with you?” would answer in the affirmative. I also think most people don’t really do this. Maybe having to set a filter would make people explicitly choose to allow some contrary arguments in. Having done that, people could no longer complain about seeing them – they would feel more of an obligation to read and think about them. And of course, anyone looking for anything more than outrage-bait would choose to preferentially let in high-quality, non-insulting examples of disagreeing views, and so get inspired to think clearly instead of just starting one more rage spiral.

And I think most people, if asked “Is it important to listen to the concerns of the less powerful?” would also be pretty strongly in favor – with the caveat that people can recognize annoyism when it’s being used against them and aren’t especially tolerant of it. The ability to completely block out annoyism, combined with people being forced to explicitly choose to listen to alternative opinions, might make groups that currently favor annoyism change tactics to something more pleasant – though possibly less effective.

I think the result would be several carefully separated groups with their own social and epistemic norms, all of which coexist peacefully and in relative isolation from one another – groups which I would hope then develop their own norms about helping powerless members. This would be an interesting step towards what I describe in my Archipelago article as “a world where everyone is a member of more or less the community they deserve.”

Prescriptions, Paradoxes, and Perversities

[WARNING: I am not a pharmacologist. I am not a researcher. I am not a statistician. This is not medical advice. This is really weird and you should not take it too seriously until it has been confirmed]

I.

I’ve been playing around with data from Internet databases that aggregate patient reviews of medications.

Are these any good? I looked at four of the largest such databases – Drugs.com, WebMD, AskAPatient, and DrugLib – as well as psychiatry-specific site CrazyMeds – and took their data on twenty-three major antidepressants. Then I correlated them with one another to see if the five sites mostly agreed.

Correlations between Drugs.com, AskAPatient, and WebMD were generally large and positive (around 0.7). Correlations between CrazyMeds and DrugLib were generally small or negative. In retrospect this makes sense, because these two sites didn’t allow separation of ratings by condition, so for example Seroquel-for-depression was being mixed with Seroquel-for-schizophrenia.

So I threw out the two offending sites and kept Drugs.com, AskAPatient, and WebMD. I normalized all the data, then took the weighted average of all three sites. From this huge sample (the least-reviewed drug had 35 ratings, the most-reviewed drug 4,797) I obtained a unified opinion of patients’ favorite and least favorite antidepressants.

This doesn’t surprise me at all. Everyone secretly knows Nardil and Parnate (the two commonly-used drugs in the MAOI class) are excellent antidepressants1. Oh, nobody will prescribe them, because of the dynamic discussed here, but in their hearts they know it’s true.

Likewise, I feel pretty good to see that Serzone, which I recently defended, is number five. I’ve had terrible luck with Viibryd, and it just seems to make people taking it more annoying, which is not a listed side effect but which I swear has happened.

The table also matches the evidence from chemistry – drugs with similar molecular structure get similar ratings, as do drugs with similar function. This is, I think, a good list.

Which is too bad, because it makes the next part that much more terrifying.

II.

There is a sixth major Internet database of drug ratings. It is called RateRx, and it differs from the other five in an important way: it solicits ratings from doctors, not patients. It’s a great idea – if you trust your doctor to tell you which drug is best, why not take advantage of wisdom-of-crowds and trust all the doctors?

The RateRX logo. Spoiler: this is going to seem really ironic in about thirty seconds.

RateRx has a modest but respectable sample size – the drugs on my list got between 32 and 70 doctor reviews. There’s only one problem.

You remember patient reviews on the big three sites correlated about +0.7 with each other, right? So patients pretty much agree on which drugs are good and which are bad?

Doctor reviews on RateRx correlated at -0.21 with patient reviews. The negative relationship is nonsignificant, but that just means that at best, doctor reviews are totally uncorrelated with patient consensus.

This has an obvious but very disturbing corollary. I couldn’t get good numbers on how times each of the antidepressants on my list were prescribed, because the information I’ve seen only gives prescription numbers for a few top-selling drugs, plus we’ve got the same problem of not being able to distinguish depression prescriptions from anxiety prescriptions from psychosis prescriptions. But total number of online reviews makes a pretty good proxy. After all, the more patients are using a drug, the more are likely to review it.

Quick sanity check: the most reviewed drug on my list was Cymbalta. Cymbalta was also the best selling antidepressant of 2014. Although my list doesn’t exactly track the best-sellers, that seems to be a function of how long a drug has been out – a best-seller that came out last year might have only 1/10th the number of reviews as a best-seller that came out ten years ago. So number of reviews seems to be a decent correlate for amount a drug is used.

In that case, amount a drug is used correlates highly (+0.67, p = 0.005) with doctors’ opinion of the drug, which makes perfect sense since doctors are the ones prescribing it. But amount the drug gets used correlates negatively with patient rating of the drug (-0.34, p = ns), which of course is to be expected given the negative correlation between doctor opinion and patient opinion.

So the more patients like a drug, the less likely it is to be prescribed2.

III.

There’s one more act in this horror show.

Anyone familiar with these medications reading the table above has probably already noticed this one, but I figured I might as well make it official.

I correlated the average rating of each drug with the year it came on the market. The correlation was -0.71 (p < .001). That is, the newer a drug was, the less patients liked it3.

This pattern absolutely jumps out of the data. First- and second- place winners Nardil and Parnate came out in 1960 and 1961, respectively; I can’t find the exact year third-place winner Anafranil came out, but the first reference to its trade name I can find in the literature is from 1967, so I used that. In contrast, last-place winner Viibryd came out in 2011, second-to-last place winner Abilify got its depression indication in 2007, and third-to-last place winner Brintellix is as recent as 2013.

This result is robust to various different methods of analysis, including declaring MAOIs to be an unfair advantage for Team Old and removing all of them, changing which minor tricylics I do and don’t include in the data, and altering whether Deprenyl, a drug that technically came out in 1970 but received a gritty reboot under the name Emsam in 2006, is counted as older or newer.

So if you want to know what medication will make you happiest, at least according to this analysis your best bet isn’t to ask your doctor, check what’s most popular, or even check any individual online rating database. It’s to look at the approval date on the label and choose the one that came out first.

IV.

What the hell is going on with these data?

I would like to dismiss this as confounded, but I have to admit that any reasonable person would expect the confounders to go the opposite way.

That is: older, less popular drugs are usually brought out only when newer, more popular drugs have failed. MAOIs, the clear winner of this analysis, are very clearly reserved in the guidelines for “treatment-resistant depression”, ie depression you’ve already thrown everything you’ve got at. But these are precisely the depressions that are hardest to treat.

Imagine you are testing the fighting ability of three people via ten boxing matches. You ask Alice to fight a Chihuahua, Bob to fight a Doberman, and Carol to fight Cthulhu. You would expect this test to be biased in favor of Alice and against Carol. But MAOIs and all these other older rarer drugs are practically never brought out except against Cthulhu. Yet they still have the best win-loss record.

Here are the only things I can think of that might be confounding these results.

Perhaps because these drugs are so rare and unpopular, psychiatrists only use them when they have really really good reason. That is, the most popular drug of the year they pretty much cluster-bomb everybody with. But every so often, they see some patient who seems absolutely 100% perfect for clomipramine, a patient who practically screams “clomipramine!” at them, and then they give this patient clomipramine, and she does really well on it.

(but psychiatrists aren’t actually that good at personalizing antidepressant treatments. The only thing even sort of like that is that MAOIs are extra-good for a subtype called atypical depression. But that’s like a third of the depressed population, which doesn’t leave much room for this super-precise-targeting hypothesis.)

Or perhaps once drugs have been on the market longer, patients figure out what they like. Brintellix is so new that the Brintellix patients are the ones whose doctors said “Hey, let’s try you on Brintellix” and they said “Whatever”. MAOIs have been on the market so long that presumably MAOI patients are ones who tried a dozen antidepressants before and stayed on MAOIs because they were the only ones that worked.

(but Prozac has been on the market 25 years now. This should only apply to a couple of very new drugs, not the whole list.)

Or perhaps the older drugs have so many side effects that no one would stay on them unless they’re absolutely perfect, whereas people are happy to stay on the newer drugs even if they’re not doing much because whatever, it’s not like they’re causing any trouble.

(but Seroquel and Abilify, two very new drugs, have awful side effects, yet are down at the bottom along with all the other new drugs)

Or perhaps patients on very rare weird drugs get a special placebo effect, because they feel that their psychiatrist cares enough about them to personalize treatment. Perhaps they identify with the drug – “I am special, I’m one of the only people in the world who’s on nefazodone!” and they become attached to it and want to preach its greatness to the world.

(but drugs that are rare because they are especially new don’t get that benefit. I would expect people to also get excited about being given the latest, flashiest thing. But only drugs that are rare because they are old get the benefit, not drugs that are rare because they are new.)

Or perhaps psychiatrists tend to prescribe the drugs they “imprinted on” in medical school and residency, so older psychiatrists prescribe older drugs and the newest psychiatrists prescribe the newest drugs. But older psychiatrists are probably much more experienced and better at what they do, which could affect patients in other ways – the placebo effect of being with a doctor who radiates competence, or maybe the more experienced psychiatrists are really good at psychotherapy, and that makes the patient better, and they attribute it to the drug.

(but read on…)

V.

Or perhaps we should take this data at face value and assume our antidepressants have been getting worse and worse over the past fifty years.

This is not entirely as outlandish as it sounds. The history of the past fifty years has been a history of moving from drugs with more side effects to drugs with fewer side effects, with what I consider somewhat less than due diligence in making sure the drugs were quite as effective in the applicable population. This is a very complicated and controversial statement which I will be happy to defend in the comments if someone asks.

The big problem is: drugs go off-patent after twenty years. Drug companies want to push new, on-patent medications, and most research is funded by drug companies. So lots and lots of research is aimed at proving that newer medications invented in the past twenty years (which make drug companies money) are better than older medications (which don’t).

I’ll give one example. There is only a single study in the entire literature directly comparing the MAOIs – the very old antidepressants that did best on the patient ratings – to SSRIs, the antidepressants of the modern day4. This study found that phenelzine, a typical MAOI, was no better than Prozac, a typical SSRI. Since Prozac had fewer side effects, that made the choice in favor of Prozac easy.

Did you know you can look up the authors of scientific studies on LinkedIn and sometimes get very relevant information? For example, the lead author of this study has a resume that clearly lists him as working for Eli Lilly at the time the study was conducted (spoiler: Eli Lilly is the company that makes Prozac). The second author’s LinkedIn profile shows he is also an operations manager for Eli Lilly. Googling the fifth author’s name links to a news article about Eli Lilly making a $750,000 donation to his clinic. Also there’s a little blurb at the bottom of the paper saying “Supported by a research grant by Eli Lilly and company”, then thanking several Eli Lilly executives by name for their assistance.

This is the sort of study which I kind of wish had gotten replicated before we decided to throw away an entire generation of antidepressants based on the result.

But who will come to phenelzine’s defense? Not Parke-Davis , the company that made it: their patent expired sometime in the seventies, and then they were bought out by Pfizer5. And not Pfizer – without a patent they can’t make any money off Nardil, and besides, Nardil is competing with their own on-patent SSRI drug Zoloft, so Pfizer has as much incentive as everyone else to push the “SSRIs are best, better than all the rest” line.

Every twenty years, pharmaceutical companies have an incentive to suddenly declare that all their old antidepressants were awful and you should never use them, but whatever new antidepressant they managed to dredge up is super awesome and you should use it all the time. This sort of does seem like the sort of situation that might lead to older medications being better than newer ones. A couple of people have been pushing this line for years – I was introduced to it by Dr. Ken Gillman from Psychotropical Research, whose recommendation of MAOIs and Anafranil as most effective match the patient data very well, and whose essay Why Most New Antidepressants Are Ineffective is worth a read.

I’m not sure I go as far as he does – even if new antidepressants aren’t worse outright, they might still trade less efficacy for better safety. Even if they handled the tradeoff well, it would look like a net loss on patient rating data. After all, assume Drug A is 10% more effective than Drug B, but also kills 1% of its users per year, while Drug B kills nobody. Here there’s a good case that Drug B is much better and a true advance. But Drug A’s ratings would look better, since dead men tell no tales and don’t get to put their objections into online drug rating sites. Even if victims’ families did give the drug the lowest possible rating, 1% of people giving a very low rating might still not counteract 99% of people giving it a higher rating.

And once again, I’m not sure the tradeoff is handled very well at all.6.

VI.

In order to distinguish between all these hypotheses, I decided to get a lot more data.

I grabbed all the popular antipsychotics, antihypertensives, antidiabetics, and anticonvulsants from the three databases, for a total of 55,498 ratings of 74 different drugs. I ran the same analysis on the whole set.

The three databases still correlate with each other at respectable levels of +0.46, +0.54, and +0.53. All of these correlations are highly significant, p < 0.01.

The negative correlation between patient rating and doctor rating remains and is now a highly significant -0.344, p < 0.01. This is robust even if antidepressants are removed from the analysis, and is notable in both psychiatric and nonpsychiatric drugs.

The correlation between patient rating and year of release is a no-longer-significant -0.191. This is heterogenous; antidepressants and antipsychotics show a strong bias in favor of older medications, and antidiabetics, antihypertensives, and anticonvulsants show a slight nonsignificant bias in favor of newer medications. So it would seem like the older-is-better effect is purely psychiatric.

I conclude that for some reason, there really is a highly significant effect across all classes of drugs that makes doctors love the drugs patients hate, and vice versa.

I also conclude that older psychiatric drugs seem to be liked much better by patients, and that this is not some kind of simple artifact or bias, since if such an artifact or bias existed we would expect it to repeat in other kinds of drugs, which it doesn’t.

VII.

Please feel free to check my results. Here is a spreadsheet (.xls) containing all of the data I used for this analysis. Drugs are marked by class: 1 is antidepressants, 2 is antidiabetics, 3 is antipsychotics, 4 is antihypertensives, and 5 is anticonvulsants. You should be able to navigate the rest of it pretty easily.

One analysis that needs doing is to separate out drug effectiveness versus side effects. The numbers I used were combined satisfaction ratings, but a few databases – most notably WebMD – give you both separately. Looking more closely at those numbers might help confirm or disconfirm some of the theories above.

If anyone with the necessary credentials is interested in doing the hard work to publish this as a scientific paper, drop me an email and we can talk.

Footnotes

1. Technically, MAOI superiority has only been proven for atypical depression, the type of depression where you can still have changing moods but you are unhappy on net. But I’d speculate that right now most patients diagnosed with depression have atypical depression, far more than the studies would indicate, simply because we’re diagnosing less and less severe cases these days, and less severe cases seem more atypical.

2. First-place winner Nardil has only 16% as many reviews as last-place winner Viibryd, even though Nardil has been on the market fifty years and Viibryd for four. Despite its observed superiority, Nardil may very possibly be prescribed less than 1% as often as Viibryd.

3. Pretty much the same thing is true if, instead of looking at the year they came out, you just rank them in order from earliest to latest.

4. On the other hand, what we do have is a lot of studies comparing MAOIs to imipramine, and a lot of other studies comparing modern antidepressants to imipramine. For atypical depression and dysthymia, MAOIs beat imipramine handily, but the modern antidepressants are about equal to imipramine. This strongly implies the MAOIs beat the modern antidepressants in these categories.

5. Interesting Parke-Davis facts: Parke-Davis got rich by being the people to market cocaine back in the old days when people treated it as a pharmaceutical, which must have been kind of like a license to print money. They also worked on hallucinogens with no less a figure than Aleister Crowley, who got a nice tour of their facilities in Detroit.

6. Consider: Seminars In General Psychiatry estimates that MAOIs kill one person per 100,000 patient years. A third of all depressions are atypical. MAOIs are 25 percentage points more likely to treat atypical depression than other antidepressants. So for every 100,000 patients you give a MAOI instead of a normal antidepressant, you kill one and cure 8,250 who wouldn’t otherwise be cured. The QALY database says that a year of moderate depression is worth about 0.6 QALYs. So for every 100,000 patients you give MAOIs, you’re losing about 30 QALYs and gaining about 3,300.

OT19: Don’t Thread On Me

This is the semimonthly open thread. Post about anything you want, ask random questions, whatever. Also:

1. Comments of the week are Scott McGreal actually reading the supplement of that growth mindset study, and gwern responding to the cactus-person story in the most gwernish way possible.

2. Worthy members of the in-group who need financial help: CyborgButterflies (donate here) and as always the guy who runs CrazyMeds (donate by clicking the yellow DONATE button on the right side here)

3. I offer you a statistical mystery a little closer to home than the ones we usually investigate around here: how come my blog readership has collapsed? The week-by-week chart looks like this:

Notice that the week of February 23rd it falls and has never recovered. In fact, I can pinpoint the specific day:

Between February 20th and February 21, I lost about a third of my blog readership, and they haven’t come back.

Now, I did go on vacation starting February 20 and make fewer posts than normal during that time, but usually when I don’t post for a while I get a very gradual drop-off, whereas here, the day after a relatively popular post, everyone departs all of a sudden. And I’ve been back from vacation for a month and a half without anything getting better.

I would assume maybe WordPress changed its method of calculating statistics around that time, but I can’t find any evidence of this on the WordPress webpage. That suggests it might be a real thing. Did any of you leave around February 20th for some reason and not check the blog again until today? Did anything happen February 20th that tempted you to leave and you only barely hung on? I get self-esteem and occasionally money from blog hits, so this is kind of bothering me.

4. I want to clarify that when I discuss growth mindset, the strongest conclusion I can come to is that it’s not on as firm ground as some people seem to think. I do not endorse claims that I have “debunked” growth mindset or that it is “stupid”. There are still lots of excellent studies in favor, they just have to be interpreted in the context of other things.

Posted in Uncategorized | Tagged | 869 Comments