Open threads at the Open Thread tab every Sunday and Wednesday

Google Correlate does not imply Google Causation

I need something sexy, something to lure new readers to this new blog and get them excited. So let’s talk about statistical correlations. No, wait, failed statistical correlations!

Google Correlate is a nifty new Google product that takes data sets and finds search terms that correlate with them. For example, if you set it to “correlate over time” and enter a data set of average US temperature, it might return the search term “skiing”, because people are most likely to ski when it’s cold and so searches for skiing will be correlated with temperature. You can also just enter in Google search terms and see what other search terms they’re correlated with.

The results seem to fall into two categories: obvious and nonsensical.

Ones with clear time patterns are obvious. If you enter in skiing, you’ll get “how to ski”, “buy skis”, “snowboarding”, “ski resorts”, and the like. If you enter in a news trend that was only popular at one point, you’ll get both related terms and other news trends only popular at that one point – for example, “school shooting” brings up “jan berenstain”, not because the Berenstain Bears books secretly cause school shootings (…one hopes) but because she died the same week as a relatively big one and so people were searching them around the same time.

Things that don’t have obvious time patterns seem to bring up results that are both nonsensical and very-very convincing-looking. The worst are diseases.

This is Google Correlate’s result for heart attack. It matches it to “pink lace dress” with a correlation of .88 (for comparison, a study comparing cigarette use vs. lung cancer rates across different social groups found a correlation of .71).

Figure 1: Correlation between interest in heart attacks and in pink lace dresses, by time.

As far as I can tell, this is just an artifact of Google having lots and lots of search terms and you would expect some of them to be heavily correlated by mere coincidence.

Google also has a correlate-by-state feature. This one has even weirder results for heart attack, like “can you get a” and “is it a” (note that these are the entire search terms). I understand that “is it a heart attack” is a reasonable question, but I don’t understand who would just enter that phrase into Google and hope it would figure it out. I’m kind of imagining someone having a heart attack going on Google, typing as far as “is it a…” and then falling over dead, but I assume the real explanation is more prosaic, like someone expecting autocomplete to work but being disappointed.

Google’s state-by-state feature seemed potentially really exciting to me. I wrote a while back on the effect of parasite load, and I had the dataset lying around with different states ranked on different metrics. I entered the data for parasite load and got the following search terms: “Toy Johnson”, “Bernie Mac”, “booty models”, “Harvey suits”, “Beyonce clothing line”.

Figure 2: Correlation between parasite prevalence and interest in booty models, by state.

I didn’t actually know what most of these were (I kinda thought Bernie Mac was a real estate conglomerate, which turns out to be false) but upon closer investigation they are all black people or Stuff Black People Like. So I think what’s happening here is that the high-parasite load states are all in the South and relatively poor with low access to health care, which also selects for black people. This obviously has significant implications for the study’s attempt to determine that high parasite load causes certain social trends.

My next thought was “if I multiply this data set by negative one, I will have an objective pipeline to figuring out Stuff White People Like. That sounds interesting.” So I tried it, and my results were: “black albino”, “shake that eminem”, “tony hawk pro skater”, and “green day time of your life”. I was sort of hoping that “Black Albino” was the name of a band or something (it would actually be a pretty good one) but no, it turns out white people are just fascinated with the idea of black albinos. White people are kind of weird.

Figure 3: A black albino. Happy now, white people?

But let’s keep going through the state-by-state data set. My next Big Social Statistic was “importance of family ties, by state”. States with higher family ties were more likely to search for: “how to swim”, “composition book”, “noni juice”, “muscle men”, “girl kiss”, “Toyota Tacoma 2008”.

Figure 4: Correlation between strength of family ties and interest in swimming, by state.

A lot of these seem related to physical fitness, or ruggedness (the Tacoma seems to be a very sporty, rugged car), or masculinity. I’m not really sure what to make of this.

The last Social Science Statistic in the dataset was Religiosity, which correlated with the following search terms: “Christmas themes”, “rotary cutter”, “Honda rebel 250”. Christmas themes seems sort of plausible. I dunno about the rest.

So as far as I can tell Google Correlate is not very interesting. It doesn’t reveal any deep connections between concepts, or even guess what concept my dataset came from to begin with. For something potentially so powerful this is disappointing.

I can think of two possible uses for it. The first is as a sanity check to make sure your data aren’t completely confounded. If you think you’re measuring average number of roof tiles per house or something, and your data’s Google Correlate results come back with Toy Johnson and Beyonce clothing, you’re probably just measuring race and for some reason different races have different numbers of roof tiles on their houses. Which means if you think you’ve found a correlation between roof tiles and something fascinating like voting record, you’re probably just being confounded by race. This is a real problem in a lot of studies.

The second is as a cheap hack for creating datasets. I entered “Jesus” in and got a state by state list of who searched for Jesus. It looked a lot like my state-by-state map of religiousity. The correlates were all things like “Apostle”, “Paul”, “preaching”, and for some reason “Abednego”, who is a very minor Biblical character who has no business being in the top ten correlates of Jesus at all. If you wanted to make a cheap map of state-religiosity in order to correlate to parasite load or whatever, Google Trends seems like a plausible method.

On the other hand, I tried to see if I could recreate their state map of parasite load. I asked it to correlate “metronidazole”, a medication commonly used in the treatment of parasitic diseases, on the grounds that people with parasites would be prescribed metronidazole and then look it up to see if it was safe. The result looked only a little like my map of state-by-state parasite data, and the number one correlated search term (r = .89) was “Is Lil’ Wayne gay?”

Figure 5: Correlation between curiosity over Lil’ Wayne’s sexual orientation and interest in the anti-parasitic medication metronidazole. Whatever my case was, I hereby rest it.

So if nothing else, this exercise has proven my suspicion that the sort of people who worry about whether Lil’ Wayne is gay are, in fact, crawling with parasites.

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

5 Responses to Google Correlate does not imply Google Causation

  1. Andrew Rettek says:

    You’ve convinced me that correlations are weaker evidence than I thought. They are useful bit of evidence, but I would want to have a reason to privilege the hypothesis first.

  2. Julia says:

    Rotary cutters are mainly used to cut fabric for quilting. I’m not surprised that correlates with the Bible belt.

  3. Anonymous says:

    > So if nothing else, this exercise has proven my suspicion that the sort of people who worry about whether Lil’ Wayne is gay are, in fact, crawling with parasites.

    With this and “It is quite uncontroversial among historians that Lincoln attempted to summon the dead,” I’m quite enthralled by your abilities at obviously joking statements which are quite literally true.

  4. gwern says:

    Utilitarian ethics demand that we invade Mexico and seize all their citruses:

  5. Luke Somers says:

    > and for some reason “Abednego”, who is a very minor Biblical character who has no business being in the top ten correlates of Jesus at all.

    But he’s so hot!