Philip Tetlock, author of Superforecasting, got famous by studying prediction. His first major experiment, the Expert Political Judgment experiment, is frequently cited as saying that top pundits’ predictions are no more accurate than a chimp throwing darts at a list of possibilities- although Tetlock takes great pains to confess to us that no chimps were actually involved, and this phrasing just sort of popped up as a flashier way of saying “random”.
Although this was generally true, he was able to distinguish a small subset of people who were able to do a little better than chance. His investigation into the secrets of their very moderate success led to his famous “fox” versus “hedgehog” dichotomy, based on the fable that “the fox knows many things, the hedgehog knows one big thing”. Hedgehog pundits/experts are people who operate off a single big idea- for example, an economist who says that government intervention is always bad, predicts doom for any interventionist policy, and predicts great success for any noninterventionist one. Foxes are people who don’t have much of a narrative or ideology, but try to find the right perspective to approach each individual problem. Tetlock found that the hedgehogs did worse than the chimp and the foxes did a little better.
Cut to the late 2000s. The US intelligence community has just been seriously embarrassed by their disastrous declaration that there were weapons of mass destruction in Iraq. They set up an Intelligence Advanced Research Projects Agency to try crazy things and see if any of them worked. IARPA approached a bunch of scientists, handed them a list of important world events that might or might not happen, and told them to create some teams and systems for themselves and compete against each other to see who could predict them the best.
Tetlock was one of these scientists, and his entry into the competition was called the Good Judgment Project. The plan was simple: get a bunch of people to sign up and try to predict things, then find the ones who did the best. This worked pretty well. 2,800 people showed up, and a few of them turned out to be…
…okay, now we’re getting to a part I don’t understand. When I read Tetlock’s paper, all he says is that he took the top sixty forecasters, declared them superforecasters, and then studied them intensively. That’s fine; I’d love to know what puts someone in the top 2% of forecasters. But it’s important not to phrase this as “Philip Tetlock discovered that 2% of people are superforecasters”. This suggests a discontinuity, a natural division into two groups. But unless I’m missing something, there’s no evidence for this. Two percent of forecasters were in the top two percent. Then Tetlock named them “superforecasters”. We can discuss what skills help people make it this high, but we probably shouldn’t think of it as a specific phenomenon.
Anyway, the Good Judgment Project then put these superforecasters on teams with other superforecasters, averaged out their decisions, slightly increased the final confidence levels (to represent the fact that it was 60 separate people, all of whom were that confident), and presented that to IARPA as their final answer. Not only did they beat all the other groups in IARPA’s challenge in a landslide, but they actually did 30% better than professional CIA analysts working off classified information.
Having established that this is all pretty neat, Tetlock turns to figuring out how superforecasters are so successful.
First of all, is it just luck? After all, if a thousand chimps throw darts at a list of stocks, one of them will hit the next Google, after which we can declare it a “superchimp”. Is that what’s going on here? No. Superforecasters one year tended to remain superforecasters the next. The year-to-year correlation in who was most accurate was 0.65; about 70% of superforecasters in the first year remained superforecasters in the second. This is definitely a real thing.
Are superforecasters just really smart? Well, sort of. The superforecasters whom Tetlock profiles in his book include a Harvard physics PhD who speaks 6 languages, an assistant math professor at Cornell, a retired IBM programmer data wonk, et cetera. But the average superforecaster is only at the 80th percentile for IQ – just under 115. And there are a lot of people who are very smart but not very good at predicting. So while IQ definitely helps, it isn’t the whole story.
Are superforecasters just really well-informed about the world? Again, sort of. The correlation between well-informedness and accuracy was about the same as the correlation between IQ and accuracy. None of them are remarkable for spending every single moment behind a newspaper, and none of them had as much data available as the CIA analysts with access to top secret information. Even when they made decisions based on limited information, they still beat other forecasters. Once again, this definitely helps, but it’s not the whole story.
Are superforecasters just really good at math? Again, kind of. A lot of them are math PhDs or math professors. But they all tend to say that they don’t explicitly use numbers when doing their forecasting. And some of them don’t have any kind of formal math background at all. The correlation between math skills and accuracy was about the same as all the other correlations.
So what are they really good at? Tetlock concludes that the number one most important factor to being a superforecaster is really understanding logic and probability.
Part of it is just understanding the basics. Superforecasters are less likely to think in terms of things being 100% certain, and – let’s remember just how far left the bell curve stretches – less likely to assign anything they’re not sure about a 50-50 probability. They’re less likely to believe that things happen because they’re fated to happen, or that the good guys always win, or that things that happen will necessarily teach a moral lesson. They’re more likely to admit they might be wrong and correct themselves after an error is discovered. They’re more likely to debate with themselves, try to challenge their original perception, start asking “What could be wrong about this thing I believe?” rather than “How can I prove I’m right?”
But they’re also more comfortable actively using probabilities. Like my predictions, the Good Judgment Project made forecasters give their answers as numerical probability estimates – for example, 15% chance of a war between North and South Korea in the next ten years killing > 1000 people. Poor forecasters tend to make a gut decision based on feelings that superficially related to the question, like “Well, North Korea is pretty crazy, so they’re pretty likely to declare war, let’s say 90%” or “War is pretty rare these days, how about 10%?”. Superforecasters tend to focus on the specific problem in front of them and break it down into pieces. For example, they might start with the Outside View – it’s been about 50 years since the Koreas last fought, so their war probability per decade shouldn’t be more than about 20% – and then adjust that based on Inside View information – “North Korea has a lot fewer foreign allies these days, so they’re less likely to start something than they once were – maybe 15%”.
Or they might break the problem down into pieces: “There would have to be some sort of international incident, and then that incident would have to erupt into total war, and then that war would have to kill > 1,000 people. There are about two international incidents between the Koreas every year, but almost none of them end in war; on the other hand, because of all the artillery aimed at Seoul, probably any war that did happen would have an almost 100% chance of killing > 1,000 people” … and so on. One result is that while poor forecasters tend to give their answers in broad strokes – maybe a 75% chance, or 90%, or so on – superforecasters are more fine-grained. They may say something like “82% chance” – and it’s not just pretentious, Tetlock found that when you rounded them off to the nearest 5 (or 10, or whatever) their accuracy actually decreased significantly. That 2% is actually doing good work.
Most interesting, they seem to be partly immune to cognitive bias. The strongest predictor of forecasting ability (okay, fine, not by much, it was pretty much the same as IQ and well-informedness and all that – but it was a predictor) was the Cognitive Reflection Test, which includes three questions with answers that are simple, obvious, and wrong. The test seems to measure whether people take a second to step back from their System 1 judgments and analyze them critically. Superforecasters seem especially good at this.
Tetlock cooperated with Daniel Kahneman on an experiment to elicit scope insensitivity in forecasters. Remember, scope insensitivity is where you give a number-independent answer to a numerical question. For example, how much should an organization pay to save the lives of 100 endangered birds? Ask a hundred people, and maybe the average answer is “$10,000”. Ask a (different group of) a hundred people how much the same organization should pay to save the lives of 1000 endangered birds, and maybe the average answer will still be $10,000. So it seems you can get people to change their estimate of the value of bird life just by changing the number in the question. Poor forecasters do the same thing on their predictions. For example, a hundred poor forecasters might on average predict a 15% chance of war in Korea in the next five years, and a different group of a hundred poor forecasters might on average predict a 15% chance of war in Korea in the next fifteen years. They’re ignoring the question and just going off of a vague feeling of how likely another Korean war seems. Superforecasters, in contrast, showed much reduced scope insensitivity, and their probability of a war in five years was appropriately lower than of a war in fifteen.
Maybe all this stuff about probability calibration, inside vs. outside view, willingness to change your mind, and fighting cognitive biases is starting to sound familiar? Yeah, this is pretty much the same stuff as in the Less Wrong Sequences and a lot of CFAR work. They’re both drawing from the same tradition of cognitive science and rationality studies.
So as I said before, Superforecasting is not necessarily too useful for people who are already familiar with the cognitive science/rationality tradition, but great for people who need a high-status and official-looking book to justify it. The next time some random person from a terrible forum says that everything we’re doing is stupid, I’m already looking forward to pulling out Tetlock quotes like:
The superforecasters are a numerate bunch: many know about Bayes’ theorem and could deploy it if they felt it was worth the trouble. But they rarely crunch the numbers so explicitly. What matters far more to the superforecasters than Bayes’ theorem is Bayes’ core insight of gradually getting closer to the truth by constantly updating in proportion to the weight of the evidence. That’s true of Tim Minto [the top superforecaster]. He knows Bayes’ theorem, but he didn’t use it even once to make his hundreds of updated forecasts. And yet Minto appreciates the Bayesian spirit. “I think it is likely that I have a better intuitive grasp of Bayes’ theorem than most people,” he said, “even though if you asked me to write it down from memory I’d probably fail.” Minto is a Bayesian who does not use Bayes’ theorem. That paradoxical description applies to most superforecasters.
And if you’re interested, it looks like there’s a current version of the Good Judgment Program going on here that you can sign up to and see if you’re a superforecaster or not.
EDIT: A lot of people have asked the same question: am I being too dismissive? Isn’t it really important to have this book as evidence that these techniques work? Yes. It is important that the Good Judgment Project exists. But you might not want to read a three-hundred page book that explains lots of stuff like “Here’s what a cognitive bias is” just to hear that things work. If you already know what the techniques are, it might be quicker to read a study or a popular news article on GJP or something.
“So as I said before, Superforecasting is not necessarily too useful for people who are already familiar with the cognitive science/rationality tradition, but great for people who need a high-status and official-looking book to justify it.”
This is how I felt about Bostrom’s book.
I read Superintelligence after being immersed in LW for over a year and I certainly didn’t share your feeling, I think that it’s excellent and informative.
Updating based on evidence and avoiding simple biases like anchoring or scope insensitivity is first, a very basic of the sequences and second, something that you could have heard from any number of other sources from Kahnemann to Silver.
In contrast, the specific details of various AI development/control/alignment approaches are something that LW gives only a vague and one-sided idea of and something that you can’t really find compiled anywhere else. To put it in Tetlock’s terminology: Superintelligence is a fox book, it teaches you about a whole lot of things, more things than I thought could even be relevant to SAI.
(Typo: “Superforecasters, in contrast, showed much reduced scope sensitivity” — they’re more scope sensitive, less scope insensitive.)
Warning: mostly nonsense.
“So as I said before, Superforecasting is not necessarily too useful for people who are already familiar with the cognitive science/rationality tradition, but great for people who need a high-status and official-looking book to justify it.”
So, Tetlock looks at the world and sees how to actually work with it, the rationalists sit back and revel in fanfiction and fandom, lolololol.
That said, I wonder if there are any weirder cultural affinities of superforecasters. What sort of music to they like? What books do they read? What cultures did they grow up in, or are descended from? Do superforecasters peak at a certain age? Are they nerds, or physically fit and mentally healthy? I suppose I should read the book.
Followup thought, can you do this in reverse, have them predict history?
“What’s the probability that North Korea attacked South Korea in the last 50 years?”
It depends on what you mean by “attacked”.
From what I remember they were older and well educated. He mentioned some of them having a decent amount of time on their hands but I’m not sure how that generalized to all the super forecasters. The impression I got was they were basically older nerds.
So you do know that aside from one guy writing fanfiction, there was the entire Less Wrong Sequences which was hundreds of posts attempting to synthesize and build upon the cognitive science literature, plus CFAR, plus…eh, I don’t even know why I bother with you people anymore.
I was kidding, come on, I read “The Sequences” before HPMOR was a thing (there was a lot of nerdiness there too, though), I enjoy fandom, I lurked LW, unrepentant, etc. “Warning: mostly nonsense?” and I followed it with “lolololol”. Banter!
(also being provactive gets responses to the other questions, ideally)
Kicking people in the shins probably also gets responses. It might even start an interesting conversation occasionally. I would still generally prefer it not to be done.
Prior updated: can dish out, but can’t take.
What do you mean by “dish out” here?
What exactly do you consider that I’ve dished out but can’t take? (It sounds like you haven’t noticed I’m not the same person as Scott, though actually I think my question would be reasonable even if I were.)
Insults can hurt. Keep that in mind next time.
If you are attempting self-deprecating humor, a minor change in phrasing like saying “we rationalists” might do wonders to make your meaning clear instead of making it seem like a attack.
Insults can hurt? What’s insulting is:
“So as I said before, Superforecasting is not necessarily too useful for people who are already familiar with the cognitive science/rationality tradition, but great for people who need a high-status and official-looking book to justify it.”
because it alleges that only lesser nerds need status, and us True Geeks know the score before it was cool. This is an incredibly stupid and damaging way of looking at the world, a classic nerd failure mode, and everytime nerds sneer at those of high status they emphasize their own weakness and make their situation worse.
honestlymellowstarlight, I think you have misunderstood the sentences you are quoting.
They are not talking about two groups of people and saying one is superior. The second group is the subgroup of the first that wants help sharing their knowledge with others in a particular way.
They are saying: if you’re already familiar with this way of thinking, you probably won’t find anything new TO YOU, but you may find an easily-presentable-to-other-people summary of things you know.
I am curious why you jumped straight to your “insult” interpretation. I don’t even see any derogatory language in there at all. Are you interpreting the tone as sarcastic?
honestlymellowstarlight, I have no idea how you managed to read an insult or a sneer into that sentence, and I would not bet money that you’re not just trolling us even more, just finding more and more ways to insult and distract.
So this post is my final response to you: Stop insulting people, and I suggest that you’re not so quick to find an insult in what another says, but if you’re indeed actually insulted by something, then fucking COMMUNICATE THAT (as we’re trying and seemingly failing to do with you), just in case, you know, no insult was meant.
But as I said I have no idea if you’re being sincere or trolling, but if you’re trolling stop that too. It’s bad, ‘kay?
Sean, I think he (probably he) interpreted the word “need” to mean “should not actually need,” because that does seem like common usage on the Internet. Here it seems like the exact opposite of the intended meaning, but that may not be obvious to someone unfamiliar with Scott.
Note that Scott is also a political dirty-tricks expert and may in fact have a sinister motive for writing statements with double meanings; but this would still be incompatible with starlight’s reading of the situation.
While I can’t get mad at some good old banter, your comment mapped really well to a certain internet demographic that’s been generally pretty awful to scott and pals, which might be why they weren’t too receptive of your joke.
Between this comment and Scott’s “you people”, I’m really (genuinely) interested what Persecuting Outgroup I’m apparently a part of.
Well, I don’t know who Scott’s talking about, but what I mean is… well, I’m not sure it’d be accurate or fair to talk of a group, but I think “The Rationalwiki Crowd” kind of summarizes the type of people I mean.
EDIT: Or maybe the “SA PYF crowd”, I think there’s a whole bunch of overlap (they styles are strikingly similar), but I don’t want to start assuming things.
Yudkowsky calls them haters and sneerers.
Flipping out at your hatedom never ends well, even when they’re being unreasonable. There are more of them, and they’re better at making you look stupid than you are them, because this is what they do for fun and so they have more practice. I wish Eliezer understood this.
(But maybe I’m interrogating the text from the wrong perspective.)
Hell, even Scott has this problem sometimes, and I think Scott’s a lot better at navigating these waters than Eliezer is.
same
Can’t help but smile at the people who used to badmouth trolls and write about how pacifism was killing their gardens or whatever now getting what they deserve.
“Yudkowsky calls them haters and sneerers”
Yudkowsky’s reaction to humorists and critics may not be optimal.
Humor on the internet and via purely text is weird. I’ve seen several other people make jokes while trying to be friendly towards the rationalist, LW, and SSC crowds and it often just doesn’t work as intended unless it’s an awkward pun.
Sometimes it falls flat and other times it blows up in the person’s face. Someone ends up getting offended, states it publicly, and then the good-intending jokester gets stuck in a position of having clashed a portion of the group in unintended ways. (At which point I hope they just update their expressions of humor slightly to fit the group better rather than get pushed away.) I think the cultures are just missing an element of humor mores that many/most other cultures possess.
Puns are often the exception to this.
(At which point I hope they just update their expressions of humor slightly to fit the group better rather than get pushed away.)
Why wouldn’t you hope for option 3: a change in the culture? (Or is such hope futile?)
Don’t bother. Just ban!
Actually, it would be nice to see periodic feedback from the commentors on whether bans are too common or too uncommon.
My experience says that they are almost always too uncommon.
I too support blanket bans by ingroups of outgroups, especially if “you people” is invoked. Trigger warning: meant seriously!
Most likely too uncommon here. Definitely too uncommon on LW in the past. I also think that proactive bannings are important for a healthy and vibrant internet community.
Sidenote: I think honestlymellowstarlight was attempting humor that didn’t work out well. I read the sentence, “So, Tetlock looks at the world and sees how to actually work with it, the rationalists sit back and revel in fanfiction and fandom, lolololol.” as contrasting two groups who are doing the same thing but then get accused of having massively different statuses and connotations culturally when they obviously shouldn’t. Then again, I very well may be steelaliening him.
I was using intentionally abrasive humor to demonstrate the weird relationship the LW community has with the idea of status, as per my other comments. But it’s not as visceral when I say it like that, which is also part of the point.
(My dearest hope was that someone would attempt to steelman my original post, and I am happy at least one person mentioned it explicitly.)
The post claims zie was “just joking” but remains rather hostile. The poster also made a strange comment about Scott being insulting. I doubt there is a good “culture fit” between the “joking” poster and SSC.
This poster, zie has a name, you know. “Banter” is not “just joking”, “confrontational” is not “hostile”, these distinctions are important, think about it a little more.
We’re already under a Reign of Terror, we hardly need to add lynching mobs to the mix.
There’s a REIGN OF TERROR going on?? How did I miss this???*
OK, I know how I missed. Still curious, though.
I am not going to pretend I wouldn’t be happy to see mellow go.
If you go around being a hostile, aggressive, “confrontational” poster you should expect people to express their displeasure.
TBH, there have been many hostile, aggressive, and confrontational posters toward whom no one has expressed anything approaching pleasure. But, people get singled out seemingly at random, depending on who they happen to rub the wrong way.
Beware the fundamental attribution error. Right now honestlymellowstarlight had made a post people are objecting to and is defending themself. In a less confrontational context they may make valuable contributions.
If Scott were trustworthy to ban the right people they’d be way too uncommon, but given Scott as the flawed human he is I’d say they’re just about right.
If Scott were to be appointed government czar of all message boards, maybe. But given that the whole reason I come here is that I enjoy his insight and trust his judgment, I support him banning whoever he wants.
That being said, I think the situation is pretty reasonable. I can’t really think of any problem people who need to be banned. And the people he does ban pretty much have it coming.
I guess the bigger problem is this place turning into a conservative and libertarian echo chamber. I mean, I’m a libertarian myself, but a large part of the reason I come here is to hear from sensible members of the left. I don’t know that this is happening for sure, and I wouldn’t know what to do about it without making thing worse, but it would be unfortunate.
In college, I was part of a great debate club that used to have a fair number both of conservatives and of relatively far-left people. But by the time I graduated, it was completely dominated by “establishment liberal”, center-left Hillary Clinton types. They were nice people and objectively more reasonable than the far-left ones, but it made things a lot less interesting.
I agree with Vox. I’m right-wing myself, and the main appeal of this place was one with honest, intelligent liberals and leftists who actually treated conservatives as human beings with opinions, rather than Orcs needing only to be crushed.
Now, though, a lot of those leftists are disappearing, and I’m losing that really valuable perspective.
@ Vox Imperatoris
Hey Vox, I just wanted to say — “a reasonable objectivist*” used to be an oxymoron to me, but you have changed that. Thanks.
(* I don’t know if you identify as an Objectivist, but you seem close to that cluster in terms of… intellectual background? — so I hope this approximation is OK with you.)
@ Nita:
Thank you!
I do “identify” as an Objectivist, of the Atlas Society inclination. Unfortunately, there are a lot of unreasonable Objectivists out there.
One of the most interesting things I’ve read in this regard is Nathaniel Branden’s essay “The Benefits and Hazards of the Philosophy of Ayn Rand”. It is interesting in that he retains a fundamentally positive outlook on Objectivism, despite his acrimonious personal relationship with Ayn Rand. But he points out several areas in which Rand’s intellectual style and outlook set a bad example—in a way that runs counter the fundamental elements of the philosophy.
I thought the comment was perfectly fine for a public forum, but perhaps not for the comment section of a private blog (in the vein of “don’t walk into someone’s house and insult the host before you’ve taken your shoes off”).
I’m not sure which this comment section is. Certainly the open threads are more like a public discussion forum for an online community.
This is more or less the perspective I’m coming from:
http://fredrikdeboer.com/2014/06/10/did-nir-rosen-deserve-an-expectation-of-privacy-on-twitter/
>Actually, it would be nice to see periodic feedback from the commentors on whether bans are too common or too uncommon.
Bans have been more common than before the reign of terror, which I don’t like. But what I really don’t like is the higher rate of indefinte (permanent, in practice) bans.
Actually, it would be nice to see periodic feedback from the commentors on whether bans are too common or too uncommon.
I don’t think that SSC can remain viable in the long run without either more hands-on moderation than Scott can afford given that he has a day job, or frequent banning (and, yes, arbitrary permabanning) of commenters that would otherwise make hands-on moderation a particularly tedious and unwelcome task.
I can’t see a credible alternative to either the Reign of Terror or the end of SSC in anything like its present form, and I would prefer the Reign of Terror. I haven’t seen anything so far that would suggest it is excessive or substantially mistargeted in its application
Cosigned in every detail.
What do you mean by “remain viable”?
Less seriously but for the record, I am also pro-Reign of Terror, and only mostly because of the name.
In other words, the Superforecasters are the people who did their homework.
No, reading up on the conflicts involved had some effect, but no more than other things like high IQ and numeracy, and less than a generally good understanding of probability.
In other words, they study smarter and not harder.
I strongly prefer the specifics described in the main post over the ambiguous-to-the-point-of-being-impossible-to-correctly-interpret “smarter and not harder”.
Doing homework also takes the form of being willing to calculate the probabilities instead of guessing. It looks like the non-superforecasters didn’t bother doing that.
Sort of. I don’t think of this as “doing homework” in the sense of working at the same thing harder, or even more formally. I think of it as a completely different thought process. I’m not sure if the sort of thing I’m doing is the same sort of thing they’re doing, but if it is, I can do it (badly) in my head in a couple of seconds. Its quick-and-dirty form is as simple as saying “It’s been 60 years since the last Korean War, so if nothing’s changed maybe the probability of a Korean War per year is somewhere on the order of 1/60. Certainly not 1/6 or 1/6000.”
You really need to do research to figure out base rates if you’re going to do prediction as well as the top forcasters. For things like war on the Korean peninsula it’s easy since I think most people have a good idea of the history off the top of their head. But I was in the experiment in one year and for most questions I would need to do a fair amount of research despite the fact that I read The Economist.
I also was in the experiment for a year and they would ask things about the probability of president X being ousted in the next year in small african country Y. I am a person that reads alot, but every question usually involved quite a bit of reading to do basic background probabilities.
The bird question is not number independent at all. My price depends on how many are left. In the second case you informed me that there are at least 1000, in the first there might only be 100. I have no idea what the distribution of endangered bird populations is, but I’d go with a log prior, in which case I’d pay about the same whether you told me 100 or 1000.
Well, you beat me to it 🙂 One could also say that the marginal utility I have from an additional bird is decreasing. After all, if the question was about billions of birds, people would probably pay killing some of them to keep their population in check.
The question is actually hard to answer if you are not provided with the information about how many birds there are in total. Saving 10 birds is a big deal if there are only 10 birds in total, saving ten birds in a population of millions is not worth a dime.
So basically the superforecasters are saying “Well, 100 birds are a lot fewer than 1,000 birds, so plainly they are more endangered and need more money faster to save them”?
The “bird question shows ordinary people just guess by the seat of their pants” isn’t quite fair, as does it say how many birds in all? Is that “only 100 birds left in the entire world” or is it “only 100 birds in this particular area but there are more birds elsewhere”? That would certainly make a difference to me, for one! The same with asking a different group about 1,000 birds; that sounds like a reasonably high number in a flock, or at least that we’re not down to our very last breeding pair of dodo birds, save them today or no more dodos ever!
This has nothing to do with superforecasters. It is a study from 24 years ago that EY cited in a post where he argues in favor of total utilitarianism.
On the other hand, 1000 birds may be enough to bring the species back from the brink pretty easily, while a population of 100 turns out to be a hopeless money pit (for comparison, it has cost $35 million to turn 27 California condors into 425). The point is well-taken, though, there are too many extra factors here.
There are cases where biases are really just people running a deeper strategy that the ‘correct’ answer doesn’t take into account.
This isn’t one of them, and this should make you skeptical of your ability to tell the difference between right and wrong answers.
Why not?
I suppose I should be clear about what “this” is, since a lot of these references may be less clear if you don’t know the relevant literature.
The actual source Scott is pointing to is a 1992 experiment done by Desvouges et al., in which it was specified that there were 8.5 million migratory waterfowl, and that N of them died, and that it would be possible to prevent those deaths, but doing so is costly and the costs would be passed on to the consumers. So, how much are you, as a consumer, willing to pay to prevent the N deaths?
The three cases were 2k, 20k, and 200k. The willingness to pay was statistically indistinguishable between the three groups.
The straightforward interpretation of this result, that lines up with other experiments, is that people are imagining a single bird, deciding how much sympathy they have towards that one bird, and choosing a price accordingly. They are not making use of the number of birds, or of the fraction of all birds that this represents.
I’m not making the stronger claim that people have to value the number of birds linearly. It seems to me like a log scoring might make sense there.
(One could point out that people are comparing log(8.5e6-2k) which is roughly the same as (8.5e6-200k), and so both are similar, but this is both wrong mathematically (the larger the base you’re subtracting off of, the less the log matters and the closer the marginal value is to linear) and psychologically (that’s inconsistent with the results of other experiments and the self-reports of the people doing the experiment).)
“A bunch of cute birdies are going to die. Their species will not go extinct because of it, but (as you might guess from the fact that I am talking to you about this) people are going to talk about these cute birdies if they die. Expect to see fifteen minutes of very sad footage of dead cute birdies on your television, computer, or phone, spread out over a couple of weeks in little bits you won’t be able to anticipate or avoid.
How much do you think we should pay to make sure you don’t have to watch sad videos of dead cute birdies for fifteen minutes, and feel sad after than because you didn’t help?”
Why would anyone expect the answer to be more than vaguely correlated with the number of birds?
I signed up for the Good Judgement Project. It’s hard to say how well I will do considering I skipped many of the more obscure questions. There are many questions on foreign elections and treaties I haven’t heard of(and I would consider myself reasonably well-informed).
I’m surprised the precision of the forecasts affected their accuracy.
Me too! I always associate the person who says “82% chance of this happening” with an annoying pretentious person who watches too much Star Trek.
I originally thought maybe these people were better not because that level of precision helped but because this signified they were using math (eg maybe some complicated process that spits out 82 – nobody would just start by guessing that). But the rounding experiment seems to show that’s not true. Wow. My hat is off to people who can be that precise.
Do you know if the data is publicly available somewhere? The whole thing just screams “implausible” to me. How many predictions did each person make, anyway? Wouldn’t you need dozens or hundreds of predictions to detect the difference between 80% and 82%?
Not necessarily. I don’t know how predictions are actually rated, but intuitively, I would measure the difference between the predicted value and the actual value, sort of like golf scoring.
In retrospect, an event has a probability of either 100% or 0%, depending on whether it happened or not. Take the difference between the predicted and the actual value, maybe square that number to more heavily punish strong mispredictions. Assign that value as an absolute as penalty points, and the person with the lowest penalty score wins. Then repeat that same scoring except with the prediction values rounded, and compare the scores.
Regardless, dozens or hundreds of predictions is not implausible, especially when you do them as part of a study/program i.e. when it becomes your job for a while. People like Scott make dozens of such predictions each year on their own time.
I don’t believe Scott does not make enough predictions to detect the difference between 80% and 82%. This is exactly what I’m talking about.
If the real probability of an event is 82%, and you’re predicting 80%, you’re only wrong on an extra 1 event out of 50. So at the very least, we’re talking 50 events to even have a chance of seeing the difference. But if we want to detect this difference with confidence, we need more – maybe 200 events or something. And if we’re rounding 82% to 80% but 58% to 60%, the error partially cancels out, so we’d need *even more* events.
anon85, the errors don’t cancel out like that. The magnitude of the errors is added, not the errors themselves. 2% too high +2% too low does not equal no error.
For an example. Suppose someone predicted that dimes have a 60% chance of landing on heads when flipped. Pennies have a 40% chance of landing on heads. Those predictions don’t cancel out and make the set of predictions as good as predicting 50% for both.
It might be because they’re multiplying probabilities in necessary causal sequences. Even if the individual numbers are “round”, the result will be non-round.
War between the Koreas (~4%) = incident (75%) * declaration issued (5%).
(Don’t read too much into my prediction here. 😉 )
It’s maybe my mathy background, but I can imagine some superforecaster predicting between 4/5 and 5/6, translating it to 80% to 83.3% and coming out with 82%, which ends up more accurate than either rounded figure. No need for anything more complicated than that.
It is worth noting that around the 80s and 90s (%), rounding can make a very large difference when you think in terms of log odds or bits. E.g. many people would round 85% to 90%, or might feel a bit more than 90% confident and call that 95%, or a bit more than 95% confident and call it 99%. The last one is particularly bad.
Obviously this is all mirrored at the low end.
They used Brier scoring, not log.
Making small frequent updates increases both accuracy and apparent precision.
I’ve read that this is the explanation.
Ultimately I think this doesn’t show that they had great precision per se, it just shows that if you are genuinely using the whole range of probabilities without unconscious bias towards round or frequently-used numbers, rounding is going to hurt.
Seen on MR and relevant: When Negotiating a Price, Never Bid With a Round Number
Yes; I realized that, but it’s for a different purpose.
I comment constantly on the Marginal Revolution, but Tyler keeps deleting my comments since I used a bad word a few times over two months ago.
What sort of bad word?
If doing the probability calculations (even Fermi estimate style) is the best you can do, then rounding them off will tend to give you less accurate numbers.
One reason for hyper-precise estimating is that you get credit for your prediction on each day, so it pays to boost your percentage frequently as time is running out . Say with a month to go you say there is an 80% chance X won’t happen. Then 3 more days go by with nothing of interesting happening and no new news, so there are only 27 days left in which X could happen. So you boost your estimate of X not happening from 80% to 82%.
80% is 1 in 5, whereas 83% is 1 in 6. If an event happens once every 6 years, then it makes sense that rounding that to once every 5 years would hurt your accuracy.
Eliezer often refuses to give precise numerical probability estimates. I think he’s wrong about this, and to the extent this view is common in LW/CFAR circles this might be an area for improvement.
I wonder whether most of this effect concerned probability estimates near to 0% or 100%, where rounding to the nearest 5% can be a really big change.
(I think the right thing to look at is the log odds. Rounding 48% to 50% changes that from log 48/52 ~= -0.08 to zero, whereas rounding 3% to 5% changes it from log 3/97 ~= -3.48 to log 5/95 ~= -2.94, a change more than 6x bigger.)
They used Brier scoring, not log.
To anyone with experience playing poker, the link between granularity of estimates and accuracy will seem very obvious.
e.g. the expected value of any decision in poker depends upon how your hole cards compare the range of hole cards that a villain may hold. Estimating that opposing range is everything—a 3-5% swing will be the difference between winning and losing money in the long run. Good players are thinking something like “he could have any 2 pair, all broadway cards, any 2 suited cards with an Ace, suited connectors above 67s, suited 2-gappers above 9Js.” Putting villains on this range involves first starting with the outside-view: “what position is the villain in relation to the button?, what does the median villain do at these limits” then considering the inside view, eg. “specific reads about villains tendencies. Prior betting action by the villain.” Conclusions about both the inside and outside views involve a Bayesian kind of updating, in the former case as the macro game evolves, in the latter case as you learn more about a specific villain.
In fact, nearly all of the differentiating characteristics of the superforecasters are prerequisites to making money at professional poker. When I read the book, I couldn’t shake the feeling that he was abstracting the poker mindset for use in predicting geopolitical affairs.
I think both sets of skills boil down to making good decisions under uncertainty.
I and at least one of the other supers had an extensive poker background. Several others had dabbled profitably at lower stakes.
Nate Silver made a nice living as a professional poker player fleecing fish during the Poker/Housing Bubble of 2004-2006. In late 2006, all the fish suddenly disappeared and the only poker players left at the Mirage in Las Vegas were pros. So he lost a lot of money in early 2007 and then quit to get into election forecasting business. The funny thing is that Silver never noticed that the poker bubble in Las Vegas was a side effect of the housing bubble in Nevada, Arizona, and California and thus the poker bubble popped at exactly the same time as the subprime bubble. If he’d noticed the connection, he could have gotten in on the Big Short, but he’s never noticed it yet:
http://takimag.com/article/silver_cashes_in_steve_sailer/print#axzz3zMdUB3rR
Aim small, miss small.
Which is to say – why is it surprising that someone who approaches a problem at a more precise level tends to be more accurate? Would it surprise you if a shooter who aimed at the center of a cut-out target hit the target more often than the individual who aimed at the target as a whole?
The Cognitive Reflection Test may be confounded by childhood mathiness, as questions like those are quite common in math contests. If you ran into them there, you don’t need reflective capability to avoid the traps.
Yeah, good point. I’ve seen all three before.
Is there an updated version of the cognitive reflection test? I have seen it in 20 different places, it doesn’t seem that hard to come up with new questions that actually test system 2 and not whether you lucked into seeing the test before.
Wouldn’t it be confounded just by knowing that you’re taking a cognitive reflection test? If someone tells you beforehand that the questions are tricky, you’re already primed to use System 2.
Of course, you would never tell people you’re asking trick questions.
That’s the problem with the current 3 questions: even people who have no clue what “system 2” is are very likely to have seen them before with a vague impression that these are trick questions.
Right, so that means it’s permanently useless as an evaluation of anyone who has read about this kind of thing before.
I suppose you could get around that limitation by presenting something like a 30 question test where only three of them are tricky questions with obvious false answers.
Increasing familiarity with testing is probably one cause of the Flynn Effect.
But that could equally indicate that you’ve read a lot of psychology/behavioural economics books and blogs and what not, such as those of Tetlock. I’ve noticed that there are a handful of experiments or studies that seem to be repeated again and again in this kind of book; I’m sure I’ve read about Asch’s conformity or the bystander effect 20 times across various pieces of literature.
I would also imagine that knowing the system 1/system 2 reasoning behind the question would help a lot more in answering future questions of that style than simply seeing one such question and having a wrong answer.
Something seems off about the numbers. 2,800 people agreed to predict things? And they agreed to predict enough things, repeatedly over enough years, to get be able to find the top 2% of predictors without being subject to selection bias (e.g. “the dice that rolled the most 6s are the best at rolling 6s”)? And not only that, but it was possible to do stats on the IQ and knowledgeability of these people, and subject them to non-standard tests, all while remaining statistically significant?
It seems suspicious, in the “too good to be true” kind of way.
They did pay the forecasters a little for their time. And plenty of people will do plenty of stuff to win competitions. Doesn’t seem implausible to me.
This was a government project funded by a group affiliated with the CIA and run through multiple universities. When you’ve got that level of resources, sure.
I participated in all GJP seasons. Let me point out that we got paid a decent amount for the work involved – my ledger says I was paid a total of $922. We also had to fill out the IQ and political knowledge surveys at the start of our first season, which is where that comes from. This was a big content, and 2k people is not too implausible. As for the non-standard tests… well, I don’t think it’s much worse than any other study you might be reading which wasn’t pre-registered.
Tetlock already had a fairly high-status reputation due to his book Expert Political Judgment. An obscure professor probably couldn’t have done as well.
I participated in the GJP for two seasons on two different teams while I was working the nightshift as a 911 operator (between calls).
I didn’t take any IQ tests, but I did take abridged versions of the cognitive reflection test as well as answered general knowledge questions (I think I was also asked for my SAT score at one point). If my experience was typical, I don’t see any reason why they couldn’t learn all kinds of interesting things from it.
The contest was well publicized ahead of time on the highbrow public affairs and forecasting blogs. Tetlock is a big name in certain circles, so there was much interest in it among the kind of people who might do well on it.
I’m familiar with the standard foxes/hedgehogs story, but here’s a new angle that’s just occurred to me: what if part of the reason foxes do better than hedgehogs is because they’re imitating the chimps, because reasoning by using many diverse sources of information somewhat resembles random guessing? In other words, maybe when we praise foxes we are overstating the benefits of having lots of eclectic knowledge, and in truth the benefits of foxiness come about in an accidental triumph of randomness over stupidity.
Also, for anyone who wants to look at more information related to this, there’s a relevant recorded discussion group on Edge’s website. Tetlock, Kahneman, and a whole bunch of friends and industry specialists attended. There’s a typed out transcript in addition to video, for those who prefer to read.
One thing in that discussion that stood out to me is that the “one weird trick” all superforecasters use is looking for comparable events in history with known base rates. I haven’t read the book, so I don’t know whether or not it’s mentioned in there also. I like this piece of advice because it’s simple and highly actionable. It’s easy to apply it to specific things, and when I do so my thoughts almost automatically start being detailed and productive.
If the benefits of foxiness are purely from randomness, why do the foxes repeatedly outperform random guessing? And why do good predictors remain good predictors over time rather than regressing to the mean?
I wasn’t saying *purely* from randomness. I suppose my point is that comparing their performance to hedgehogs is a low bar in some ways that’s easy to get overexcited about, at least for me.
A more likely alternative explanation is that most people aren’t utilitarians and certainly not linear utilitarians, so those responses indicate nothing whatsoever about their “estimate of the value of bird life” or even that they have such a thing that applies per bird at all.
Also, as Karsten points out above, the fact that the number 100 or 1000 is even used in the question normally communicates information about the situation, making it legitimate to give “inconsistent” replies. It doesn’t communicate this information if the number is specifically chosen to test people’s responses to different numbers, but they don’t know you’re doing that and won’t reply on that basis.
Do you predict that the same groups of people who get the Kahneman-approved right answer on the bird problem will get the Tetlock-approved right answer on the “does Korea go to war in 5 years vs. 15 years” problem?
What is the Kahneman-approved right answer on the bird problem? Saving the lives of birds has no objective value. It’s a cause you can donate money to, or agitate for other people to donate money to, in order to signal that you are The Right Sort Of Person. And with that in mind, it seems quite reasonable that the amount of money people would wish to donate would be determined much more strongly by how much money they have than by “the value of saving one bird”.
“Saving the lives of birds has no objective value. ”
Well, it might. As witness the horrors of when Communist China went after the sparrows. But that’s hard to judge, likely to be small with so small a group, and may be negative.
The amount of money you have (or the amount of money you are asked to imagine you have) is a constant. The variable is the number of birds.
Something fishy is going on if the amount of money is negatively correlated with the quantity of birds saved.
Like maybe you don’t give a shit about birds (raises hand). But if I would only give a nickel to save 100,000 birds, it doesn’t make sense for me to say I would give a dime to save 100.
I’m being deliberately unhelpful here, but maybe people’s beliefs about the quality of a program are correlated with its cost effectiveness? If they see something that promises to save 100,000 birds for a penny, they might implicitly recognize it as too good to be true. Yay confounders.
@ Michael Watts
Saving the lives of birds has no objective value.
Objective moral value .. is rather a quaint notion, anyway.
If you are asked whether North Korea will go to war in X years, the X gives you information in the same way that being asked how much you’d pay to save X birds does (so long as you are unaware that the X is being put there solely to test your reaction to X). So that part would apply to both cases and I would expect some correlation.
Obviously whether someone is utilitarian wouldn’t affect the North Korea answer. However, if most people are not utilitarian, and if whatever error is made in the North Korea case is common, the answer to your question as written is yes because people who have extremely common trait X are likely to have extremely common trait Y, regardless of whether they are correlated. If I steelman your question as “do you expect correlation”, I of course wouldn’t expect correlation due to this reason.
> If you are asked whether North Korea will go to war in X years, the X gives you information in the same way that being asked how much you’d pay to save X birds does (so long as you are unaware that the X is being put there solely to test your reaction to X).
How? The birds question implies that there ARE that many birds. With NK and SK, the existence of that many years was not, presumably, in question.
Suppose I spun a wheel, labeled 1 to 200, and the wheel came up X. Then I asked you how many countries are in Africa.
Would the value of X affect answers given by people? Should it?
In actual discourse, people use numbers in their questions because the numbers are relevant, and in such cases the fact that the number was used provides information. You can, of course, cheat this by deliberately picking the number for irrelevant reasons, in which case it provides no information. And that’s what you did by spinning the wheel.
Jiro: …is that a yes or a no to my second question? You seem to have correctly ascertained that random numbers do not have a bearing on the number of countries in Africa, but the first sentence seems to suggest that I am somehow wrong in putting that piece of information next to my question, because the proximity will trick someone into believing they are relevant.
Which, sure, it will, but that’s a demerit on the person answering the question that they’re unable to handle that case.
If you picked a number randomly, and you asked a question using the number, and the person answering wasn’t told that you picked the number randomly, he would probably give a different answer for a different number and would be justified in doing so. Numbers in such questions normally carry information. They don’t carry information in the unusual case where they were picked randomly, but he doesn’t, after all, know that you did that.
(This scenario is not what you literally described, since that doesn’t try to connect the number to the question. If you did what you literally described, the number would obviously carry no information.)
[This is Jiro, an errant cut/paste killed the user name]
Jiro: the experiment is done by spinning the wheel in front of the person. The point is that people by default don’t seem to have the ability to consciously zero out the effect of the number, or the knowledge that they should do so.
The bird question is kind of ill-defined because unlike the probability of korea going to war, it depends on a value judgement, namely how much a species and an individual animal is worth. These numbers will either have to be pulled out of thin air by each individual participant, or else estimated based on the amount of funding similar bird-saving initiatives receive in the real world. I think the relevant information is hard to find and interpret in this case, again unlike the korea example.
This introduces a huge amount of noise into the picture, so I would expect both groups to do equally badly.
I would, besides scope insensitivity both example also have anchoring bias in common (why is 5 years or 100 birds an important number? it isn’t) and can be avoided just by thinking in terms of standard rates ($/bird, wars/year). I feel like converting everything to base rates is an either/or skill that once you figure it out makes you a better predictor on a wide range of topics.
I’d guess there’s a correlation, and the correlation is this: Treating the numbers as important.
Some people will look at the problem of birds and read it entirely as “Save the birds.” Likewise, some people will look at the problem of a war between Koreas as “Does Korea go to war.”
Which is to say, some of the people are answering a different question than is being asked. (I think schools train students to do this by inserting red-herring numbers into word problems, training people to think of the question first, and the numbers second.)
If schools have an effect, it is likely that they train people to assume the numbers are relevant, not irrelevant. (See Who Is The Bus Driver? as an example.)
The simplest, and thus default, explanation is that people are conserving on thought and replacing a harder question with an easier question.
I think most people would have no idea how much it costs to run a bird sanctuary or save a habitat, so $10,000 seems like a reasonable sum – high enough to get stuff done, not so high that it’s arguable you’d be better spending it on the homeless or cute cancer kids or something.
So it’s not necessarily “people in group X think 100 birds are worth more per bird, people in group Y think 1,000 birds are worth less per bird”, it’s the amount they think reasonable if it’s coming out of the public purse.
Alternatively, it’s a number high enough not to offend the bird-savers, but low enough not to offend people who can think of other ways to spend the money.
It probably depends how your question is worded, too: “These are the last 100 Greater Purple Wattled Grabblenecks in the world, how much do you think is reasonable to spend on saving them?” might induce people to suggest a higher figure than “How much do you think is reasonable to spend on saving 100 Greater Purple Wattled Grabblenecks?”
It seems strange to me that anyone would be a “linear utilitarian” about birds. How are birds different from apples? I consume apples by eating them, I “consume” birds by feeling good about them being out there, watching them in the zoo, etc. If my taste for apples is about average (mine isn’t, I don’t like apples very much) and I have 10 apples or 10 apples a month I am going to be less willing to pay for an extra apple than if I only have 1 apple a month. If there are 1000 birds in total on average at any point in time, I am going to pay quite a lot to prevent a bird genocide (saving 1000 birds) but if there are 10 million birds, I might not be interested in paying much to prevent reducing the average population by 10 thousand birds. I might even favour that.
You miss the point. It’s not about saving or not saving the birds, it’s about consistency between the answers. Utilitarianism is completely irrelevant.
But it is entirely consistent not to want to spend more money on extra birds when those extra birds bring you virtually no extra utility. I think that the phenomenon these questions try to explore is probably real but the questions are not formulated all that well. The most reasonable answer is neither “my value of one bird’s life times number of saved birds” or a number which my gut tells me represents “how much I care about birds” but that I cannot give a meaningful answer without additional information. Without knowing how many birds are out there, I don’t know how much I value saving the marginal bird and so I cannot give a sensible answer.
“Shut up and do the math.” Many folks in the effective altruism circles seem to follow a principle that we should value the lives directly, rather than indirectly through our own personal utility. It is a crazy alien idea, so far as I am concerned. And basically makes them paperclip maximizers who I cannot trust to act human.
I completely agree with you on the craziness of valuing the lives terminally rather than instrumentally.
But this egoism is no better rationalization of the finding in regard to birds. If you don’t want to give any money to save birds, that’s one thing. But if you want to give a certain amount to save 100, it is very peculiar that you would give the same or lesser amount to save 100,000.
I mean, we’re not talking about saving your pet parrot vs. 99,999 other birds. We’re talking about 100 abstract birds vs. 100,000 abstract birds.
The findings indicate either irrational scope insensitivity (the good explanation), or very implausible preferences (the bad explanation).
It doesn’t matter how much you value birds. The point is that whatever that value is, the value for 100,000 of them should not be less than 100, or even equal except possibly at the zero bound.
Saving 100000 birds is a larger number than saving 100 birds. However, it may not be a larger percentage of the bird population, and I might gain constant utility from saving a fixed percentage of the bird population rather than from saving a fixed number of birds. Furthermore, someone asking me a question about saving birds will normally name a number that roughly scales with the size of the population.
From that I would conclude that I should pay a constant amount for saving X birds, no matter what X is, under the assumption that the speaker is acting like speakers normally do (and did not, for instance, pick X randomly or specifically choose it to test people’s reaction to different numbers).
(Note that “I assume that they are asking me how many birds are needed to keep the species from going extinct” is just a subcase of this.)
@ Jiro:
Yes, you can fight the hypothetical.
If you’re arguing, “Is there some convoluted way I could rationalize this?” then the answer is always yes.
The question does not say or imply anything about percentages. The people who are being asked know it’s one of these types of artificial scenarios.
If you throw in weird assumptions about implicit premises the questioner didn’t tell you, you can get a result like this. That’s why you’re almost always told not to read anything into the question. But would everyone have the same weird assumptions such as to produce a consistent result?
Your hypothesis here is that a) people get utility from saving fixed percentages of bird populations, b) everyone knows that when someone gives you a figure of the number of animals you can save by donating to a charity, it’s a fixed percentage of the population, and c) the results are explained by the respondents reading these premises into the question. That’s not completely impossible, I guess, but I find it unlikely. And the way you detect this sort of thing is by asking people to explain their answers to detect misunderstandings of the question.
Let me apply this sort of analysis to the trolley problem.
“Hmm, well, in real life no morally sane person would ask me whether I wanted to let a trolley run over five innocent people instead of one, so in the example the five must be assholes who had it coming, and the question is like a test to see if I’ll realize this. Therefore…yep, I choose not to press the button.”
Someone could think that, I guess.
Edit: this probably came off as meaner than I intended. That last paragraph was intended in a joking manner. I just think this whole discussion is silly. Clearly scope insensitivity is a real phenomenon, and I think these results are much more likely attributable to it than to other strange implicit premises.
What I am describing is a line of reasoning to rationally get the answer you consider irrational. Unless your hypothetical is “suppose you don’t reason that way”, that isn’t fighting the hypothetical.
It’s unlikely that anyone would explicitly spell it out in so many words, but the fact that someone cannot spell it out doesn’t mean that’s not what their reasoning is. (Honestly, how many people would even know what you mean when you ask if their utility is linear in number of birds?)
In real life, trolley problems involving literal trolleys don’t happen at all. So nobody would deduce anything about some aspect of the real life situation, such as the victims being assholes, because there is no real life situation. Saving birds does happen in real life.
And even ignoring that, suppose that reasoning that way in trolley problems was common. All that that shows is that the trolley problem has the same flaw as the bird problem. I don’t hold “trolley problems are flawless” as a premise, after all.
@Vox:
I think that if someone asks me about saving 100 birds or 1000 birds, it is quite natural to thing that both numbers probably mean something like “save that species of birds from extinction”. Given that most people do not care whether there are 100 birds or 1000 birds but do care about that species existing or not existing, it should not come out as such a surprise that they say more or less the same figure in both cases, because they more or less the same result in both. What you buy is the continued existence of a species. There are probably people who care about lives of individual birds but there are not that many of them.
But there is a simple way to find out whether that is correct or not – ask the same people about saved lives of people. I’m pretty sure that most people think about humans differently that they do about animals and also it is obvious that the question is not about a species’ extinction. If they are again willing to pay more or less the same for saving 100 people (let’s say it is always people they don’t know, they’re even from a different continent and you also tell them that) and 1000 people then the interpretation from the article is probably correct. If they give different answer, then it is not.
I suspect that the issue here is that the original bird question was in relation to the exxon valdez accident and framed so that all the questions taken in context can be reduced to
“how much money should we spend on saving one bird species”
Asking the exact same question to three different groups should normally yield roughly the same answer, independent of exact wording.
This does not mean that scope insensitivity isn’t real, only that using birds affected by one particular oil spill is perhaps not the optimal way of measuring it.
Performing at chance because you’re braindead is one thing, but for hedgehogs to perform *below* chance it seems like there has to be something weird going on. Does he discuss why this happens?
If possible options aren’t equally probable, always choosing the same one is likely to lead to performance under random, because you’re quite unlikely to have chosen one of the options that have higher probabilities than most others.
I think.
So there is one big truth to the universe, but since there are many potential big truths the median hedgehog will not have found it and so will always be wrong? But you’d think that the one hedgehog who did find the right big truth would bring up the score enough to equal chance again.
The zealots/hedgehogs believe strange things, but they don’t believe random things. More likely, they believe things they would like to be true, and proceed from there. If, say, you’re convinced Communism is the salvation of mankind, it will skew your perceptions. As it turns out, Communism likely isn’t the salvation of mankind, so that’s one wrong thing you believe, upon which you base your modeling.
Am I making sense?
Further:
Suppose the question is, “will adopting Xism be beneficial to the nation of Y?” – what’s the guarantee that the proportion of Xists and anti-Xists in the sample matches the probability that the statement is true?
Yes, if you’re convinced of some nonsense proposition, it will skew your perceptions and be a wrong thing to base your modeling on. But to do worse than chance it doesn’t have to be worse than ideal, it has to be worse than completely braindead.
Is your point that things that people want to believe are especially likely to be false, and this causes people who believe things to be worse than completely braindead?
My point is, I think, that people like to be consistent. If you have a deeply held belief about the nature of the world, you obviously also believe that this belief holds, and if it holds, then events in the world will be consistent with it.
The problem is not just that the belief may be false, but it may be irrelevant – but they’re still going to pick a consistent option. Also, if only have a hammer, every problem looks like a nail. Knowing just one thing and viewing everything through that lens will distort things.
Crap. I feel as though I have this intuitive understanding of what’s happening here, but when I try to put it into words, only garbage comes out.
I think you’re missing my point. You don’t merely have to argue that a hammer is the wrong tool for most jobs, you have to argue that a hammer is a worse-than-random tool. But if a hammer is a worse than average tool, then something like a wrench must be better than average. So then do more people use hammers than wrenches? Why?
Because this isn’t a hammer-or-wrench question. This is people thinking a hammer is the be-all, end-all tool for every task, and there’s a whole tool shed of different tools. The chance that a random task happens to be one where hammers excel is small, and hammers aren’t particularly universal tools either. A random choice of tool to a random task at least has the chance of selecting something like a multitool that works on multiple different problems.
I think the issue is that the sample of people in question did not, on average, happen to hold useful ideological viewpoints compared to the questions asked.
Yeah but to do worse than random on a large average it’s not enough to always select a particular tool, you have to have people systematically selecting a particularly unsuited tool.
What Tetlock showed the first time around is that pundits were operating without feedback– no one was keeping track of their errors. This would increase the odds of them having theories which felt good to themselves and their audiences, but which had little or no predictive power.
Also, it can take time for a trend to make itself clear, and meanwhile, it’s easy to say that the trend would be what you want it to be if people only did things differently or tried harder.
I suspect that there is a correlation between theories that sound nice and theories that don’t predict well. Theories that predict well may mean that the forecaster has difficulty getting audience enough to count as one.
I don’t know what, in this context, “worse than chance” means. To define chance outcomes you need a probability distribution.
What’s the chance distribution for how soon the Koreas go to war?
@ David Friedman:
I am not sure on this, either, but I think someone explained that, at the end of the time period, they rate the probability as either 0% (if it didn’t happen) or 100% (if it did). And then they measure how far off your guess was. So if you predict a 70% chance that the two Koreas go to war and they don’t, you gain 30 points. And if they do, you gain 70 points.
This makes sense because if you were an omniscient, ideal predictor, you would only predict either 100% or 0%, assuming the events are actually determinate.
Of course, this would be terrible to use for just one question, but averaged out on lots of questions it works.
If that’s true, that would also mean guessing 50% on every question comes out the same as alternating between 0% and 100%. Doing the latter sounds like metagaming it to me, however.
I don’t know if this is right, though.
Why are you assuming that any of the hedgehogs happen to find the correct big truth?
Or, indeed, that the hedgehogs’ big ideas are themselves randomly distributed? It’s entirely possible that there’s a bias to total population of hedgehogs’ beliefs, even if it’s just something like “favors round numbers” or the related “favors simple answers”.
Suppose, out of a range of 100 values, the correct one is 63. That’s a weird number. It’s neither very close to any extremes, nor to the middle. It’s deliberately chosen to not be round, either in a conventional sense nor in a “really easily computed” sense like 2/3 or 1/4. It is, however, both
A) The nearest rounded value to “halfway between 3/5 and 2/3”
B) One of two acceptable roundings to the range of permitted values for “5/8”.
In other words, it’s actually not an unreasonable number even if you’re working with things that we only have really low-precision estimates about, or if you’re doing pretty off-the-cuff sorts of math. You can easily find “weirder” numbers, in terms of trying to produce plausible things that compute to them, but many of them are actually closer to “big round numbers” too. “60%” is a reasonable hedgehog-y estimate for “a bit more than half” or “a bit less than 2/3”, but that still leaves a huge margin for error. An awful lot of people are just going to say “about half” or “about 2/3” or even “about 60%” without applying the *only slightly further step* of getting “between 2/3 and 3/5” or “about 5/8”, just because those are slightly more complicated answers.
If the evaluation function punishes overconfidence (e.g. score = (correct answer – your answer)^2)), that might explain it.
Indeed, I think it would be just as interesting to study the Unterforecasters and figure out what the heck was going wrong there.
Unless it turned out they just had low IQ, which would be boring. But I’m assuming that the group was already above the average IQ.
That’s a good question. One possible answer might be that the hedgehog approach emphasizes dramatic, counterintuitive predictions (because it’s selecting for a strong narrative as much as for truth), and those predictions are more likely than chance to be wrong.
The most likely predictions are usually along the lines of “things stay the same”; no revolution, modest gains in the market, the establishment candidate wins again. But that’s much less compelling than revolution, or a crash, or a political upset favoring the underdog.
That’s the best answer I’ve seen. If hedgehogs consistently make counter-intuitive predictions, I can see how they’d be consistently wrong. Think of all the book deals and speaking fees people earned from predicting the financial crisis. You’d never get a book deal from successfully predicting business as usual.
I think the economist who deserves credit for taking a fox approach to macroeconomic predictions is David Henderson. He disagreed with Krugman when Krugman was predicting runaway inflation in the early 1980s, and he bet against Bob Murphy when Bob was predicting runaway inflation after 2008. Krugman and Murphy are both smart hedgehogs, and Henderson was able to out-predict both of them by just projecting the status quo forward.
I didn’t think of this when I was writing the ancestor, but a data point in favor of my theory might be all those studies showing simple algorithmic models outperforming human experts. Not all those models boil down to simple regressions, but a lot of them do, and the ones that don’t will at least be measuring what they measure in a way that has nothing to do with narrative.
If hedgehogs consistently predict lower than chance, does that mean they can predict greater than chance if they just predict the inverse of whatever they think?
And does this cause them to explode in a puff of logic like a computer in Star Trek?
Sadly, most real-life situations are multipolar in a way that doesn’t allow for that. It’d work for binary choices like “which party is going to win the next American presidential election”, but I’d be astonished to find anyone performing worse than chance on that over long enough timescales.
I think the moral of the story is to make less-edgy predictions. Jehovah Witnesses have been predicting that “the end is nigh” almost every year for the past century. But the highlight of the week is more often something like “Dow Jones rose 37 points”. The AGI narrative used to be “We’ve found the Great Filter, the end is nigh.” Nowadays, it’s more like “It’s a big deal, but feedback effects don’t seem as strong as we thought.”
Incidentally, DAE remember the 2012 Mayan Apocalypse? The Y2k crash? Man Bear Pig? #NeverForget
The Y2K crash actually demonstrates a dynamic in prediction-making that deserves a little more scrutiny. Yes, there wasn’t a huge catastrophe, but in large part that was because people had predicted a huge catastrophe and had worked diligently to avert the catastrophe. If no one had predicted the Y2K crash, it could easily have been real. If forecasters are actually respected and paid attention to, self-fulfilling/self-negating prophesy becomes a real concern.
@ Nornagest:
Well, that makes a lot more sense. I was still thinking in the context of these “Probability that X will happen or not in 3 months” kinds of questions. So people are, in fact, not worse than chance on those?
Still, I suppose by definition it does prove on those multipolar questions that they’re better off getting out the dartboard.
@suntzuanime:
I’ve seen the claim that Y2K wasn’t a catastrophe because people took precautions against it, but I’m suspicious. Surely there were some firms and even some countries that didn’t do anything much–did terrible things happen to them?
What’s the evidence that if people had not made a big deal about it, there would have been a catastrophe?
[Re Y2K.] Surely there were some firms and even some countries that didn’t do anything much–did terrible things happen to them?
Why would a problem have to be considered a catastrophic threat for qualified people to take it as a problem, and as deserving reasonable action? In this case, qualified programmers did notice the flaw, reasonable action was taken, there was no catastrophe — so this means there was never need for any action at all?
So non-qualified third parties did start a flap — as usual, Sturgeon’s Law.
“Your radiator has developed a leak. If you don’t replace it, X will fail and you’ll hit a school bus and start a catastrophic fire!”
The school bus and the fire are unlikely, but not impossible, and we’d still better replace the radiator.
How about, there was reasonable need; reasonable action was taken?
I was there in the mid-90s. I was already running into things that went 1998, 1999, 1900. Also, 1999 was a popular flag meaning “stop”. Later systems without these flaws could have been affected by bad data fed from the dinosaurs, even if no dinosaurs had crashed (which we have no guarantee of).
The company I worked for in 1999 had a revision control system that was not Y2K compatible. Exceptionally so; it not only didn’t work but corrupted its own proprietary database leaving all our source code inaccessible. We found this out when a system accidentally had a clock set ahead.
If we’d gone blindly into Y2K without trying to fix anything, I think it’s fair to say that there would have been very many incidents of this sort, and some more severe.
This is reminding me of a piece of investment advice I saw somewhere– when you’re considering an investment choice, ask yourself whether you’d still do it if the only people who’d know about it were you and your accountant.
My guess is reasoning with identity politics.
I think of intelligent pundits who always have to publicly spew party lines, regardless of the stupidity of such a thought.
Here’s a copy of his first book. See page 51.
http://emilkirkegaard.dk/en/wp-content/uploads/Philip_E._Tetlock_Expert_Political_Judgment_HowBookos.org_.pdf
Here’s the strategies, along with their discrimination and 1- calibration scores (higher are better). Calibration indicates divergence of predicted and observed probabilities, discrimination indicates the amount of observed variance predicted. Note that **all the questions were of the form “Will X go up, stay the same, or go down in the next Y time”**.
* 1) Chimp – every possibility gets the same probability. (0, ~0.99)
* 2) Restrictive, recent, and contemporary base rates – estimate answers by figuring out the base rate for changes from historical data, with varying time scales. (0, 0.96-0.999) (narrower time scales do better).
* 3) Case-specific strategies – assume the most recent trend for the question persists. (0.04, 0.985-0.995)
* 4) Average of human forecasters (experts + dilettantes). (0.02, 0.98)
* 5) Briefly instructed undergraduates. (<0.01, 0.94)
The experts on average did better than the chimp (assuming equal weighting of the two scores). They lost on calibration (they were wrong more often), but they did discriminate between outcomes better – they were more likely to give higher probabilities to things that happened than to things that didn't (whereas the chimps gave the same probabilities to everything). Undergraduates who don't know anything do substantially worse.
He then tried to split human forecasters by scores, looking for correlations with various things about them (my favorite correlation is that people regularly in contact with the media did substantially *worse*). He found that cognitive style (fox/hedgehog) had a big impact. You can see his fox/hedgehog quiz on page 74.
Hedgehogs – (0.02, 0.945-0.97).
Foxes – (0.03, 0.99).
So foxes do as good or better than than the chimp on calibration (accuracy of their responses) while substantially improving discrimination (they hedge less). Hedgehogs do substantially worse than the chimp on calibration while picking up only a bit of additional discrimination.
Also, check out the calibration curves on page 78. Foxes are almost perfectly calibrated. Expert hedgehogs (those with domain expertise) do worse than non-expert hedgehogs.
Thanks for this. I think I understand the claim being made now and why people might gloss those results as being “worse than chance” when they really aren’t in an information-theoretic way.
Go not to the well-calibrated for counsel, for they will say both no and yes.
And in a meta comment that is not at all nice, but definitely true and necessary, I would like to publicly shame all the people in this sub thread who speculated wildly on the answer to suntsuanime’s perfectly reasonable question when the correct answer was available with 20 minutes of research on the web.
It is fucking embarrassing, especially with all the posturing about how smart this community is, we don’t need to learn anything from Tetlock.
Tetlock gives as an example of a hedgehog the example of supply side economics pundit Larry Kudlow who had some success a long time ago predicting that economic performance would negatively correlate with marginal tax rates. Cutting taxes is his hammer and everything looks like a nail to him. That’s his shtick and he gets on TV for being a reliable voice for cutting taxes.
My suggestion (not Tetlock’s) for a right of center pundit fox would be Michael Barone, who isn’t closely associated with one idea, but knows a simply unbelievable amount of stuff about American politics.
I’m not quite sure what is meant with “the hedgehogs did worse than random”.
Obviously the hedgehogs are not just spitting out completely random answers: if you asked them “Will the next president of the US be: A) Sanders, B) Clinton, or C) Hitler”, you wouldn’t get equal amounts of responses for each answer. So even if you show that the hedgehogs answer worse than “chance” for a given list of questions, this is necessarily because of the framing of the questions by someone (presumably) smarter than them. Perhaps you mean that the hedgehogs did worse than a random sample of normal people?
That would have been my guess. And the implication would be that hedgehogs overemphasise their specific *thing* and ignore the surface-level data they share with the man on the street, even when the latter is more relevant. Fits with the finding (I feel like it might have been Tetlock again?) that if you wanted to know what was going to happen in, say, Syria, an expert on ‘the Middle East’ outperformed an expert on Syria specifically. Seems plausible that this is because the expert on Syria is over-weighting fine detail that only he knows about the precise situation whereas the Middle East expert is taking a more reliable outside view.
Plus ‘deep expertise in one area’ is hard to achieve without ending up somewhat partisan, and people often confuse what they think SHOULD happen and what they think WILL happen.
Yeah, it could well be that because the CIA analysts knew the deep data that they were basing their predictions on “Well, General X is high in favour at the moment and he has a realistic view of North Korea’s military capabilities so he will quash any war-mongering”, then General X trips in the bath or there’s a sudden purge because the Dear Leader woke up on the wrong side of the bed that morning and oops, now the guys who want to nuke South Korea are riding high?
Whereas the superforecasters knew nothing about General X and so left his influence out of their weighing of the probabilities, so when he fell into disfavour it made no difference to their predictions?
You’re correct that the participants choose from a list of possibilities. They assign a probability to each possibility, and then a Brier score is computed from these probabilities once the question is resolved. For instance, one of the question on Good Judgment right now is “Who will be the Democratic nominee”, and the three options are Clinton, Sanders, and O’Malley. You assign a probability to each candidate, and your probabilities must add to 100. (Technically this is bad, because these three possibilities are not a partition of the probability space, but whatever).
I don’t think Tetlock is ever specific about what “Worse than random” means, but the obvious answer is that the hedgehogs’ Brier score is worse than what they would get if they just assigned uniform probability to all possibilities on all questions. (He might also mean that they have Brier scores greater than 0.5, which would only be “worse than random” on binary questions).
Slightly offtopic (but a fun mental exercise): What do you think the chances are of none of those three getting the democratic nomination?
There’s several ways this could happen. All three could die for example. Or be seriously implicated in something severe enough to bar them from the nomination. If something were to happen to both Clinton and Sanders, I could see someone else entering the race instead of the nomination going to O’Malley uncontested.
Another possibility is that there might not be a democratic nominee. The party could split, or the election process could be suspended. Something like a Yellowstone eruption, or a nuclear war with Russia, would probably take care of the latter.
All of those seem pretty long shots though. I’d put the aggregate probability around 1/1000, with the biggest contribution “Both Clinton and Sanders die and someone else beats O’Malley”. Based on actuarial tables Clinton and Sanders are at 2.9% and 3.4% of dying next year, respectively. Their health (and health care) is no doubt above average, but they are also at a much higher chance of being murdered (1.8% of US presidents are murdered every year).
Which tables did you use to give Clinton a 2.9% chance of dying? (The ones I found put it at 1.4% for her age/sex.)
(less chance for SES, imo, and more because of previous health issues, so I think it evens out.)
That’s a 100% difference, so I was wondering where you got your data.
I used the first link from google when I looked for ‘Actuarial tables United States’.
However I am apparently unable to correctly read data from a simple table, since I now see that it should be 1.3% for Hillary. I remember in an earlier version of my post I wrote that Clinton is 2 years older than Sanders, so I think I was mistakenly taking her age as 76 instead of 68. Whoops.
O’Malley dropped out of the race four days ago. Amusing that he was so irrelevant that even his exit went unnoticed.
Yes, but I’d assume the chances of him undropping if the other candidates die are pretty high.
I had noticed. But like the poster above me points out, that doesn’t mean he can’t win the nomination anymore.
What is P(O’Mally wins the nomination | Neither Clinton nor Sanders wins it)? I’d say it’s still pretty high, despite him having dropped out. Let’s pluck some numbers out of thin air and say 25%.
If Hillary and Sanders both bought it in a blimping accident or whatever, Biden would become the overwhelming favorite.
I think there is a significant chance of someone other than those three being nominated, probably several percent. Clinton could be eliminated either by dropping dead or by being indicted over the emails. Neither is likely, both possible.
Once Clinton is out, the Democratic establishment looks at Sanders, decides he is too extreme to win, and gets behind one of the potential candidates who ended up not running.
I think they’d court Bloomberg. (but if you’re keeping score, this prediction is conditional on Hillary being eliminated, because otherwise she gets it)
Well Hillary, or rather most of her immediate entourage have been already been implicated. The real question is whether or not the Obama administration will actually prosecute, which I doubt they will.
The interesting thing will be to watch the DoJ and DoD for high profile departures between now and Jan 20th 2017.
The US intelligence community has just been seriously embarrassed by their disastrous declaration that there were weapons of mass destruction in Iraq.
I think that sentence needs a lot of unpacking, but let’s not get bogged down in politics.
As to superforecasters – might it be the “foxiness” is a summation of the collection of traits you mention – they’re a little bit better at maths and a little bit more well-read and a little bit smarter and a little bit ‘lucky’? They’re not outstanding on any one trait, but bundle all those traits together and suddenly it’s a different ball game?
I often think the difference (using Tetlock’s own terms) between rationalists and super forecasters is that the super forecasters are foxy. They try lots of little things, constantly update their estimates, get more information wherever they can.
Rationalists seem to have started from a similar place but then they build these big hedgehog systems and try to reason from the hedgehog systems.
An interesting claim, but seemingly contradicted by the high rate at which LW rationalists have come out as being among Tetlock’s superforecasters.
Selection bias + other confounders? How many less wrong rationalists aren’t superforecasters? I’ve actually been surprised at how few rationalists actually participate in things like the good judgement project, though I suspect it’s an age and lack of time to commit phenomenon.
Anyway,you’ve probably seen the same hedgehoggy tendencies of the community that I have: Hanson’s near-far views and X is not about X. Bayesian stats as a master system instead of one more tool in the stats toolbox. Political systems built off sweeping theories of history (“cthulu always swims left”). The best of the sequences have lots of useful little debiasing ideas (rationalist taboo and the like), but there is also a lot of hedgehogging in there.
The community definitely has some hedgehogs; the examples you cite are good ones. (Though it’s important to note that two of the three examples you give are fringe, minority views even within the rationalist community, and the Bayesian stuff is obscure enough that people just might not know there is a controversy.)
But I still suspect that it has fewer hedgehogs than most communities. How do you think the rationalist community compares to the average?
I cited Hanson’s near-far, Hanson’s X is not about X (although healthcare is not about healthcare is probably his most common) and both of these I think I’ve seen used by most of the community I interact with.
The Bayes supremacy stuff is foundational.
The one fringe example I cited was the political theory.
Rationalism is all about being more of a fox and less of a hedgehog. There are still rationalist hedgehogs, but they’d be even more hedgehog-y if they weren’t part of a community that encouraged them to constantly critique their own beliefs.
Hospitals are full of sick people, but (presumably) the hospital is making them less sick. By the same token, rationalism can be full of hedgehogs even though rationalism makes you more of a fox.
Those are inevitable traits of very smart people- they try to make explanations of things based of what they know. The relevant comparison would be rationalists versus group x (where x is a group with a similar amount of highly intelligent people).
As I like to put it, human beings are equipped with very good pattern recognition software—so good that it can even find patterns that aren’t there.
Not surprising, given the difference in cost between spotting a large predator that isn’t there and failing to spot one that is.
I don’t think these are inevitable traits of very smart people, it’s more about cognitive style. Superforecasters, as described in the book, avoid making these big sweeping explanations and instead stick to much more narrow ones and are more effective for it.
“Still, I appreciated its work putting some of the rationality/cognitive-science literature into an easily accessible and more authoritative-sounding form. […] not necessarily too useful for people who are already familiar with the cognitive science/rationality tradition, but great for people who need a high-status and official-looking book to justify it.”
Haven’t read the book, but based on your review would it be correct to say that this book represents the first large-scale scientific study which demonstrates that all those rationality techniques actually work in practice?
I mean, sure, many of those techniques can be said to be just applied math, so they “have to” work. But the real world is not a universe of perfect spherical cows, so one could say that until now the rationalist community has been taking it on faith that those things are in fact useful under chaotic real-world circumstances. You yourself, Scott, have expressed some skepticism about that in the past. (OK, that was about the effect of rationality techniques on personal success, which is not the same thing as being able to predict geopolitical events, but there’s a similar “just because something works on paper doesn’t mean it’s very useful in practice” point.)
If a proper scientific study shows that people with above-average but not exceptional IQs, using only the kind of publicly available information that an ordinary well-read person would have, can significantly outscore CIA specialists with access to confidential data just by practicing proper epistemological self-discipline, I would say that’s a pretty big deal. (It would also be a good thing to tell the people you needed to talk out of a depression because they felt that without a >140 IQ they would never amount to anything.)
After all, aren’t rationalists supposed to value scientific confirmation of their hypotheses? Until now, the claim that techniques such as those described in the Sequences are useful not only for solving made-up puzzles but can also help an ordinary person to make sense of complex chaotic real-world situations like the war in Syria, was a hypothesis that sounded plausible but didn’t have a whole lot of evidence behind it. Anybody who would have claimed that it was “obviously true” and no evidence was needed, should perhaps try to be a bit more strict in applying their rationality to their own beliefs.
From your summary of Tetlock’s book, it looks like he has now provided that evidence. That sounds like it deserves somewhat higher praise than “yeah, it’s useful if you feel the need to justify our obviously true beliefs to others.”
Next step: a similarly large-scale study to see if people can learn to become superforecasters via e.g. CFAR seminars?
To add to my previous point: it is not too difficult to imagine the outcome of Tetlock’s research going in the other direction: he could have found that the most successful forecasters did not actually practice any LessWrong-style techniques such as Bayesianism, but that they were all very intuitive people who arrived at their successful predictions through some mysterious je-ne-sais-quoi which they were unable to explain. Like in Malcolm Gladwell’s Blink. While the people who attempted to deliberately apply CFAR-style techniques, ended up in the bottom half together with the chimps and the hedgehogs.
That didn’t happen, fortunately. But, if you’re honest with yourself, what would your prior for that outcome have been?
Techniques like Fermi estimation were well-pedigreed before, so I would have found it pretty implausible that LW-style reasoning ended up being worse than chance. But I agree that it’s nice to win contests.
One thing to keep in mind is that Tetlock’s contest wasn’t set up to reward out-of-the-blue intuitive forecasting of major shifts in the direction of history. Instead, it rewards people who have reasonable views on average on a whole bunch of things that will or will not happen over the next 12 months, and who will do the grinding work necessary to keep up to date on, say, Polish politics.
It would be interesting to look at historically great forecasting performances.
For example, in 1790, Burke shocked his Whig colleagues by turning vociferously against the seemingly moderate French Revolution, predicting regicide, terror, war, inflation, and it all ending in military dictatorship. That’s a really good forecast ten year out into one of the most unprecedented and tumultuous decades in history. On the other hand, Burke seemed to go a little nuts after that.
Similarly, Rousseau, while not making specific predictions, in the midst of the Age of Reason around 1760 did an amazing job of anticipating how cultured Europeans around 1820 at the height of the Romantic Era would feel about things in general (emotionally). Rousseau was a disagreeable nut, but he was a genius in the sense of being able to foresee the emotions of the future.
I’m not sure that Tetlock’s meat and potatoes forecasting tests would have captured the genius of Rousseau and Burke.
Adam Smith suggested that, since the English establishment wasn’t willing to let the American colonies go, they should instead offer them seats in parliament proportional to their contribution to the tax revenue of the empire. He then commented that, if that was done, in about a century the capital would move to the New World.
Or in other words, he predicted that U.S. national income would pass that of Great Britain in the second half of the 19th century, which I believe is what happened.
Ben Franklin proposed in 1760 that control of the interior of the North American continent via Quebec City and New Orleans would determine whether French-speakers or English-speakers would dominate the world from 1900 onward.
Franklin’s 1754 analyis of population growth in America (that it had doubled via natural increase in the last 20 to 25 years) was immensely influential on top British theorists such as Malthus and Darwin.
Agreed. We should keep our eyes out for more confirmations (and disconfirmations).
>His investigation into the secrets of their very moderate success led to his famous “fox” versus “hedgehog” dichotomy, based on the fable that “the fox knows many things, the hedgehog knows one big thing”.
Scott, I’m not sure whether it’s just your phrasing or my parsing, but it sounds like you’re attributing this dichotomy to Tetlock himself, while in fact it comes from Isaiah Berlin’s essay from 1953.
I took “his famous ‘fox’ versus ‘hedgehog’ dichotomy” to mean that Tetlock came up with it and my first thought was also of Berlin’s essay, FWIW.
It comes from a fragment of the greek poet Archilochus by way of Berlin.
Did the Good Judgment Project have a control group? Did it put, say, the middle 2% forecasters into groups, average and normalize their predictions the same way as the top 2%, and compare them? I’d be interested in the outcome of a control group. You might test how much “wisdom of the crowds” exists at various levels of individual performance.
My hunch is that the normalized average of any contiguous 2% will outperform most or all of its individuals over the full range of questions. I’m wondering if I could draw an apt analogy to how people tend to rate averaged composites of faces more attractive than individual faces because of how the averaging smooths out their blemishes and idiosyncratic proportions.
Yes; other IARPA tournament entries tried more straightforward wisdom of crowds methods, but were outperformed significantly by the superforecasters.
But Tetlock says that the other groups failed because of “bad management,” not bad methods.
Having heard presentations from a couple of those other groups, I wasn’t under that impression.
What do they say? Do you know if there are any write-ups of their inferior performance?
I’m just quoting Tetlock. I don’t know that he knows what’s going on with the others.
I can’t reply to the other comment, but… I saw presentations of the results by academics, several of who started working with GJP in later seasons. So if Tetlock said it, it’s possible he was more informed that I was about what went wrong with other groups.
For more on the competition, here’s IARPA’s program overseeing the project for all the teams: http://www.iarpa.gov/index.php/research-programs/ace
In the spirit of the Good Judgement Program, http://www.metaculus.com/ is a site that predicts things and then awards points based on how accurate you are and how far away you are from the group average. It appears to be mainly filled with LW style rationalists, so by the theory of the book, will be quite accurate.
I imagine that in the future the points will be used to more heavily weight a group score based on those who are good at answering, but that doesn’t happen yet.
I’m one of Tetlock’s Superforecasters, and I’m happy to answer some questions.
I stumbled into this blog a couple months ago and otherwise have little or no exposure to the rationalist community, so I won’t have any direct insight comparing them with Superforecasters.
What’s your academic background? How did you come across this project? Can you give us some insights on how you come up with your predictions? Am I asking too many questions?
I have an undergrad math degree from Caltech, no graduate work. I now work in finance. Pretty sure I found a link to the project on the Freakonomics blog.
The most important factor for a lot of the questions was determining a reasonable base rate. For the most obvious example, each year there were probably about five questions along the lines of “Will X remain the leader of country Y?”, often with no particular reason to think that they would leave office.
I, along with most of the other supers on my team, didn’t feel that I was particularly adept at forecasting so much as that most people (even most smart people who haven’t spent a lot of time thinking about probability in some way) are really bad at it.
The number of similar predictions actually seems important to me, from a calibration perspective. If you predict whether a large number of leaders will still be in office, you allow the underlying base rate of “leader leaves office before end of term, per year” to actually have a chance at affecting the prediction calibration.
IOW, if I roll 6 die 6 and ask you to predict the odds of each individual one coming up 6, you start being able to separate the guy who guesses 1/6 from the guy who guesses 1/5 or even 1/3.
I don’t know, maybe this is the wrong way to look at it.
How do you translate subjective/qualitative data into something quantitative you can use to update your forecast?
Again, I think the most important factor for most questions was trying to come up with good base rates. One related process that I found helpful for questions with arbitrary time frames (e.g. will North Korea test a nuclear weapon in the next 3 months?) would be to extrapolate my answers to more intuitive time frames. Or maybe extrapolate to more intuitive probabilities… If I don’t think there’s anything special about the time window under consideration, I can reframe the question more along the lines of “For what time frame would I estimate a probability of 50%?” and then back out an equivalent probability for the shorter time interval.
OK, but what about after you establish a base rate? Let’s take the NK example. Say you have a forecast in place already, and Kim Jong-Un makes a speech saying “the capitalist pig-dogs will be punished by our nuclear arsenal very soon”. (This particular example is bad because he probably makes that sort of speech every week, but anyway). How do you go from the words to adjusting the probability?
My first instinct is still to grasp at a base rate, if I’m able to find comparable statements in the past by him or even a related dictator. Or, if I don’t find previous comparable statements preceding NK’s previous nuclear tests, that in itself is telling. If I find any reason to dismiss the comments, I’ll stay fairly anchored to my base rate with only a moderate adjustment.
Assuming that I don’t have anything to compare the situation to, I’d try to place the remark in a fuller context of NK’s current stance. Are they doing other aggressive things, like cancelling planned family reunions (for families split between North & South)? Are they responding to some perceived provocation? If so, how serious was the provocation?
At the end of the day, at that point it really comes down to making an informed guess. I’d make sure not to have an extreme forecast in either direction, and hope that some of my teammates have more real knowledge about NK to base an estimate on.
Whatever adjustment I do make, I’ll start to decay after a week or two if there are no new developments.
Can you describe the thought process you followed when making your predictions?
The first thing I’d think about was whether I even wanted to answer the question. To maintain a semi-reasonable workload, GJP designed the scoring system to impute a probability for you each day for each question to which you had not submitted an answer. For the supers grouped into teams, this imputed answer would be the median forecast of members of your team. So, if our team already had several forecasts for a question, I wouldn’t get involved unless I my own guess disagreed with most of the forecasts (or there was a lot of dispersion among our forecasts).
For questions that I did decide to research, my first two steps in forming a prediction were usually a search on Google News for recent information, and a search for historic information to form a base rate (more often than not, Wikipedia). I’d try to write up my own reasoning and form an initial probability before reading my teammates’. One of my teammates would occasionally email reporters or academics and ask for their probability estimates and reasoning.
One thing that Tetlock points out in his book is that supers use a lot more language in their posts suggesting that they’re weighing evidence pointing in different directions (e.g. “on the other hand”).
Thanks.
This is very helpful.
Do you feel that you’ve used an explicit or implicit version of Bayes’ theorem to update based on evidence?
Implicit would be a vague feeling of “the less likely this data was in a world that looks like X the stronger evidence it is against X” without the actual formula.
Very occasionally explicit, usually implicit.
Tangentially, even when I’ve answer probability quiz/puzzle questions which are designed to be plugged into Bayes’ theorem, I only rarely would have explicitly used the formula. My intuition is basically to mentally build a Venn diagram.
I’m also a superforecaster, but I’ve been involved in LW for years, and find that this form of forecasting is excellent in practicing (a small subset of) rationality skills.
How did you approach the team aspect of the Good Judgment Project, especially comparing the first season team of random participants vs. teams of superforecasters? And how much time did you typically sink into the predictions each week?
The reason I ask: I participated in GJP around 2012 and enjoyed it (though I only scored around the average), but found the team aspect frustrating. Most members were not very active/motivated, although many seemed to have quite good domain-specific knowledge (e.g. DC foreign policy wonk types). This didn’t do any favors for my own engagement, which dropped over time as the Project did an excellent job of converting reading the news from diversion to chore. Without much in the way of expectations from the team, keeping abreast of such a wide variety of world events lost its luster pretty quickly.
I probably spent 5-10 hours a week on forecasting for my first few years, but eventually stopped participating for the last 6 months or so.
They had a variety of forecasting conditions for the non-supers… I joined midway through season 1 and was initially in an individual forecasting condition. I was able to see the mean forecast of the other ~200 forecasters in my experimental condition, and there was a leaderboard, but no other way of interacting with each other. Eventually I realized that the group’s mean forecast tended to be correct but was vastly underconfident, so my baseline estimate was a more extremized version of the group consensus. The GJP had algorithms to convert the group’s forecasts into a higher-quality estimate, and this was the most helpful of several transformations that they discovered.
In season 2, I was in a team condition (of non-supers) like you described. While there were plenty of inactive members, I don’t recall being teamed up with anyone who was actively harmful. I was fortunate to be grouped with a few other forecasters who contributed well-reasoned arguments on a regular basis, and we worked well together. At least a couple of us were able to look at almost all the questions, and we had fairly good results for a non-super team. They made 4 of us supers for season 3.
I was in GJP for a short while but I left. Five to ten hours a week is a lot of work, for any sort of unpaid hobby or job.
I found the reward centres of my brain were left wanting by the long time until questions were finished (e.g. will there be a war in the next year?) which stood in stark contrast ot the fact I felt the need to massage my answers frequently in response to events.
Effort-to-reward ratios were very low and a great deal of intrinsic motivation was needed. Far more than I – a person with some real interest in predictions – possessed.
Do you have any insights into what the other groups were doing badly?
I was surprised to learn that on the whole, non-supers tend to be underconfident for most probabilities (particularly if you’re aggregating their forecasts). They are, more expectedly, overconfident for extreme probabilities.
Non-supers tend to take too much of an inside view and neglect base rates.
Most non-supers also put in much less effort into forecasting. They updated their forecasts much less often (ignoring both new news, and time decay). When they did update their forecasts based on news, they tended to overreact (again, too much of an inside view).
Great! I do have questions.
At what rate do superforecasters drop out?
Has being a superforecaster brought you compensation and/or job offers?
How many of you are not US citizens?
I don’t know the total rate at which we dropped out – I’d guess 10-20% attrition per year. In my cohort of 15, IIRC there were 2 dropouts between seasons 3 and 4, and there was 1 member who chose to switch to the prediction market setting rather than the team setting.
I’m curious about the grading system. I understand you can enter many predictions for the same event. How do they simultaneously avoid all the following cheats:
1. Changing your prediction at the last second (eg “Will there be war in Korea this year?” 100% NO at 11:59 on December 31)
2. Making the same very easy prediction a hundred times over a hundred days to increase your score
3. Waiting until the last possible second to make your prediction at all
Also, does anyone skip questions? Are you penalized eg given a zero grade on that question?
My impression is that the score on a single question is the integral over time of score of the prediction current at those times. So if A and B are both 100 days in the future and I answer A once today but answer B every day, that doesn’t count as 1 prediction of A and 100 predictions of B, but 100 prediction-days of each. (Or maybe things are divided by time open, so all questions are equal weight.) I think skipped questions were interpreted differently for different people. Elsewhere, Jon says that for (some?) people with access to a group, skipped questions are interpreted as following the average. If you answer 100-day question on the last day, that’s 99% imputed predictions and 1% real predictions.
If question A is what will happen on the particular day in the future, it is sensible to answer once and wait (in the absence of news). But if B is whether something will happen between now and then, you probably should change your answer every day, reducing the odds by 1%. That’s a better average prediction than predicting once and forgetting, but you aren’t credited or penalized just for activity.
Individuals in the GJP forecasting setups (as opposed to the prediction markets) received 1 grade for each question which was posed (excepting a small handful that were thrown out by IARPA for various reasons). The grades were Brier Scores (see Wikipedia).
Forecasts are logged and judged once per day at midnight – each day’s prediction is assigned a Brier Score. If a question was open for 20 days, your final score for the question is the average of your 20 daily Brier Scores. The day which triggers an early close to a question (e.g. a war in Korea starting mid-day) is not scored – so you don’t get any credit for changing to 100% right after the event happens.
Each team’s forecast is assumed to be their median forecast, and the team is assigned a Brier Score based on that median.
You receive a score for each day, for each question, regardless of whether you submit a forecast. For days on which you haven’t yet submitted a forecast, one is imputed to you. If you’re on a team, you’re assigned the team’s median forecast (so your Brier Score will match the team’s score). If your’re not on a team, you’re assigned the mean forecast of people in your experimental condition (you are able to see that mean forecast).
So you don’t benefit by cherry picking the questions in any way – you’re always scored as if you had submitted some kind of average forecast. I would only submit forecasts when I thought that I was more accurate than my team’s answers so far, so I participated on only about half the questions (fewer when they were first posed, but I’d sometimes jump in if I thought the team wasn’t decaying quickly enough).
There is one subtle way to game the scoring system… when a question has an asymmetric closing date, like the Korea war example. If the question is posed with a 6 month time horizon, then when the result is No, there will be ~180 days whose scores are being averaged, while if the result is Yes, there may be a very small number of days that count. So to maximize your expected score, you would artificially raise your forecast initially – you get a huge benefit in the unlikely event it triggers quickly, but only a mild penalty if the question goes the distance. One of the supers built a calculator to optimize our answers for those types of questions: http://jsfiddle.net/Morendil/5BkdW/show/
I notice the site name is Morendil – is this Morendil the LWer?
That makes sense, but also sort of annoys me. Aside from that calculator, you ought to change your forecast every day – if nothing else, the chance of war in Korea decreases by 1/X every day in a period with X remaining days. Doesn’t this give an unfair advantage to people with way too much free time who redo their predictions every single day?
Not sure, but I’d guess there’s a good chance it’s the same person.
It does give an advantage, but in practice most of those questions have low enough probabilities that it doesn’t tick a whole percent very often (you can only submit forecasts in whole percentages). Excepting the scoring bug that Morendil’s calculator addresses, I think that updating all your forecasts once a week probably gets you 95% of the way towards handling decay properly. To game that bug properly, you do need to update every pretty often right after a new question is posed.
Also worth noting, if you’re on a team that’s conscientious about decaying and you all agree, most of you can just withdraw your forecasts and rely on a couple of you to keep decaying it.
This falls under the category of “we really should be asking a statistician”. Making such a later prediction should “count” less as a favorable prediction than making an earlier one (reaching a limit when the year is over, you predict a 0% chance of war, and this provides no information about your ability to predict at all.) I don’t know how to compute how much less the prediction should count; but it’s unbelievable that there is no literature about this subject.
I had this thought before when reading Tetlock, and IMO the opportunity for this type of gaming the the payout structure is why prediction markets inherently add a layer of informativeness. The reason being that someone could take the “short” position against someone gaming in this way and arbitrage the advantage away.
It also brings to mind a natural avenue for expanding on Tetlock’s research– which is to look at the volatility of the predictions through time.
e.g. under the current payout structure Jon S describes, for each question, a participant is effectively given a financial option that he/she gets to set the strike price on. The proper strike price (probability) that maximized the payout on any given day depends not only on your native estimate on a given day, but also the time remaining and the term-structure of volatility through time.
Wouldn’t information on the implied volatility for these forecasted outcomes be of interest in addition to what Tetlock is currently measuring (just the central tendency?)
Seeing volatility of predictions over time as important and related to arbitrage is really interesting, thanks for this perspective.
Do you know what what kind of assumption, and their justification, Morendil is using to model that convergence?
I am also a superforecaster and have mentioned it before and why doesn’t anyone think I’m interesting, wtf
lol
Then I am also interested in your answer to my question below, FWIW.
We think you’re interesting!
(If you are still here.)
What differences are there between your process and the other person’s?
I think you are interesting enough that I would love to chat/interview you (not for any public facing thing) to discuss group rationality and your epistemic heuristics. You can ping me at my first name at mealsquares dot com.
What probability would you assign to a US-Russia nuclear war happening within 10 years?
here’s my guess
[This question is also open to other superforecasters.]
Testing my prediction here: Estimate how good is your approximate numer system. How good are you at quickly guessing the number of objects in a picture full of objects? Comparing the volumes of differently-shaped containers? Guess the Correlations? The last one is particularly significant since you can measure it quickly and conveniently and it gives you numerical answers; there are probably tests on the internet that relate to ANS in a more direct way but I’m too lazy to search for them.
I’d also expect that my ANS is significantly above average. Pretty sure that as part of the GJP I took a quick test on the subject… the organizers let some of their grad students experiment on us along those lines. I tried out Guess the Correlations – got 236 on my first try, not sure how good that is.
It sounds like a lot of your success is based on a guarded, see-all-sides, outside view that assigns high-entropy probabilities to “dramatic” questions. Were there any striking counterexamples? Big news items that seemed to take the world, maybe the media, by surprise, but that you saw coming.
I can’t think of any dramatic examples for my own predictions. One of my teammates did his dissertation on a topic that was quite relevant to the situation between Ukraine and Russia, and he correctly made strong predictions to a couple of related questions that were counter to the average prediction of other super-teams.
“None of them are remarkable for spending every single moment behind a newspaper….”
Arguably the most successful modern forecaster whose accuracy we can measure is Warren Buffett, and he basically spends all day reading.
This was an extremely useful review for me, because I was considering whether to buy this book, and I have plenty of familiarity with this material.
A nice alternative to buying the book is to listen to the (excellent) Econ talk podcast with Tetlock as a guest:
http://www.econtalk.org/archives/2015/12/philip_tetlock.html
YES. I loved that episode. Can’t get enough EconTalk.
For a list of Scott’s previous book reviews: https://www.reddit.com/r/slatestarcodex/comments/3w4ip3/the_book_reviews_of_scott_alexander/
Scott, how are you coming up with your own yearly predictions? Did you read Superforecasters before you made the 2016 predictions? It would be interesting to see if reading the book makes someone actually better at predicting.
No, I hadn’t read this before.
Scott, it’s worth noting that the tournament continued for 5 years, and they continued to run both superforecaster and other non superforecaster conditions to compare – but there was bleeding, since a few non-supers were invited in over time. IIRC, they had some good results for conclusive the groups, and the preformance of supers was shown pretty clearly to be robustly higher.
I was clearly NOT a super forecaster. I participated in the first two years. I dropped out because many of the questions seemed to be about world politics and I had zero knowledge of or desire to learn. A typical question was on the likelihood of a candidate or party I had never heard of winning an election in a country I would be hard pressed to point out on a map. I added zero value to the process, unless it depends upon shills.
My experience was pretty similar–I put in one good season before pulling the plug. As I mentioned above, it sucked a lot of the fun out of reading the news by making it so goal-oriented.
That being said, I disagree that poor domain knowledge means being pure deadweight in the process: the world-events subject matter was, IIRC, deliberately selected to be at least somewhat unfamiliar to most forecasters. This ensures that the test is more towards “what are good practices/approaches to forecasting in general?” than just seeing who has the most complete knowledge of the topic at hand. Reinforcing that point, I scored around average for the project as a whole in the year I participated, and a little better than my team’s average. But I had basically zero domain knowledge compared to most of my active team members, who described themselves as things like “Weekly reader of Foreign Affairs for the past three decades” or “MA in International Relations at Georgetown” or “retired diplomat.”
I wonder how much would the results be better or worse if the forecasters are allowed to discuss the situation before giving the prediction.
Discussion among superforecasters improves their predictions. I think discussion among average forecasters improves their predictions, but increases variance. That is, the groups become echo chambers, but averaging a bunch of discussion groups is better than just averaging individuals who don’t discuss.
Maybe it’s possible to avoid echo chamber effect if no one directly states their predictions, but they all discuss the evidence they’ve got and the appropriate reference classes.
@Whatever Happened to Anonymous
I suppose “sneering”, if that’s the accepted shorthand. You are allowed to sneer at high status and official-looking books and the people who like them, but heaven forfend you call someone a nerd. This is exactly backwards.
With regard to your second comment, I am reading as Blue Tribe then, so this is being modelled as an intra-tribal dispute.
@jaimeastorga2000
This is a weird sociology of the internet, given how intertwined sneer culture, troll culture, and hate culture all are; and how they developed over time.
>I suppose “sneering”, if that’s the accepted shorthand. You are allowed to sneer at high status and official-looking books and the people who like them, but heaven forfend you call someone a nerd. This is exactly backwards.
Reading it again, I totally see how it could be interpreted that way. But I don’t think that was the intent, what I thought Scott meant was that it is good for providing validation to a group of people that get a lot of “crank” accusations.
>With regard to your second comment, I am reading as Blue Tribe then, so this is being modelled as an intra-tribal dispute.
I admit I have trouble with this Blue/Red tribe concept, but I would hope that’s not representative of Blue Tribe. At the very least, the self-styled blues that comment here (Scott included) display nowhere near the level of toxicity of those communities.
Well, we’ve established intent doesn’t matter, especially on the internet, see the reactions to my explaination of my first remark. The community here has a very strange (and dare I say) toxic relationship with social status, and I enjoy goring that ox at every opportunity.
As for sneering, I suspect it’s a human universal behavior, my first thought was thinking of hipsters and “I liked this band before it was cool” reading this post. I’d be happy to learn that it’s not a human universal and I’m grossly generalizing my experience, but then, that’s why I leave abrasive comments on the internet.
Interesting juxtaposition lately. An insult to Emma Watson, brings talley-ho on her. A banter on HPMOR, talley-ho on you.
The CIA already knew this: https://www.cia.gov/library/center-for-the-study-of-intelligence/csi-publications/books-and-monographs/psychology-of-intelligence-analysis/PsychofIntelNew.pdf
Heuer’s work influenced GJP, but it was based on speculation and abstract research. Also, his ACH methodology wasn’t implemented widely, at least by agencies that are public about their work (like the FBI, or NYPD.) It also hasn’t been publicly tested before – though the CIA may be using it internally for things, for all we know. The CIA is, however, one of the organizations that works closely with IARPA, and they were aware that this was going on, so presumably they thought it was worthwhile.
The lesson for individuals that Scott takes away from this book is the same lesson of that book. But what does it mean for the CIA to know something? Some people inside the CIA knew about the heuristics and biases literature, but I don’t think that the standard procedures and training reflected that book. This contest may well have been created by those people to provide concrete evidence as a step towards implementing it.
But mainly the contest is about aggregating judgements, something that the heuristics and biases literature does not address. Indeed, the contest may have been pushed by prediction markets people, not by heuristics and biases people.
I am not sure how it relates to the other points, but in the decade since the book was written, the CIA has defined probabilistic interpretations of its analysts assessments. This contest has encouraged them to increase the number of buckets.
I guess what I meant to say was, “Here’s an interesting book on the same topic.”
The biggest limitation of the forecasting process is what questions are asked.
For example the biggest event of 2015 was the european refugee crisis. but despite the fact that the main elements of the crisis were already in place (the collapse of Libya, the civil wars in Afghanistan and Syria and Erdogan’s willingness to play hard ball with Europe) there were no forecasting about this.
What the superforcasters need is a group of superintuitives that ask the right questions, including the craziest ones.
Tetlock mentions this and agrees with you, though he doesn’t talk about any specific ways of solving the problem.
My review of Tetlock’s “Superforecasting” book in Taki’s Magazine last month looked at 3 big stories of 2015 — Merkel’s Million Muslim Mob, Caitlyn Jenner and World War T, and the rise of Trump — and how hard it would be to even ask forecasters to predict the probability of them happening:
http://takimag.com/article/forecasting_a_million_muslim_mob_steve_sailer/print#axzz3zMdUB3rR
Personally, I sort of, kind of predicted that World War T was coming back in 2013-14, and, if I were feeling generous, could give myself partial credit on the other two. Or I could give myself zero points for all of them if were being persnickety.
This is a good review, but sounds a little like “meh, it’s just okay for those of us who already know this stuff.” Rationalists and rationalist-influenced folks should give this book far more credit, and assign it far more importance, than that.
Sure, if you’re a CFAR alum or LW reader, you might not learn any new tactics from this book. But the GJP study functions as far more than social proof: it’s just proof, period. Okay, okay, it’s pretty good evidence, period.
Specifically, it’s evidence that a probabilistic mode of thinking transfers across a broad range of topics, even topics that initially appear to be ill-defined and overly complex. Let’s remember that the conventional wisdom – among really smart people – is that an amateur can’t possibly make accurate predictions about both global warming and North Korea. “Some of these areas are too complex and messy for anyone to call, and if you did make a prediction it’d be a black box more art than science, so you would need many years of experience building your intuition in the field, and even then we’re just flipping a coin a lot of the time.”
The conventional wisdom is entirely reasonable, and Tetlock’s findings are non-obvious. Even on this blog, we’ve discussed previously whether “critical thinking” supposedly taught in college transfers to other domains, or whether education is just a grab-bag of domain-specific nuggets of knowledge. The GJP shows that rationality-style thought – what Anna Salomon described as a convergence between Feynman, Munger, MBA programs, LW, Tetlock, and others – actually work in an important way for hard, messy, big, real-world problems.
http://lesswrong.com/lw/n4e/why_cfars_mission/
We can be pretty damn sure we have a good way of thinking, one that takes the best of and refines the work of the past couple of centuries, but evidence that this functions in the real world matters. After all, even many quite technical people are less than enamoured of this style of rationality.
Further, let’s not sneeze at social proof. The GJP’s success means that mainstream folks may pay more attention to rationality, and that’s not a bad thing.
The bit about how “foxes” tend to do better than “hedgehogs” reminds me of how famously right-wingish, Austrian economics-inclined Jim Rogers and famously left-wingish donor to political causes, George Soros, could both work on the same fund and both make a ton of money investing even when their broad worldviews seem so divergent. I haven’t read much by Soros, but Rogers’ statements on investing always emphasize doing a lot of on-the-ground research. So, had he (or Soros) invested on the basis of a broad ideological outlook they probably would not have done well, but inversely, you can have an entirely mistaken meta-outlook (on the assumption that Rogers and Soros can’t both be right on that question, regardless of which you think is better) and still do very well if you confine yourself to more detail-oriented, object-level questions.
To expand on that, I’d be interested if there were a study on whether investors with known left-wing ideological leanings do better than investors with known right-wing ideological leanings, but I imagine that, if such a study exists or were done, it would show no appreciable difference. Which, on the face of it, seems bizarre–how could two groups of people with two opposite opinions on the likely outcome of various major government policies, for example, do equally well over a given period of time? But I imagine they do.
Reminds me also of some joke about macro and microeconomics to the effect of “macro tells you interesting things with very low certainty and micro tells you boring things with much higher certainty.” Sticking to the boring things is probably a better way to make money. Also, even though I think libertarian and socialist investors will probably do equally well, I think this very fact supports the libertarian view of the economy and the extent to which it can or should be predicted and regulated: Hayek and the pretense of knowledge and all that.
[insert tongue in cheek]
libertarian investors have a poor understanding of social good but a great willingness to exploit anyone for their own profit.
socialist investors are great at long-term social benefit hence are willing to forego some of their own profit.
These differences may cancel out 🙂
[remove tongue from cheek]
As in, it’s possible, as a speculation, that a libertarian investor does a smaller number of more personally profitable but less overall sustainable actions, while a socialist investor does a larger number of less personally profitable but more overall sustainable actions.
Sigh, to be clear, I am not saying the above is the actual explanation. But only that it comes to mind as a potential way that two different factions pursuing different strategies could end up with comparable net return. Note it’s not exactly symmetric since the libertarian investor has externalized some of the negatives elsewhere. But that’s a common investing theme (aka “socialize costs, privatize profits”).
“libertarian investors have a poor understanding of social good but a great willingness to exploit anyone for their own profit.”
Even tongue in cheek, this kind of stereotyping is unhelpful, I think. I mean, I get what you’re saying: if ruthless men with monocles and tophats and moustaches and granola eating hippies get similar returns then that seems to imply the hippy philosophy is actually better since the Monopoly guy is less scrupulous, and so, all things equal, should do better. But the idea that libertarians, as people, are more willing to profit through exploitation of the commons is entirely untrue in my experience.
Moreover, I’m almost certain your supposition is wrong since:
1. Most people don’t actually make practical decisions about investment, employment, and the like on the basis of broad political ideologies. And it’s good they don’t, because they’d be more likely to be wrong.
2. They’re more likely to be wrong because broad ideologies do not include enough information to make accurate predictions.
As per the macro-micro joke, I don’t think accurate, large-scale, meaningful predictions are possible for human brains at their current scale and longevity. People can make meaningful, better-than-chance predictions about their own very narrow field of expertise–if you’ve been working in a particular industry for decades, for example, are intimately familiar with its ups and downs and history, then your guess about where it’s going next will be better than chance. But only about that one thing. If you’ve been studying the history of one particular type of institution for decades, you might be able to predict with better-than-chance accuracy how a new institution of a certain type might behave, though that prediction will still be confounded by a thousand other factors emanating from areas outside your field of expertise.
Since macro-scale events are nothing more nor less than the sum of the interaction of all the micro-scale events, the only way to predict them with any accuracy would be to be intimately familiar with say, 100 different industries, and/or to have simultaneously lived 1000 different lives as people of all nationalities, genders, races, and social/class/cultural backgrounds. And then you’d have to be smart enough to understand how all these different things would interact. So basically you have to be a super AI or an immortal or something.
The closest we can come is things like stock markets, prediction markets, etc. which aggregate dispersed knowledge in a way no single brain can. Hence, one person may be consistently better than prediction markets on one particular area they have studied for decades, but I doubt there is anyone who could do better than prediction markets in every area from sports to political outcomes, to the price of orange juice.
Off-topic, but I was amused at how badly I misread
No excuse, really, except that I’ve fallen behind and was reading too fast. You didn’t really mean “ruthless men with many things eating hippies”.
Tetlock uses as his example of a hedgehog with one idea the pundit Larry Kudlow, who first had success with supply side economics in the Reagan Era. Since then, he’s used the same model, but with pretty bad forecasting results.
My view is that supply side economics is a classic example of the diminishing returns problem: if the highest tax rate is currently 70%, cutting it will probably work out okay. But the second or third cut is likely to be less effective and more troublesome. Indeed, Arthur Laffer pointed that out when proposing supply side economics in the 1970s with his famous Napkin Graph.
I’d be a bit wary of the meaning of phrases like “… but they actually did 30% better than professional CIA analysts working off classified information.”. That sounds great. But it doesn’t take into account that being a professional analysts can involve writing what the politicians want to hear, rather than making your best effort at figuring out what will happen. For example, this was a big problem problem in the run-up to the Iraq war. Career advancement did not come from making a fine-grained estimate of Iraq having nuclear weapons. It was much more from writing along the likes of “War! War! War! We must invade. See this brief as to why …”.
http://www.washingtonpost.com/wp-dyn/content/article/2005/05/12/AR2005051201857.html
“Seven months before the invasion of Iraq, the head of British foreign intelligence reported to Prime Minister Tony Blair that President Bush wanted to topple Saddam Hussein by military action and warned that in Washington intelligence was “being fixed around the policy,” according to notes of a July 23, 2002, meeting with Blair at No. 10 Downing Street. ”
It’s not hard to do better than professionals when they’re not being professional.
There’s some truth to that. On the other hand, assuming that there are different parts of the government with different incentives for accuracy vs. flattery, it might be possible to replace or supplement the analysts (with their incentive to flatter) with the GJP people (who are too widely distributed to have that incentive) and so actually make gains.
Who would be doing the replacing? If the US government doesn’t want to hear the real story, but rather wants only justification for what the politicians have decided to do already, the whole point is not to get accurate predictions. In this case, British intelligence was giving a straight story to the British government (from the above):
“”The case was thin,” summarized the notes taken by a British national security aide at the meeting. “Saddam was not threatening his neighbours and his WMD capability was less than that of Libya, North Korea or Iran.” ”
And the British government didn’t care either:
“The British Butler Commission, which last year reviewed that country’s intelligence performance on Iraq, also studied how that material was used by the Blair government. The panel concluded that Blair’s speeches and a published dossier on Iraq used language that left “the impression that there was fuller and firmer intelligence than was the case,” according to the Butler report. ”
Too many organizations don’t want accurate predictions in the first place.
http://www.nbcnews.com/id/3080244/ns/meet_the_press/t/transcript-sept/
MR. TIM RUSSERT: Our issues this Sunday: America remembers September 11, 2001. In Iraq, six months ago, the war began with shock and awe. Vice President Dick Cheney appeared on MEET THE PRESS:
(Videotape, March 16):
VICE PRES. DICK CHENEY: My belief is we will, in fact, be greeted as liberators.
That’s why I said different parts of the government. There are different incentives at different levels, but occasionally some of them are good and can use public opinion, shaming, and fear of failure to get the others into line.
As Tetlock points out, a good example is the government’s willingness to sponsor this project at all.
If you want an excuse to go to war, you don’t want accurate intelligence that tells you “Actually, we can’t dredge up a shred of an excuse for you”. That’s the whole controversy over “sexing up the dossiers” and the vexed Chilcot Inquiry which finished in 2011 but the report of which has not yet been published, primarily due to British and American politicians agreeing to sit on it to avoid mutual embarrassment over claims such as the following:
The few scraps of fact emerging from the morass do seem to indicate that for both administrations, those advising caution and discounting claims that WMDs! right now! could bomb London into oblivion in 45 minutes! were pushed aside in favour of those writing the story the way it was wanted. If you have a career as an intelligence analyst, you know that you need to please your masters. People who insist on being factually correct when it’s politically inconvenient don’t get listened to and do get sidelined.
A panel of GJP people who produce good, consistent, accurate results that don’t chime in with what the government of the day wants may not influence policy decisions that much and may not make the gains expected.
The unfortunate end of Dr David Kelly bears witness to that.
Now that I think about it, if CIA are basing their predictions on information which is supposed to be secret, will the agents give you predictions which might enable someone to deduce the secret information?
Oooh, that’s a good one!
“If we provide accurate forecasts based on our top-top-secret information gained from a triple agent, and they fall into the wrong hands, then the bad guys can figure out what we know and how we came to know it – so better issue a false forecast. In fact, to be safe, we should issue one that is such complete nonsense, the opposition can’t decide whether it’s because we’re all idiots or if it’s that our intel really is that bad” 🙂
I was inspired by Tetlock’s book to launch my own forecasting contest among friends. It’s 25 questions about events in 2016, with 30 entrants. So far one of the events has been decided (RIP, Abe Vigoda) but the others will have to wait for the presidential election, the Olympics, and mostly the end of the year.
The two camps of foreasters were the extremists (who gave 0 and 100% probabilities to everything) and the rest. Nobody seemed to use base rate calculations, so I did that for my own entry, in a quick and dirty way.
Running some simulations, I saw some curious things that may bear on Scott’s questions about granularity. Extremists in general do poorly. Also, taking the winner and “extremizing” all his predictions makes him do worse, not better. Making moderate predictions–and I think granular predictions might be a more intense form of moderation?–gets you farther.
I think it might be an artifact of how I’m calculating the Brier score. I went with the square of the error on each forecast, so being 100% wrong on a question carries a huge cost. Tetlock seems to use the error without squaring it (I couldn’t exactly tell), so maybe there is a lesser penalty for being extreme. Does anyone know?
What scoring system did you use?It’s pretty difficult to incentivize people to maximize their score. Most people just want to maximize their chance of winning, which is not the same thing. It may be that under Brier scoring the way to maximize a small chance of winning is to do a lot of 0/100 answers to maximize score conditional on getting the most questions right. But not under log scoring.
This is the debate I’m having with one of the extremists. I think we agree that extremism maximizes chances of winning when you have lots of forecasters and few questions. But once you have enough questions, you are already in a repeat game, where (informed) moderation seems to be a better strategy–even if you are just trying to go for broke and win outright.
Did you add the 3rd and 4th paragraphs after my comment? I struck out my first line in response to the 4th paragraph, which I hadn’t noticed before. The third paragraph sounds relevant to my main paragraph. It isn’t entirely clear that you did the right simulations. The part where extremizing winners makes them do worse isn’t relevant if it’s just worse score. I’m skeptical that the same number of entrants and questions is enough.
What is the model in the simulations? If everyone has the same beliefs, then reporting true beliefs guarantees that everyone will tie, which may or may not be considered a win.
The main simulation I did was: if crowd average less than 50, doesn’t happen, if greater than 50, does happen. The original extremists do poorly in this case, and I (with unadjusted base rate calculations) come in #5 of 30. If I modify everyone’s predictions to be the extreme version of what they originally predicted (a further simulation requested by an extremist), then nearly everyone does worse–except my forecasts, which do slightly better.
The intuition the extremists have is “surely it’s obvious how some of these events will play out, so why not pocket the Brier score of 0 for those cases.” But it’s obvious only in hindsight, and the Brier score really punishes you for being 100% wrong. So going for broke doesn’t work unless you have very few questions to predict and you are trying to break out of a very large pack of forecasters.
Tetlock does use squared errors. A Brier Score is generated for each answer choice, then those scores are summed. So if you answer Yes/No question as 100% Yes, then you receive a score of 0 if it happens and 2 if it doesn’t:
(1-1)^2 + (0-0)^2 = 0
(1-0)^2 + (0-1)^2 = 2
If you answer 60% Yes, then your score will either be .32 or .72:
(.6-1)^2 + (.4-0)^2 = .32
(.6-0)^2 + (.4-1)^2 = .72
If you answer 50%, then your score is always .5.
Thanks. That math looks like it produces exactly 2x the “Brier scores” described in Wikipedia, which is what I used. Is that correct? https://en.m.wikipedia.org/wiki/Brier_score
That is correct for binary questions. The scores that Tetlock used generalize to multiple-choice questions.
Regarding the Scope Insensitivity item, I wonder if this “cognitive bias” is really a bias. Suppose I have $100.
Under a fixed budget, the total amount of money I can pay for chocolate is constrained by my disposable income (regardless of how many chocolate bars I actually want, or how much of a bargain the clearance-sale is). So perhaps scope insensitivity isn’t as maladaptive as it appears. I think the heuristic that our System-1 actually uses is closer to “Hm, [saving birds] is less important than [paying rent], but more important than [eating chocolate]. Therefore, I think I’ll allocate Y% of my budget towards [saving birds], where choc(X%) < birds(Y%) < rent(Z%)."
Whether or not saving birds is more important than something else should depend on the amount of birds. OK, function could be non-linear, but if is pretty much constant + noise, probably you are not taking number it into account at all and not just precisely so budget constrained for that to make sense.
Since you’re implying that utility monotonically trends towards infinity, what is the minimum threshold of “birds-worth-saving per hour” to warrant you to devote 100% of your lifespan towards bird-conservation (as opposed to things like eating or sleeping)? If bird enthusiasts were to forgo basic needs like eating and sleeping, they’d die prematurely. And then the glorious Bird Conservation Projects they had envisioned would never actually come to fruition.
Bird Conservation is contingent on your self-preservation. So before you can engage in other people’s welfare, you have to ensure your own welfare. This is why airplanes instruct parents to put their own oxygen-masks on before worrying about their children’s masks.
Instead of thinking in terms of “maximum utility”, I think it’s more accurate to think in terms of “satisfying {wants, needs, goals}” (aka problem solving). E.g. this cake recipe needs 10 cups of flour. “Maximizing flour” (in proportion to the rest of the recipe) doesn’t maximize tastiness, it ruins it. Instead of maximizing flour, you should satisfy the recipe’s flour-requirements. Once the flour requirement has been satisfied, you move on to satisfying the recipe’s next requirement. Givewell for example already understands that you can’t always increase utility by throwing more money at a problem.
When Neuroscience is finally able to examine our own source code, I don’t think our utility function will look like a sum of linear terms. Instead, it’s going to look like a sum of parabolas, sigmoids, etc. E.g. the “flour vs utility” term is going to look like a parabola, while the “Bird Conservation vs utility” term is going to look like a sigmoid. Given this model is correct, the [pay rent] term contains a jump discontinuity — the right-limit of which is higher than the upper asymptote of the [Bird Conservation] sigmoid term, and the left-limit of which is lower than the lower-asymptote of the [Bird Conservation] sigmoid term.
To go back to the original question, I think the issue is that the question assumes money is directly proportional to utility, rather than treating money as an input (aka resource) into a not-necessarily-linear utility term.
So do you think that all people in the survey just happened to either care so much about birds to devote all their spare money even in minimal number condition or not at all? Or are you concern trolling with actually irrelevant stuff?
Let me elaborate/explain. My model of reasonable bird lover under budget constraint:
0-N birds in peril – donate 0, problem considered not serious, save otters instead.
N-M donare money, monotonicly increasing to X$ at M
>M donate X$, all the budget allotted for this category of the thing.
If you would average a lot of those with different N, M, X you would get some sort of monotonic increasing smooth function. That is not at all what study found:
“Subjects were told that either 2,000, or 20,000, or 200,000 migrating birds were affected annually, for which subjects reported they were willing to pay $80, $78 and $88 respectively.”
Also, I didn’t imply as you claim that utility of birds tends to infinity.
Or, what they mostly found was $X, the maximum amount the average person would budget for bird preservation.
Yeah, coz they always said pretty much the same amount so its the most and the least and the average and whatever by definition. And it just happens that everyone would have found 2000 birds in year in USA which has god knows how many millions of birds a really urgent important problem but only could spare 80$ on average. And if 100 times (!) more birds were affected, well cant give another 50, would starve/be homeless. Sure, why not, very plausible, totally not rationalizing.
@satanistgoblin:
There is good evidence that humans don’t count past 3 or 4 on an intrinsic basis. So, absent some practical experience with a group of things, it wouldn’t surprise me that the average person is not scope sensitive to differences in large numbers.
We all have experience with money, to a certain scope. Very few people have experience with large populations of wild birds.
Hypothesis: A dedicated Alaskan birder will give different answers to these questions, and is more likely to give scope sensitive answers because they know something about the bird populations in question.
I don’t know if that hypothesis is true, but I don’t think it’s merely a rationalization of the data. Rather, it’s a hypothesis that attempts to explain the data that we see.
I think it’s something like what satanistgoblin says– people have a feeling of how much they want to spend on saving birds, and that’s what they’d say whether it’s 100 birds or 1000 birds, especially when the specific situation isn’t described. Is it birds in trouble (covered with oil who need a rescue operation)? The last members of a species? The last members in a region? The last members of a slightly different local variant?
For the unspecified situation, I wonder whether the answers are different amounts of money depending on the order the 100 birds and the 1000 birds are asked about.
My bet is that you get the same amount if the 100 birds are asked about first (see reasoning above), but it will be less for the 100 birds if they’re asked about second.
[I think I misunderstood you. On the first pass, your original comment didn’t make 100% grammatical/semantic sense to me. So I assumed you were using a phone or something and steelmanned.]
I think matthew had the correct idea. The way humans frame the bird question isn’t “how many birds can I save per dollar?”, but “does the problem (of extintion) justify the cost needed to solve it?”. In this way, it more closely resembles a decison problem than an optimization problem.
Actual study was not about preventing one bird species extinction, or birds being from one endangered species. There is some sort of game of telephone going on.
The study wasn’t about “birds” period. The study was about human cognition. Which (I think we both know) doesn’t always interpret researchers’ questions in the way the researchers expect. If this was indeed the case, it screws up the study’s conclusions.
Scott’s next post Highlighted Passages From Superforecasting even brings this up. E.g. that people interpret “there’s a significant chance” as meaning anything between 20% and 80%. Those numbers are wildly different!
I don’t know if this is just me, but superficially, certain general questions like “will the DJIA be up or down by the end of the week” seem as if they might be easier to forecast than more specific questions, like “will the price of orange juice be higher at the end of this week?” Yet logically, it should be exactly the opposite. I feel like, superficially, it seems like the former question requires less detailed knowledge, because you can just look at big trends, interest rates, etc., whereas the latter question requires you know a lot of details about orange farming, etc.
Yet, of course, the question of whether the DJIA will be up at the end of any given week should be exponentially harder than the question of whether orange juice will be up, since, to make a really informed guess, you’d have to know about all the companies and industries included in that index, in addition to fed policy, etc. etc.
This isn’t a strong intuition of mine, but it’s there enough that I find it noteworthy. It reminds me of something I think Eliezer once said somewhere about how “Thor does it” is actually a more complicated explanation for lightning than the natural, physical explanation, because first you have to explain Thor and then you still have the question of how he makes lightning. But, superficially, “Thor does it” seems like a simpler explanation, as “will the DJIA be up or down” seems kind of like a simple question. Is there some sort of LWish name for this bias?
I’m not sure your premise isn’t fatally flawed. The average of a thousand die rolls is vastly easier to predict than a single die roll.
To take a closer analogy, I can predict with moral certainty that the league-wide batting average in the MLB next season will be between .241 and .267. This doesn’t require me to have special knowledge of all 1280 players– in fact, I could not predict the batting average of any single player with anywhere near the same precision– it just requires me to notice that the average last year was .254 and that annual fluctuations have been no greater than .13 since 1931. The variance in the careers of the individual players tends to cancel out when they are pooled, allowing us to more easily spot gross regularities in the aggregate data.
What you (onyomi) are asking here is how we can have a science of the Dow Jones without first having a science of each of its constituent stocks. The answer is that we can’t, but we don’t need to, because the Dow exhibits far greater regularity than Boeing or GE.
At some levels you can treat a lot of that stuff as noise and abstract it away. Analogously, it’s a lot easier to predict the motion of a gallon of water than a single water molecule.
(And at some levels you can’t. Abstracting stuff away as noise is a decent summary of what got us into trouble in the subprime mortgage market a few years ago.)
Over a long period of time, you can much more easily predict that the DJIA will be up than the price of orange juice. And that means if you go week by week and always say the DJIA will be up, you will do better than chance.
This is implied by the fact that you can consistently earn money by investing conservatively. What you can’t do is try to game the highs and lows and beat the market to earn a higher rate of return. But it’s very easy to predict that the rate of return will be positive.
Prices are signals, right? So by knowing the interest rate and knowing that it is positive or negative, you have allowed the invisible hand to calculate for you whether the stock market will be up or down. You don’t have to calculate yourself every little fact about orange and potato and smartphone and soda production; that’s been done by other people looking to make money.
I feel like you have just rediscovered Hayekian economics from another perspective.
This is a good point: it seems like the price of orange juice will be easier to guess because you only have to understand orange juice. But, of course, the price of orange juice will already be preferentially decided by people who understand orange juice because those are the people most likely to be active in trading it. Therefore, to consistently make money predicting the price of orange juice requires you be better than the collective guess of people who are, by and large, much more informed about orange juice than average.
As for predicting the DJIA: it is, of course, easy to do better than random simply by guessing that it will be up, week over week, since historically, that’s what it does. But to actually beat it–to say, outperform a conservative investor by buying some kind of index fund before it goes up and selling or shortselling before it goes down: that is, of course, much harder, if not literally impossible (as it is likely impossible for anyone to consistently beat prediction markets in every area, since that would mean beating the aggregated knowledge of experts working throughout the economy).
So, one level, it seems like consistently beating the market on a general question is no harder nor easier than consistently beating the market on a more specific question, since, in either case, the market is doing the “work” for you of aggregating the opinions of the relevant experts.
Yet it still seems to me, on some level, like it should be easier to answer the more narrow question: to simplify matters, imagine trading one particular stock within the DJIA–let’s say Boeing. It seems like, in order to have a shot at consistently beating the people who trade Boeing, you need only be an expert in the airline industry, whereas to consistently beat the people who trade the DJIA as an index, you’d need to be an expert in all the other industries represented, as well as an expert in airlines?
Of course, for someone who doesn’t want to do a ton of research on a specific industry, trading an index is the safer way to insure stable returns. But to do better than that seems to require specific expertise, and that seems like it should be harder to have the bigger the scale?
Not exactly. There is no reason you can’t invest only in orange futures. If the stock market (and commodity markets), on average are going up, picking stocks or commodities at random is just as good in the long run as buying the whole index.
The foolish part of this is simply that returns are more “lumpy”: on average, you will do the same this way, but you could lose all your money. While if you buy the whole index, you won’t. If you did not have a diminishing marginal utility in regard to money, this would not be a problem.
It is easier, if you get down to a very narrow field where you can truly have expertise. Otherwise, no one could ever make a higher rate of return as an entrepreneur than by buying an index fund.
It is harder to have expertise and beat the stock market as a whole. But beating the market is not the same as predicting that the market will in general go up.
It’s much easier to beat a small, local market, and people do this all the time. It happens anytime someone earns a higher-than-average rate of return. For instance, I read a recent news article that bowls (just plain bowls you eat out of) are becoming more popular because of the growing popularity of Asian food and the increasing informality of dining. So companies selling dinnerware sets with more bowls in them were making better returns than they could have gotten just by investing their profits in an index fund, which draws other companies to also produce more bowls, eventually to a point where it no longer pays above-average returns to produce more bowls.
I think the issue with a big index like the DJIA is that, to the extent people can predict its movements, it is not based on knowing detailed information about all the companies and industries represented. It is, instead, knowledge of macro trends and history–information like, “based on the past 100 years, the Dow has returned a rate of between x and y, with an average rate of z…” as well as bigger things like “I bet this revolution I think is brewing in country b may negatively impact all these industries in the short run…”
I could be wrong, but I don’t think anyone attempts to predict its movements by being an expert in each company and industry, except insofar as a certain level of expertise is already “baked in” to the evaluations by the traders of the individual stocks.
In other words, I think maybe my original suspicion is correct: to predict the Dow the way you’d predict something like the market for airplanes, i. e. through detailed an intimate knowledge of all the businesses involved, probably ismuch more difficult, if not outright impossible.
Yet on another level, any given tradeable index or commodity, be it for the price of one commodity or small company or for the average value of some big index, will, inevitably, have those who specialize in trying to predict its movements. So there must be “Dow watchers” just as there are orange juice futures traders, but the level of information they are working with is already at a higher level of aggregation and therefore, more prone to error.
“Over a long period of time, you can much more easily predict that the DJIA will be up than the price of orange juice. And that means if you go week by week and always say the DJIA will be up, you will do better than chance.”
Maybe yes, maybe no. The question isn’t whether it rises on average but whether the probability of its rising in a week is greater than the probability of its falling.
Suppose that, two weeks out of three, it falls by ten points, and one week out of three it rises by twenty-one points. On average it’s going up, but if each week you bet a dollar that it will go up that week you will, on average, lose a dollar every three weeks.
Sure.
Empirically, market returns are negatively skewed.
Conjunction Fallacy?
Inferential Distance?
According to Jon S, base rates are more important (which you have know to answer the DJIA question anyways). Details about orange farming are usually just rounding error. I think the confusion has something to do with inside vs outside view, but I’m getting myself confused at the moment about how to articulate it.
It actually is easier to predict orange juice prices and commodity prices generally. There’s seasonal variation and predictable weather impacts. (This does not mean it’s easier to make money in it of course).
“So as I said before, Superforecasting is not necessarily too useful for people who are already familiar with the cognitive science/rationality tradition, but great for people who need a high-status and official-looking book to justify it. ”
The first part seems oddly dismissive. The results of the studies described in the book are valuable information (which one should want to update on!), in particular:
1. Indeed, this sort of rationality is very helpful for forming true beliefs and predictions about the world.
2. RCT evidence that a one hour instruction session about probability, reference class forecasting, and Bayesian updating can likewise provide substantial benefits. [If you’re talking about causal benefits of CFAR or Less Wrong, as opposed to correlation/selection, this is very important evidence.]
3. Quantitative evidence about the relative contribution of different factors to success.
You deserve kudos for predicting this correctly if you did, but going from prediction to confirmed prediction is a big deal.
I’m not saying Tetlock’s work isn’t useful, I’m saying it’s not much fun reading a book that offers things like a long description on what it means to estimate probabilities.
Tetlock has been publishing his findings for several decades now and his research has been pretty influential. I wouldn’t be surprised if Tetlock’s thinking has been influencing mine since the 1990s, with an outside chance of since the 1980s.
I acknowledge that scope insensitivity is a problem in general, but I’m unconvinced that this:
For example, how much should an organization pay to save the lives of 100 endangered birds? Ask a hundred people, and maybe the average answer is “$10,000”. Ask a (different group of) a hundred people how much the same organization should pay to save the lives of 1000 endangered birds, and maybe the average answer will still be $10,000.
is a good example. Unless the question was really careful in its wording to avoid this, I would expect people to interpret either question as “How much would you spend to prevent the extinction of an endangered species of birds by saving [X] of them?” People will give the same numerical value regardless of what X is, because they are treating it as a boolean outcome (extinct/not extinct), not a scalar outcome (X surviving birds).
Issue seems to be due to inaccurate paraphrasing of the original study: “In one study, respondents were asked how much they were willing to pay to prevent migrating birds from drowning in uncovered oil ponds by covering the oil ponds with protective nets. Subjects were told that either 2,000, or 20,000, or 200,000 migrating birds were affected annually, for which subjects reported they were willing to pay $80, $78 and $88 respectively”. Yeah, if only 100/1000 birds were left in existence it would have been a wholly different question.
The setup is implicitly and explicitly that those are all the birds migrating through that area. There isn’t any information about how much of the total population of birds this represents. The question basically just boils down to “how much money would you spend to protect this route of migration?”
I mean, think of a nearby to you body of water. Do you know off the top of your head how many birds migrate through it each year? If someone then said a different body had 20000 birds that were affected by a spill, would you know how that compares to your body of water?
The scope insensitivity exists. The question is why does it exist? “People are dumb” is not a satisfactory answer because it just begs the question.
I thought that was the number of birds which would die. I am not going to look it up, I am sick and tired of those damn birds now.
This is all very well and good, but it’s going to be for nothing if we don’t do something about the appalling state of the minds of our youth, which have been blasted into fragments of anxious, variously-gendered, hermetically-cut-off-from-one-another micropersonality blobs, by decades of kafkatrapping, SJW/feminist claptwaddle.
BRAAAIIINNNSSSSS!!!!!
We need to stop this quasi-religious mind-virus that’s been busily replicating itself pointlessly throughout the universities and the education system, cunningly disguised as a form of liberalism. That is absolute priority number one.
We cannot afford to sleepwalk into a three-cornered Armageddon, between a barely-rational Idiocracy, a Christian Fascist theocracy and an Islamofascist theocracy. That’s three major religions, count ’em, three: all armed to the teeth with nuclear weapons, and no shred of the shared rationality that made Mutual Assured Destruction a successful thing.
NO. DO NOT WANT. By God we WILL be one of the few, sparse scattering of species throughout the Universe that have escaped the Great Filter.
This is needlessly inflammatory and totally unrelated to the post. Please don’t.
Crazy problems are not unique to the current decade. They come, eat all the cheap food, lie down on the couch for a while, drink the good wine and stagger out of the house before being hit by an oncoming car… which caries the next batch. All we can do is take the cars they leave behind and sell the scrap for cash to pay for more food and wine.
I like that.
This is such an atrocious comment from a usually good commenter that I worry I’m missing some kind of weird joke or reference, but banned for two months or until I figure out what it is, whichever comes first
I’m assuming that this is trolling, but I don’t get the joke either.
Inspired particularly by the fact that superforecasters use precise numbers and get better predictions that way, I wonder to what extent their success is not from having a better idea of the chance of something happening, but rather that they are better at turning their intuitive notion of probability to a well-calibrated numerical estimate. The latter isn’t trivial even to someone who is otherwise rational. One prediction I’ll make on the basis of this idea is that superforcasters have a better approximate number system than average controlling for other measures of intelligence.
One point about Tetlock’s tournament is that it gets scored on an annual basis, so being able to foresee events more than 12 months in the future isn’t the way to win.
For example, Merkel’s migrant crisis of 2015 was strikingly like the one foretold in Jean Raspail’s 1973 book The Camp of the Saints. But by being more or less right 42 years ahead of time means that Raspail was wrong, on an annual basis, for the first 41 years. (Whether it’s nice he lived long enough for his mordant prediction to come true might be interesting to debate.)
I suspect that one big advantage Superforecasters have is awareness of how often cans get kicked down the road for another year. For example, over the last 40+ years, I’ve read numerous predictions that the current division of Cyprus into two hostile but non-warring zones won’t last. Either things will get worse or will get better on Cyprus. That prediction seems like a sure thing, but so far status quo has been the result year after year.
With Peyton Manning in the Super Bowl one last time, I’d like to dredge up my 2009 essay, inspired by all the arguments over Peyton Manning versus Tom Brady, about how we are most interested in forecasting that which is hardest to forecast.
“The everlasting Brady-Manning controversy reminded me of an epistemological insight that Harvard cognitive scientist Steven Pinker suggested when I interviewed him in 2002 during his book tour for his bestseller The Blank Slate. It didn’t fully register upon me at the time, but what has stuck with me the longest is Pinker’s concept that “mental effort seems to be engaged most with the knife edge at which one finds extreme and radically different consequences with each outcome, but the considerations militating towards each one are close to equal.”
“To put it another way, the things that we most like to argue about are those that are most inherently arguable, such as: Who would win in a fight, Tom Brady or Peyton Manning?
“As Pinker observed, this notion of the most evenly matched being the most interesting “seems to explain a number of paradoxes, such as why the pleasure of sports comes from your team winning, but there would be no pleasure in it at all if your team was guaranteed to win every time like the Harlem Globetrotters versus the Washington Generals.”
“On the other hand, scientific knowledge is that which tends to become increasingly less arguable (which might help explain why Nielsen ratings are higher for football games than for chemistry documentaries).”
http://takimag.com/article/quibbling_rivalry/print#ixzz3zMqB4099
Feynman commented on this. The fact that we *don’t* have the answer to a question lets people debate over it in the first place. As soon as there is an answer, or in this case a strong probability, its not interesting nor debatable.
Its probably one of the top basic learning heuristics one would put in a system that had limited resources to discover the world, eh?
“As soon as there is an answer, or in this case a strong probability, it’s not interesting nor debatable.”
Right. We can, say, predict accurately when the sun will come up for the next 1000 years, but everybody finds that boring, even though it’s an amazing accomplishment.
We’re more excited over whether the Panthers can cover the point spread tomorrow, because the point spread has been set to get half the betting public too think it’s too high and half too low.
The superforecasters are a numerate bunch: many know about Bayes’ theorem and could deploy it if they felt it was worth the trouble. But they rarely crunch the numbers so explicitly. What matters far more to the superforecasters than Bayes’ theorem is Bayes’ core insight of gradually getting closer to the truth by constantly updating in proportion to the weight of the evidence.
This is spot on. When I teach probability theory in my Philosophy of Science classes, I try to impress upon my students the importance of a conceptual understanding of probabilistic reasoning, rather than explicitly crunching numbers whenever you’re trying to think through something. Learning probability theory is important largely because of methodological ideas it gives you a deeper insight into: the importance of surprising predictions, the value of varied evidence, the weakness of arguments from absence, and so on.
Basically you read a book that suggests that the way you did your https://slatestarcodex.com/2016/01/25/predictions-for-2016/ is systematically flawed and then ignores that it tells you, you did things wrong and claim it didn’t tell you anything new.
Tetlock claim that it’s important to make finely grained predictions instead of doing them in 5% intervals should matter a great deal to how we predict.
I think you’re putting the cart before the cargo here. The fine-grained predictions came about as a result of the research-and-math-based predictions of the forecasters – just making your numbers less round, without doing the research and math, is unlikely to help.
Whatever Tetlock’s research says, it’s a mathematical fact that rounding to 5% increments can’t make your accuracy worse by more than 2.5%. Many biases people study can have an effect size much larger than this. Even it is better to make finely-grained predictions, I don’t see why it should matter a great deal to how we predict.
But a requirement of 5% increments discourages more frequent updating than does allowing 1% increments. If you read an article about a local scandal involving Prime Minister Boratsky’s party, maybe you immediately ding his chances of making it through the year by, say, 1%. But if you have to wait until you’ve got enough new information to add or subtract 5%, you might forget it or get bored or whatever.
I joined the good judgment program for a time. I was very excited.
Confession time. I quit after a relatively short period because it was a) HARD and b) BORING.
The questions were amazingly esoteric and the predictions were really difficult to make satisfactorily. This is the hidden part of the post above. Being a superforecaster inevitably involves a great deal of WORK. That makes joining the GJP as a lark / side-project a rather unpleasant experience.
That I failed to foresee this is perhaps the strongest evidence of my unsuitability for the project.
” The year-to-year correlation in who was most accurate was 0.65; about 70% of superforecasters in the first year remained superforecasters in the second. This is definitely a real thing.”
I guess it just depends on the statistics of the problem domain, but to make it in the top 2% two years in a row struck me as really impressive.
Hmmm. Thinking about it, I’m not so sure “it’s a thing”. If the underlying process varies in time slowly, and this year’s state of the process happens to match my priors, then probably next year matches pretty well too.
I have to say that the idea the book is useless for people familiar with Less Wrong is grounded in what I’ve considered (until now) a very deranged idea that the Less Wrong ideas are actually useful (given past failures at critical thinking training that was unlikely, something the LW community is blissfully unaware of). Your writeup indicates that Less Wrong is at the very least on the right track.
Bruce Bueno De Mesquita wrote an interesting book on using game theory to improve predictions “The Predictioneer”. I highly recommend it (as well as his book “The Dictators Handbook”). The back boasts his predictions were predictions were better than the CIA experts. The study was done with the CIA, taking experts on (eg) Colombia, and asking them to rate the most important people, how much control they had over decisions, and how strongly they felt about issues. He then asked them to estimate the chances of various events. De Mesquita built a computer game theory model based on their knowledge and ran thousands of simulations of the game and reached a prediction.
I read Predictioneer while also reading Taleb’s Black Swan. I was hoping to get two sides of an argument, but didn’t find much disagreement in the end. In fact there was an interesting connection on the Expert Problem. Mesquita found that using CIA experts he could get good predictions, but that a PHD student studying the relevant country was sufficient. Being an expert doesn’t make someone a good predictor, and it’s not even necessary to be a good predictor. Seems to chime with Superforcasting.
I’ve no way to know if De Mesquita’s Game Theory models are necessarily the best. But it doesn’t seem crazy that a calculation humans are usually terrible at (predicting), could be aided with automated tools.
Careful here. Is there anything in the book that suggests that “superforecasting” (or even just forecasting) skills are teachable?
It seems to me that the LessWrong/CFAR community, with its insistence of teaching rationality, makes a much stronger claim than this book or standard literature on cognitive biases.
“Is there anything in the book that suggests that “superforecasting” (or even just forecasting) skills are teachable?”
Yes, it says that a one hour training on Bayesian probability, reference class forecasting, and suchlike gave a 10% accuracy improvement [content also contained in LW]. Scott refers to it near the beginning of his second review post:
https://slatestarcodex.com/2016/02/07/list-of-passages-i-highlighted-in-my-copy-of-superforecasting/
You select people for the accuracy of their forecasts and then add random noise (in the form of rounding off to nearest 5) and then – surprise! – those predictions do worse. Who would have thought…….
Compare: I fit a linear, polynomial, etc regression to some data. I pick the best one. When adding random noise to it’s predictions they get worse.
What can be concluded from these is that superforecasters are a group of people who a) have a lot of time to spend on this forecasting game and b) understand that it’s not a great idea to give probability > .9 in this scoring system.
Other than that, I don’t think there’s evidence that they have special insight.
The way that “accuracy” is computed is by squared error. So if it’s a binary event and I say .9, then if it happens, my error is (1 – .9)^2 + (0 – .1)^2 = .02. If it didn’t happen, the error would be big: (0 – .9)^2 + (1 – .1)^2 = 1.62.
So under this system, it’s very bad to say that something has probability 1 and have it not happen. Superforecasters don’t do this as often as other people.
Look at Figures 2 and 3 in the Mellers et al 2014 paper in Psych Science….. Every group, except superforecasters, does terrible when they say that something will happen with probability 1. Those events only happen like 70-80% of the time. Superforecasters don’t fall victim to this and probably only rarely say p=1. But look at the probability-trained group in Fig 2 vs. the superforecasters in Fig 3. The probability-trained group gets crushed when they say p=1. But other than that, they are *better* calibrated than the supers.
The other big thing, which the second paper doesn’t really mention, is how hard they were trying. “Independent forecasters made an average of 1.4 predictions per question, and regular teams made an average of 1.6. The surprising result was that super- forecasters made an average of 7.8 predictions per question. Their engagement was extraordinary.” So clearly the superforecasters were actually trying at this game, constantly updating their probabilities as the facts changed, etc. And the other people didn’t care as much as they did.
From the other article: “We counted how often forecasters clicked on the news reader. During Years 2 and 3, superforecasters clicked on an average of 255 stories, significantly more than top-team individuals and all others, who clicked on 55 and 58 stories, respectively. In sum, several variables suggest that superforecasters were more committed to cultivating skills than were top-team individuals or all others.”
Even better, though, the lack of updating turns out to *explain* the p=1 fallacy. They even say this in the paper, but don’t connect it to the supers. “Calibration was worst at forecasts of 100%. The problem was a lack of updating. Forecasts of 100% made in the early days of a question were accurate only about 70% of the time. The same forecasts made in the final 20% of days of a question were correct about 90% of the time.” This is very compelling evidence that the superforecasters win because they take the time to fix their p=1 answers over the course of the competition.
The authors act like this “engagement” factor is another superpower of the superforecasters and it’s totally consistent with their explanation. But doesn’t it seem plausible that the non-supers just don’t bother to update their probabilities? And that the story here is “some people try hard at this game and other people don’t?”
I agree that the much higher effort was a huge part of why the supers were able to outperform, but there’s a lot more to it than that.
Setting aside the probability-1 forecasts, I think that the calibration of the probability-trained group was still only roughly even with the supers. In figure 3, their accuracy is slightly higher for 55, 65, 70, and 75%, while worse for 80, 85, 90, and 95%.
Regarding probability-1 forecasts… I don’t believe that the authors’ claim “The problem was a lack of updating” is accurate. Early forecasts of 100% were just horribly calibrated, and that is independent of future updates. In general, I suspect that Supers submitted very-high probability forecast somewhat more often than other groups. As figure 4 points out, they had higher resolution, and on average were submitting higher probabilities.
On a side note, Brier Scores are proper, in the sense that your expected score is maximized by answering with the true probability. If you know that something is 99% to happen, you should forecast 99%, despite the huge penalty you expect to receive 1% of the time.
I’m not convinced that we can tell from Fig 4 (or anywhere else in the paper or SI unless I’m missing something?…which is entirely possible since I haven’t looked exhaustively) what the distribution of guesses looks like for different groups. But sure it’s possible that Supers are guessing p=1 as often or more than others…but also possible that they’re updating those p=1 guesses more often.
It seems very important that p=1 forecasts made at the beginning of the period are right only 70% of the time, p=1 forecasts made towards the end of the period are right 90% of the time, AND that the perhaps biggest difference reported in the paper between Supers and Regulars is that Supers update their probabilities about 5 times more frequently than non-Supers.
For a study like this, the Year 1 attrition was pretty low: 7%. It’s not clear to me how attrition was defined. But it would be very surprising if there weren’t a lot of “semi-attrition”: pockets of people who are doing the bare minimum to get by or who are only periodically paying attention.
Basically, my bet here would be that, if the Super group were determined strictly by level of engagement (ignoring anything about actual performance), the results would look similar with a pocket of people doing better than everybody else. Would be interesting to test that.
I think you’re right that choosing supers based only on engagement level would lead to a pretty similar group selection. I’d also guess that part of the causation runs the other way: people who have poor forecasting results are more likely to get discouraged and decrease their engagement over time.
“hat matters far more to the superforecasters than Bayes’ theorem is Bayes’ core insight of gradually getting closer to the truth by constantly updating in proportion to the weight of the evidence.”
Which neatly illustrates the main problem with Pop Bayesianism…it insists on gradualism, and doesn’t acknowledge the occasional need for backtracking and revolutionary change.
I’ll see your Bayes, and raise you some Popper and Kuhn,
Bayes may be an improvement on Aristotle and Wilson, but can be improved on. You can get better results by trying to disprove yourself, and you can get better results by throwing out long-held principles that no longer work.