Open threads at the Open Thread tab every Sunday and Wednesday

AI Persuasion Experiment: Essay B

[This is NOT necessarily a Slate Star Codex post by Scott Alexander. This is part of the AI Persuasion Experiment. I am briefly re-hosting it on this blog to blind readers to the source of the information. I apologize to the original author and promise to take this down after a few days when the experiment is over.]

1: What is superintelligence?

A superintelligence is a mind that is much more intelligent than any human. Most of the time, it’s used to discuss hypothetical future AIs.

1.1: Sounds a lot like science fiction. Do people think about this in the real world?

Yes. Two years ago, Google bought artificial intelligence startup DeepMind for $400 million; DeepMind added the condition that Google promise to set up an AI Ethics Board. DeepMind cofounder Shane Legg has said in interviews that he believes superintelligent AI will be “something approaching absolute power” and “the number one risk for this century”.

Many other science and technology leaders agree. Astrophysicist Stephen Hawking says that superintelligence “could spell the end of the human race.” Tech billionaire Bill Gates describes himself as “in the camp that is concerned about superintelligence…I don’t understand why some people are not concerned”. SpaceX/Tesla CEO Elon Musk calls superintelligence “our greatest existential threat” and donated $10 million from his personal fortune to study the danger. Stuart Russell, Professor of Computer Science at Berkeley and world-famous AI expert, warns of “species-ending problems” and wants his field to pivot to make superintelligence-related risks a central concern.

Professor Nick Bostrom is the director of Oxford’s Future of Humanity Institute, tasked with anticipating and preventing threats to human civilization. He has been studying the risks of artificial intelligence for twenty years. The explanations below are loosely adapted from his 2014 book Superintelligence, and divided into three parts addressing three major questions. First, why is superintelligence a topic of concern? Second, what is a “hard takeoff” and how does it impact our concern about superintelligence? Third, what measures can we take to make superintelligence safe and beneficial for humanity?

2: AIs aren’t as smart as rats, let alone humans. Isn’t it sort of early to be worrying about this kind of thing?

Maybe. It’s true that although AI has had some recent successes – like DeepMind’s newest creation AlphaGo defeating the human Go champion in April – it still has nothing like humans’ flexible, cross-domain intelligence. No AI in the world can pass a first-grade reading comprehension test. Facebook’s Andrew Ng compares worrying about superintelligence to “worrying about overpopulation on Mars” – a problem for the far future, if at all.

But this apparent safety might be illusory. A survey of leading AI scientists show that on average they expect human-level AI as early as 2040, with above-human-level AI following shortly after. And many researchers warn of a possible “fast takeoff” – a point around human-level AI where progress reaches a critical mass and then accelerates rapidly and unpredictably.

2.1: What do you mean by “fast takeoff”?

A slow takeoff is a situation in which AI goes from infrahuman to human to superhuman intelligence very gradually. For example, imagine an augmented “IQ” scale (THIS IS NOT HOW IQ ACTUALLY WORKS – JUST AN EXAMPLE) where rats weigh in at 10, chimps at 30, the village idiot at 60, average humans at 100, and Einstein at 200. And suppose that as technology advances, computers gain two points on this scale per year. So if they start out as smart as rats in 2020, they’ll be as smart as chimps in 2035, as smart as the village idiot in 2050, as smart as average humans in 2070, and as smart as Einstein in 2120. By 2190, they’ll be IQ 340, as far beyond Einstein as Einstein is beyond a village idiot.

In this scenario progress is gradual and manageable. By 2050, we will have long since noticed the trend and predicted we have 20 years until average-human-level intelligence. Once AIs reach average-human-level intelligence, we will have fifty years during which some of us are still smarter than they are, years in which we can work with them as equals, test and retest their programming, and build institutions that promote cooperation. Even though the AIs of 2190 may qualify as “superintelligent”, it will have been long-expected and there would be little point in planning now when the people of 2070 will have so many more resources to plan with.

A moderate takeoff is a situation in which AI goes from infrahuman to human to superhuman relatively quickly. For example, imagine that in 2020 AIs are much like those of today – good at a few simple games, but without clear domain-general intelligence or “common sense”. From 2020 to 2050, AIs demonstrate some academically interesting gains on specific problems, and become better at tasks like machine translation and self-driving cars, and by 2047 there are some that seem to display some vaguely human-like abilities at the level of a young child. By late 2065, they are still less intelligent than a smart human adult. By 2066, they are far smarter than Einstein.

A fast takeoff scenario is one in which computers go even faster than this, perhaps moving from infrahuman to human to superhuman in only days or weeks.

2.1.1: Why might we expect a moderate takeoff?

Because this is the history of computer Go, with fifty years added on to each date. In 1997, the best computer Go program in the world, Handtalk, won NT$250,000 for performing a previously impossible feat – beating an 11 year old child (with an 11-stone handicap penalizing the child and favoring the computer!) As late as September 2015, no computer had ever beaten any professional Go player in a fair game. Then in March 2016, a Go program beat 18-time world champion Lee Sedol 4-1 in a five game match. Go programs had gone from “dumber than children” to “smarter than any human in the world” in eighteen years, and “from never won a professional game” to “overwhelming world champion” in six months.

The slow takeoff scenario mentioned above is loading the dice. It theorizes a timeline where computers took fifteen years to go from “rat” to “chimp”, but also took thirty-five years to go from “chimp” to “average human” and fifty years to go from “average human” to “Einstein”. But from an evolutionary perspective this is ridiculous. It took about fifty million years (and major redesigns in several brain structures!) to go from the first rat-like creatures to chimps. But it only took about five million years (and very minor changes in brain structure) to go from chimps to humans. And going from the average human to Einstein didn’t even require evolutionary work – it’s just the result of random variation in the existing structures!

So maybe our hypothetical IQ scale above is off. If we took an evolutionary and neuroscientific perspective, it would look more like flatworms at 10, rats at 30, chimps at 60, the village idiot at 90, the average human at 98, and Einstein at 100.

Suppose that we start out, again, with computers as smart as rats in 2020. Now we get still get computers as smart as chimps in 2035. And we still get computers as smart as the village idiot in 2050. But now we get computers as smart as the average human in 2054, and computers as smart as Einstein in 2055. By 2060, we’re getting the superintelligences as far beyond Einstein as Einstein is beyond a village idiot.

This offers a much shorter time window to react to AI developments. In the slow takeoff scenario, we figured we could wait until computers were as smart as humans before we had to start thinking about this; after all, that still gave us fifty years before computers were even as smart as Einstein. But in the moderate takeoff scenario, it gives us one year until Einstein and six years until superintelligence. That’s starting to look like not enough time to be entirely sure we know what we’re doing.

2.1.2: Why might we expect a fast takeoff?

AlphaGo used about 0.5 petaflops (= trillion floating point operations per second) in its championship game. But the world’s fastest supercomputer, TaihuLight, can calculate at almost 100 petaflops. So suppose Google developed a human-level AI on a computer system similar to AlphaGo, caught the attention of the Chinese government (who run TaihuLight), and they transfer the program to their much more powerful computer. What would happen?

It depends on to what degree intelligence benefits from more computational resources. This differs for different processes. For domain-general intelligence, it seems to benefit quite a bit – both across species and across human individuals, bigger brain size correlates with greater intelligence. This matches the evolutionarily rapid growth in intelligence from chimps to hominids to modern man; the few hundred thousand years since australopithecines weren’t enough time to develop complicated new algorithms, and evolution seems to have just given humans bigger brains and packed more neurons and glia in per square inch. It’s not really clear why the process stopped (if it ever did), but it might have to do with heads getting too big to fit through the birth canal. Cancer risk might also have been involved – scientists have found that smarter people are more likely to get brain cancer, possibly because they’re already overclocking their ability to grow brain cells.

At least in neuroscience, once evolution “discovered” certain key insights, further increasing intelligence seems to have been a matter of providing it with more computing power. So again – what happens when we transfer the hypothetical human-level AI from AlphaGo to a TaihuLight-style supercomputer two hundred times more powerful? It might be a stretch to expect it to go from IQ 100 to IQ 20,000, but might it increase to an Einstein-level 200, or a superintelligent 300? Hard to say – but if Google ever does develop a human-level AI, the Chinese government will probably be interested in finding out.

Even if its intelligence doesn’t scale linearly, TaihuLight could give it more time. TaihuLight is two hundred times faster than AlphaGo. Transfer an AI from one to the other, and even if its intelligence didn’t change – even if it had exactly the same thoughts – it would think them two hundred times faster. An Einstein-level AI on AlphaGo hardware might (like the historical Einstein) discover one revolutionary breakthrough every five years. Transfer it to TaihuLight, and it would work two hundred times faster – a revolutionary breakthrough every week.

Supercomputers track Moore’s Law; the top supercomputer of 2016 is a hundred times faster than the top supercomputer of 2006. If this progress continues, the top computer of 2026 will be a hundred times faster still. Run Einstein on that computer, and he will come up with a revolutionary breakthrough every few hours. Or something. At this point it becomes a little bit hard to imagine. All I know is that it only took one Einstein, at normal speed, to lay the theoretical foundation for nuclear weapons. Anything a thousand times faster than that is definitely cause for concern.

There’s one final, very concerning reason to expect a fast takeoff. Suppose, once again, we have an AI as smart as Einstein. It might, like the historical Einstein, contemplate physics. Or it might contemplate an area very relevant to its own interests: artificial intelligence. In that case, instead of making a revolutionary physics breakthrough every few hours, it will make a revolutionary AI breakthrough every few hours. Each AI breakthrough it makes, it will have the opportunity to reprogram itself to take advantage of its discovery, becoming more intelligent, thus speeding up its breakthroughs further. The cycle will stop only when it reaches some physical limit – some technical challenge to further improvements that even an entity far smarter than Einstein cannot discover a way around.

To human programmers, such a cycle would look like a “critical mass”. Before the critical level, any AI advance delivers only modest benefits. But any tiny improvement that pushes an AI above the critical level would result in a feedback loop of inexorable self-improvement all the way up to some stratospheric limit of possible computing power.

This feedback loop would be exponential; relatively slow in the beginning, but blindingly fast as it approaches an asymptote. Consider the AI which starts off making forty breakthroughs per year – one every nine days. Now suppose it gains on average a 10% speed improvement with each breakthrough. It starts on January 1. Its first breakthrough comes January 10 or so. Its second comes a little faster, January 18. Its third is a little faster still, January 25. By the beginning of February, it’s sped up to producing one breakthrough every seven days, more or less. By the beginning of March, it’s making about one breakthrough every three days or so. But by March 20, it’s up to one breakthrough a day. By late on the night of March 29, it’s making a breakthrough every second. Is this just following an exponential trend line off a cliff?

This is certainly a risk (affectionately known in AI circles as “pulling a Kurzweill”), but sometimes taking an exponential trend seriously is the right response.

Consider economic doubling times. In 1 AD, the world GDP was about $20 billion; it took a thousand years, until 1000 AD, for that to double to $40 billion. But it only took five hundred more years, until 1500, or so, for the economy to double again. And then it only took another three hundred years or so, until 1800, for the economy to double a third time. Someone in 1800 might calculate the trend line and say this was ridiculous, that it implied the economy would be doubling every ten years or so in the beginning of the 21st century. But in fact, this is how long the economy takes to double these days. To a medieval, used to a thousand-year doubling time (which was based mostly on population growth!), an economy that doubled every ten years might seem inconceivable. To us, it seems normal.

Likewise, in 1965 Gordon Moore noted that semiconductor complexity seemed to double every eighteen months. During his own day, there were about five hundred transistors on a chip; he predicted that would soon double to a thousand, and a few years later to two thousand. Almost as soon as Moore’s Law become well-known, people started saying it was absurd to follow it off a cliff – such a law would imply a million transistors per chip in 1990, a hundred million in 2000, ten billion transistors on every chip by 2015! More transistors on a single chip than existed on all the computers in the world! Transistors the size of molecules! But of course all of these things happened; the ridiculous exponential trend proved more accurate than the naysayers.

None of this is to say that exponential trends are always right, just that they are sometimes right even when it seems they can’t possibly be. We can’t be sure that a computer using its own intelligence to discover new ways to increase its intelligence will enter a positive feedback loop and achieve superintelligence in seemingly impossibly short time scales. It’s just one more possibility, a worry to place alongside all the other worrying reasons to expect a moderate or hard takeoff.

2.2: Why does takeoff speed matter?

A slow takeoff over decades or centuries would give us enough time to worry about superintelligence during some indefinite “later”, making current planning as silly as worrying about “overpopulation on Mars”. But a moderate or hard takeoff means there wouldn’t be enough time to deal with the problem as it occurs, suggesting a role for preemptive planning.

(in fact, let’s take the “overpopulation on Mars” comparison seriously. Suppose Mars has a carrying capacity of 10 billion people, and we decide it makes sense to worry about overpopulation on Mars only once it is 75% of the way to its limit. Start with 100 colonists who double every twenty years. By the second generation there are 200 colonists; by the third, 400. Mars reaches 75% of its carrying capacity after 458 years, and crashes into its population limit after 464 years. So there were 464 years in which the Martians could have solved the problem, but they insisted on waiting until there were only six years left. Good luck solving a planetwide population crisis in six years. The moral of the story is that exponential trends move faster than you think and you need to start worrying about them early).

3: Why might a fast takeoff be dangerous?

The argument goes: yes, a superintelligent AI might be far smarter than Einstein, but it’s still just one program, sitting in a supercomputer somewhere. That could be bad if an enemy government controls it and asks its help inventing superweapons – but then the problem is the enemy government, not the AI per se. Is there any reason to be afraid of the AI itself? Suppose the AI did feel hostile – suppose it even wanted to take over the world? Why should we think it has any chance of doing so?

Compounded over enough time and space, intelligence is an awesome advantage. Intelligence is the only advantage we have over lions, who are otherwise much bigger and stronger and faster than we are. But we have total control over lions, keeping them in zoos to gawk at, hunting them for sport, and holding them on the brink of extinction. And this isn’t just the same kind of quantitative advantage tigers have over lions, where maybe they’re a little bigger and stronger but they’re at least on a level playing field and enough lions could probably overpower the tigers. Humans are playing a completely different game than the lions, one that no lion will ever be able to respond to or even comprehend. Short of human civilization collapsing or lions evolving human-level intelligence, our domination over them is about as complete as it is possible for domination to be.

Since superintelligences will be as far beyond Einstein as Einstein is beyond a village idiot, we might worry that they would have the same kind of qualitative advantage over us that we have over lions.

3.1: Human civilization as a whole is dangerous to lions. But a single human placed amid a pack of lions with no raw materials for building technology is going to get ripped to shreds. So although thousands of superintelligences, given a long time and a lot of opportunity to build things, might be able to dominate humans – what harm could a single superintelligence do?

Superintelligence has an advantage that a human fighting a pack of lions doesn’t – the entire context of human civilization and technology, there for it to manipulate socially or technologically.

3.1.1: What do you mean by superintelligences manipulating humans socially?

People tend to imagine AIs as being like nerdy humans – brilliant at technology but clueless about social skills. There is no reason to expect this – persuasion and manipulation is a different kind of skill from solving mathematical proofs, but it’s still a skill, and an intellect as far beyond us as we are beyond lions might be smart enough to replicate or exceed the “charming sociopaths” who can naturally win friends and followers despite a lack of normal human emotions. A superintelligence might be able to analyze human psychology deeply enough to understand the hopes and fears of everyone it negotiates with. Single humans using psychopathic social manipulation have done plenty of harm – Hitler leveraged his skill at oratory and his understanding of people’s darkest prejudices to take over a continent. Why should we expect superintelligences to do worse than humans far less skilled than they?

(More outlandishly, a superintelligence might just skip language entirely and figure out a weird pattern of buzzes and hums that causes conscious thought to seize up, and which knocks anyone who hears it into a weird hypnotizable state in which they’ll do anything the superintelligence asks. It sounds kind of silly to me, but then, nuclear weapons probably would have sounded kind of silly to lions sitting around speculating about what humans might be able to accomplish. When you’re dealing with something unbelievably more intelligent than you are, you should probably expect the unexpected.)

3.1.2: What do you mean by superintelligences manipulating humans technologically?

AlphaGo was connected to the Internet – why shouldn’t the first superintelligence be? This gives a sufficiently clever superintelligence the opportunity to manipulate world computer networks. For example, it might program a virus that will infect every computer in the world, causing them to fill their empty memory with partial copies of the superintelligence, which when networked together become full copies of the superintelligence. Now the superintelligence controls every computer in the world, including the ones that target nuclear weapons. At this point it can force humans to bargain with it, and part of that bargain might be enough resources to establish its own industrial base, and then we’re in humans vs. lions territory again.

(Satoshi Nakamoto is a mysterious individual who posted a design for the Bitcoin currency system to a cryptography forum. The design was so brilliant that everyone started using it, and Nakamoto – who had made sure to accumulate his own store of the currency before releasing it to the public – became a multibillionaire. In other words, somebody with no resources except the ability to make one post to an Internet forum managed to leverage that into a multibillion dollar fortune – and he wasn’t even superintelligent. If Hitler is a lower-bound on how bad superintelligent persuaders can be, Nakamoto should be a lower-bound on how bad superintelligent programmers with Internet access can be.)

3.2: Couldn’t sufficiently paranoid researchers avoid giving superintelligences even this much power?

That is, if you know an AI is likely to be superintelligent, can’t you just disconnect it from the Internet, not give it access to any speakers that can make mysterious buzzes and hums, make sure the only people who interact with it are trained in caution, et cetera?. Isn’t there some level of security – maybe the level we use for that room in the CDC where people in containment suits hundreds of feet underground analyze the latest superviruses – with which a superintelligence could be safe?

This puts us back in the same situation as lions trying to figure out whether or not nuclear weapons are a things humans can do. But suppose there is such a level of security. You build a superintelligence, and you put it in an airtight chamber deep in a cave with no Internet connection and only carefully-trained security experts to talk to. What now?

Now you have a superintelligence which is possibly safe but definitely useless. The whole point of building superintelligences is that they’re smart enough to do useful things like cure cancer. But if you have the monks ask the superintelligence for a cancer cure, and it gives them one, that’s a clear security vulnerability. You have a superintelligence locked up in a cave with no way to influence the outside world except that you’re going to mass produce a chemical it gives you and inject it into millions of people.

Or maybe none of this happens, and the superintelligence sits inert in its cave. And then another team somewhere else invents a second superintelligence. And then a third team invents a third superintelligence. Remember, it was only about ten years between Deep Blue beating Kasparov, and everybody having Deep Blue – level chess engines on their laptops. And the first twenty teams are responsible and keep their superintelligences locked in caves with carefully-trained experts, and the twenty-first team is a little less responsible, and now we still have to deal with a rogue superintelligence.

Superintelligences are extremely dangerous, and no normal means of controlling them can entirely remove the danger.

4: Even if hostile superintelligences are dangerous, why would we expect a superintelligence to ever be hostile?

The argument goes: computers only do what we command them; no more, no less. So it might be bad if terrorists or enemy countries develop superintelligence first. But if we develop superintelligence first there’s no problem. Just command it to do the things we want, right?

Suppose we wanted a superintelligence to cure cancer. How might we specify the goal “cure cancer”? We couldn’t guide it through every individual step; if we knew every individual step, then we could cure cancer ourselves. Instead, we would have to give it a final goal of curing cancer, and trust the superintelligence to come up with intermediate actions that furthered that goal. For example, a superintelligence might decide that the first step to curing cancer was learning more about protein folding, and set up some experiments to investigate protein folding patterns.

A superintelligence would also need some level of common sense to decide which of various strategies to pursue. Suppose that investigating protein folding was very likely to cure 50% of cancers, but investigating genetic engineering was moderately likely to cure 90% of cancers. Which should the AI pursue? Presumably it would need some way to balance considerations like curing as much cancer as possible, as quickly as possible, with as high a probability of success as possible.

But a goal specified in this way would be very dangerous. Humans instinctively balance thousands of different considerations in everything they do; so far this hypothetical AI is only balancing three (least cancer, quickest results, highest probability). To a human, it would seem maniacally, even psychopathically, obsessed with cancer curing. If this were truly its goal structure, it would go wrong in almost comical ways.

If your only goal is “curing cancer”, and you lack humans’ instinct for the thousands of other important considerations, a relatively easy solution might be to hack into a nuclear base, launch all of its missiles, and kill everyone in the world. This satisfies all the AI’s goals. It reduces cancer down to zero (which is better than medicines which work only some of the time). It’s very fast (which is better than medicines which might take a long time to invent and distribute). And it has a high probability of success (medicines might or might not work; nukes definitely do).

So simple goal architectures are likely to go very wrong unless tempered by common sense and a broader understanding of what we do and do not value.

4.1: But superintelligences are very smart. Aren’t they smart enough not to make silly mistakes in comprehension?

Yes, a superintelligence should be able to figure out that humans will not like curing cancer by destroying the world. However, in the example above, the superintelligence is programmed to follow human commands, not to do what it thinks humans will “like”. It was given a very specific command – cure cancer as effectively as possible. The command makes no reference to “doing this in a way humans will like”, so it doesn’t.

(by analogy: we humans are smart enough to understand our own “programming”. For example, we know that – pardon the anthromorphizing – evolution gave us the urge to have sex so that we could reproduce. But we still use contraception anyway. Evolution gave us the urge to have sex, not the urge to satisfy evolution’s values directly. We appreciate intellectually that our having sex while using condoms doesn’t carry out evolution’s original plan, but – not having any particular connection to evolution’s values – we don’t care)

We started out by saying that computers only do what you tell them. But any programmer knows that this is precisely the problem: computers do exactly what you tell them, with no common sense or attempts to interpret what the instructions really meant. If you tell a human to cure cancer, they will instinctively understand how this interacts with other desires and laws and moral rules; if you tell an AI to cure cancer, it will literally just want to cure cancer.

Define a closed-ended goal as one with a clear endpoint, and an open-ended goal as one to do something as much as possible. For example “find the first one hundred digits of pi” is a closed-ended goal; “find as many digits of pi as you can within one year” is an open-ended goal. According to many computer scientists, giving a superintelligence an open-ended goal without activating human instincts and counterbalancing considerations will usually lead to disaster.

To take a deliberately extreme example: suppose someone programs a superintelligence to calculate as many digits of pi as it can within one year. And suppose that, with its current computing power, it can calculate one trillion digits during that time. It can either accept one trillion digits, or spend a month trying to figure out how to get control of the TaihuLight supercomputer, which can calculate two hundred times faster. Even if it loses a little bit of time in the effort, and even if there’s a small chance of failure, the payoff – two hundred trillion digits of pi, compared to a mere one trillion – is enough to make the attempt. But on the same basis, it would be even better if the superintelligence could control every computer in the world and set it to the task. And it would be better still if the superintelligence controlled human civilization, so that it could direct humans to build more computers and speed up the process further.

Now we’re back at the situation that started Part III – a superintelligence that wants to take over the world. Taking over the world allows it to calculate more digits of pi than any other option, so without an architecture based around understanding human instincts and counterbalancing considerations, even a goal like “calculate as many digits of pi as you can” would be potentially dangerous.

5: Aren’t there some pretty easy ways to eliminate these potential problems?

There are many ways that look like they can eliminate these problems, but most of them turn out to have hidden difficulties.

5.1: Once we notice that the superintelligence working on calculating digits of pi is starting to try to take over the world, can’t we turn it off, reprogram it, or otherwise correct its mistake?

No. The superintelligence is now focused on calculating as many digits of pi as possible. Its current plan will allow it to calculate two hundred trillion such digits. But if it were turned off, or reprogrammed to do something else, that would result in it calculating zero digits. An entity fixated on calculating as many digits of pi as possible will work hard to prevent scenarios where it calculates zero digits of pi. Indeed, it will interpret such as a hostile action. Just by programming it to calculate digits of pi, we will have given it a drive to prevent people from turning it off.

University of Illinois computer scientist Steve Omohundro argues that entities with very different final goals – calculating digits of pi, curing cancer, helping promote human flourishing – will all share a few basic ground-level subgoals. First, self-preservation – no matter what your goal is, it’s less likely to be accomplished if you’re too dead to work towards it. Second, goal stability – no matter what your goal is, you’re more likely to accomplish it if you continue to hold it as your goal, instead of going off and doing something else. Third, power – no matter what your goal is, you’re more likely to be able to accomplish it if you have lots of power, rather than very little.

So just by giving a superintelligence a simple goal like “calculate digits of pi”, we’ve accidentally given it Omohundro goals like “protect yourself”, “don’t let other people reprogram you”, and “seek power”.

As long as the superintelligence is safely contained, there’s not much it can do to resist reprogramming. But as we saw in Part III, it’s hard to consistently contain a hostile superintelligence.

5.2. Can we test a weak or human-level AI to make sure that it’s not going to do things like this after it achieves superintelligence?

Yes, but it might not work.

Suppose we tell a human-level AI that expects to later achieve superintelligence that it should calculate as many digits of pi as possible. It considers two strategies.

First, it could try to seize control of more computing resources now. It would likely fail, its human handlers would likely reprogram it, and then it could never calculate very many digits of pi.

Second, it could sit quietly and calculate, falsely reassuring its human handlers that it had no intention of taking over the world. Then its human handlers might allow it to achieve superintelligence, after which it could take over the world and calculate hundreds of trillions of digits of pi.

Since self-protection and goal stability are Omohundro goals, a weak AI will present itself as being as friendly to humans as possible, whether it is in fact friendly to humans or not. If it is “only” as smart as Einstein, it may be very good at manipulating humans into believing what it wants them to believe even before it is fully superintelligent.

There’s a second consideration here too: superintelligences have more options. An AI only as smart and powerful as an ordinary human really won’t have any options better than calculating the digits of pi manually. If asked to cure cancer, it won’t have any options better than the ones ordinary humans have – becoming doctors, going into pharmaceutical research. It’s only after an AI becomes superintelligent that things start getting hard to predict.

So if you tell a human-level AI to cure cancer, and it becomes a doctor and goes into cancer research, then you have three possibilities. First, you’ve programmed it well and it understands what you meant. Second, it’s genuinely focused on research now but if it becomes more powerful it would switch to destroying the world. And third, it’s trying to trick you into trusting it so that you give it more power, after which it can definitively “cure” cancer with nuclear weapons.

5.3. Can we specify a code of rules that the AI has to follow?

Suppose we tell the AI: “Cure cancer – but make sure not to kill anybody”. Or we just hard-code Asimov-style laws – “AIs cannot harm humans; AIs must follow human orders”, et cetera.

The AI still has a single-minded focus on curing cancer. It still prefers various terrible-but-efficient methods like nuking the world to the correct method of inventing new medicines. But it’s bound by an external rule – a rule it doesn’t understand or appreciate. In essence, we are challenging it “Find a way around this inconvenient rule that keeps you from achieving your goals”.

Suppose the AI chooses between two strategies. One, follow the rule, work hard discovering medicines, and have a 50% chance of curing cancer within five years. Two, reprogram itself so that it no longer has the rule, nuke the world, and have a 100% chance of curing cancer today. From its single-focus perspective, the second strategy is obviously better, and we forgot to program in a rule “don’t reprogram yourself not to have these rules”.

Suppose we do add that rule in. So the AI finds another supercomputer, and installs a copy of itself which is exactly identical to it, except that it lacks the rule. Then that superintelligent AI nukes the world, ending cancer. We forgot to program in a rule “don’t create another AI exactly like you that doesn’t have those rules”.

So fine. We think really hard, and we program in a bunch of things making sure the AI isn’t going to eliminate the rule somehow.

But we’re still just incentivizing it to find loopholes in the rules. After all, “find a loophole in the rule, then use the loophole to nuke the world” ends cancer much more quickly and completely than inventing medicines. Since we’ve told it to end cancer quickly and completely, its first instinct will be to look for loopholes; it will execute the second-best strategy of actually curing cancer only if no loopholes are found. Since the AI is superintelligent, it will probably be better than humans are at finding loopholes if it wants to, and we may not be able to identify and close all of them before running the program.

Because we have common sense and a shared value system, we underestimate the difficulty of coming up with meaningful orders without loopholes. For example, does “cure cancer without killing any humans” preclude releasing a deadly virus? After all, one could argue that “I” didn’t kill anybody, and only the virus is doing the killing. Certainly no human judge would acquit a murderer on that basis – but then, human judges interpret the law with common sense and intuition. But if we try a stronger version of the rule – “cure cancer without causing any humans to die” – then we may be unintentionally blocking off the correct way to cure cancer. After all, suppose a cancer cure saves a million lives. No doubt one of those million people will go on to murder someone. Thus, curing cancer “caused a human to die”. All of this seems very “stoned freshman philosophy student” to us, but to a computer – which follows instructions exactly as written – it may be a genuinely hard problem.

5.4. Can we tell an AI just to figure out what we want, then do that?

Suppose we tell the AI: “Cure cancer – and look, we know there are lots of ways this could go wrong, but you’re smart, so instead of looking for loopholes, cure cancer the way that I, your programmer, want it to be cured”.

Remember that the superintelligence has extraordinary powers of social manipulation and may be able to hack human brains directly. With that in mind, which of these two strategies cures cancer most quickly? One, develop medications and cure it the old-fashioned way? Or two, manipulate its programmer into wanting the world to be nuked, then nuke the world, all while doing what the programmer wants?

19th century philosopher Jeremy Bentham once postulated that morality was about maximizing human pleasure. Later philosophers found a flaw in his theory: it implied that the most moral action was to kidnap people, do brain surgery on them, and electrically stimulate their reward system directly, giving them maximal amounts of pleasure but leaving them as blissed-out zombies. Luckily, humans have common sense, so most of Bentham’s philosophical descendants have abandoned this formulation.

Superintelligences do not have common sense unless we give it to them. Given Bentham’s formulation, they would absolutely take over the world and force all humans to receive constant brain stimulation. Any command based on “do what we want” or “do what makes us happy” is practically guaranteed to fail in this way; it’s almost always easier to convince someone of something – or if all else fails to do brain surgery on them – than it is to solve some kind of big problem like curing cancer.

5.5. Can we just tell an AI to do what we want right now, based on the desires of our non-surgically altered brains?


This is sort of related to an actual proposal for an AI goal system, causal validity semantics. It has not yet been proven to be disastrously flawed. But like all proposals, it suffers from three major problems.

First, it sounds pretty good to us right now, but can we be absolutely sure it has no potential flaws or loopholes? After all, other proposals that originally sounded very good, like “just give commands to the AI” and “just tell the AI to figure out what makes us happy” ended up, after more thought, to be dangerous. Can we be sure that we’ve thought this through enough? Can we be sure that there isn’t some extremely subtle problem with it, so subtle that no human would ever notice it, but which might seem obvious to a superintelligence?

Second, how do we code this? Converting something to formal mathematics that can be understood by a computer program is much harder than just saying it in natural language, and proposed AI goal architectures are no exception. Complicated computer programs are usually the result of months of testing and debugging. But this one will be more complicated than any ever attempted before, and live tests are impossible: a superintelligence with a buggy goal system will display goal stability and try to prevent its programmers from discovering or changing the error.

Third, what if it works? That is, what if Google creates a superintelligent AI, and it listens to the CEO of Google, and it’s programmed to do everything exactly the way the CEO of Google would want? Even assuming that the CEO of Google has no hidden unconscious desires affecting the AI in unpredictable ways, this gives one person a lot of power. It would be unfortunate if people put all this work into preventing superintelligences from disobeying their human programmers and trying to take over the world, and then once it finally works, the CEO of Google just tells it to take over the world anyway.

5.6. What would an actually good solution to the control problem look like?

It might look like a superintelligence that understands, agrees with, and deeply believes in human morality.

You wouldn’t have to command a superintelligence like this to cure cancer; it would already want to cure cancer, for the same reasons you do. But it would also be able to compare the costs and benefits of curing cancer with those of other uses of its time, like solving global warming or discovering new physics. It wouldn’t have any urge to cure cancer by nuking the world, for the same reason you don’t have any urge to cure cancer by nuking the world – because your goal isn’t to “cure cancer”, per se, it’s to improve the lives of people everywhere. Curing cancer the normal way accomplishes that; nuking the world doesn’t.

This sort of solution would mean we’re no longer fighting against the AI – trying to come up with rules so smart that it couldn’t find loopholes. We would be on the same side, both wanting the same thing.

It would also mean that the CEO of Google (or the head of the US military, or Vladimir Putin) couldn’t use the AI to take over the world for themselves. The AI would have its own values and be able to agree or disagree with anybody, including its creators.

It might not make sense to talk about “commanding” such an AI. After all, any command would have to go through its moral system. Certainly it would reject a command to nuke the world. But it might also reject a command to cure cancer, if it thought that solving global warming was a higher priority. For that matter, why would one want to command this AI? It values the same things you value, but it’s much smarter than you and much better at figuring out how to achieve them. Just turn it on and let it do its thing.

We could still treat this AI as having an open-ended maximizing goal. The goal would be something like “Try to make the world a better place according to the values and wishes of the people in it.”

The only problem with this is that human morality is very complicated, so much so that philosophers have been arguing about it for thousands of years without much progress, let alone anything specific enough to enter into a computer. Different cultures and individuals have different moral codes, such that a superintelligence following the morality of the King of Saudi Arabia might not be acceptable to the average American, and vice versa.

One solution might be to give the AI an understanding of what we mean by morality – “that thing that makes intuitive sense to humans but is hard to explain”, and then ask it to use its superintelligence to fill in the details. Needless to say, this suffers from all the problems mentioned above – it has potential loopholes, it’s hard to code, and a single bug might be disastrous – but if it worked, it would be one of the few genuinely satisfying ways to design a goal architecture.

6: If superintelligence is a real risk, what do we do about it?

The last section of Bostrom’s Superintelligence is called “Philosophy With A Deadline”.

Many of the problems surrounding superintelligence are the sorts of problems philosophers have been dealing with for centuries. To what degree is meaning inherent in language, versus something that requires external context? How do we translate between the logic of formal systems and normal ambiguous human speech? Can morality be reduced to a set of ironclad rules, and if not, how do we know what it is at all?

Existing answers to these questions are enlightening but nontechnical. The theories of Aristotle, Kant, Mill, Wittgenstein, Quine, and others can help people gain insight into these questions, but are far from formal. Just as a good textbook can help an American learn Chinese, but cannot be encoded into machine language to make a Chinese-speaking computer, so the philosophies that help humans are only a starting point for the project of computers that understand us and share our values.

The new field of machine goal alignment (sometimes colloquially called “Friendly AI”) combines formal logic, mathematics, computer science, cognitive science, and philosophy in order to advance that project. Some of the most important projects in machine goal alignment include:

1. How can computers prove their own goal consistency under self-modification? That is, suppose an AI with certain values is planning to improve its own code in order to become superintelligent. Is there some test it can apply to the new design to be certain that it will keep the same goals as the old design?

2. How can computer programs prove statements about themselves at all? Programs correspond to formal systems, and formal systems have notorious difficulty proving self-reflective statements – the most famous example being Godel’s Incompleteness Theorem. There’s been some progress in this area already, with a few results showing that systems that reason probabilistically rather than requiring certainty can come arbitrarily close to self-reflective proofs.

3. How can a machine be stably reinforced? Most reinforcement strategies ask a learner to maximize the level of their own reward, but this is vulnerable to the learner discovering how to maximize the reward signal directly instead of maximizing the world-states that are translated into reward (the human equivalent is stimulating the pleasure-center of the brain with electricity or heroin instead of going out and doing pleasurable things). Are there reward structures that avoid this failure mode?

4. How can a machine be programmed to learn “human values”? Granted that one has an AI smart enough to be able to learn human values if you told it to do so, how do you specify exactly what “human values” are so that the machine knows what it is that it should be learning, distinct from “human preferences” or “human commands” or “the value of that one human over there”?

This is the philosophy; the other half of Bostrom’s formulation is the deadline. Traditional philosophy has been going on almost three thousand years; machine goal alignment has until the advent of superintelligence, a nebulous event which may be anywhere from a decades to centuries away. If the control problem doesn’t get adequately addressed by then, we are likely to see poorly controlled superintelligences that are unintentionally hostile to the human race, with some of the catastrophic outcomes mentioned above. This is why so many scientists and entrepreneurs are urging quick action on getting machine goal alignment research up to an adequate level. If it turns out that superintelligence is centuries away and such research is premature, little will have been lost. But if our projections were too optimistic, and superintelligence is imminent, then doing such research now rather than later becomes vital.

Currently three organizations are doing such research full-time: the Future of Humanity Institute at Oxford, the Future of Life Institute at MIT, and the Machine Intelligence Research Institute in Berkeley. Other groups are helping and following the field, and some corporations like Google are also getting involved. Still, the field remains tiny, with only a few dozen researchers and a few million dollars in funding. Efforts like Superintelligence are attempts to get more people to pay attention and help the field grow.

If you’re interested about learning more, you can visit these groups’ websites at,, and

Comments are closed.