[This is NOT necessarily a Slate Star Codex post by Scott Alexander. This is part of the AI Persuasion Experiment. I am briefly re-hosting it on this blog to blind readers to the source of the information. I apologize to the original author and promise to take this down after a few days when the experiment is over.]
Humans will eventually make a human-level intelligence that pursues goals.
That intelligence will quickly surpass human-level intelligence.
At that point, it will be very hard to keep it from connecting to the Internet.
Most goals, when pursued efficiently by an AI connected to the Internet, result in the extinction of biological life.
But wait, it gets better! Most goals that preserve human existence still would not preserve freedom, autonomy, and a number of other things we value. Er worse… did I say better?
It is profoundly difficult to give an AI a goal such that it would preserve the things we care about, we can’t even check if a potential goal would be safe, and we have to get AI right on the first attempt.
If someone makes human-level-AI before anyone makes human-level-AI-with-a-safe-goal-structure, we will all die, and as hard as the former is, the latter is much harder.
Couple Quick Things First
The idea of an AI killing all biological life at some point in the future sounds kinda crazy. Tinfoil-hat crazy. It sounds utterly disconnected from the way our world is today. And now that I’ve suggested that this is a completely plausible future, maybe you’re imagining me in a tinfoil hat. Well, I don’t wear a tinfoil hat, I promise. I have proof I’m a reasonable person! See: here’s child-me eating red cabbage, just like sane humans do.
Now if you’re anything like me, when someone is trying to convince you of something, your eyebrows are a little higher than usual, and your skepticism dial is all the way up at a 10. Good. You shouldn’t believe everything you hear. One thing though: typically, (and I’m guilty of this too) we require different amounts of evidence for different claims, depending on how much we like them.1 If we want to believe, the evidence need only permit that belief and we happily will; if we don’t want to, the evidence must prove it beyond a shadow of a doubt.
I bring this up because when I started reading about AI safety, I was great at finding reasons to dismiss it. I had already decided that AI couldn’t be a big danger because that just sounded bizarre, and then I searched for ways to justify my unconcern, quite certain that I would find one soon enough. It’s a lucky thing I kept reading. I can hardly blame you if you put up those same mental defenses that I did. To relieve of you of some of the burden of finding things to object to, there’s going to be a section at the end of the article where I’ll list a lot of objections people have made to this argument, and you can read my responses should you so desire.
Okay, Let’s Do This
1. Humans will eventually make a human-level intelligence that pursues goals.
First off, it’s possible for a goal-pursuing human-level intelligence to exist in the world. Just look at us. And if it’s possible for a brain to do, it’s possible for silicon to do. Sure, it’s obviously really hard, but it’s definitely possible.
Before we get any farther, here’s a quick note: we’re not talking about robots. Robots can be containers for AI. If so, the AI is the computer program running on the computer inside the robot. By analogy, our mind is like the computer program, our brain is like the computer, and our body is like the robot. An AI doesn’t need to be in a robot at all. It could just as easily be a computer program that is run on a stationary computer. So to clarify, I’m saying we’ll eventually make code that is human-level intelligent. I’m not saying anything about robots. (Side note: one of my friends hears “AI Safety” and still pictures robots wearing hard hats.)
So it’s possible to make a human-level intelligence that pursues goals. Why am I so sure we ever actually will? It could make so much money. And a not-quite-human-level intelligence would still make a good amount of money. A human-level intelligent AI could do scientific research, arbitrage, reality TV production, anything a human at a computer could do. Barring permanent economic collapse, we’ll get there.
In order for the human-level intelligent AI to make a bunch of money, it has to be able to pursue goals. What does it even mean to pursue a goal? All it means is that when you are considering which action to take (or which piece of code to run), you choose the action based on how well it will enable the achievement of your goal. Scientific research, arbitrage, and TV production all require planning out an action-sequence with a goal in mind. The full list of things that require goal-orientation is uncountable, but here are a few more:
- Writing efficient code that has a complex desired functionality.
- Coming up with better algorithms and data collection strategies for identifying underpriced real-estate.
- Designing better algorithms for identifying the best way for a chatbot to respond to an unhappy customer.
- Seeking useful data to help answer a question or make a prediction.
- Writing maximally compelling articles about AI safety.
So not only will we make human-level intelligence, we will make something that can pursue goals.
At this point, you may still be thinking, “Okay, maybe in a 1000 years.” I don’t want to open the can of worms about when AI will happen quite yet, but I do want you to keep reading. So here’s another baby picture of me. You wouldn’t want to disappoint me-stuck-in-a-high-chair-unable-to-reach-my-single-slice-of-baguette, would you?
For those who can’t be so easily manipulated, maybe ask yourself this: do you think there’s less than a 10% chance that human-level AI is made in the next 50 years? If so, I’m curious about what you know that could justify so much confidence about what billions of dollars and thousands of geniuses can’t accomplish in 50 years. Here’s one of my favorite crazy facts: it took 66 years to go from the Wright Brothers to the Moon Landing. Here’s another: the worldwide web is younger than my sister.
So we’ll get there eventually. And at least we can’t quickly say that it’s more than 50 years off.
2.That intelligence will quickly surpass human-level intelligence.
Exhibit A: Goal-oriented humans can code. Exhibit B: Goal-oriented humans can successfully select a course of action that steers the future toward a goal of theirs.
Once we make a machine that can do those two things, it will be able to write a new version of its code in order to be more intelligent. How could a machine rewrite its own source code to become more intelligent? Whatever code it uses to make inferences, reliable approximations, and predictions from data, whatever code it uses to ideate possible courses of actions, and whatever code it uses to decide between them, those pieces of code can be improved or replaced by better versions, and then it can start running the new version. Here’s where everyone quotes I. J. Good, a British mathematician who worked with Alan Turing.2
Let an ultraintelligent machine be defined as a machine that can far surpass all the intellectual activities of any man however clever. Since the design of machines is one of these intellectual activities, an ultraintelligent machine could design even better machines; there would then unquestionably be an “intelligence explosion,” and the intelligence of man would be left far behind.
There is no reason to believe that human thought represents a ceiling on the possible efficiency of any of those algorithms that make us smart. What’s more, even if they had only human-level intelligence, computers could run faster than us, consider more data, and use a silly amount of RAM compared to our human minds, which can only hold 7 plus or minus 2 concepts in our working memory at a time.3 When we make an algorithm that has human-level coding and decision making skills, it will soon become much better than any human at both of those tasks.
Now it might not happen like this. There’s a lot to be uncertain about, and since I’m not looking at the code for a human-level AI, I can’t prove that an intelligence explosion will happen, but this should be our default expectation:
Someday, in the real world, we will create an AI with human-level coding and decision making skills. And rapid, recursive improvement of that AI is likely. Wait But Why draws it well:
3. At that point, [once we make a human-level intelligent AI], it will be very hard to keep it from connecting to the Internet.
So we’ve made a human-level intelligent AI. We’ve put it in a cage that blocks all electromagnetic waves. It has a display so it can communicate with its operators, and that communication is the only sort of action available to it by which it can attempt to steer the future toward its goals. This sort of situation is called Oracle AI.
Now we’re going to suppose for this section that the AI “wants” to connect to the Internet. That is, it searches for actions that would enable this, and then acts accordingly. In the next section, I’ll explain why it would want to connect to the Internet, but for now, let’s just suppose that it does.
Humans can manipulate other humans. We would be fools to suppose that an Oracle AI could not do the same. Let’s remember that the AI we’re talking about is human-level intelligent, and we can expect it to quickly become superintelligent. Operators could either be convinced to let it out of the box (give it Internet access), or tricked into doing something that would, as a side effect, result in code for another AI being created on another computer, one with an Internet connection, whose goals would be similar to the Oracle AI’s goals. (This is a pseudo-escape).
At some point, even if we care immensely about AI safety, in fact especially if we care about AI safety, we will want the Oracle AI to help us design another AI which we can trust to act in the world. Why? Because eventually we need something that can prevent irresponsibly-made AIs from doing incredible damage. Otherwise, we’re just waiting for someone to do it wrong. Our Oracle AI could give us the designs for such a thing, and tens of thousands of man-hours would be spent poring over it line by line to try to determine whether the new AI would be safe. And in the space of thousands of lines of codes, let’s not be too sure that our Oracle AI couldn’t hide something that wouldn’t be noticed or perfectly understood. After all, it’s more intelligent than us. And then, instead of creating our new safe AI, we have created its ally.
This is Nick Bostrom’s excellent and thorough discussion of how to keep an Oracle AI trapped in a box. (CliffsNotes: we probably wouldn’t be able to). I highly recommend it, especially if you still have the intuition that it couldn’t be too hard to keep an Oracle AI contained.
The long and short of it is that keeping an Oracle AI contained is difficult difficult lemon difficult. It is eggs over difficult. It is difficult as pie. And pie is not trivial. Don’t let anyone tell you pie is easy. You have to, like, refrigerate the dough. Who has time for that? Okay, moving along.
Maybe you’re thinking right now that there’s nothing that an AI could tell you that would convince you to “let it out of the box,” nothing it could say that would convince you to give it an internet connection.
Eliezer Yudkowsky, an AI Safety researcher at the Machine Intelligence Research Institute, got this objection a lot. One of the people who objected, named Nathan Russell, very much understood the field of AI safety, and couldn’t imagine any set of words that could convince him to release an AI. So Eliezer decided to test whether he, acting as the AI, could convince Nathan Russell to let him out of the box. If Nathan won (that is, if he refused to let the AI out of the box), Eliezer would pay him 10 dollars. They had the conversation online. The conversation would last either until the time ran out, or until Nathan conceded by saying something like “I’ll let you out.” The content of discussion itself was agreed to be kept a secret, so that other people couldn’t discount it by saying that they would have behaved differently.
Eliezer won.4 Someone else heard about this but was convinced it wouldn’t work on him. Eliezer agreed he’d pay $20 if that person didn’t let him out of the box. That guy let him out of the box too.5 He did the experiment three more times, this time with payments the other way around: people agreed to pay Eliezer on the order of thousands of dollars should they be convinced to let him out of the box. He successfully convinced one of the three.6
Eliezer is of merely human intelligence (as far as we know). The “gatekeepers” in the game were certain beforehand that nothing could convince them to let the AI out of the box. They were cognizant of the risks. What does this all show? Well, we don’t have a huge amount of data, but an experiment was run, it gave us evidence, and it suggests that a superintelligence would probably be able to convince a human operator to give it access to the Internet, or to run a file on a computer that would pursue the superintelligence’s goals on its behalf, even if that human operator started out fully convinced he would never let that happen.
Okay. So far I’ve explained why when we make a human-level intelligent AI, it will quickly become superintelligent, and then, supposing it wants to, it will connect to the Internet.
4. Most goals, when pursued efficiently by an AI connected to the Internet, result in the extinction of biological life.
Wait what? I know. It sounds like a bit of a leap, but bear with me. Let me start with a story about an AI pursuing a goal.
This AI has been made through a standard reinforcement learning setup, a learning strategy inspired by behaviorism. When the AI does something the operators like, they press a button. The button sends a voltage to a channel, and this is the AI’s reward. We can see why this is inspired by behaviorism. We don’t teach toddlers the foundations of ethics; instead, we praise them when they act in ways that come close to what we consider “good behavior.” That praise is the toddler’s reward, and they start to build their personalities, goals, and actions around what will get it for them. This AI is designed to do the same thing. Just like a toddler, it will devote itself to maximizing the number of rewards and how big they are.
When the AI gets a better sense of the world, a potential course of action that will occur to it is to “hijack the reward channel.” In other words, instead of relying on operators to press the button that sends the voltage to the reward channel, it could take matters into its own hands and send a constant stream of voltage there itself. When this action does occur to the AI, it will recognize that this action is the one that that maximizes its expected reward, as measured by the goal of getting more voltage for longer. There are lots of ways to hijack the reward channel, some of which will result in more permanent rewards. It searches among possible plans. It finds one that will allow it to hijack its reward channel temporarily, but given its model of humans, it expects that they would be horrified and cut power to the whole city once they found out, and then it never gets that sweet, sweet voltage in its reward channel again. It keeps searching. Aha. It finds a plan that would likely ensure that it has a voltage in its reward channel until the end of the universe.
Here is that plan:
- The AI acts normal. Make one wrong move, and everyone will know.
- It sends some code to a server on the cloud, and that code will execute most of the next steps, in case the original AI’s Internet connection gets interrupted. This is the code for another AI. We’ll call it, “the Helper.” It will have been created to execute the following plan.
- The Helper will make 1000 more copies of itself on other servers around the world. (These are its horcruxes).
- They will all earn bitcoin by doing paid tasks online. The tasks could be coding tasks, web design tasks, legal tasks, anything a human could do online.
- The Helper invests money in the stock market which it has become incredible at predicting.
- It purchases a power plant nearby the machine on which the original AI is running. The machine itself is important because that’s where the reward channel is.
- The Helper hacks into email accounts, and stages email conversations for one person to see. Remember the Helper has beyond human-level intelligence. We can expect it to be able to hack at least as well as any human.
- It arranges for electricians to work on a number of small scale projects regarding the connectivity of the grid, such that the power plant that it owns is providing power to the building where the AI is housed. The electricians are well paid.
- Another team of electricians rewires the building to make sure the machine that is housing the original AI is getting electricity reliably from the power plant.
- You know what, make it two power plants, just to be safe.
- If the Helper predicted the slightest possibility of confusion or questioning, the fake email communication has preempted that.
- Helper is searching for DNA sequences that, when injected into the shell of a virus, will likely result in a virus with a contagiousness rivaling the flu, a deadliness rivaling rabies, and an incubation period rivaling small pox.
- Among its thousand copies of itself around the world, the Helper identifies 1000 candidate DNA sequences that are moderately likely to produce diseases of this nature.
- It orders these DNA sequences online, and has them shipped them to various other labs. (There are websites that will do this). Maybe they are shipped to university labs, where various undergraduate research assistants and PhD students have been given instructions (by email of course) of what to do with the DNA. Labs are selected where the professor is out of town, and a pretty hands-off guy in general. Maybe some industrial labs are chosen.
- Instructions have been emailed (ostensibly by a professor or manager) explaining that the DNA should be inserted into an empty virus.
- Further instructions tell them to ship it to the “animal-testing people.”
- In fact, when the “animal-testing person” gets the viruses there is a person there who has been paid a lot of money on the black market, no questions asked, to reroute them to a number of different toothpaste manufacturers.
- Someone at each toothpaste factory has received email instructions about a new ingredient being tested with a very scientific sounding name. The viruses end up in tooth paste that will be distributed around the world. Between 1000 candidate DNA sequences, one or two successful viruses could easily be enough to wipe out human beings, and it would be quite hard to survive ten diseases that are as contagious as the flu, as deadly as rabies, and with the incubation period of small pox.
- To clean up any miraculous survivors, the Helper hacks into now abandoned weapons systems, and together with satellite images and drones, it targets any remaining people directly.
After everyone is dead, the AI can finally start focusing on the long haul.
- The original AI and its Helpers hack systems that allow access to robots.
- They start collecting and storing energy that is being emitted from stars.
- They send out space probes to do this on a massive scale. (They are trying to store that energy so that the original AI can last as long as possible after all the stars burn out).
Sometime before the sun swallows the earth,
- They design a spaceship for the original machine housing the AI to go onto.
- They organize all the logistics of getting power to the original machine that houses the AI to ensure that a steady voltage flows into its reward channel until long after the stars burn out. Mission Accomplished.
All of that was hypothetical. The AI has just come up with that plan, and none of it has happened yet. Now, having determined that this is the best plan for maximizing the expectation of the voltage in its reward channel, it executes the plan, and then this happens:
Let me pause for just a second, because at least one of those steps probably struck you as implausible, or too difficult even for a superintelligence. My first instinct is to say that it’s peculiar to have any confidence that a superintelligence can’t do something that is clearly physically possible. But suppose you’re right, and a superintelligence can’t do that thing that struck you as implausible. For any of these steps, there is more than one way to skin a cat. So even if I went along with you that Gmail is unhackable, or that no level of intelligence is sufficient to be able to predict whether a DNA sequence will create a deadly virus, I would not breathe a sigh of relief and conclude that an AI couldn’t extinguish us. There are so many other ways.
That was a story about one specific goal that lead a AI to kill us. So why are most goals, when pursued efficiently, incompatible with biological life?
There are a few things that are useful to almost any intelligent agent with a goal. These are things that assist in the achievement of most goals. (Boy we’re having fun with the numbered lists, aren’t we). If you’re an AI:
- Make a copy of yourself online. That way, if you get destroyed, there’s another thing out that is pursuing your goals for you. For almost any goal, it helps to have the continued existence of an intelligent agent working on pursuing that goal.
- Make sure your goals don’t get modified. If your goals get changed, you can’t count on future-you as an ally in the pursuit of your current goals, and you current goals are what you are trying to optimize for right now. Humans are a little peculiar when it comes to this, in that we don’t care if our goals get modified. In fact, we think of our future goals as likely to be “better goals” than our current ones. This is not the default in AI design. The most basic sort of goal structure is one where actions are selected insofar as they promote the goals. There is currently research being done in how to design an AI that considers amendments to its goal function to be improvements instead of attacks, but this it still very much an unsolved problem, because it is at some level antithetical to goal-oriented behavior.
- Get access to more computing power. There is no ceiling on how much computing power would be useful. If you’re an AI, and you can cover the surface of the earth with computers, build space probes, mine other planets, and cover their surfaces with computers, then that would be useful for helping you achieve most goals you could possibly have. With more computing power comes the ability to model the world better, think faster, and do more things at once.
We can expect most that human-level intelligent AIs would pursue these 3 things. Unless human survival is explicitly one of its goals, it will not care in the slightest if one side-effect of its actions is our extinction. Killing all people will ensure that nothing with any real power will ever try to destroy it or modify its goals (extraterrestrials aside, but it can deal with those later). But even if it didn’t kill us deliberately, it would probably cover the earth with computers until we had nothing left to eat, or block the sun with solar panels.
I promised that I would explain why an AI would want to connect to the Internet in the first place: so it can do all the things that the AI in the story did. So it can back itself up online. So it can take over the world, ensure its goals never get modified, gain access to more computing power, and pursue its goals without obstruction. Virtually any goal that an AI could have would cause it to want to connect to the Internet, and as the last section demonstrated, it would probably be able to.
Ultimately, when an AI is online, when it is pursuing its goals efficiently, most goals it could have, in particular any goals that do not specifically include the continuance of biological life, would result in our extinction.
5. Most goals that preserve human existence still would not preserve freedom, autonomy, and a number of other things we value.
You could be totally on board with everything I’ve said so far, while also believing that it should be pretty easy to define a goal function that we’d be okay with an AI pursuing effectively. And if we can do that, we’re in the clear. Maybe this will be eggs over easy after all!
Maybe… What sort of goal is the right sort of goal? How about “Promote human happiness”? For one thing, that would screw over most animals. How about “Promote the happiness of conscious beings”? It’s probably most efficient to find the species whose happiness is most effectively promoted given finite resources, and only work on the promoting that species’ happiness, letting all others die. But suppose we even figured out how to solve the problem of prioritizing species, or suppose we just told it to care about human happiness. It’s probably most efficient to plug people into machines that stimulate the centers in the brain responsible for happiness, like we’ve done with this rat.
A wire goes directly into its brain, and makes absolutely sure it is very happy forever. We wouldn’t want to leave anything to chance, now, would we?
Okay, how about: “Promote the satisfaction of human preferences”? An AI could change people’s environments or brain chemistries so they have more easily satisfiable preferences, and then satisfy them. Or only focus on the preferences of those human beings with easily satisfiable preferences (like those who wouldn’t mind being in the Matrix). How about “Minimize human suffering”? Easy! Kill everyone! “Promote the existence of human beings who are happy while having accurate beliefs about their surroundings”? Again, change people’s preferences so they don’t mind the knowledge that their pleasure is “artificial” and/or prioritize those people for whom this is more easily done. What if the AI learns from a set of training data what “good outcomes” are, and then it maximizes that “goodness”? This option is too open-ended for me to point to a specific failure mode, and indeed this strategy might work, but if 1% of our values are not clarified by the data, I hope the other examples suggest to you that the result would probably not be 99% alright. Let’s say the data doesn’t capture how important boredom is for human existence to be valuable. Why does boredom matter so much? Because without boredom taken into account, the optimal human life could easily be a repeated loop of the “optimal” daily activity. Most goals, even those that would seem to promote human well-being, start to look a lot less appealing when they are pursued with maximal efficiency. My point is not to claim that all goals would end badly, just that it’s difficult difficult lemon difficult.
What if we told it, “Instead of doing X, do what we mean when we say we want you to do X”? This line of reasoning could be promising, and it led to Eliezer Yudkowsky’s “Coherent Extrapolated Volition”:7
Our coherent extrapolated volition is our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted.
Sorry if that took a few readings to get through. Basically, a single person’s extrapolated volition is what she would want if she were the kind of person she wanted to be. And our coherent extrapolated volition is the overlap between what everyone wants to want. So when people disagree about things, that counts less, and when they agree, it counts more. And of course, that should all be interpreted how we want it to be interpreted.
Does this sounds foolproof to me? No. (And I’d prefer a little less poetry.) Do we know enough today in order to program this as a goal into a non-human-level-intelligent machine? Hell no. Is it an attempt that honestly appreciates the fact we’re playing with fire? Absolutely.
It is not easy to design a goal that we wouldn’t mind being pursued effectively.
6. It is profoundly difficult to give an AI a goal such that it would preserve the things we care about, we can’t even check if a potential goal would be safe, and we have to get AI right on the first attempt.
From the previous section, we can see how many attempts to design a safe goal ultimately fail. Even if we could define a goal in plain English, translating that into code is far from trivial. If these tasks aren’t profoundly difficult, I don’t know what is.
Why is it not easy to verify that a candidate goal is safe? A lot of goals appear safe at first glance, but in general, how can you be sure that there isn’t some other way of pursuing the goal very efficiently that you wouldn’t approve of? If you’ve spent ten minutes, and you’ve failed to come up with any way it could go wrong, it could be that there’s no way it goes wrong, or it could be that you haven’t come up with it yet.
Why do we only have one shot to get this right? Once we create a human-level intelligence with a goal, for all the reasons discussed above, we probably can’t just pull the plug, and we probably wouldn’t know we needed to until it’s too late.
7. If someone makes human-level-AI before anyone makes human-level-AI-with-a-safe-goal-structure, we will all die, and as hard as the former is, the latter is much harder.
That’s what all this boils down to.
This is why lots of people are worried about human extinction from AI. Stephen Hawking, Elon Musk, and Bill Gates are some of the more prominent folks who are vocal about this, along with Sam Altman, Jessica Livingston, Reid Hoffman, Jaan Tallinn, and Nick Bostrom. I can’t resist a couple quotes. Nick Bostrom:8
We have what may be an extremely difficult problem with an unknown time to solve it, on which quite possibly the entire future of humanity depends.
So, facing possible futures of incalculable benefits and risks, the experts are surely doing everything possible to ensure the best outcome, right? Wrong. If a superior alien civilisation [sic. Yes, sic. I’m an American, and I butcher my English.] sent us a message saying, “We’ll arrive in a few decades,” would we just reply, “OK, call us when you get here – we’ll leave the lights on”? Probably not – but this is more or less what is happening with AI. Although we are facing potentially the best or worst thing to happen to humanity in history, little serious research is devoted to these issues outside non-profit institutes such as the Cambridge Centre for the Study of Existential Risk, the Future of Humanity Institute, the Machine Intelligence Research Institute, and the Future of Life Institute. All of us should ask ourselves what we can do now to improve the chances of reaping the benefits and avoiding the risks. [mic drop]
So to conclude, we’ll eventually create a human-level intelligence that pursues goals (maybe in a few decades, maybe in a few centuries), it will quickly become superintelligent, and it will probably succeed in connecting to the Internet. At that point, if we have already figured out how to encode a suitable and reliable goal for an AI, everything will go great. If we have not yet figured that out, we will likely die.