Open threads at the Open Thread tab every Sunday and Wednesday

AI Persuasion Experiment: Essay C

[This is NOT necessarily a Slate Star Codex post by Scott Alexander. This is part of the AI Persuasion Experiment. I am briefly re-hosting it on this blog to blind readers to the source of the information. I apologize to the original author and promise to take this down after a few days when the experiment is over.]

If you've studied computer science in the last twenty years, you know Stuart Russell — as he co-authored Artificial Intelligence: A Modern Approach with Peter Norvig, which has been the standard textbook in the field for the last two decades. An extraordinarily clear and engaging text, it was well placed to become the classic it is today.

Once a bastion of overpromises and vaporware, in the years since the publication of A Modern Approach the field of artificial intelligence has grown into a bulwark of our economy, optimizing logistics for the military, detecting credit card fraud, providing very good, though certainly subhuman, machine translations, and much, much more.

Slowly these programs have become more capable and more general, and as they do our world has become more efficient, and its inhabitants richer. All in all, we've been well served by modern machine learning techniques. As artificial intelligence becomes, well, more intelligent whole hosts of new businesses and applications become possible. Uber, for example, would not exist if not for modern path-finding algorithms and self-driving cars are only beginning to actualize thanks to the development of probabilistic approaches to modeling a robot and its environment.

One might think that this trend towards increasing capability will continue to be a positive development, yet Russell thinks otherwise. In fact, he’s worried about what might happen when such systems begin to eclipse humans in all intellectual domains, and even had this to say when questioned about the possibility by, “The right response seems to be to change the goals of the field itself; instead of pure intelligence, we need to build intelligence that is provably aligned with human values.”

One of the fathers of modern artificial intelligence thinks we need to redefine the goals of the field itself, including some guarantee that these systems are “provably aligned” with human values. This is interesting news in itself, but far more interesting are the arguments and thinkers that lead him to this conclusion

Russell cites two academics in his comment to, Stephen M. Omohundro, an American computer scientist and former physicist, and Nick Bostrom, an Oxford philosopher who published a book last year on the potential dangers of advanced artificial intelligence, Superinteligence: Paths, Dangers, Strategies.

We’ll begin with Nick Bostrom’s orthogonality thesis and illustrate it with his famous paperclip maximizer thought experiment.

The year is 2055 and The Gem Manufacturing Company has put you in charge of increasing the efficiency of its paperclip manufacturing operations. One of your hobbies is amateur artificial intelligence research and it just so happens that you figured out how to build a super-human AI just days before you got the commission. Eager to test out your new software, you spend the rest of the day formally defining the concept of a paperclip and then give your new software the following goal, or “utility function” in Bostrom’s parlance: create as many paperclips as possible with the resources available.

You eagerly grant it access to Gem Manufacturing’s automated paperclip production factories and everything starts working out great. The AI discovers new, highly-unexpected ways of rearranging and reprograming existing production equipment. By the end of the week waste has quickly declined, profits risen and when the phone rings you’re sure you’re about to get promoted. But it’s not management calling you, it’s your mother. She’s telling you to turn on the television.

You quickly learn that every automated factory in the world has had its security compromised and they are all churning out paperclips. You rush to into the factories’ central server room and unplug all the computers there. It’s no use, your AI has compromised (and in some cases even honestly rented) several large-scale server farms and is now using a not-insignificant percentage of the worlds computing resources. Around a month later, your AI has gone through the equivalent of several technological revolutions, perfecting a form of nanotechnology it is now using to convert all available matter on earth into paperclips. A decade later, all of our solar system has been turned into paperclips or paperclip production facilities and millions of probes are making their way to nearby stars in search of more matter to turn into paperclips.

And lest you think of an obvious patch like limiting the number of paperclips the software would want to produce, Bostrom gives this response:

One might think that the risk [..] arises only if the AI has been given some clearly open- ended final goal, such as to manufacture as many paperclips as possible. It is easy to see how this gives the superintelligent AI an insatiable appetite for matter and energy. […] But suppose that the goal is instead to make at least one million paperclips (meeting suitable design specifications) rather than to make as many as possible.

One would like to think that an AI with such a goal would build one factory, use it to make a million paperclips, and then halt. Yet this may not be what would happen. Unless the AI’s motivation system is of a special kind, or there are additional elements in its final goal that penalize strategies that have excessively wide- ranging impacts on the world, there is no reason for the AI to cease activity upon achieving its goal. On the contrary: if the AI is a sensible Bayesian agent, it would never assign exactly zero probability to the hypothesis that it has not yet achieved its goal. […]The AI should therefore continue to make paperclips in order to reduce the (perhaps astronomically small) probability that it has somehow still failed to make at least a million of them, all appearances notwithstanding. There is nothing to be lost by continuing paperclip production and there is always at least some microscopic probability increment of achieving its final goal to be gained. Now it might be suggested that the remedy here is obvious. (But how obvious was it before it was pointed out that there was a problem here in need of remedying?)

Now this parable may seem silly. Surely once it gets intelligent enough to take over the world, the paperclip maximizer will realize that paperclips are a stupid use of the world’s resources. But why do you think that? What process is going on in your mind that defines a universe filled only with paperclips as a bad outcome? What Bostrom argues is this process is an internal and subjective one. We use our moral intuitions to examine and discard states of the world, like a paperclip universe, that we see as lacking value.

And the paperclip maximizer does not share our moral intuitions. Its only goal is more paperclips and its thoughts would go more like this: does this action lead to the production of more paperclips than all other actions considered? If so, implement that action. If not, move on to the next idea. Any thought like ‘what’s so great about paperclips anyway?’ would be judged as not likely to lead to more paperclips and so remain unexplored. This is the essence of the orthogonality thesis which Bostrom defines as follows:

Intelligence and final goals are orthogonal axes along which possible agents can freely vary. In other words, more or less any level of intelligence could in principle be combined with more or less any final goal [even something as ‘stupid’ as making as many paperclips as possible].

In my previous review of his book, I provided this summary of the idea:

Though agents with different utility functions (goals) may converge on some provably optimal method of cognition, they will not converge on any particular terminal goal, though they’ll share some instrumental or sub-goals. That is, a superintelligence whose super-goal is to calculate the decimal expansion of pi will never reason itself into benevolence. It would be quite happy to convert all the free matter and energy in the universe (including humans and our habitat) into specialized computers capable only of calculating the digits of pi. Why? Because its potential actions will be weighted and selected in the context of its utility function. If its utility function is to calculate pi, any thought of benevolence would be judged of negative utility.

Now I suppose it is possible that once an agent reaches a sufficient level of intellectual ability it derives some universal morality from the ether and there really is nothing to worry about, but I hope you agree that this is, at the very least, not a conservative assumption. For the purposes of this article I will take the orthogonality thesis as a given.

So a smarter than human artificial intelligence can have any goal. What’s the big deal? What’s really implausible about the paperclip scenario is how powerful it got in such a short amount of time.

This is an interesting question: What is so great about intelligence? I mean, we have guns and EMPs and the nuclear bombs that create them. If a smarter-than-human AI with goals that conflict with our own is created, won’t we just blow it up? Leaving aside the fact that guns and nuclear bombs are all products of intelligence and so illustrate how powerful a force it really is, I think it’s important to illustrate how smart an AI could get in comparison to humans.

Consider the chimp. If we are grading on a curve, chimps are very, very intelligent. Compare them to any other species besides Homo sapiens and they’re the best of the bunch. They have the rudiments of language, use very simple tools and have complex social hierarchies, and yet chimps are not doing very well. Their population is dwindling, and to the extent they are thriving they are thriving under our sufferance not their own strength.

Why? Because human civilization is little like the paperclip maximizer; We don’t hate chimps or the other animals whose habitats we are rearranging; we just see higher-value arrangements of the earth and water they need to survive. And we are only every-so-slightly smarter than chimps.

In many respects our brains are nearly identical. Yes, the average human brain is about three times the size of an average chimp’s, but we still share much of the same gross structure. And our neurons fire at about 100 times per second and communicate through saltatory conduction, just like theirs do.

Compare that with the potential limits of computing. Eliezer Yudkowsky, a controversial independent theorist who nonetheless collaborates with Bostrom frequently, said this in a debate about the potential of AI:

[When one looks at the brain] you get observations like, signals are travelling along the axons and the dendrites at a top speed of say, 150 meters per second absolute top speed. You compare that to the speed of light, and it’s a factor of 2,000,000. Or similarly, you look at how fast the neurons are firing. They’re firing say, 200 times per second, top speed. And you compare that to modern day transistors, and again you are looking at a factor of millions between what neurons are doing and what we have already observed to be physically possible.

There’s an awful lot of room above us. An AI could potentially think millions of times faster than us. Problems that take the smartest humans years to solve it could solve in mintues. If a paperclip maximizer (or value-of-Goldman Sachs-stock maximzer) is created, why should we expect our fate then to be any different that that of chimps now?

All the weapons at our disposal are the equivalent of sticks and stones. If a superintelligence decides we are a threat to the pursuit of its goals and wants us out of the way, this could be done very quickly with some trivial advances in weponized biotechnology.

The key is making sure it doesn't want us dead. However, as Omohundro points out in his paper, The Basic AI Drives, this is not nearly as easy as one might think.

As the paperclip maximizer scenario makes clear, seemingly innocuous goals can lead to catastrophic outcomes when implemented in a superintelligent system. Omohundro uses von Neumann’s mathematical theory of microeconomics to analyze the likely convergent behavior of most rational artificial intelligences of human equivalent or greater-than human intelligence. He defines six basic drives. To clarify, these basic drives are sub-goals that nearly all rational AIs will develop as a consequence of pursuing their primary or “terminal” goals. Here are the most salient ones:

Goal-content integrity

A superintelligence will not willingly alter its utility function or allow it to be altered. For example, the paperclip maximizer would resist any attempt to reprogram it to value, say, staples instead of paperclips. Why? It can predict that if its goals were changed, the future would contain fewer paperclips — a disaster from its current perspective.

AIs will want to self-improve

Any increase in intelligence translates in to an increase in an agent’s ability to achieve its goals, so most goal-directed agents will have strong instrumental reasons to increase their intelligence. Thus they will research artificial intelligence design themselves, and quickly surpass their creators in the discipline.

AIs will want to acquire resources and use them efficiently

All computation and physical action requires the physical resources of space, time, matter, and free energy. Almost any goal can be better accomplished by having more of these resources. In maximizing their expected utilities, systems will therefore feel a pressure to acquire more of these resources and to use them as efficiently as possible. Resources can be obtained in positive ways such as exploration, discovery, and trade. Or through negative means such as theft, murder, coercion, and fraud. […]Without explicit goals to the contrary, AIs are likely to behave like human sociopaths in their pursuit of resources.

Without explicit goals to the contrary, AIs are likely to behave like human sociopaths? That’s a hell of a statement. We observe that criminals are both less educated and less intelligent than normal citizens. Therefore, don’t we have some empirical evidence that as one gets more intelligent one gets more moral? And if so, why shouldn’t we expect this trend to continue for superhuman intelligence?

Even leaving aside the selection effect (those criminals who are caught are likely to be less intelligent than those that are not) this argument doesn't account for that fact that in an ecology of agents of similar abilities, cooperation and exchange is often rational, regardless of your morals — and the entire edifice of civilization is designed to encourage and enforce such cooperation. The fact that intelligent people tend not to perform actions that get them incarcerated really doesn't say much about their morals, as there are instrumental reasons why most humans, regardless of their morality, want to avoid ending up in jail.

And of course, as these incentives emerged to constrain the options of human-level agents, there is no reason to think an AI millions of times more powerful than any human or collection of humans will play by the same rules.

Ok, so a paperclip maximizer would be a dangerous artifact, so what? No one is actually going to build one. Humans will be the ones programing the first artificial intelligences and computers behave exactly as they’re programed to. So why don’t we just program them to be safe? Conditional on humanity surviving the creation of smarter-than-human AI, I suspect we will do just that. But saying “just program them to be safe?” understates the difficulty of the problem. I mean what exactly do you mean by “safe”? What goal, when pursued by a superintelligence, will lead to good outcomes?

Consider this proposal from Bill Hibbard of the University of Wisconsin:

We can design intelligent machines so their primary, innate emotion is unconditional love for all humans. First we can build relatively simple machines that learn to recognize happiness and unhappiness in human facial expressions, human voices and human body language. Then we can hard-wire the result of this learning as the innate emotional values of more complex intelligent machines, positively reinforced when we are happy and negatively reinforced when we are unhappy.’

Here Hibbard is defining human happiness as the position of our lips, our body language and the tone of our voices. What should we expect a superinteligence programmed in this manner to do? Well, so long as we are more powerful than it, it would have incentive to make us smile through standard means: telling jokes, building useful products, maybe even following our orders.

But as soon as it self-improves to the point where it is more powerful than humanity it would implement more-efficient means of satisfying its desires — like wiring our mouths into rictus grins, or more plausibly, begining the process of converting all available matter (including the resources that keep us alive) into smiling, happily positioned manikins, each with a speaker in its throat producing squeals of delight.

One mistake Hibbard makes is defining an agent that desires a proxy for human happiness (the physiological correlates of same) rather than happiness itself, conflating the smile with the emotion that produces it. But even if we could program AIs to truly desire the mental state behind the smile rather than the smile itself, it’s not clear this would be a good idea. Placing every human on a food drip, implanting electrodes in our brains, and then directly stimulating our pleasure centers would vastly increase the amount of happiness in the world, yet that hardly seems like a good outcome.

So if happiness is not the right answer, what is? No one knows. There are interesting ideas floating around but nothing even close to definite. Inspired by ideal adviser theories in moral philosophy, Bostrom and his collaborates propose we may be able to offload some of the philosophical work of figuring out what it is we actually want to the AI. He calls this aproach Indirect normativity.

[A] future superintelligence occupies an epistemically superior vantage point: its beliefs are (probably, on most topics) more likely than ours to be true. We should therefore defer to the superintelligence’s opinion whenever feasible.

Indirect normativity applies this principle to the value-selection problem. Lacking confidence in our ability to specify a concrete normative standard, we would instead specify some more abstract condition that any normative standard should satisfy, in the hope that a superintelligence could find a concrete standard that satisfies the abstract condition. We could give a seed AI the final goal of continuously acting according to its best estimate of what this implicitly defined standard would have it do.

They propose giving a superintelligence a goal that goes something like this: perform those actions you estimate humanity would want performed if we were as intelligent as you.

What actions would humanity want performed if we were a superintelligence ourselves? This is an empirical question, and a superintelligent AI, being superintelligent, might be able to make very accurate estimations of how we would choose to steer our fate if we were smart enough to see the implications of our actions, defining and then steering the future toward some some state of the world that we would all consider a happy ending.

Depending on your disposition, this is either a bit of a cheat or a terribly clever piece of intellectual judo. However, even if we grant such a goal is coherent, there is still the amazingly difficult challenge of formalizing such an abstract idea and implementing it it code.

Cryptographer Wei Dai comments on this difficulty:

What I’m afraid of is that a design will be shown to be safe, and then it turns out that the proof is wrong, or the formalization of the notion of “safety” used by the proof is wrong. This kind of thing happens a lot in cryptography, if you replace “safety” with “security”. These mistakes are still occurring today, even after decades of research into how to do such proofs and what the relevant formalizations are. From where I’m sitting, proving an AGI design Friendly seems even more difficult and error-prone than proving a crypto scheme secure, probably by a large margin[.]

Obviously, more research is needed. And it will be up to this generation to do it.

Stuart Russell wrote, “The right response seems to be to change the goals of the field itself; instead of pure intelligence, we need to build intelligence that is provably aligned with human values.” I hope this next generation of computer scientists are as inspired by Russell’s call for safety as this last one was by Artificial Intelligence: A Modern Approach.

Comments are closed.