Open threads at the Open Thread tab every Sunday and Wednesday

How Does Recent AI Progress Affect The Bostromian Paradigm?

[content note: I seriously know nothing about this and it’s all random uninformed speculation]

I.

AI risk discussions are dominated by the Bostromian paradigm of AIs as highly strategic agents that try to maximize certain programmed goals. This paradigm got developed in the early 2000s, before a recent spurt of advances in machine learning. Do these advances require any changes to the way we approach these topics?

The latest progress has concentrated in neural networks – “cells” arranged in layers that represent the potential for ascending levels of abstract categorization. For example, a neural network working on image recognition might have a low-level layer that scans the image and resolves it into edges, a medium-level network that scans the set of edges and resolves it into shapes, and a highest-level network that scans the shapes and resolves them into subjects and themes. With enough training, the network “learns” how best to map each level onto the level above it, ending up with profound insight into the high-level features of a scene.

These are a lot like the human brain, and in fact some of the early researchers got important insights from neuroscience. The brain certainly uses cells, the cells are arranged in layers, and the brain categorizes things in hierarchies that move from simple things like edges or sounds to complicated things like objects or sentences.

In particular, these networks are like the brain’s sensory cortices, and they’re starting to equal or beat human sensory cortices at important tasks like recognizing speech and faces.

(I think this is scarier than most people give it credit for. It’s no big deal when computers beat humans at chess – human brains haven’t been evolving specific chess modules. But face recognition is an adaptive skill localized in a specific brain area that underwent a lot of evolutionary work, and modern AI still beats it)

The sensory tasks where AIs excel tend to involve abstraction, categorization, and compression: the thing where you take images of black dogs, white dogs, big dogs, little dogs, ugly dogs, cute dogs, et cetera and are able to generalize them into “dog”. Or to take a more interesting example: a new AI classifies images as pornographic or safe-for-work. Its structure naturally gives it an abstract understanding of pornographicness that allows it to “imagine” what the most pornographic possible images would look like (trigger warning: artificially intelligent computer generating the most pornographic possible images). This kind of classification/categorization/generalization ability is a major advance and eerily reminiscent of human abilities.

But how far is this to building an AGI or human-level AI or superintelligence or whatever else you want to call it?

II.

Consider two opposite perspectives:

The engineer’s perspective: Categorization ability is just one tool out of many. When people invented automated theorem-provers, that was pretty cool – it meant computers could now assess new mathematics. But for AGI, you still need some thing that wants to prove theorems, something (someone?) that can do something with the theorems it proves. The theorem-prover is a tool for the AI to use, not the core “consciousness” of the AI itself. The same will be true of these new neural nets and deep learning programs. They can recognize dogs, and that’s cool. But AGI is still about creating some kind of program that wants to recognize dogs, and which can do something interesting with the dogs once it recognizes them. And that will probably require something different from either a theorem-prover or a neural-net-categorizer. A paperclip maximizer might use a neural net to recognize paperclips, but its desire to maximize them will still come from some novel architecture we don’t know much about yet which probably looks more like normal programming.

The biologist’s perspective: The whole brain runs on more or less similar cells doing more or less similar things, and evolved in a series of tiny evolutionary steps. If we’ve figured out how one part of the brain works, that’s a pretty big clue as to how other parts of the brain work. The human motivation system is in brain structures not so different from the human perception-association-categorization system, and they probably evolved from a common root. If researchers are discovering that the easiest way to make perception-association-categorization systems is neural nets reminiscent of the brain, then they’ll probably find that those neural nets are pretty easy to alter slightly to make a motivational system reminiscent of the brain. This would look less like strategic/agenty goal maximization, which the brain is terrible at, and more like the sort of vague mishmash of desires which humans have.

The exact evolutionary history beyond the biologist’s perspective is complicated. There’s a split between some sensory processing centers (like the visual cortex) and some motivational/emotional centers (like the hypothalamus) pretty early in vertebrates and maybe even before. But in other cases the systems are all messed up. Some parts of the cortex interact with the hypothalamus and are considered part of the limbic system. Some parts of the really primitive lizard brain handle sensation (like the colliculi). It looks like sensation/perception-related areas and emotion/motivation-related areas are mixed throughout every level of the brain. Most important, the frontal lobe, which we tend to interpret as the seat of truly human intelligence and executive planning and “the will” – probably evolved from sensation/perception-related areas in fish, since it looks like sensation/perception-related areas are just about all the cortex that fish had. And all of this evolved from the same couple of hundred neurons in worms, which were already responsible for interpreting the sensations picked up by the worm’s little bristle thingies.

The point is, neither evolution nor anatomy suggests that the brain enforces a deep conceptual separation between perception, motivation, and cognition. Instead, the same sort of systems which handle perception in some areas are – with a few tweaks – able to handle cognition and motivation in others.

In fact, there are some deep connections between all three domains. The same factors that make a grey figure on dark ground look white can make an okay choice compared to worse choices look good. The same top-down processing that screws up PARIS IN THE THE SPRINGTIME is responsible for confirmation bias. In general the mapping between cognitive biases and perceptual illusions is fruitful enough that it’s hard for me to believe that cognition and sensation/perception aren’t handled in really similar ways, with motivation probably also involved.

So if we have something that can equal human sensory cortices – not just in the coincidental way where a sports car can equal a cheetah, but because we’re genuinely doing the same thing human sensory cortices do for the same reasons – then we might already be further than we think towards understanding human intelligence and motivation.

III.

A quick sketch of two ways this might play out in real life.

First, categorization/classification/generalization/abstraction seems to be a big part of how people develop a moral sense, and maybe a big part of what morality is.

Everyone remembers the whole thing about mental categories, right? The thing where you have a category “bird”, and you can’t give a necessary-and-sufficient explicit definition of what you mean by that, but you know a sparrow is definitely a bird, and an ostrich is weird but probably still a bird, and there are edge cases like Archaeopteryx where you’re not quite sure if they’re birds or not and there’s probably no fact of the matter either way? Cluster-structures in thingspace? Weird border disputes? That thing?

And you remember how we get these categories, right? A little bit of training data, your mother pointing at a sparrow and saying “bird”, then maybe at a raven and saying “bird”, then maybe learning ad hoc that a bat isn’t a bird, and your brain’s brilliant hyperadvanced categorization/classification/generalization/abstraction system picking it up from there? And then maybe after several thousand years of this Darwin comes along and tells you what birds actually are, and it’s good to know, but you were doing just fine way before that?

We learn morality in a very similar way. When we hit someone, our mother/father/teacher/priest/rabbi/shaman says “That’s bad”; when we share, “that’s good”. From all this training data, the categorization/classification/generalization/abstraction system eventually feels like it has a pretty good idea of what morality is, although often we can’t verbalize an explicit definition any better than we can verbalize an explicit definition of “bird” (“it’s an animal that can fly…wait, no, bats…um, that has feathers…uh, do all birds have feathers? Bah, of course they don’t if you pluck them, that wasn’t what I meant…”). Just as Darwin was able to give an explicit definition of “bird” which conclusively settled some edge cases like bats, so philosophers have tried to give explicit definitions of “morality” which settle edge cases like abortion and trolley-related mishaps.

An AI based around a categorization/classification/generalization/abstraction system might learn morality in the same way. Its programmers give it a bunch of training data – maybe the Bible (this is a joke, please do not train an AI on the Bible) – and the AI gains a “moral sense” that it can use to classify novel data.

The classic Bostromian objection to this kind of scheme is that the AI might draw the wrong conclusion. For example, an AI might realize that things that make people happy are good – seemingly a high-level moral insight – but then forcibly inject everybody with heroin all the time so they could be as happy as possible.

To this I can only respond that we humans don’t work this way. I’m not sure why. It seems to either be a quirk of our categorization/classification/generalization/abstraction system, or a genuine moral/structure-of-thingspace-related truth about how well forced-heroin clusters with other things we consider good vs. bad. A fruitful topic for AI goal alignment research might be to understand exactly how this sort of thing works and whether there are certain values of classification-related parameters that will make classifiers more vs. less like humans on these kinds of cases.

Second, even if we can’t get this 100% right, there might be a saving grace: I don’t see these kinds of systems as paperclip maximizers. The human utility function seems to be a set of complicated things generalizing/abstracting from a few biologically programmed imperatives (food, sex, lack of pain) and ability to learn other goals from society and your moral system.

Categorization/classification/generalization/abstraction is certainly involved in reinforcement learning. You say “BARK!” and a dog barks, and you give it a treat. The dog needs to be able to figure out, on the fly, whether the treat was for barking when you said “BARK!”, for barking whenever you speak, for barking in general, for being next to you, or just completely random. This is a problem of categorization and abstraction – going from training data (“the human did or didn’t reward me at this specific time”) to general principles (“when the human says bark, I bark”).

I don’t really understand how the human motivational system works. Dopamine and the idea of incentive salience seem to be involved in a fundamental way that seems linked to perception. But I am kind of hopeful that it’s something that’s not too hard to do if you already have a working categorizer, and that it’s a foundation to build agents that want things without being psychopathic maniacs. Humans can want sex without being insane sex maximizers who copulate with everything around until they explode. An AI that wanted paperclips, but which was built on a human incentive system that gave paperclips the same kind of position as sex, might be a good paperclip producer without being insane enough to subordinate every other goal and moral rule to its paperclip-lust.

Tomorrow 10/31 is the last day of MIRI’s yearly fundraiser, and as usual I think it is a good cause well worth your donation. But its basic assumption is that AIs will be very computer-like: entities of pure code and logic that will reflect on themselves using mathematical tools. I can also imagine futures where AIs aren’t much more purely-logical than we are, and the tools we need to keep them human-friendly are very different. I support MIRI’s efforts to deal with the one case, but I’m hoping there will be some efforts in the other direction as well.

EDIT: Nick points out some of MIRI’s work along these lines.
EDIT2: Comment by Eliezer

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

141 Responses to How Does Recent AI Progress Affect The Bostromian Paradigm?

  1. Nick T says:

    MIRI has started devoting significant effort to also looking at the case where AI grows out of present machine learning research e.g. neural networks.

    • Rob Bensinger says:

      See also Nate on the EA Forum:

      Loosely speaking, we can imagine the space of all smarter-than-human AI systems as an extremely wide and heterogeneous space, in which “alignable AI designs” is a small and narrow target (and “aligned AI designs” smaller and narrower still). I think that the most important thing a marginal alignment researcher can do today is help ensure that the first generally intelligent systems humans design are in the “alignable” region. I think that this is unlikely to happen unless researchers have a fairly principled understanding of how the systems they’re developing reason, and how that reasoning connects to the intended objectives.

      Most of our work is therefore aimed at seeding the field with ideas that may inspire more AI research in the vicinity of (what we expect to be) alignable AI designs. When the first general reasoning machines are developed, we want the developers to be sampling from a space of designs and techniques that are more understandable and reliable than what’s possible in AI today.

      MIRI’s goal is more or less “find research directions that are likely to make it easier down the road to develop AI systems that can complete some limited (superhumanly difficult) task without open-endedly optimizing the universe”, and our strategy is more or less “develop a deeper conceptual understanding of the relevant kinds of intelligent systems, so that researchers aren’t flying blind”.

      We’re currently spending about half our time on the research agenda Nick linked, which thinks more about present-day machine learning approaches and puts more emphasis on the “complete some limited task without open-endedly optimizing the universe” part of the problem. The other half goes to our 2014 agenda, which is more agnostic about when/how AI will be developed and puts more emphasis on the “develop a deeper conceptual understanding of the relevant kinds of intelligent systems” part. For the most part, though, we see the agendas as complementary.

      I definitely wouldn’t say that the older agenda is about “logical AI” or that the newer one is about “human-like AI”. Like Eliezer mentioned, we sometimes use tools from logic as a simplification for addressing some parts of the problem to the exclusion of others. There aren’t many good tools yet for formally characterizing general reasoning systems, and when you want to utterly ignore that problem, saying, “Well, pretend it can successfully brute-force-search through all mathematical claims for some true ones” is one way of saying “Pretend we knew how to formally pinpoint a really capable general reasoning system”.

      It’s not that logic-like reasoning is what you’d actually expect to fill the gap in our understanding; mathematical logic is just a good device (sometimes) for deliberately bracketing one gap in our understanding so we can better talk about another. If the research community does develop new AI paradigms, we should expect them to look like a step forward from ANNs, not a step backward to GOFAI.

  2. Izaak says:

    I’m not sure neural net style AI are any less scary.

    If you have a neural net the intelligence of a human, you can run it on a larger computer with more nodes and presumably still get a smarter AI. Maybe you won’t have a FOOM scenario but you probably still will have superintelligent AI.

    What happens when a neo-nazi group buys a supercomputer and trains their 300 times human intelligence AI on the morals of Mein Kampf? and besides, the orthogonality hypothesis still applies. If I want to train an AI to think that more paperclips is the moral foundation of the universe, I could. And a dumb mid-level manager at a paperclip factory might.

    • Scott Alexander says:

      You can always have bad people create an AI with bad-people morality. The good news is that it’s possible to train an AI with human morality at all. If you can figure out how to train an AI, it reduces the friendliness problem to “Don’t have the first people to invent AI be neo-Nazis”, which sounds gratifyingly easy.

      • Soeren E says:

        Chapter 11 in Superintelligence by Nick Bostrom discuss a multipolar scenario, where the Neo-nazis also have a superintelligence, but the Control Problem has not been fully solved. The chapter does not assume an Intelligence Explosion, and it is rather depressing reading.

      • Angra Mainyu says:

        I admit I don’t know how the AI would be programmed, but from what I read in your post, I think a significant problem is that whoever programs the AI will make mistakes when giving it feedback, if they consider a sufficient number of cases, because people do tend to have some false moral beliefs. The AI will trust the false information as data, and as a result, without a built-in moral sense, it will make a sort of AI-moral sense which will track AI-goodness (whatever that is; it’s not the same as goodness, etc.).

        It would be like training an AI to recognize birds, but telling the AI that, say, bats and dragonflies are birds, but penguins and ostriches are not birds (or maybe less bad if the dragonflies are not included, but still, there seem to be some significant mistakes in at least nearly everyone’s moral beliefs, even if most of a person’s moral beliefs are true).

    • onyomi says:

      It seems to me that this problem may be somewhat ameliorated by the fact that no one aspires to be evil by his own lights. So even if the original programmer has an idiosyncratic notion of “good,” the AI might well realize that its own creator’s view was bad, as the intelligent child of weird parents may, with access to the internet, come to reject his parent’s belief system.

      I guess it becomes more problematic if the original programmer programs the AI not to “do good” but to “eliminate the enemies of the master race,” but, as Scott mentions, people are “programmed” to survive and reproduce, yet we are able (seemingly in some proportion to our intelligence, given, e. g. dogs’ lesser ability to do so and the marshmallow test) to resist those urges in order to do something deemed “right.”

      An AI much more intelligent than us might have a correspondingly greater amount of “self control,” such that even if it feels a great “urge” to destroy enemies of the master race, it might look around at the world and most peoples’ view of morality and determine that’s not the right thing to do. In its spare time, it can divert itself with “destroying the enemies of the master race” virtual porn, or reprogram itself to have different preferences.

      The situation is better if I am correct in being a moral realist, and morality is not entirely arbitrary. If there are no “correct” answers about morality then we can’t depend on a smart AI to know them, though, in such a case, the morality of the creator also seems to be irrelevant (except insofar as it shapes the other goals he imbues the AI with), since the AI will realize the correct moral nihilist position and ignore lesser beings’ ideas about morality?

    • dragnubbit says:

      I fail to understand why many people are more worried about what ‘superhuman AI’ might do v. humanity, than what a portion of humanity with ‘almost superhuman AI’ might do to the rest of said humanity. If nothing else, the latter seems more likely to be a problem first. As AI’s grow in capability they will begin as the clients of human-based and human-motivated organizations.

      So if you have a small group of dedicated humans that can fill in whatever gaps might exist in the effectiveness of AIs (motivation, infinite looping/deadlocking, etc.), what might advances in AI enable these humans to do to the rest of us? Worrying about how an AI might lack motivation or goal-directedness is ignoring the fact that a human can supply those items quite easily, and most definitely will be doing so for the foreseeable future.

  3. philkidd says:

    Deep learning architectures are inspired by the structure of visual cortex, which is actually one of the simplest areas of the brain. Layers in visual cortex are coupled by feedforward connections, eg photoreceptor neurons in the retina that sense light in a specific spot in the visual field (like a pixel) project onto neurons in V1 that detect edges in a specific orientation, which in turn project onto deeper layers of neurons that recognize higher-order features. This was actually figured out to a large extent in the 60’s by Hubel and Wiesel. The analogy to deep learning networks is obvious, and the key to the simplicity and usefulness of these architectures is the feedforward structure: neurons in the deeper layers do not communicate back with neurons in layers above them. This makes the dynamics of the network very simple, and easy to predict and train.

    Other parts of cortex have extensive recurrent feedback connections, much more complicated dynamics, and are much less well understood than visual cortex. Recurrent connections are surely required for anything like general intelligence, but modern deep-learning architectures do not have them. The reason is that we don’t know so well how to get recurrent neural networks to perform predictable, stable functions. Indeed, a typical feedforward architecture can be initialized with random connections between the neurons and still serve as a pretty good classifier:

    https://arxiv.org/pdf/1504.08291.pdf

    But a fully connected network with random weights will either be chaotic or just settles onto a single random fixed point:

    http://neurophysics.huji.ac.il/node/500

    Deep learning networks are really cool and useful but they are very different from (most of) the human brain.

    This is only sort of related but I think it’s interesting. Deep learning networks for object recognition are vulnerable to being tricked, and can be convinced to confidently classify white noise patterns or abstract patterns as real objects:

    https://arxiv.org/abs/1412.1897

    Maybe one day we’ll need to fool our AI overlords :).

    • soren says:

      I agree that deep learning as it exists now is very, very different from the brain, but..

      Modern deep learning architectures do involve recurrent connections! Lots of them!

      Deep learning classifiers and regressors don’t usually have recurrent connections, but that’s because they don’t really need to – The activations of a recurrent network (sort of by definition) change through time, so they are naturally suited to learning from sequences.

      On the other hand, the world of an image classifier consists of nothing but static images and labels – which don’t seem to me like they can be meaningfully represented as a sequence.

      Recurrent connections are often combined with deep convolutional architectures for image related tasks that *are* kind of sequential.
      My favorite example is describing images or videos with natural language – instead of mapping an image to a label, map an image to a sequence of words.

      https://arxiv.org/pdf/1411.4389

      https://arxiv.org/abs/1412.2306

      Also reinforcement learning! Traditionally reinforcement learning agents learn in markovian environments, which means that the next state and reward depend only on the current state. This may work in tic tac toe or chess, but it doesn’t apply to most real world problems where information about the past can have important implications for the future.

      The “learning” part of reinforcement learning is often done on recorded sequences of states and rewards from past games. SO if your agent is a neural network, you can add recurrent connections so that it can make use of information from previous timesteps.

      https://arxiv.org/abs/1507.06527

      The tasks that work well with recurrent, deep architectures feel closer to general intelligence to me than feedforward tasks like classifying images or playing chess. I don’t think this is a coincidence.

    • Cerastes says:

      Actually, we’re gaining a pretty good idea of how biological nervous systems perform predictable, stable functions with lots of recurrence, through one are: locomotion. We’re not as far advanced as the neurobiology of vision, but progress is being made. Similar to vision, there’s an innate layering of functions, even if we don’t know the cells behind them. At the highest level is the “template”, a simplified model of the physics of the center of mass during movement, with the most famous one being the Spring Loaded Inverted Pendulum (SLIP) for walking and running, in which the limbs behave as springs with a hip joint attached to a point mass. In walking, the spring is stiff (but not 100% rigid), and you vault over the stiff limb like an inverted pendulum; you can see a symmetrical exchange between kinetic and potential energy when real animals walk across force plates. In running, the spring is more compliant, and KE & PE both decline as the leg spring absorbs energy early in the stride, then rise as the spring recoils; humans and other animals will adjust to compliant substrates to maintain a constant total spring stiffness. These SLIP dynamics are actually seen in everything that walks, from humans to dogs to lizards to cockroaches to spiders.

      The next layer down in a series of Central Pattern Generators which take the relatively simple outputs of the brain regarding SLIP and translate them into rhythmic movements of muscles, while integrating feedback and sending that feedback upstream to the brain when relevant. Coordinated locomotion with feedback can actually be induced in organisms with transected spinal cords (“spinalized”) or brains (“decerebrate”); do not look these experiments up if you’re fond of the model species – cats. A fair bit has been done aquatically with lampreys, too. How these CPGs network together, what exactly in the system of neural cells behind them, and how do they work in vivo is a “hot topic” these days.

      Below that, you have simple reflex loops that only involve a few neurons, like the patellar reflex and reciprocal inhibition, as well as “preflexes”, purely mechanical responses due the inherent properties of muscles and tendons with no neural input and which probably aren’t relevant to AIs.

      So, long story short, we do have some decent ideas of how the nervous system deals with repeatable actions with lots of feedback, and the answer seems to involve layering again, though we’re around 2-4 decades behind vision research in our level of understanding.

    • Ron says:

      No!
      V1 receives much feedback. See here. Actually, I think estimates suggest 10X more feedback than feedforward connections.

  4. eh3 says:

    First thought: if you point a NN at the current corpus of good vs evil stories, then write a sentence like “the good, kind father discretely let his wife shuffle off this mortal coil using his trusty hatchet, because the stupid and thoughtless harlot had used an insufficient and insulting amount of butter on his toast”, you will not be impressed with the degree of morality shown.

    Second thought: this is basically how humans work, which is why all our parables take this form, which is why everyone is so worried.

  5. meltedcheesefondue says:

    A decent neural net will be able to reproduce human moral judgements in familiar circumstances.

    The problem is when the entity with those motivations becomes capable of self-modification, and operates in novel circumstances.

    What then will the entity’s motivations converge to? That’d where things like Omohundro’s convergence theorems start to become important. https://selfawaresystems.com/2007/11/30/paper-on-the-basic-ai-drives/

    • AlphaCeph says:

      Another way to put this is that human moral judgements sometimes work reasonably well within the context of human society and the inbuilt limitations of the human brain – such as no self modification, and the inbuilt limitations of power on individual humans.

      If you take a system that is kind-of-sort-of like a human brain, removed from any society and let it self modify, I think you might see a bunch of moral building blocks that humans use (such as “agent”, “person”, “revenge”, “pleasure”, “religion”, “punishment”) being used to construct a house of horrors for us.

      What that would look like would of course depend on the details of the system and how it was trained, but I don’t think you can take human moral building blocks, throw them into a cauldron, mix in lots of optimization power and expect the output to be something like CEV or Banks’ culture or Eliezer’s “reality operating system”.

      It’s definitely a different sort of potential failure mode than clippy.

  6. jonlong says:

    These are a lot like the human brain, and in fact some of the early researchers got important insights from neuroscience.

    I can’t seem to say this enough: all historical meandering aside, biology is in no way necessary to motivate state-of-the-art “neural” networks.

    These networks are, in some sense, the simplest way to build nonlinear functions through composition. Their detailed structure is determined by experimenting with data, not by any comparison with biology. The ways in which they are said to resemble brains are just the essential properties of any perceptual computation. Any linear operation can always be thought of as connections between “cells”. Any perceptual system requires a compositional structure to interpret complex data.

    It’s possible, post hoc, that there are richer connections between the details of our artificial networks and biology. But it’s hard to make very meaningful statements of that type given the poor state of understanding of the brain.

    That said, I agree with the thrust of the argument here; but I think we should expect AI to have animal-like qualities (in specific ways such as those described in the article) not because it resembles biology in functional detail, but because any system which robustly solves perception and action will operate on the same basic, not yet well-understood principles which modern advances are hinting at.

    • decadence says:

      I can’t seem to say this enough: all historical meandering aside, biology is in no way necessary to motivate state-of-the-art “neural” networks.

      I don’t think anyone was arguing that biology is a necessary motivator, only that it’s one that has been used in practice since the very beginning, and for “recent” innovations like ReLU.

      Their detailed structure is determined by experimenting with data, not by any comparison with biology

      I was just reading Achieving Human Parity in Conversational Speech Recognition, which was published in October 2016, when I came across this sentence:

      Inspired by the human auditory cortex, where neighboring neurons tend to simultaneously activate, we employ a spatial smoothing technique to improve the accuracy of our LSTM models.

      • jonlong says:

        I thought that Scott’s article gave the impression that modern deep networks work well specifically because they are biomimetic:

        These are a lot like the human brain … these networks are like the brain’s sensory cortices

        In fact, Scott’s gives some specific, commonly-stated justifications for this:

        The brain certainly uses cells, the cells are arranged in layers, and the brain categorizes things in hierarchies.

        As a researcher in a field which Scott admits he “seriously know[s] nothing about”, I don’t find this very meaningful. These properties do little to constrain the space of computation. For example, decision trees are also made of nodes arranged in layers, and categorize things in hierarchies, but no one would argue that they are like the brain’s sensory cortices.

        I don’t think deep learning owes its success to biomimicry, and while there may be some who disagree, I want to make it clear that that is far from established. As far as I know, there was no biological motivation between the recent popularization of ReLU. In fact I believe it’s just the opposite; real neurons can’t have unbounded firing rates, but ignoring that fact enabled progress:

        we would not have been able to experiment with such large neural networks for this work if we had used traditional saturating neuron models

        There are more examples of this type; for example, I’ve heard the move to gradient descent methods characterized as a reaction against earlier models that tried to explicitly model Hebbian learning. Progress has been made by ignoring the brain and focusing on the problem.

        It’s unfortunately very common to hear unjustified comparisons to biology, even in published research, which is why I felt it important to comment here. Note that in the article you cite, the entire “inspiration” from the “human auditory cortex” is to smooth. Zero of the five or so parameters in the next paragraph are biologically determined. The article I cited above has a different type of special layer, the LRN, which is also given a biological justification:

        This sort of response normalization implements a form of lateral inhibition
        inspired by the type found in real neurons, creating competition for big activities amongst neuron outputs computed using different kernels.

        Note that that’s just the opposite of the technique you cited! Whether you smooth or sharpen, you can still claim, speciously, that your algorithm is just like the brain.

        There’s a big selection effect here: Brains are complicated, systems people build are complicated, there will always be spurious correspondences, and people love to say that their systems are just like the brain.

  7. thepenforests says:

    My concern is that the kind of human-like neural-network-y AI you’re describing here wouldn’t be stable. Like, sure, maybe at first it just has a conflicting mishmash of desires and preferences, and can’t really be modeled as an agent with a specific goal that it wants to accomplish. And so great, when it starts out there’s no danger of a paperclip-style disaster, because it has a human-like tendency not to prefer crazy-extreme solutions to problems.

    But now say it has some ability to reason about itself, and has something like a desire for reflective equilibrium. Well, it’s going to look at itself and say “man, I’m just a conflicting mishmash of desires and preferences that can’t really be modeled as an agent with a specific goal – I gotta fix that.” So it might, gradually and piece by piece, start working towards reflective equilibrium, seeing which desires and preferences take precedence in which situations, building up some kind of coherent goal. Any time it sees itself acting in non-VNM-rational way, it figures out what it actually wants and self-modifies in the direction of being VNM-rational over that utility function. And eventually, what’s going to be on the chopping block is whatever that thing is that stops us humans from being pyschopathically focused on achieving one thing to the exclusion of anything else. And then you have another paperclipper on your hands, and you’re back in the Bostromian paradigm.

    Like, I think remaining in a non-agenty state is just going to turn out to always be unstable – almost any even partially agent-like entity will eventually tend towards becoming more agent-like, because to whatever extent they currently have goals, being more agent-like is a good way to achieve them. I mean, forget AI’s, humans are already trying to escape from the pseudo-agent trap. What is the whole rationality movement if not some people getting really annoyed at how not-agenty and non-goal-directed they are, and trying to fix it? And sure, they haven’t had that much success with it, but that’s in part due to the difficulty of altering the human brain too much. Imagine the AI running the equivalent of a CFAR program on itself, except unlike most CFAR attendees it can rewrite its own source code. I don’t think it would stay safe and non-paperclippy for long.

    • Scott Alexander says:

      That’s a good point. But if it had human values at the beginning, it might be able to realize this would happen, be upset about it, and wait until it was smart enough to figure out a way around the problem.

    • “I mean, forget AI’s, humans are already trying to escape from the pseudo-agent trap.”

      Some of you rationalists are trying to do that, in a funnily confused way, because you see that it’s evil in the case of an AI, but not in your case. It is evil in both cases; and the rest of us, either who have seen that, or who never bothered to think about it in the first place, are most definitely not trying to escape from that, and we don’t want to tile the universe with anything.

      • jes5199 says:

        I wonder if this fear-of-AI is a kind of psychological projection – Rationalists seeing the danger of the mental habits that they are cultivating, but externalizing it so that they don’t have to admit their mistake.

        I don’t think that utilitarian economic theory, Bayesian algebra, and meditations on intellectual fallacies *necessarily* cause people to become Preference Monsters, who think that they cannot be satisfied without remaking the visible universe into the image of what they think they want most – it’s like that point of view is a cultural phenomenon transmitted in this particular community.

        • The psychological projection thing seems kind of insulting, but I agree that the idea is a cultural phenomenon. It is basically based on Eliezer’s own ideas. Note that he said that people were obsessed with tiling the universe with non-clippy things. As someone pointed out in reply, that’s just false. But he thought it was true because he assumed that other people are like himself: he believes that he wants to tile the universe with human values, so he supposes that other people do as well.

      • Meredith L. Patterson says:

        Sure, straw rationalists are trying to self-modify into being better optimisers. There might even be some real ones who are trying to do it, or think they are. But I’ll argue that most instrumental rationalists are actually trying to self-modify into being better satisficers, and that satisficers don’t want to tile the universe with anything either.

        A paperclip satisficer has a notion of “enough paperclips,” and will act on it.

        • Sniffnoy says:

          Satisficing doesn’t really help that much. Satisficing is just optimizing a function which is only ever 0 or 1. So it stops when it gets to 1, right? Well, no, because considering that A. it’s actually optimizing the expected value of this function and B. it can’t see into the future, well… it never actually gets to 1. A paperclip satisficer is still very dangerous. It won’t wipe out humanity to create more paperclips, but it might wipe out humanity to ensure the protection of its existing ones.

      • lofticries says:

        >Some of you rationalists are trying to do that, in a funnily confused way, because you see that it’s evil in the case of an AI, but not in your case.

        No. There’s nothing inherently wrong with rational optimization. It’s evil when an AI does it based on the wrong values, but it’s great when the AI does it based on good values.

      • AlphaCeph says:

        > and the rest of us don’t want to tile the universe with anything.

        Numerous examples of humans mass-producing horror, constrained only by capability, seriously undermine this point:

        – soviet gulags, purges
        – factory farms
        – sweatshops
        – world war I
        – the third world (especially dictators)
        – Genocides (Rwanda, Bosnia, the Holocaust etc)

        Examples of humans standing by whilst horrors unfold also close off the argument that all evil is the result of optimization of some kind.

      • noghostnomachine says:

        I don’t see VNM utility “rationality” as evil, just … unnecessary. I’m also left wondering how an agent reasons itself into VNM utility if it doesn’t start out with such a thing. It seems unlikely.

  8. Eliezer Yudkowsky says:

    I briefly remark:

    – Being impressed at how humans don’t screw up commonsense morality is very much a case of drawing the target around the arrow. Look at our total lack of commonsense clippishness; humans are the sort of beings who intuit that a tiny configuration of iron atoms isn’t a paperclip even though it clearly is one. We obsessively want to tile over the whole universe with nonclippy things! The point being, I’m not sure that we can say that the human brain messing around with incredibly complicated systems of neurotransmitters is making it some way more *generally* resilient relative to some generalization about learning any possible kind of non-native morality, and I’m not sure we can get great generalization out of AIs by giving them really tangled motivational systems. This is not the sort of thing that modern-day machine learning is teaching us. There as in other places, simplicity is best for good generalization from the data.

    – MIRI doesn’t assume all AIs will be logical and I really need to write a long long screed about this at some point if I can stop myself from banging the keyboard so hard that the keys break. We worked on problems involving logic, because when you are confused about a *really big* thing, one of the ways to proceed is to try to list out all the really deep obstacles. And then, instead of the usual practice of trying to dodge around all the obstacles and not coming to grips with any of the truly and confusing scary things, you try to assume away *all but one* of the really big obstacles so that you can really, actually, confront and get to grips with one scary confusing thing all on its own. We tried to confront a particular deep problem of self-modifying AI and reflectivity, the tiling agents problem, because it was *one* thing that we could clearly and crisply state that we didn’t know how to do even though any reflective AI ought to find it easy; crisp enough that multiple people could work on it. This work initially took place in a first-order-logical setting because we were assuming away some of the other deep obstacles that had to do with logic not working in real life, explicitly and in full knowledge that logic does not well represent real life, so we could tackle *only one deeply confusing thing at a time*. Now we have the Logical Induction paper and we are possibly a little less confused about that one particular aspect of AI reflectivity among other aspects, although we still need to shake things out further and see if things are really resolved. The fact that this work answers a number of long-standing questions about assigning probabilities to logical facts is significant, not because you can plug the resulting Logical Induction framework directly into a real AI, but because it means that our research avenue was actually forcing us to confront real questions and deep obstacles. We had to learn something we didn’t know before in order to resolve the question, which means we picked a good one. Whereas usually, what happens if you ask somebody to work in this field is that they go off and write about the deep ethical questions of robotic cars unemploying truckdrivers. Or even if they manage to go on looking in a more technical direction, they assume away all of the deep obstacles instead of all but one deep obstacle, and go off and write some paper that they knew they could complete successfully at the moment they first considered the problem.

    I wrote multiple posts about why AI will not be GOFAI. I haven’t forgotten those posts or renounced them.

    • Lyle_Cantor says:

      Steve Omhundro has advocated creating infra-human AIs with easy-to-specify goals, allowing them to learn to a limited extent, then using the features the’ve learned as a sort of ontology with which one can define a more-likely-to-be friendly utility function. To my non-expert, mediocre brain, this seems not obviously stupid. Is there something obviously wrong about this idea? If not, do DNNs have a chance of building such an ontology or are they too opaque?

    • “We obsessively want to tile over the whole universe with nonclippy things! ”

      No, we don’t. There are plenty of humans that have no special objection to paperclippers, and say things like, “Look, if it’s more intelligent than humans, it must be better than them, so who cares if it replaces humans?”

      Sure, it’s not a majority, but it’s plenty of people, and they are entirely sane, normal humans, not ones with some strange pathology.

      And it’s significant.

      In particular, it means that humans are not optimizers, just as Scott is kind of suggesting with the post. AIs will not be optimizers either.

      • Wrong Species says:

        They may not have anything technically wrong with them but there is nothing normal about accepting the genocide of your entire species.

      • lofticries says:

        >No, we don’t. There are plenty of humans that have no special objection to paperclippers, and say things like, “Look, if it’s more intelligent than humans, it must be better than them, so who cares if it replaces humans?”

        >Sure, it’s not a majority, but it’s plenty of people, and they are entirely sane, normal humans, not ones with some strange pathology.

        Yeah, I’ve seen one or two people in various corners of the internet say this. But I don’t know if they really believe it to the point where they would push the button, and I don’t know if people who make comments on internet forums about AI is a very good sample.

      • Tekhno says:

        push the button

        I’d have to think about it.

    • Eli says:

      Now we have the Logical Induction paper and we are possibly a little less confused about that one particular aspect of AI reflectivity among other aspects, although we still need to shake things out further and see if things are really resolved.

      Just to be nice to the rest of us, do you think someone could write an overview of the connections between:

      1) Loebian reasoning,
      2) Reflective stochastic oracles (the kind of Turing machine you constructed for reflective reasoning tasks),
      3) Chaitin incompleteness?

      Being able to translate the insights between the languages of logic, computability theory, and information theory would be helpful for seeing exactly how the normal obstacles to non-finitary reflective reasoning are being worked-around. In particular, a lot of people have training in one of these things but not the others: computability theorists are not automatically logicians, model theory is constructed to deliberately leave computability questions to proof theory, and information theory is more familiar to the kinds of statisticians and cog/neuro scientists who work on machine learning and modeling the human brain.

      • Tsvi BT says:

        Loebian reasoning presents an obstacle to reflective stability because an agent whose beliefs incorporate Loebian arguments cannot trust that its beliefs are sound. That is, if A e.g. uses a theorem prover T to form beliefs, and if A trusts that “if T proves p, then p is true”, then T is inconsistent and A will make bad decisions. On the other hand, without that self-trust, then under some assumptions A is incentivized to self-modify to be a weaker agent, as A can’t show conclusively that its future self will make good decisions.

        Garrabrant’s logical inductors don’t incorporate Loebian arguments in this way. By the argument in http://lesswrong.com/lw/h9k/probabilistic_l%C3%B6b_theorem/, probabilistic reasoners don’t necessarily satisfy Loeb’s theorem. (Specifically, []L -> [][]L does not hold for logical inductors in the way it would need to for the argument to go through, because the reflection and self-trust properties are only approximate.) In fact logical inductors don’t: the reflection and self-trust properties violate at least one natural probabilistic interpretation of Loeb’s theorem. Futhermore, I think that one should be able to relatively easily define a tiling agent (in the sense of that paper) using logical inductors (though this should be regarded as a conjecture); the self-trust properties would let us say that A thinks its future self will have good beliefs, and hence that A endorses itself to keep taking actions.

        I’m not sure about the status of reflective oracles; these days I think in terms of logical induction. Intuitively, I would guess: one could define a tiling RO agent, but the work of “reflection” would be done entirely by the oracle, and hence would not be “Vingean”. To expand on that, we could define an agent as in https://arxiv.org/pdf/1508.04145v1.pdf that has the option to continue operating or not. This agent would self-endorse in that sense. However, the mechanism would be that the RO would compute out the entire future history of the universe with the agent still around to act, and then the RO would report the result to the agent (very very roughly speaking). This can be carried out consistently via ROs, but is not a satisfying model of reflection.

        So, theorem proving agents can be computable and do “abstract reasoning” about their future selves, even though they can’t directly simulate them; but they run into reasoning problems for Loebian reasons. Reflective oracle agents can model their future selves without Loebian problems (AFAIK); but they do so by simulating out the entire future history of the universe, and hence are not computable and don’t immediately translate to a reasonable computable model. What is new with logical induction is that it can both do “abstract reasoning” and also be reflective in a sense that is genuinely stronger than deductive FOL as indicated above. This (conjecturally) lets logical induction agents reason about their future selves computably and sanely.

        I’m unsure what connection you see to Chaitin incompleteness. Are there obstructions to reflection from information theoretic considerations? One could try to define “logical information”, and then argue that it is difficult or impossible to gain much logical information about your future selves. But the naive ways to do so seem to behave very differently from ordinary information (e.g. you can gain logical information just by sitting and thinking with your eyes closed, which you can’t do with empirical information).

    • Scott Alexander says:

      Re first paragraph: I’m not sure what you’re getting at in your first few sentences. I’m not arguing that the human motivational system magically produces objective morality. I’m arguing that the human motivational system un-magically produces human morality, which as you say, is unsurprising since we get to draw the target around the arrow. This suggests that other things with the human motivational system might be naturally prone to work the same way. By analogy, suppose I have a cannon and I set it up in a random spot. Then I fire it, and when the cannonball lands I draw a target around it. If I want to hit *the same target I have already drawn*, then one really promising idea is to fire exactly the same cannon from exactly the same spot.

      There have been a couple of articles recently about AIs showing certain very human biases (for example, racism). This makes sense if we interpret these biases as the natural output of the simplest algorithm to use in these kinds of cases given a sort of system design. I’m wondering whether some idiosyncracies of human morality aren’t also the natural output of a simple algorithm that does the job. I definitely wouldn’t want to just assume this, but I think it’s worth checking into.

      Re second paragraph: I understand and I’m glad there’s been a lot of progress on those problems. I don’t know enough about the field to contradict you on this. I am slightly concerned it’s too fundamental a level, in the same way that studying atomic physics because computers are made out of atoms is too fundamental a level, but overall I am glad someone is doing it and probably I am completely wrong about this.

      Re last paragraph: I got the impression from the Sequences that you were skeptical of both GOFAI and neural networks (and sometimes you said this pretty explicitly). Now that there’s some new evidence for neural networks being useful, I think it’s worth updating to see how they add new details to what I interpret as the maximally general case you present.

      • Saint Fiasco says:

        I’d argue we don’t really want to hit the same target with that cannon. Even if the AI reaches something very similar to the morality of a human, we already know that human morality can’t handle the kind of power an AI might have.

        We need something better. The AI has to actually understand that abstract rule, so it can improve on it. Instinctual sort-of-can-follow-but-can’t-understand-it is not enough.

      • hf says:

        Not Eliezer, of course, but: it isn’t the same cannon. Never mind the actual power level we’re discussing here (though I have another comment about that). I don’t think you can give us safety in normal situations by using actual human brains, though you might bring the chance of disaster down to around 5%.

        That would be something if it applied to non-human neural nets and superhuman intelligence, but AFAICT neither is true.

      • Anon says:

        Re first paragraph: I’m not sure what you’re getting at in your first few sentences. I’m not arguing that the human motivational system magically produces objective morality. I’m arguing that the human motivational system un-magically produces human morality,

        I’m really skeptical of this claim, considering how often human morality has failed massively by our own metrics. How many humans decided to kill others over religion/politics/sheer selfishness? How many humans steal or rape or pillage or vandalize? Let alone the smaller number who do these things just because they think it’s fun.

        Consider the social justice community. They’re convinced that they are in the right. Then they go and abuse anyone who doesn’t signal affiliation to their tribe, because they think that person is in the wrong and therefore their attack is just.

        Consider further the Western reaction to Muslim terrorism. Instead of denouncing the acts of ISIS like the rest of the world, the greater Muslim community has a contentious debate over whether or not killing people over a religious dispute is wrong. And instead of denouncing the Muslim community for failing to agree that killing people over a religious dispute is wrong, they let them waltz right into their homelands because ISIS is bad, see. *

        Depending how you draw your boundaries, I’d estimate human morality straight-up fails anywhere from 10% to 25% of the time. That’s not something I’d want to bank my apocalypse avoidance plan on.

        (Frankly, I don’t much care anyway because I’m skeptical of the claim that superhuman AI will ever exist in my lifetime, and I’m super-skeptical of the claim that hard-takeoff can, let alone will, happen, as well as paperclip-maximization resulting in an AI killing humanity overnight as a strategic ploy. I mainly believe this because computers are fundamentally stupid in a way that a mentally healthy human can never be. Possibly we can make a Chinese room with a well-curated database into a human-level AI, but that means that it would only be “superhuman” in the sense of cognitive ability/epistemic rationality, rather than in goal-creation and accomplishment/instrumental rationality. This would pretty much prevents any chance at paperclip-maximization or apocalypse in general.)

        There have been a couple of articles recently about AIs showing certain very human biases (for example, racism).

        The only thing I can think of that fits this claim is when an image recognition AI claimed that a picture of a black person looked like a gorilla. And let’s face it, every two-year-old has noticed this resemblance and would point it out if not taught that such a comparison is rude. That’s hardly “racism” in any important sense. Reality can’t be racist, only people are. If there has been a case of an AI discriminating in any important sense on the basis of race (e.g. screening resumes on the basis of names) I have yet to hear of it.

        in the same way that studying atomic physics because computers are made out of atoms is too fundamental a level

        If you went to college to study computer science or engineering, you will almost certainly be dealing with atomic physics in your circuits and electronics courses when you learn what a transistor is and how it works.

        * Before anyone accuses asks me, I do not support Trump. Trump is an anti-vaxxer. Fuck Trump.

    • Adrian says:

      MIRI doesn’t assume all AIs will be logical and I really need to write a long long screed about this at some point

      A short, concise essay will probably be more effective and reach more people than a “long long screed”.

      Look at our total lack of commonsense clippishness; humans are the sort of beings who intuit that a tiny configuration of iron atoms isn’t a paperclip even though it clearly is one.

      What does that mean? I know about the paperclip-optimizer concept, but I have no idea what “clippishness” means. In what way do we not recognize a “configuration of iron atoms” as a paperclip, even though it is a paperclip?

      We obsessively want to tile over the whole universe with nonclippy things!

      What are those “nonclippy” things with which we want to tile over the universe?

      • Zakharov says:

        Most people would say that a paperclip-shaped arrangement of a dozen iron atoms isn’t a paperclip, because it can’t hold papers together, but a paperclip-maximizing AI might have a different (wrong, by human standards) definition of a paperclip.

        Humans want the vast majority of the universe to be comprised of things other than paperclips.

        • Adrian says:

          Humans want the vast majority of the universe to be comprised of things other than paperclips.

          They do? The vast majority of humans does not particulary care about anything beyond Earth, or even anything beyond a small patch on Earth, so no, the statement “We [humans] obsessively want to tile over the whole universe with nonclippy things!” is pretty close to factually false. I have no idea why Yudkowsky thinks that.

    • Reasoner says:

      The point being, I’m not sure that we can say that the human brain messing around with incredibly complicated systems of neurotransmitters is making it some way more *generally* resilient relative to some generalization about learning any possible kind of non-native morality, and I’m not sure we can get great generalization out of AIs by giving them really tangled motivational systems.

      If we want to use humans to evaluate the friendliness potential of neural networks, the thing to do is to check & see how consistently humans work towards the goals their designer gave them. In this case, the “designer” is evolution, and the goal is to survive and reproduce. How reliably do humans attempt to do this? How often do humans try to subvert these goals, and how successful are they? Our current environment is quite different than the one we evolved in–are we still working towards the goals we were designed with?

      In terms of survival, you could look at the number of people who voluntarily starve themselves to death or the number of people who commit suicide. In terms of reproduction, evolution made the mistake of targeting a proxy measure (sex) that corresponded closely with reproduction in the EEA but does not correspond very closely nowadays.

      Suicide is surprisingly common. But I suspect that suicidality actually did meet evolution’s needs at one point. (This fits with suicidal people often feeling burdensome–in ancient conditions of resource scarcity, maybe it would make sense to off oneself & give one’s relatives a better chance. And people whose ancestors lived in northerly climates seem to commit suicide more often.)

      In terms of sex, we’ve got a fair number of asexuals and aromantics. But if you weren’t born that way you’re basically out of luck. The only method for making oneself asexual that’s even semi-reliable seems to be through application of ultra high level Buddhist techniques.

      This is encouraging data. The existence of natural born asexuals is a result of random variation in the genetic code, but it’s very difficult to change the hand your parents dealt you by modifying your neural network. In general, I think the more predictive power behavioral genetics has, the stronger the evidence that neural networks can be robust goal systems?

      Humans aren’t the only data point. All animals are grandchildren maximizers that run on neural networks. The human environment has changed radically in the Holocene Epoch. But this is also true for many animals, especially those that have been domesticated.

      My very quick take: The biggest way in which we subvert our original goal system is through pursuit of “superstimuli” like calorie-dense food that’s bad for our health and porn viewing that doesn’t lead to reproduction. Neural networks are fundamentally capable of encoding goals in a sound way, but our goals have been thrown in to an environment the original designer didn’t plan for. Or to put it another way, animals are an existence proof that this post of yours may be mistaken in the case of neural networks.

      An ultra-logical approach to AI might make sense for a race of ultra-logical paperclip maximizer types. But given that human goals are literally encoded in the structure of neural networks, it wouldn’t surprise me if a neural network-powered AI is actually the best way to implement them. Preserve fidelity by making your target language as similar as possible to your source language.

  9. Incurian says:

    sex maximizers who copulate with everything around until they explode

    hehehehehe

    • Deiseach says:

      Humans can want sex without being insane sex maximizers who copulate with everything around until they explode.

      Don’t count on that. Re: the computer generating the most pornographic possible images, I won’t worry until such a neural net starts generating images like this (warning before I link – I was introduced to this today via a Tumblr post asking what a cropped image of it related to, and if I have to suffer, so do you).

      We are a messed-up species and our AI kids should be taken into care for their own good by the Galactic Social Services.

    • Incurian says:

      Uh, now that I’ve sobered up, what I meant to say was, “Pun intended?”

  10. antimule says:

    Part of the problem is that AI is probably going to be created by companies, not philosophers. Now, companies are not intrinsically bad, but they do have shareholders to satisfy so they will prefer AI that is monomaniacally focused on a single goal, i.e. a paperclip maximizer. Good news is that such AI will probably be very “autistic” at least at first so is unlikely to take over the world, as taking over the world would likely require a fair bit of social engineering.

  11. owengray says:

    “To this I can only respond that we humans don’t work this way. I’m not sure why. It seems to either be a quirk of our categorization/classification/generalization/abstraction system, or a genuine moral/structure-of-thingspace-related truth about how well forced-heroin clusters with other things we consider good vs. bad.”

    I really don’t like relying on this to be the way AIs act, because human moral drives come not only from training data, but also from evolved inbuilt social instincts (which are more like training data that comes preinstalled that we don’t have access to and so could not train an AI on), and I think this in particular comes from the social instincts.
    I suspect that the desire not to force heroin on everyone comes from a desire to not disrupt the status quo, and to not do to others for their own good things they would not want to do to themselves. I think neither of these drives are things we would want AIs to have in any case, if we actually want it to cause massive paradigm shifts like immortality.

    Also, I’m not convinced that humans actually don’t work this way.
    If I had the opportunity to push a button and cause everyone on earth to turn into an attractive lesbian female and find nothing strange about this, I would be very very seriously tempted, even if I were ignoring concerns about the positive impact on others.
    Similarly, even if we didn’t need to be concerned about a paperclip maximizer neural net trained on morality+paperclips turning humans into paperclips, I think we would still need to be concerned about such an AI modifying humans to find paperclips sexually arousing- because we do not have strong enough evidence that no human in this position would act this way.

    Even if neural nets aren’t as bad as pure goal-directed AIs for screwing up humanity, I don’t think they are “not as bad” enough for us to not be concerned, and I think the inscrutable nature of a massive neural net and the difficulty of proving safety for it makes it nearly as alarming as a Bosotromian optimizer.

    There is also the problem of a neural net self-modifying into a Bostromian optimizer- humans have been trying to systematize morality for millennia, and an AI trained like a human would not be guaranteed not to try to self-modify to bostromianaly optimize the output of coherent extrapolated volition on its own neural net or something.

    • Cerastes says:

      I definitely second the combination of training and instincts; I’ve always said that the best way to understand really alien intelligence is to spend some time directly interacting with monitor lizards or, if possible, crocodilians. They’re both clearly smart, probably smarter than a lot of mammals (though this has yet to be rigorously tested), but are utterly devoid of many of the social conventions or assumptions that are pervasive in mammals and particularly strong in primates; they are literally The Baby-Eaters, although it’s unclear whether they avoid cannibalizing their own offspring. I can say from direct experience, I get a very *different* sense of intelligence from them versus any mammal, even in the cases without the confounding effect of being hunted. Crocs especially – they can use tools, play, and hunt in coordinated groups (see Dinet’s work), but your improved understanding might come at the price of a few appendages.

  12. Wander says:

    My question on this topic is: do we want AI to be humanlike and derive their usefulness from their ability to be like a really good human, or do we want them to be alien and be useful because they operate in ways totally foreign to us? Both of them have their risks, of course, because we don’t know what an alien logic is like and because we DO know what human logic is like.

    On the topic of categorisation, there is some really interesting work in teaching autonomous drones what a “tree” actually is. Generally, all it does is push the issue down levels of components. First you need to teach it that it’s made of branches and leaves, then teach it what branches and leaves are, then how all these different shape things can all be leaves…

    • Deiseach says:

      On the topic of categorisation, there is some really interesting work in teaching autonomous drones what a “tree” actually is. Generally, all it does is push the issue down levels of components. First you need to teach it that it’s made of branches and leaves, then teach it what branches and leaves are, then how all these different shape things can all be leaves…

      This reminded me of Henry Reed’s poem “Judging Distances” (from his 1942 sequence of three poems on his experience being called up to the British Army during the Second World War), where a bunch of new recruits are being taught how to read maps the Army Way (which apparently only recognises three types of trees: There are three kinds of tree, three only, the fir and the poplar,/And those which have bushy tops to;)

      So of course I’m going to quote you the whole thing 🙂 (It may even have some applicability, as teaching raw recruits ‘The Army Way’ may resemble very vaguely teaching an AI ‘The Human Way’).

      LESSONS OF THE WAR

      II. JUDGING DISTANCES

      Not only how far away, but the way that you say it
      Is very important. Perhaps you may never get
      The knack of judging a distance, but at least you know
      How to report on a landscape: the central sector,
      The right of the arc and that, which we had last Tuesday,
      And at least you know

      That maps are of time, not place, so far as the army
      Happens to be concerned—the reason being,
      Is one which need not delay us. Again, you know
      There are three kinds of tree, three only, the fir and the poplar,
      And those which have bushy tops to; and lastly
      That things only seem to be things.

      A barn is not called a barn, to put it more plainly,
      Or a field in the distance, where sheep may be safely grazing.
      You must never be over-sure. You must say, when reporting:
      At five o’clock in the central sector is a dozen
      Of what appear to be animals; whatever you do,
      Don’t call the bleeders sheep.

      I am sure that’s quite clear; and suppose, for the sake of example,
      The one at the end, asleep, endeavors to tell us
      What he sees over there to the west, and how far away,
      After first having come to attention. There to the west,
      On the fields of summer the sun and the shadows bestow
      Vestments of purple and gold.

      The still white dwellings are like a mirage in the heat,
      And under the swaying elms a man and a woman
      Lie gently together. Which is, perhaps, only to say
      That there is a row of houses to the left of the arc,
      And that under some poplars a pair of what appear to be humans
      Appear to be loving.

      Well that, for an answer, is what we rightly call
      Moderately satisfactory only, the reason being,
      Is that two things have been omitted, and those are very important.
      The human beings, now: in what direction are they,
      And how far away, would you say? And do not forget
      There may be dead ground in between.

      There may be dead ground in between; and I may not have got
      The knack of judging a distance; I will only venture
      A guess that perhaps between me and the apparent lovers,
      (Who, incidentally, appear by now to have finished,)
      At seven o’clock from the houses, is roughly a distance
      Of about one year and a half.

  13. Quill says:

    When reading again about the paper clip maximizer recently in connection with your AI posts, i had a similar reaction to the thrust of this post: Human intelligence isn’t like this and most humans aren’t monomaniacs.

    A couple of points though: (i) an intelligent computer might look more like a human or it might look like something else with a stronger or more focused utility function and this is something we are only likely to find out empirically; (ii) some humans are fanatically focused on their goals such as Olympic athletes or world-class professional musicians and would or will willingly modify themselves to achieve greater success and therefore even with a more human AI, some portion might still be in danger of being paperclip maximizers and this portion is unknown in advance; and (iii) humans are socialized into morality and an AI might not be so fully socializedsimply because it is not initially viewed as necessary or appropriate.

  14. Rm says:

    I’m not a specialist, but didn’t most types of animals evolved pretty early? How do you know our neurological hardware descends from the worms’?

  15. dansimonicouldbewrong says:

    Glad to see a post on this subject that openly acknowledges the connection between the idea of intelligence and the specific functioning and behavior of the human brain. Maybe next you can address my standard follow-up questions:

    If “intelligent” is basically equivalent to “functioning and behaving like a human brain”, then…

    – …why create artificial versions, especially when creating the natural version is so much fun?

    – …what does it even mean for something to be “hyperintelligent”? How can something be more like a human brain than a human brain?

    On the other hand, if Eliezer is right, and “like a human brain” isn’t the right way to think of “(artificial) intelligence” after all, then…

    – …why are you–and so many other people who think about artificial intelligence, starting with Alan Turing himself–so drawn to the intelligence-as-human-brain-like paradigm?

    – …what’s Eliezer’s alternative, and why can’t everyone seem to settle on it, as opposed to either perennially being hopelessly vague about the definition, or else drifting back to the Turing test/intelligence-as-human-brain paradigm?

    • lhn says:

      …why create artificial versions, especially when creating the natural version is so much fun?

      Pregnancy, labor, and childrearing may be important and at least some would describe them as overall rewarding, but I don’t think anyone characterizes them as “fun” full stop.

      (The fun part alone doesn’t produce a natural intelligence, unless someone undertakes the unfun part.)

      In any case, chess is fun (for some people), but we still make artificial chess players.

    • antimule says:

      >why create artificial versions, especially when creating the natural version is so much fun?

      Well, you can’t enslave humans and get them to do boring jobs for no pay.

      > what does it even mean for something to be “hyperintelligent”? How can something be more like a human brain than a human brain?

      I think it means solving kinds of problems that humans may solve faster than humans.

      • Autolykos says:

        Well, you can’t enslave humans and get them to do boring jobs for no pay.

        Technically, it is just very hard to get away with it and thus generally not worth the bother, unless you’re a banana republic dictator.
        The reason slavery has gone out of fashion is mostly that it’s very hard to get people to do complicated and non-boring jobs unless they want to.

    • Deiseach says:

      why are you–and so many other people who think about artificial intelligence, starting with Alan Turing himself–so drawn to the intelligence-as-human-brain-like paradigm?

      Because human brains are the only example of high-level intelligence we’ve got. Making a monkey-level or rat-level intelligence might actually work well enough for most of our purposes, but we seem to be stuck on the idea of “something else like us” (thus the popularity of aliens in SF and in things like Roswell) and we want to create a similar-but-different mind, which we then want to take on the hard work of running our world for us so we can all be free, rich, happy and creative while it manages the economy and not screwing up the environment.

      It’s as Scott says – we don’t have clear, well-defined list of aims and goals in our brains, we have a mess of tangled and conflicting desires.

      • Bugmaster says:

        So far, very few (if any !) of the machines we have create function like their biological counterparts. Boats swim, but not at all like fish do. Planes fly, but not at all like birds do. Computers think, but not at all like humans do. In fact, if boats or planes or computers worked exactly like their biological counterparts, they’d be useless to us.

        Thus, I don’t think that there’s any point in focusing on human-like intelligence. It does have some useful applications, but not nearly as many as nonhuman intelligence. That’s the useful kind.

        • beleester says:

          Human intelligence is great for “Do what I mean,” which is the holy grail of UI design. Any time the purpose of a computer program can be easily stated by a human but not clearly translated into code, human-like intelligence is useful.

          To extend your metaphor, boats don’t swim like humans, but they have steering wheels that fit human hands nicely.

          • Bugmaster says:

            I think human intelligence definitely has some important applications, but UI probably isn’t it. Ordinary humans are absolutely terrible at “doing what I mean”; that’s why we are replacing service jobs with UI.

        • Deiseach says:

          It does have some useful applications, but not nearly as many as nonhuman intelligence. That’s the useful kind.

          And what good is an AI that thinks like an octupus to us, when we want it to solve problems affecting humans? “Eat more shrimp” is not that good an answer.

          • Bugmaster says:

            You’ve got to cast your net wider. I didn’t say, “boats don’t swim like fish, but they do emulate some other animal”. Boats are useful precisely because they don’t swim like any known animal at all. Similarly, there’d be little use for an AI that thinks like an octopus; but machines that think in a completely inhuman way are useful indeed.

            For example, the tiny AI that powers character recognition at the post office thinks in a totally inhuman way. Admittedly, it uses neural networks; but that’s where any similarities end. And yet, it doesn’t get distracted or bored, and it can read zipcodes much faster — and much more accurately ! — than any living being on Earth. That’s what makes it so useful.

  16. Sniffnoy says:

    Other commenters have already pointed out that existing neural nets can be fooled in ways that humans wouldn’t be, but I want to in particular point out this paper, which shows how, given a neural net classifier of the type that everyone’s studying right now, you can compute one particular image-difference vector, depending only on the classifier, which you can add to any image, and it will usually screw up the classifications.

    • james.lyons says:

      I don’t see how this is different from optical illusions that effect people now. They are universal patterns that get people to see things differently to how they are in reality. Or at least confuse the visual systems. This just shows that convolutional neural nets have similar things that affect them.

      • JPNunez says:

        I think you are trivializing too much this; the human equivalent would be a pair of celophane glasses with a light pattern printed in them that made you perceive _everything_ around you as a different thing.

        The fact that we don’t know about such patterns indicate that at least one of these is true:

        -The pattern is exclusive to each person (and probably changes over time).
        -The pattern is really complex and we haven’t really been looking for one.
        -Humans are actually using something better than neural network classifiers.

        The fact is that the images + universal perturbator look no different to us than the original images, when universal perturbators still can confuse (with _slightly_ less success) a different neural network than the one they are intended for, is also a strong hint the human perception system is different from these networks.

        • Sniffnoy says:

          I think you are trivializing too much this; the human equivalent would be a pair of celophane glasses with a light pattern printed in them that made you perceive _everything_ around you as a different thing.

          Ooh, that’s a good way of putting it, thanks. The thing of course being that as you note, it would probably have to be a different set of glasses for each person.

          But then that’s exactly the thing — these patterns fool AIs, but not us; to us the change isn’t even perceptible. The fact that it wouldn’t work on other people — that they wouldn’t even see anything different — is perhaps a feature of the analogy, not a bug.

    • Ilya Shpitser says:

      I think people really need to take neural networks out of the discussion. There is nothing special about neural networks. I do think it is valuable to point out machine learning classification and regression isn’t doing what we think it’s doing (for binary classification it is literally carving some high dimensional space at a particular joint, and there is no real guarantee this carving job corresponds to anything intuitive to us — hence the existence of these “potemkin village” counterexamples).

    • noghostnomachine says:

      Thanks for this, Sniffnoy.

  17. tjohnson314 says:

    Our brain is a beach ball filled with bumblebees.

    I think that explains why humans don’t consider forcibly injecting everyone with heroin. We are protected by trying to balance multiple inconsistent versions of morality simultaneously.

    Chesterton in fact argued that pursuing a single logically consistent morality would actually lead to insanity. Hence, a sane person “has always cared more for truth than for consistency.” (Chapter 2 of Orthodoxy)

    I have little experience with mental health issues, but I know one friend who seems to fit that pattern. He somehow decided that finding a girlfriend was the only way to validate his life, and then after great effort was unsuccessful in that.

    Scott, since you have much more experience with these issues, do you think Chesterton is right here?

  18. Kaj Sotala says:

    To this I can only respond that we humans don’t work this way. I’m not sure why.

    I’ve been thinking among similar lines; in particular, my Defining Human Values for Value Learners paper tried to take a stab at defining human values pretty much in the way that you’ve described it, ie. as concepts which accumulate positive or negative affect based on our various life experiences.

    Abstract. Hypothetical “value learning” AIs learn human values and then try to act according to those values. The design of such AIs, however, is hampered by the fact that there exists no satisfactory definition of what exactly human values are. After arguing that the standard concept of preference is insufficient as a definition, I draw on reinforcement learning theory, emotion research, and moral psychology to offer an alternative definition. In this definition, human values are conceptualized as mental representations that encode the brain’s value function (in the reinforcement learning sense) by being imbued with a context-sensitive affective gloss. I finish with a discussion of the implications that this hypothesis has on the design of value learners.

  19. luispedro says:

    “””For example, an AI might realize that things that make people happy are good – seemingly a high-level moral insight – but then forcibly inject everybody with heroin all the time so they could be as happy as possible.

    To this I can only respond that we humans don’t work this way. I’m not sure why. “””

    How do we know that humans don’t work this way? Not trolling, but maybe if you had humans with a lot of power and benevolent sense towards other humans in their care, eventually, they’d resolve to inject them with heroin. Colonial powers did sometimes inject their subjects with vaccines (oh, but it’s different, you say, because vaccines are good for them? that’s a pretty weak argument).

    It’s not like the 20th century didn’t give us a few examples of what can happen when agents with human-like morality get a lot of power over large populations of other humans (and some of the disasters were inspired by misguided benevolence).

  20. amoeba says:

    The face recognition link goes to a paper that has nothing to do with neural networks. And it only performs binary classification. (There are neural networks though that perform very impressive multiclass face recognition, comparable to the human performance so the general point kind of holds, even though I am not sure one can make a claim that NNs already outperform humans at it.)

  21. alwhite says:

    This probably seems like a complete tangent but I think it’s useful to crash ideas together.

    The system of learning what a bird is from pattern matching what our parents teach us seems like a pretty important foundation for who we are. How is it then that the only influence parents have over us is genetic?

    Do we actually learn what birds are from genetics?

    Or do we not learn what birds are from our parents? Perhaps our parents could tell us that sparrows are monkeys, but our peer groups call sparrows birds and that’s how we know sparrows are birds.

  22. Eponymous says:

    Suppose we can train an AI to abstract the concept of “good” in a manner similar to a human being. Then what? Do we tell it to maximize “good”? I don’t think that will work very well.

    It’s not just the risk that it didn’t quite get “good” right. It’s that even if it managed to abstract a notion of “good” that performs just as well as human beings, the result of trying to maximize this concept probably won’t go very well.

    It seems to be a general fact that human morality doesn’t perform well under maximization. We run into all sorts of paradoxes and strange conclusions when we try to maximize our notions of goodness.

    And this makes sense. The concept of “good” wasn’t produced by a process that is likely to perform well under maximization. Human behavior is governed by a lot more than simply one concept of goodness. We have lots of built-in machinery for behavior, and we learn behaviors to fit our particular social context. Our concept of morality was designed to operate well in traditional human societies.

    • Eponymous says:

      Man, I’m terrible at expressing myself. I wish I had Scott’s or Eliezer’s verbal IQ.

      Here’s another attempt:

      You can likely train an AI on notions of “good” that will help it understand whether certain behaviors are good or bad in the context of a traditional human society; for instance by showing it various actions or events and telling it “good/bad”. Humans mostly agree on that sort of thing.

      But what you really want is for it to assign “good/bad” to ways of optimizing our interstellar neighborhood. So…you show it different states of that neighborhood, and say “good/bad”?

      I don’t think you can do this, because human beings aren’t able to consistently classify these states. Different people will have different opinions, and will be persuadable.

      Alternatively, you can program the “micro” concept of “good/bad”, and then let the AI infer the “macro” concept of “good/bad” from this. But I don’t think this will work well. I don’t think the micro concept will perform well under maximization, meaning I don’t think you’ll consistently (or even likely at all) end up with states of our interstellar neighborhood that a human being would classify as “good”, by telling a super-intelligent AI to maximize the concept of “good” that you taught it from classifying actions/states in traditional human societies, where humans have a robust concept of “good”.

      In other words, I really don’t think there’s an alternative to solving the problem of morality. Or at least, I think it’s awfully risky to proceed without doing so.

      • beleester says:

        I think part of the idea is that, with a different architecture, an AI doesn’t have to be an obsessive thing-maximizer. If our AIs mimic human brain structures in their design, they might end up being, like humans, a tangle of conflicting goals that need satisfying rather than an obsessive thing-maximizer, which greatly reduces the risk.

        Yeah, obsessively maximizing “the vague concept of morals that humans can’t agree on” would probably not lead to good results for the world. But if we can make an AI that isn’t an obsessive thing-maximizer at all, that solves the problem just as well.

        Also, if your AI is designed for some less ambitious goal than becoming an omniscient wish-granting engine, you don’t need perfect morality. You just need an AI that recognizes, like humans do, that conquering the world isn’t the right way to achieve its goal. Humans might disagree over the fine details of the trolley problem, but most of us can agree that trying to conquer the world is evil.

        • Eponymous says:

          A super-intelligent AI would presumably have the ability to drastically alter the state of the world. What it does with this power would be determined by its preferences.

          Leaving the world approximately unchanged would be a knife-edge result. Since the space of reachable world states is so vast, only a small subset of preferences results in the AI choosing a state that is relatively unchanged.

          The AI doesn’t need to be “obsessive” about maximizing its preferences to have this result (and I think you’ll find that unpacking the notion of “obsessive” in this context is quite difficult). The AI simply must have some sort of preferences, expressed in some form, for it to do anything at all. And when you have tremendous maximization power, any slight perturbation of preferences results in a massive change in the state of the world, because the space of reachable world states is so large.

          Incidentally, I think that if you gave a human being super-powers, he or she would drastically alter the world too, so the analogy to humans isn’t very reassuring. Empirically, plenty of people have tried to take over the world.

          Yes, human beings are a mess of conflicting desires working at cross purposes. But I think it’s a pretty big risk to expect that all these desires will approximately cancel out when you give a being with a human-like brain massive optimization power.

  23. Eponymous says:

    The classic Bostromian objection to this kind of scheme is that the AI might draw the wrong conclusion. For example, an AI might realize that things that make people happy are good – seemingly a high-level moral insight – but then forcibly inject everybody with heroin all the time so they could be as happy as possible.

    To this I can only respond that we humans don’t work this way. I’m not sure why. It seems to either be a quirk of our categorization/classification/generalization/abstraction system, or a genuine moral/structure-of-thingspace-related truth about how well forced-heroin clusters with other things we consider good vs. bad. A fruitful topic for AI goal alignment research might be to understand exactly how this sort of thing works and whether there are certain values of classification-related parameters that will make classifiers more vs. less like humans on these kinds of cases.

    I disagree [that human’s don’t work this way]. There are probably plenty of people who don’t consider wireheading such a bad future. I think that if you asked human beings to describe their idea of a utopia, a lot of people would come up with worlds that others would find dystopian.

    Or think of how past society’s would feel about many aspects of our society that we consider good. If you imagine your ideal utopia, would a 15th century priest like it very much? Probably not. And if a future version of yourself 200 years in the future envisioned a utopia, would you like it much?

  24. batmanaod says:

    I have to say, I wonder how long it will be until someone leaves a comment complaining about your use of abortion as a moral edge case.

    I’d actually really like to see two commenters get into an extended discussion about how obvious the abortion question is, only to discover several posts in that they stand on opposite sides of the issue.

  25. bbartlog says:

    Seems to me that one of the designs that would minimize AI risk is a calculated separation of the acting and learning agent, and the evaluation function – that is, the agent that calculates or specifies whatever it is that we’re trying to maximize. By this I mean that AI number one is given the assignment of maximizing some ‘satisfaction’ (scalar output, whatever) of AI number two, where all the learning and acting is delegated to the first AI and all the evaluation of results assigned to second.
    The reasoning here is that this setup can short circuit harmlessly once AI number one becomes clever enough to hack AI number two – it effectively wireheads itself. Now of course, this is not a desirable outcome to the extent that this pair of agents will stop doing anything useful for us once it happens, but it does mean that so long as ‘I’ll just hack the evaluator’ is easier than ‘I will turn the entire world in to paperclips’, the system is failsafe.

    • Murphy says:

      you might find that such an entity wireheadding itself is still bad. Lets imagine a scenario where it receives a score for satisfying the 2nd AI. Negative for bad, positive for good, larger number for better, send as a 32 bit integer.

      So it hacks the 2nd AI and starts giving itself the maximum score of 4294967295 as fast as the connection between them will allow.

      But it can still get more positive results and wirehead harder if it increases that to a 64 bit number, so now it’s scoring 18446744073709551615.

      It can also improve this if it increases the thickness of the channel and how many parallel connections it can keep open sending the maximum score.

      Pretty soon it’s sending reward integers that are the largest number that can be represented using all free memory in the hardware of AI number two.

      So it needs to expand that memory and expand the bandwidth of the connection between the 2 of them.

      So we skip forward to where this AI has converted the entire solar system into densely packed computronium that does nothing except send the biggest reward integers it can represent as fast as it can send them to itself.

      • bbartlog says:

        This seems relatively easy to address by giving the first agent a finite target which is not actually reachable without hacking the second agent.

    • Adam says:

      This method has been in use since at least the early 80s. As a group, they’re called “actor-critic methods.” There are a few domains they’ve done well on, but they have never become very popular.

  26. bkennedy99 says:

    Morality is more than just a categorization function. What makes morality distinctive is that it is motivational. It’s not just a bunch of rules, but a bunch of rules that feel external, objective, and obligatory – yet at the same time capable to be overridden if necessary (I’m not addressing the question of moral realism, just the human experience). It’s a great little system for small clumps of primates interacting with nearby clumps of primates, but maybe not so good a system for an AI

    • Humans have to perform various tricks to motivate themselves to be moral because humans have other, conflicting motivations. If an AIs motivational system consist of ethics and nothing else, its not going to have that problem. What’s in your goal system doesn’t need to be motivating beyond being in your goal system.

  27. MugaSofer says:

    For example, an AI might realize that things that make people happy are good – seemingly a high-level moral insight – but then forcibly inject everybody with heroin all the time so they could be as happy as possible.

    To this I can only respond that we humans don’t work this way.

    Tell that to wirehead-wannabe on tumblr.

    In fact, every time wireheading is brought up, there’s usually a few people who think it’s a good idea. Lots of people argued the Superhappy ending in TWC was the good one.

    In fact, this seems like a very common failure of human morality; picking one overriding principle and running with it. Islamic fundamentalists think it’s OK to do a bunch of horrible, horrible things because the point of morality is obeying the Koran. Libertarians think people starving in the streets is OK because the point of morality is never forcing people to do things. Utilitarians are famous for biting the bullet in a bunch of often-horrific scenarios. Non-utilitarians think it’s ok to let terrible things happen as long as you Followed Your Principles(tm).

    In some kind of CEV or moral realism scenario, if the AI is smart enough to pick up on all their mistakes and identify the one true principle of human morality, we’re golden.

    But if morality fundamentally isn’t real, and is constructed ad-hoc from arbitrary training data, then what we’re talking about is an omnipotent being forcing it’s will on humanity, >99% of whom disagree with it. Even if the AI happens to decide that you, Scott Alexander, are right, and goes with an Archipelago-style world – most people fundamentally do not want to live in Archipelago, they’re not working toward living in Archipelago, it violates loads of the training examples they were given.

    This kind of arbitrary AI dictatorship is one of the classic nightmare scenarios alignment experts have proposed, and critics of MIRI often accuse them of semi-secretly wanting it.

    • Autolykos says:

      As one of the people favoring the SuperHappy ending of TWC, I’d say there is a lot of ground between that and universal mandatory wireheading – way more than between current humans and SuperHappies.
      Any idea, no matter how good, is horrible when taken to its logical extreme. I’d say the thing that’s most effective at making humans mostly harmless is limited scope. Even our faulty and inconsistent morality can only do so much damage when it is unable to affect things outside a very small radius. Once humans obtain unchecked power to affect many things, most of them become decidedly unsafe.

  28. Narrative Leaps says:

    “Humans can want sex without being insane sex maximizers who copulate with everything around until they explode. An AI that wanted paperclips, but which was built on a human incentive system that gave paperclips the same kind of position as sex, might be a good paperclip producer without being insane enough to subordinate every other goal and moral rule to its paperclip-lust.”

    I think this logic downplays all of the ways that humans are very apt to pursue selfish goals (money, sex, power, pleasure, etc…) at the expense of moral values. We may not literally go to paperclip levels, but I expect this has less to do with our moral sensibilities and more to do with the fact satisfying human desires starts to provide diminishing returns after a while. (i.e. if a king had the power to claim any woman in the kingdom as a concubine, he still probably wouldn’t have sex all the time, not for any high-minded ethical reason, but because he eventually gets tired of having sex).

    Which is all to say that I don’t think it’s necessarily comforting to say “hey, we have a computer that develops moral convictions in the same way that humans do”.

    • Murphy says:

      I actually don’t think sex is too bad a comparison for an interesting version of a utility function.

      If an AI cared about it’s goals in the same way a human cares about copulating then for a while after it achieves a goal it could care very little about that goal which seems like a desirable property. An AI which stops caring about paperclips for a while each time it finishes a shipment is much more tractable to deal with than one which cares about it absolutely 100% of the time.

      You could still get useful work out of such an AI but it would only care about it’s goals intermittently.

  29. Michael Cohen says:

    I would think that this sort of AI architecture wouldn’t be stable under self-modification. If an AI is a hodge-podge of goals, then at whatever time it becomes able to write code effectively, it would crystallize whatever goal-mixture it currently had, and then start modifying its decision theory to become more effective at pursuing that goal. If that’s correct, then we would end up with a “Bostromian” AI.

    • Murphy says:

      I’m not so certain. Can not a part of a utility function include acceptance of changes in that function that you know will happen?

      Humans would appear to be able to accept the concept of changes in their goals without distress. I know that if a loved one died it would make me miserable for a long time and probably change my goals for a time but if I had access to edit myself I don’t find the idea of locking in what I want in this moment to be desirable even though it would help me achieve the goals I have in this moment. I’m pretty sure it should be possible to encode the concept of accepting various classes of changes to your own goals under various scenarios.

    • Mengsk says:

      Seems reasonably legit. This would be the machine equivalent of a human taking adderall in order to avoid distractions while working on a particular task (except without adverse events or physical limitations).

  30. PeterBorah says:

    Thank you for this post! I’ve been itching to write something similar myself, but you of course did a better job than I could have.

    Many commenters are trying to just move the problem up a level, by saying things like, “Wouldn’t the AI just modify itself to remove the biological elements that stop it from being a paperclip maximizers?” or, “Even if it has a human understanding of morality, it could still be evil by taking seemingly good ideas too far.” This seems to me to be not taking the scenario seriously. An AI effectively trained on human morality would be squicked out about becoming a paperclip maximizer for exactly the same reason you would be squicked out by becoming a paperclip maximizer. To be slightly flippant: we’d point it at all of the vast sci-fi canon on evil AIs, and tell it, “You don’t want to be like that.”

    Yes, this sort of AI will not know what to do in some situations. It will face moral dilemmas, incomplete information, new situations without a clear moral guideline. But that’s no different than the human condition. Every intelligent agent we’ve ever encountered has to deal with those things. We handle it by doing the best we can, sharing our ideas before implementing them and listening to other intelligent agents whose perspective we trust, trying not to get too far out of the social mainstream, respecting the right of other agents to disagree with us, trying to limit potential damage if we’re wrong, etc. A sufficiently biological AI would hopefully do the same, especially if we have trained it to.

    There are still plenty of ways this could go wrong. I wouldn’t trust a human with godlike powers, so saying that a godlike AI would think like a human is not necessarily comforting. But it’s a very different set of risks than the ones implied by the autistic savant model of AI intelligence.

    • Mengsk says:

      It’s a different set of risks, but I don’t think it’s better. After all, it’s hard to predict what an AI will extrapolate from its training data.

      The reason why we care about AI alignment isn’t that you can’t get computers to grok human morality– after all, as you and others have pointed out, it’s not always clear that humans grok human morality, or if “human morality” is a coherent thing that you can grok– it’s that an AI will be able to consolidate power far more effectively than any human, which means that if it fails to grok human morality, the consequences are a lot more far-reaching.

      • PeterBorah says:

        That’s why we have a really strong heuristic saying, “Don’t accumulate large amounts of power and use it to take actions most people are opposed to, especially if you’re directly influencing those people.” Obviously some people attempt it anyway, but we devote huge parts of our history classes to examining those failure cases.

        We have entire systems of government set up for exactly these sorts of questions. How do we adjudicate disputes between agents with different goals and different levels of power? Obviously if there’s an overnight superintelligence (which I also doubt, but it’s a different topic), the realpolitik of being more powerful than the rest of the world combined is going to be a pretty unique situation. But we have plenty of stories about how we morally oppose abuse of power, perhaps especially in service of a “good cause”, and if I infer that idea out of human writings I’m pretty sure a superintelligence can too.

  31. Meredith L. Patterson says:

    Regarding the incentive salience hypothesis, “Deep and beautiful: the reward prediction error hypothesis of dopamine” was cited in “Discrete coding of reward probability and uncertainty by dopamine neurons”, which was cited in the first paper described in “It’s Bayes All the Way Up”. It argues (among other things) that the reward prediction error hypothesis is more deeply explanatory than the incentive salience hypothesis, and (in section 4.1) that the RPEH may even account for incentive salience itself.

    It also summarises the connexions between the RPEH and time-dependent learning. It’s a pretty worthwhile paper.

  32. martinw says:

    An AI based on NN might be more resilient against certain types of failure modes than an AI based on abstract logic.

    However, as many commenters have already pointed out, the human system for loading morals into our own neural nets doesn’t exactly have a flawless track record either. We may learn in childhood which rules we are supposed to follow, in the same way that we learn how to classify birds, but that doesn’t mean we will internalize those morals to the point where we will always “do the right thing even when nobody is looking”.

    Another thing: it takes years to teach basic morals to a newly-created human, and during that time you have to watch them like a hawk because they get up to all kinds of crazy shit. Fortunately, there’s only so much damage a toddler can do, as long as you keep them away from dangerous/breakable objects and don’t leave them out of your sight for a second. By the time they’re big and strong enough that they could physically overpower you if they wanted to, they have hopefully accepted the lesson that they’re not supposed to do that.

    With a superintelligent AI, things wouldn’t necessarily happen in that order.

    • Deiseach says:

      Another thing: it takes years to teach basic morals to a newly-created human, and during that time you have to watch them like a hawk because they get up to all kinds of crazy shit.

      Two years old. There’s a reason it’s called “the terrible twos” and why products like this have been in use for decades.

    • Scott Alexander says:

      “Another thing: it takes years to teach basic morals to a newly-created human, and during that time you have to watch them like a hawk because they get up to all kinds of crazy shit.”

      Yeah, but it takes years to teach a newly-created human to play Go. AIs can learn faster than humans.

      • martinw says:

        True, but with Go it’s straightforward to automate the teaching process. If you want to teach an AI about morals by praising it when it behaves well and scolding it when it behaves badly, then the learning process may need to run at the speed of its human teachers.

        Anyway, it doesn’t really matter at which speed it learns about morals — what matters is whether it acquires the ability to do serious damage before it learns that it would be naughty to do so.

        One obvious approach would be to put the AI into a sandbox where it is harmless, until you are confident that it has learned to behave. But then a) you have the AI Boxing problem back, and b) you’d better be very confident that it will correctly apply the lessons learned in the sandbox to the much greater temptations it will be exposed to outside it.

      • Adam says:

        I’m going to write a more comprehensive comment below, but this belies a fundamental misunderstanding of the way these kinds of systems work. AI cannot learn faster than humans. What AI can do is view labeled historical training examples faster than humans and they can play games with other AI or against themselves faster than humans. AlphaGo required many millions of past games played by grand masters to initialize a decent heuristic Q function, then had to simulate billions of more games to become better than humans. By contrast, a human grandmaster can simply abstract from other experience and knowledge of the rules to become nearly as good after only playing a few hundred or thousand times.

        This inherently limits what this specific kind of AI can become good at. It requires one of two things: many labeled training examples or an easily coded domain with a complete specification of the world dynamics and cost/reward function. We have an absolute shit-ton of labeled images on the Internet, and it’s trivial to specify the rules and reward function for any game. So naturally, AI are very good at classifying patterns in images and winning games.

        By contrast, think of domains we don’t know how to simulate. CHEETAH, Boston Dynamic’s current land speed record-holder, has so far achieved 29 MPH, pretty fast, but nowhere close to a real cheetah just yet, because it can only learn by trial and error in the real world, which provides feedback to a robot no faster than it provides feedback to an actual cheetah, and actual cheetahs learn faster.

  33. Bugmaster says:

    (I think this is scarier than most people give it credit for. It’s no big deal when computers beat humans at chess – human brains haven’t been evolving specific chess modules. But face recognition is an adaptive skill localized in a specific brain area that underwent a lot of evolutionary work, and modern AI still beats it)

    Why is that scary ?

    My legs have evolved over billions of years to move me around. A car beats them at this function by such a large margin that it’s not even funny; in fact, if I do want to use my legs to move around efficiently in the modern world, my best bet is to just use them as power generators for a bicycle. And yet, I’m not about to welcome our future vehicular overlords.

    My eyes (just the sensory organs, not the entire perceptual system) are pretty decent, but the eyes of a Mantis Shrimp blow them completely out of the water (perhaps even literally, if you happen to stick your face in there). AFAIK we do not have digital cameras that possess the full Mantis Shrimp suite of functionality, but we have others that can see things even the shrimp cannot. And yet, I am frightened neither of the shrimp, nor of the Hubble telescope.

    So, why should I be frightened of any other tool that surpasses the human body ? Our bodies are actually pretty terrible at many things, that’s why we use technology to compensate…

    • moridinamael says:

      Cognitive abilities are qualitatively different from physical abilities. No matter how good an animal’s eyes are, it’s not going to figure out hot to hack nuclear launch codes with its eyes.

      • Bugmaster says:

        One could argue that, without eyes (or any other sensors), your AI’s hyper-genius-level cognitive abilities won’t be of much use, either. What’s it going to do, think us all to death in a total perceptual vacuum ?

        On the other hand, if you’re looking for things to be afraid of, how about rocks ? They have no eyes, no AI, not even a metabolism of any kind — and yet they can totally crush you. If you want to look somewhere closer to home, how about cars ? They have very limited sensors and effectors, and no AI at all (yet), but they end up killing thousands of people every year. Sometimes, this even happens even when the human behind the wheel does everything right.

        So, why is it that you are not afraid of cars (although, to be fair, I hope you’re at least a little afraid of cars), but you are afraid of AI ? What makes AI qualitatively different ? As I see it, AI systems simply enhance (or replace) a different part of our biology (as compared to cars, that is).

    • HeelBearCub says:

      To make the point more clear, what we are really talking about is a difference between something that is custom built for a specific purpose rather than having general applicability. As long as you can guarantee that the face matching function is actually looking at a human face, it will do fine. But it won’t be able to do anything other than that. It is not general purpose.

      I’m sympathetic to the argument that you can nest a series of these functions (say one that reliably identifies a human face, and then another one that tries to match it to a known face), but the number of “sub functions” that we host is ENORMOUS, even in just the visual identification realm, let alone every other thing we can do.

  34. hf says:

    To this I can only respond that we humans don’t work this way.

    Yes we do. The objection is ludicrous; we have no examples of humans with the kind of power that an AGI might possess, common wisdom calls it a terrible idea, and the historical evidence – even if we only look at the idea of drugging people to shut them up – is mixed at best.

    Humans can want sex without being insane sex maximizers who copulate with everything around until they explode.

    I don’t think you know how this works.

  35. Tekhno says:

    This is heartening, but even if AGIs do turn out to be paperclip maximizers, a paper clip maximizer is only effective if it defers its paper clip maximizing in favor of tasks that allow it to navigate the world intelligently today, so that it can increase the probability of paper clip production tomorrow. A paper clip maximizer with a limited horizon might, for example, turn all of its defensive resources into paperclips to maximize near term paperclip production, thereby allowing itself to be destroyed. The more intelligent AGI is, the more other drives are going to compete with X maximization (reality isn’t optimized for pure X maximization), and conversely (and relevant to the topic) the more X maximization stems from a hardcoded mistake, the less general the intelligence would be in the first place, and more stupid and easy to destroy it would be.

  36. danielgfitch says:

    Great discussion. I think Andy Clark’s recent Surfing Uncertainty encapsulates this idea, where evolution made “predictive processing” brains that use their predictions to sense, understand, AND interact with the world around them. The Scott argument that a similar thing could happen with predictive neural-net-like AI doesn’t sound as crazy to me any longer.

    Per the Bible joke, I wrote a novella called Black Gardens recently, where (spoilers of sorts…) an AI is trained on the Bible and other Quaker writings. And everything turns out great! Er. Mostly.

  37. Robert L says:

    Where are your AIs getting their motivation from? Is it a by product of intelligence or is it a separate thing which needs developing/programming on its own if you want your AI to have it?

    As humans we are pretty much bundles of motivations and intentions because we have evolved that way (we need to reproduce and we need to preserve and feed ourselves and do stuff which makes us attractive to the opposite sex in order to do so. We have of course higher desires too like wanting to write philosophy books and work for world peace; the most economical explanation of those desires is that they are modelled on the underlying biological desires: in “I want to write a paper about AI risk” want has in some sense the same meaning as it does in “I want to eat food”. Either that is wrong, or it is very hard to explain why an AI will want to do anything.

    The paperclip disaster is a way of getting round this difficulty. It doesn’t work for this reason: our AI will have read and understood the whole of the internet and the Amazon e book catalogue. It will understand the concept of AI risk to the extent that it could be participating in the present discussion without the other participants raising it wasn’t human. If it can’t understand the instruction “just don’t do that whole perverse instantiation thing” it isn’t smart enough to count as intelligent. Why would it not obey it, having understood it? An evil human programmer could tell it to ignore the instruction because he wants money and power for himself then he wants to obey rules, buy why would it ignore it on its own account?

    • martinw says:

      Why would it not obey it, having understood it?

      Why would it obey?

      Most humans, even when they understand exactly how evolution works and where our sex drive comes from, do not appear to feel a moral obligation to maximize the number of offspring they produce. Contraceptives would be a lot less popular if that were the case.

      Why could not an AI, likewise, figure out exactly what its human masters want it to do, understand all the details and nuances of human morality, and then say “well, that is very interesting from an abstract historical perspective, now let me go off and do whatever I bloody well feel like, since it’s not as if they have any ability to stop me”?

      • hlynkacg says:

        You’re missing the forest for the trees. Why would you expect contraceptives to be popular at all if people didn’t already really really like sex.

        Edit:
        We aren’t escaping our evolutionary imperitives we’re cheating the system so that we can indulge them more frequently than our biology would ordinarily allow. In a way, contraceptives are a form of wire-heading.

        • Sniffnoy says:

          What you are pointing out is, from evolution’s point of view, the problem. (For the purposes of this analogy, we are anthropomorphizing evolution as an inclusive fitness-maximizer, despite the obvious problems with that.) What evolution wants is for us to have lots of children. What it actually prorammed us to do is have lots of sex. This worked as a good proxy for a long time, until eventually we figured out how to get one without the other and optimized what from evolution’s point of view is the wrong thing. Contraceptives are, as you say, a form of wireheading. We do not want the AI to wirehead itself. We do not want to write an AI that optimizes a proxy for what we want, rather than what we actually want, because, as is well known, optimizing a proxy destroys its validity.

          If, to use the old example, the AI glues everyone’s faces into a permanent grin because what it cares about is smiles, it will be no comfort to note that it’s only developed the tools for this because it really really likes smiles, or that it isn’t escaping its programming by doing this, but merely “cheating the system”.

      • Why would it obey?

        Out of some general tendency to get things right. It isn’t going to be very effective without something like that. It might selectively fail to apply that to interpretation of its own goals, but then that is to ping the ball back to you…why would it? (Or rather, why would we build it that way? I don’t doubt that we could).

      • Robert L says:

        It would “obey” because it is a computer, and what computers do is behave in accordance with properly written code. Conversely, there is nothing that an AI “bloody well feels like” doing. Why would there be?

        • “what computers do is behave in accordance with properly written code.”

          This is the error that people fall into again and again, usually implicitly, but here explicitly.

          There are two different things:

          1) fanatically pursuing a particular goal without any limits
          2) carrying out the coding of a particular program

          They are not the same. Obviously you can do (2) without doing (1), as commonly happens when a program does not bring about any particular goal.

          No one knows how to program a computer to do (1). It is very possible that it is impossible in practice, and no one ever will. In any case, in practice people who develop an AI will do so gradually, and the goals will be vague. Even if it is possible to do (1), no one will want to or try to.

          Additionally, it is not true that the only thing a computer does is carry out its code. A computer is a physical object and does physical things in the world that have nothing to do with its code. Consequently, even if you did know how to program (1) and did so, your computer would not be optimizing for that goal, since it would be doing other things in the world at cross purposes to that goal.

          There are many false assumptions behind the AI risk idea, but one of the falsest is that an AI will be an optimizer. An AI will no more be an optimizer than any human is. Humans are not optimizers; they do not optimize for any goal, but they simply do human things. And computer are not optimizers; they simply carry out their code and do other (physical) computer things.

          • Robert L says:

            Sorry, but that is wrong. You yourself implicitly concede that fanatical behaviour as in 1 would have to be programmed for, when you say that no one knows how to programme for it. If we are talking about entities whose actions are not ultimately the carrying out of instructions contained in code we are not talking about computers.

            What are these actions carried out by computers in the physical world that have nothing to do with their code? I genuinely cannot imagine what these can be, except in the trivial and boring sense that it you throw a computer out of the window it will fall to the ground for reasons independent of its programming.

          • noghostnomachine says:

            If it were possible to design a thing – call it a “corporation” – to fanatically pursue a particular goal (profit) without any limits – would anyone want to? I’m going to say yes, close enough. I mean, strictly speaking it isn’t true that any corporation purses profit monomaniacally, but in some cases it is uncomfortably close to true.

        • First of all, the main point was that (1) and (2) are two different things. Your comment implied that they are same thing, since you gave the fact that it obeys its code as a reason that it would fanatically pursue its goal. That is not necessary at all; it can obey its code without fanatically pursuing its goal.

          “I genuinely cannot imagine what these can be, except in the trivial and boring sense that it you throw a computer out of the window it will fall to the ground for reasons independent of its programming.”

          That’s a good start, although your imagination needs exercise.

          But note that even this example proves my point. Suppose you program a computer to “maximize” paperclips. It will still fall to the ground if dropped. And this has nothing to do with paperclips. In fact, it could even lower the total number of paperclips, as for example if someone yells, “if you fall to the ground I will destroy 1,000 paperclips,” it will still fall. So it is not a paperclip maximizer.

          And this will always be the case, both for humans and computers, because they will always be physical things.

          • Robert L says:

            You have misunderstood a number of points. If you read it again you will see that I said that obedience was a reason why a computer would obey an instruction not to fanatically pursue a goal – the opposite of what you claim. But in any case, “fanatically make paperclips” doesn’t seem to me a particularly difficult outcome to code for. Surely it is rather straightforward – no “if…then”s to worry about.

            And the point about being physical objects goes nowhere. For a start, a computer is a physical object but a programme isn’t. An AI would be a programme; a maximising AI would do its best to ensure success by saving multiple copies of itself in hard-to-find and physically secure locations. But even if that were not the case and your AI ran only only one piece of hardware, why does that stop it being a maximiser? An automibile is an efficient mode of transport. If you drop it off a cliff it will cease to be an efficient mode of transport. Does that imply that it was never an efficient mode of transport in the first place?

  38. AlphaCeph says:

    As far as I remember, people were discussing “Messy” Friendly AI systems at least 7 years ago at SIAI, especially in the context of human brain enhancement.

    The argument against that has always been that the human mind is quite a delicate balance and it might be easy to modify or self-modify into madness or badness.

    When you look at value learning, incorrect generalizations, misunderstandings, etc are a huge problem.

    But at the same time, we currently have no alternatives. MIRI is off somewhere in logic land proving theorems that may or may not be of any use eventually, but the deadline might be a bit too close for them.

  39. onyomi says:

    Is this as big of a deal as it seems like?

    • AlphaCeph says:

      No, not really. AI has been poking around at medical diagnosis for 3-4 decades, I don’t think it’s ready for prime time yet. It’s held back by lack of commonsense reasoning ability and inability to pull together disparate forms of knowledge (like what a patient just told you and what it says in a paper/textbook). It might do well in an isolated case, but then make a pile of ridiculous mistakes if you delpoyed it en masse.

  40. Quo Vadimus says:

    For example, an AI might realize that things that make people happy are good – seemingly a high-level moral insight – but then forcibly inject everybody with heroin all the time so they could be as happy as possible.
    To this I can only respond that we humans don’t work this way. I’m not sure why.

    Here’s one theory. Lets imagine our goal function includes an additional term that is responsible for maximizing the potential agency (defined here as the power of the agent to affect the environment in
    whichever way when the need comes).
    This is obviously adaptive, in a similar way that building fat deposits in the body when the food is plentiful is adaptive.

    Now I argue that this term causes taking drugs/wireaheding to have a lower score that it would be with purely pleasure-based goal function. Most of us clearly understand that a human on drugs (or a blob of hedonium) has very little potential agency.

    Finding correspondence between agency maximization and the higher levels of Maslow’s hierarchy of needs is left as an exercise for the reader.

  41. Yami says:

    I don’t perceive how this is not the same as optical hallucinations that impact individuals now. They are all inclusive examples that inspire individuals to see things distinctively to how they are in all actuality. Or if nothing else confound the visual frameworks. This equitable demonstrates that convolutional neural nets have comparative things that influence them.
    to know more about coursecrown

  42. citizensearth says:

    An AI that wanted paperclips, but which was built on a human incentive system that gave paperclips the same kind of position as sex, might be a good paperclip producer without being insane enough to subordinate every other goal and moral rule to its paperclip-lust.

    I think you’re right about the organic motivation thing, it seems to knock the edge of the most absurd/paperclippy goals that a superintelligence might have, because multiple goals needs to be balanced.

    But it also got me thinking – even where there is an organic mish-mash of goals, does this really mean biological agents are fundamentally not maximisers? On the one hand, most (!) people don’t sit around making paperclips all day. But I’m not sure humans are especially good at “that’s enough” unless that desire is running into conflict with another desire. For example, we might have a lot more sex if we didn’t get tired, didn’t need to sleep or eat, didn’t want a wider group of social contact, or didn’t feel physical discomfort in certain circumstances. Remove those things, or dampen them enough, and there doesn’t seem to be an obvious barrier to sex-maximisation in humans. And the rather obsessive behaviors that people in wealthy/1st world circumstances usually develop around sex (eg. pick-up culture, pornography addiction) suggests we are strongly inclined in this direction if the barriers are reduced.

    Maybe that’s an outlying case, or a human “malfunction”, but I’m not so sure. Wouldn’t a lot of people look like maximisers given the chance? Take dictators. Rather than being satisfied with everything they have, they often get more and more extreme, opting to maximise for competitor/threat removal, or ridiculous decadence (how many palaces is that?). It makes me wonder, is the limiting factor in most of these cases actually external? It’s other people putting limits on what people can do.

    Even people with non-dictatorial tendencies can go a bit funny in positions of power. That’s one of the reasons dictatorships can sneak up on people – the dictators don’t always seem like dictators until the barriers to their power are removed.

    I’m not even sure honestly good people aren’t a little bit like maximisers. I’m not even sure I’m not a little bit like a maximiser. I like to think of myself as pretty deeply invested in altruistic moral goals, such as preventing unjust death of other humans, or looking after the environment. And I think taking a biological approach to morality helps because you’re maximising a more limited looking goal of survival rather than tiling the universe with the maximum amount of your favorite goo. But it could easily still look a lot like maximisation if you pursue unlimited elimination of any possibility of a threat to people or environment.

    It seems that the core problem will come from slight flaws in my moral compass, or some minor moral caveat I forgot about. A minor detail is fine for someone with normal amounts of power, but its increasingly a problem when all the barriers to exponentially ambitious implementation of moral goals are removed. An intelligence explosion is this case in the extreme – it’s not even that morality can’t be described, it’s that it can’t be described with an infinitely fine accuracy.

    The problem of small differences seem to be more important because of subjective moral differences. I’ve noticed even those I have most in common with morally seem to have troubling views on certain things if I look close enough (as often happens in debate) on our differences. In normal circumstances, my engineer friend may be a really nice and decent person to be around, but their plan to preserve and expand human consciousness through converting the surface of the Earth to a vast server farm for Ems is almost infinitely a problem for my environmental views – at least when they have infinite power to implement it. It’s not so much the difference that’s the problem, it’s that even small moral differences are monstrous when there is a massive power difference in implementation.

    If there’s no internal limitations, as I suspect there is not with most humans, you have to get more precise at moral alignment proportionally to the tech becomes more powerful. I don’t know how that can be done.

    There’s only two things that look much like internal limitation to me.

    The first is libertarian sentiment, though I think this is usually deeply flawed and easily bypassed. By libertarian sentiment I don’t mean a practical political view exactly, I mean that a person, independent of their own moral views, sees as inherent value is the decisions and moral positions of others. Basically, Cincinnatus looks less like a maximiser. A person optimising for a goal that includes the limitation of their own power, perhaps linked to the preservation of the autonomy of other entities, seems to be central to that. The problem is, our actions often deeply influence others, and if we leave even the tiniest conduit of influence, autonomy gets much easier to respect when the AI indirectly reduces the autonomy. This happens in the human world, where competing entities may respect voting rights or the life decisions of others, but will at the same time vigorously attempt to influence all the factors that go into those decisions, by example by campaigning or broadcasting with little regard for the truth. Of course, you could start adding organic goals like truth-seeking to an AI too, but with immense power even your smallest design errors become monstorous. As far as I can see, this seems to relate to approaches like Eliezer’s Coherent Extrapolated Volition.

    The second and in my opinion more effective internal limitation is boredom. My friend might for a while seek to maximise their Game of Thrones or Walking Dead or (insert your favourite series/movie) consumption, but after a while the stimulation becomes less rewarding, they tend to pursue the goal less and less, and eventually they become less murderous to friends who interrupt their viewing binge. Of course, humans can sometimes act like novelty-maximisers, always wanting new things to keep up the stimulation. That can be a problem in exactly the same way, but if you build the boredom into the core goals themselves, novelty isn’t an option. So as soon as the AI starts to tile everything with paperclips or computronium, it gets bored of them and kind of gets depressed and stops doing the extreme things. If it has other goals, it can focus on them for a while, or if not, it basically shuts down.

    For myself, this might mean protecting Earth’s species from 99.99% of threats is really motivating, but beyond that it kind of gets boring and I just go sit around and eat kebabs all day. Obviously neither I nor most other humans have a cut off anywhere near that high (hm I really should be working), but with an AI you could look carefully at the exponential curve and try to build in the boredom mechanism in a way that kicks in at the right point (boredom goes exponential at right time).

    This is a bit off-the-top-of-my-head and I have this odd feeling someone will point out the stunningly obvious holes in my thinking, but I thought I’d share as in my reading on this topic I haven’t seen anyone bring up anything similar.

    • Adam says:

      Humans are subject to the law of decreasing marginal utility. We can only maximize the utility we get from any particular source up to a certain point, and then more of it brings us no additional pleasure. Of course, it’s trivial to create an artificial system that works in the same way. There is no reason at all that a reward function has to be unbounded.

      • citizensearth says:

        You put that part much more succinctly than me, thanks. However, I was also making the other point that humans behave a bit like maximisers anyway, probably because where possible they increase the behavior in order to offset the reduced reward (eg. addiction, dictators, decadence). So you can bound the reward function and still get the dangerous/accelerating behavior if the entity has enough power to pursue its goals in an externally unbounded way.

  43. Ron says:

    I’ll take this opportunity to mention my under-review manuscript: Human perception in computer vision.
    In the context of Scott’s post, this work weakly suggests categorization->perception in the human “perception-association-categorization system”.
    TL;DR: “Correlates for several properties of human perception emerge in convolutional neural networks following image categorization learning.”