Do Neural Nets Dream Of Electric Hobbits?

Last week OpenAI announced its latest breakthrough. GPT-2 is a language model that can write essays to a prompt, answer questions, and summarize longer works. For example (bold prompt is human-written, all other text is the machine-generated response):

Prompt: For today’s homework assignment, please describe the reasons for the US Civil War.

It is easy to identify why the Civil War happened, because so many people and so many books and so much television and films tell us that it was the cause, that it has something to do with race or economics or religion. There is much agreement that it was essentially a war of slavery on behalf of capitalism, about a century of slavery. But that’s not what most people think of when they think of the Civil War. Many people think of the war as about states rights. You’re not wrong about that, but it’s kind of misleading to say that the Civil War was a conflict between states’ rights and federalism. So let’s try again. What’s the idea that has inspired so many people to go onto the battlefields of Gettysburg and Antietam and Gettysberg and Petersburg and Fredericksburg? The American idea of the republic–a notion of limited government–is a great part of the history.

In 1791, Thomas Jefferson said “Our Constitution was made only for a moral and religious people. It is wholly inadequate to the government of any other.” The “founding fathers” were basically religious men who were not very concerned with states’ rights. The Civil War occurred in reaction to the idea that the federal government was taking far too much power away from the states, that they weren’t sovereign. Jefferson himself, as you well know, is an author of the Declaration of Independence who had some really crazy ideas about, well, you know, secession for a variety of reasons. The War Between the States (later the Great Rebellion) was a radical re-imagining of the American idea and not surprisingly, Jefferson’s grandson, who took after both his father and his grandfather and wrote about it, was named Thomas, after the president. “The federal and state governments are both contrived for the benefit of the rich and large property,” This view of the Founding Fathers has been promoted by some of today’s most prominent politicians, including some who hold prominent national offices and have a direct stake in a Republican Party that has been hijacked by the “federalist” wing of the party. So why is it that so many people have an easy-to-spot way of understanding the Civil War that has everything to do with the South and nothing to do with the South? The Confederate flag has been a symbol of racism for a long time, but when the Civil War started and the Civil War ended, many people suddenly began to see it differently. There was a change of heart, which I think is what led people to take the question of why the Civil War happened quite seriously…

Although OpenAI calls this a “language model”, modeling language necessarily involves modeling the world. Even if the AI was only supposed to learn things like “texts that talk about the Civil War use the word ‘Confederate’ a lot”, that has flowered into a rudimentary understanding of how the Civil War worked. Its training corpus (8 million web pages) was large enough that in the course of learning language it learned the specific idiom and structure of all sorts of different genres and subtopics. For example:

Prompt: Legolas and Gimli advanced on the orcs, raising their weapons with a harrowing war cry.

The orcs’ response was a deafening onslaught of claws, claws, and claws; even Elrond was forced to retreat. “You are in good hands, dwarf,” said Gimli, who had been among the first to charge at the orcs; it took only two words before their opponents were reduced to a blood-soaked quagmire, and the dwarf took his first kill of the night. The battle lasted for hours until two of the largest Orcs attempted to overwhelm Aragorn. When they finally stopped, they lay defeated and lifeless for miles and miles.

“I take nothing,” said Aragorn. “But I give my word, at my peril and mine, that I will never forget this day of horror. None of us will forget. Ever!”

“I’ll never forget it!” cried Gimli, who had been in the thick of the battle but hadn’t taken part in it. One of the wounded orcs he had carried off, he was the only one of the survivors who remained uninjured. “We’ll keep the memory of that day of evil, and the war with it, alive as long as we live, my friends!”

“Then we’ll keep it alive as long as we live,” added Legolas. “And we won’t forget the first great battle of the night, even if we may have forgotten the final defeat.”

“I agree,” Gandalf said, “but we will all remember it as the last battle in Middle-earth, and the first great battle of the new day.”

Aragorn drew his sword, and the Battle of Fangorn was won. As they marched out through the thicket the morning mist cleared, and the day turned to dusk.

The Two Rings were returned to Rivendell. Frodo and Sam woke up alone in their room, and Frodo found a note on his pillow. He opened it and read: May the Power of the Ring be with you always, and may its light never fade. I am not sure if it matters which of the two rings we accept this day but, as you asked me, I have chosen mine. I am sorry to leave you, Frodo, but know that we are very close to the end, and that you are with us forever. May this letter find you safely in Rivendell; and if it does not, then I will accept the ring in your stead. If by any chance you find or give this letter to the enemy, may they learn the strength of the ring and may the Two Rings never be broken!

The big picture is beautiful. The AI understands the reference to Legolas and Gimli as placing this in the setting of Middle-Earth. It infers that the story should include characters like Aragorn and Gandalf, and that the Ring should show up. It maintains basic narrative coherence: the heroes attack, the orcs defend, a battle happens, the characters discuss the battle. It even gets the genre conventions right: the forces of Good overcome Evil, then deliver inspiring speeches about glory and bravery.

But the details are a mess. Characters are brought in suddenly, then dropped for no reason. Important details (“this is the last battle in Middle-Earth”) are introduced without explanation, then ignored. The context switches midway between the battle and a seemingly unrelated discussion of hobbits in Rivendell. It cannot seem to decide whether there are one or two Rings.

This isn’t a fanfiction, this is a dream sequence. The only way it could be more obvious is if Aragorn was somehow also my high-school math teacher. And the dreaminess isn’t a coincidence. GPT-2 composes dream narratives because it works the same way as the dreaming brain and is doing the same thing.

A review: the brain is a prediction machine. It takes in sense-data, then predicts what sense-data it’s going to get next. In the process, it forms a detailed model of the world. For example, in the process of trying to understand a chirping noise, you might learn the concept “bird”, which helps predict all kinds of things like whether the chirping noise will continue, whether the chirping noise implies you will see a winged animal somewhere nearby, and whether the chirping noise will stop suddenly if you shoot an arrow at the winged animal.

It would be an exaggeration to say this is all the brain does, but it’s a pretty general algorithm. Take language processing. “I’m going to the restaurant to get a bite to ___”. “Luke, I am your ___”. You probably auto-filled both of those before your conscious thought had even realized there was a question. More complicated examples, like “I have a little ___” will bring up a probability distribution giving high weights to solutions like “sister” or “problem”, and lower weights to other words that don’t fit the pattern. This system usually works very well. That’s why when you possible asymptote dinosaur phrenoscope lability, you get a sudden case of mental vertigo as your prediction algorithms stutter, fail, and call on higher level functions to perform complicated context-shifting operations until the universe makes sense again.

GPT-2 works the same way. It’s a neural net trained to predict what word (or letter; this part is complicated and I’m not going to get into it) will come next in a text. After reading eight million web pages, it’s very good at this. It’s not just some Markov chain which takes the last word (or the last ten words) and uses them to make a guess about the next one. It looks at the entire essay, forms an idea of what it’s talking about, forms an idea of where the discussion is going, and then makes its guess – just like we do. Look up section 3.3 of the paper to see it doing this most directly.

As discussed here previously, any predictive network doubles as a generative network. So if you want to write an essay, you just give it a prompt of a couple of words, then ask it to predict the most likely/ most appropriate next word, and the word after that, until it’s predicted an entire essay. Again, this is how you do it too. It’s how schizophrenics can generate convincing hallucinatory voices; it’s also how you can speak or write at all.

So GPT is doing something like what the human brain does. But why dreams in particular?

Hobson, Hong, and Friston describe dreaming as:

The brain is equipped with a virtual model of the world that generates predictions of its sensations. This model is continually updated and entrained by sensory prediction errors in wakefulness to ensure veridical perception, but not in dreaming.

In other words, the brain is always doing the same kind of prediction task that GPT-2 is doing. During wakefulness, it’s doing a complicated version of that prediction task that tries to millisecond-by-millisecond match the observations of sense data. During sleep, it’s just letting the prediction task run on its own, unchained to any external data source. Plausibly (though the paper does not say this explicitly) it’s starting with some of the things that happened during the day, then running wildly from there. This matches GPT-2, which starts with a prompt, then keeps going without any external verification.

This sort of explains the dream/GPT-2 similarity. But why would an unchained prediction task end up with dream logic? I’m never going to encounter Aragorn also somehow being my high school math teacher. This is a terrible thing to predict.

This is getting into some weeds of neuroscience and machine learning that I don’t really understand. But:

Hobson, Hong and Friston say that dreams are an attempt to refine model complexity separately from model accuracy. That is, a model is good insofar as it predicts true things (obviously) and is simple (this is just Occam’s Razor). All day long, your brain’s generative model is trying to predict true things, and in the process it snowballs in complexity; some studies suggest your synapses get 20% stronger over the course of the day, and this seems to have an effect on energy use as well – your brain runs literally hotter dealing with all the complicated calculations. At night, it switches to trying to make its model simpler, and this involves a lot of running the model without worrying about predictive accuracy. I don’t understand this argument at all. Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy: “the world is a uniform gray void” is very simple; its only flaw is not matching the data. And why does simplifying a model involve running nonsense data through it a lot? I’m not sure. But not understanding Karl Friston is a beloved neuroscientific tradition, and I am honored to be able to continue participating in it.

Some machine learning people I talked to took a slightly different approach to this, bringing up the wake-sleep algorithm and Boltzmann machines. These are neural net designs that naturally “dream” as part of their computations; ie in order to work, they need a step where they hallucinate some kind of random information, then forget that they did so. I don’t entirely understand these either, but they fit a pattern where there’s something psychiatrists have been puzzling about for centuries, people make up all sorts of theories involving childhood trauma and repressed sexuality, and then I mention it to a machine learning person and he says “Oh yeah, that’s [complicated-sounding math term], all our neural nets do that too.”

Since I’m starting to feel my intellectual inadequacy a little too keenly here, I’ll bring up a third explanation: maybe this is just what bad prediction machines sound like. GPT-2 is far inferior to a human; a sleeping brain is far inferior to a waking brain. Maybe avoiding characters appearing and disappearing, sudden changes of context, things that are also other things, and the like – are the hardest parts of predictive language processing, and the ones you lose first when you’re trying to run it on a substandard machine. Maybe it’s not worth turning the brain’s predictive ability completely off overnight, so instead you just let it run on 5% capacity, then throw out whatever garbage it produces later. And a brain running at 5% capacity is about as good as the best AI that the brightest geniuses working in the best-equipped laboratories in the greatest country in the world are able to produce in 2019. But:

We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text. We hope for future collaborations between computer scientists, linguists, and machine learning researchers.

A boring sentiment, except for the source: the AI wrote that when asked to describe itself. We live in interesting times.

This entry was posted in Uncategorized and tagged . Bookmark the permalink.

188 Responses to Do Neural Nets Dream Of Electric Hobbits?

  1. Nancy Lebovitz says:

    The Civil War piece reads as though the poor thing was trained on high school students trying to meet a word count requirement.

    • max says:

      I think that’s actually one of the most interesting things about the piece – the AI is clearly responding in some way to the first part of the prompt “For today’s homework assignment”, rather than just regurgitating something completely generic about the Civil War.

      • Mary says:

        And free associating after that. . .

      • AISec says:

        Yes, great point. If the prompt had been “The causes of the civil war are as follows”, the system wouldn’t have tried to imitate the style of a homework essay. (I work in this area)

    • A1987dM says:

      You mean a minimum word count requirement, rather than a maximum one, right? Because it sounds pretty much like the opposite of someone trying to squeeze as much info about the Civil War into n words or less.

      • Alsadius says:

        Highschool assignments are always minimums. Very few people get the knowledge and inclination to write to a max until at least mid-university.

        • Hyperfocus says:

          One of my classes in high school had word count maxima on writing prompts, but it was an epistemology class, so it was probably for the teacher’s sanity rather than any pedagogical purpose.

    • nameless1 says:

      I liked that period when our teachers were not yet really aware of the existence of personal computers, thoughts in typewriters so only defined a page count. Fiddling with margins, font size etc. until it becomes too obvious game ensued.

      • zorbathut says:

        I had a period late in college where I was playing around with serious typesetting applications. No teacher *ever* caught the “add a tiny bit of extra horizontal spacing between all letters” trick.

    • aristides says:

      What I find worse, is that this is probably a B papar if it was turned into a regular high school US history class. What does it say that a machine that Scott characterizes as working like a human dreaming and similar to a brain running at 5% capacity can outperform a significant number of humans?

      • toastengineer says:

        Well, humans doing a completely artificial task that they hate doing and know doesn’t actually matter.

    • eqdw says:

      The fact that that snippet sounded so much like a highschool essay gives me an overwhelming sense, not that this AI is actually good, but that the median highschool essay is total garbage.

      This is a nice reminder that a lot of things that sound like clever and insightful analysis don’t actually have much real informational content

      • Simon_Jester says:

        The median high school essay is a bit more coherent than this, if only because it would take the average student some real effort to dig up some of the trivia and random associations that ‘contaminate’ the Civil War essay and clutter up any semblance of a narrative.

        But it could pass for an essay written by a student with good grammar, very subpar actual writing skills that in their most shining moments spike to ‘passable,’ and a bizarre willingness to do “deep dives” of research material in a quest for trivia whose relevance or irrelevance they aren’t mentally equipped to comprehend.

        Which is, come to think of it, pretty much what this neural net is.

    • deciusbrutus says:

      At least some of its training was exactly that.

    • alcatrash says:

      I have a (high functioning) 8 year old autistic son. This reminds me of his writing very much.

  2. Michael Arc says:

    I would love to use “essay was generated by this machine” as the null hypothesis that all documents of any sort we’re required to overcome in order to receive an attribution of semantic value. I bet that the machine could, with modest modification, identity with high confidence any document that it had produced. The hard part would be to not let random noise, like the bit of this essay about dinosaurs, mess that up. Still, if an essay was scanned by a system like this and the high surprise parts were handed to a human to evaluate, this would enable the human to be dramatically more productive in evaluating essays.

    Hmm. Actually, this would be useful enough that I want to know if anyone is interested in actually building it. It could improve the productivity of any profession which involves reading.

    • Scott Alexander says:

      I’m nervous that this would confuse stylistic flair with educational value. I’m not sure that Origin of Species is any more “surprising” than the average corporate mission statement in a pure language-prediction sense, though I would be really excited to see someone do the experiment.

      • Michael Arc says:

        https://www.complexityexplorer.org/courses/33-maximum-entropy-methods
        Not with those specific examples, but this has actually been done a lot

        • jp says:

          I’m hoping this isn’t too culture war-ish, but Simeon DeDeo’s misunderstanding of Robin Hanson in their youtube debate is perhaps the closest I have come to witnessing an otherwise healthy, intelligent human being demonstrate this purely-on-the-surface style of speaking and arguing we are also seeing here with the AI.

      • jp says:

        One might quantify 1. the surprisal across the layers of the network – maybe scientifically novel ideas are more surprising in the layers taking care of the most high-level “thinking”, instead of just word choice and syntax, style etc (which presumably are being taken care of in at least partially distinct, compartmentalised parts of the net)? 2., the model update induced by a text – i.e., presumably The Origin Of Speciesgreatly influences many of our representations about how the world functions, perhaps inducing a great many redundancies, whereas delusional rantings have high entropy and don’t reduce it elsewhere either.

        I.e., “reduction in deep layer entropy” as an alternative to the Turing test/grading criterion.

    • peterispaikens says:

      I strongly disagree with this notion – the way language modelling systems work, that machine can *NOT* distinguish a document it produced from documents seen in its training data. It generates content that is (to the extent that it knows) plausible, unsurprising and reflect the things seen in training data.

      Furthermore, the *opposite* process applies – if you had some system that can identify with high confidence that some document was/was not real, then it would be trivial to adapt it to generate documents that it can’t distinguish from human writing.

      By the way, the same applies to non-text aspects e.g. generative adversarial networks as applied for “DeepFake” purposes – the only arms race can be in the sense that a stronger model can detect fakes generated by a weaker model, but the strongest publicly available detector will always be flawed, because attackers can just use it to build fakes that the general public can’t automatically detect. This has some unpleasant implications for future.

      • David Speyer says:

        That’s surprising. The whole point of the NP versus P distinction is that there are things which are easy to recognize but hard to produce. Is “sensible English prose” not such a thing? (Of course, there is isn’t a hard line between sensible and not, but I think the analogy is reasonable.) For that matter, why did spam filters so thoroughly win the war on spam if it is so easy to engineer something to pass a filter?

    • youzicha says:

      I bet that the machine could, with modest modification, identity with high confidence any document that it had produced.

      Could it? My intuition is the opposite, the random samples are already drawn from the most sophisticated model available, so according to that model they look just like human writing. In order to tell them apart, you need to have a more sophisticated model, which can see that one of them makes sense and one is random nonsense.

      • Simon_Jester says:

        For the sake of argument, let’s call this ‘high-entropy’ writing (lots of fluctuation in the topic, characters appearing and disappearing chaotically, ideas being mentioned that aren’t relevant to the narrative).

        Let’s call well-crafted writing ‘low-entropy,’ characterized by persistent characters that appear only in specific scenes but frequently within those scenes, a topic that remains stable and consistent, and a relative dearth of unrelated ideas being brought into the narrative at random.

        A net like this might be able to detect ‘high entroy’ writing by forming a connectivity net and observing “hm, the word ‘grandson’ rarely appears in Civil War essays, so a mention of how Thomas Jefferson’s grandson was also named Thomas is a high-entropy chunk of this Civil War essay.” Similarly, it might be able to observe “hm, Lord of the Rings stories that involve a lot of fighting rarely involve Elrond personally appearing in the scene, so Elrond appearing alongside a lot of battle words is high-entropy.”

        And in this way it might be possible to at least roughly guess which pieces were machine-generated (or just very badly written) and which were written intentionally by a self-aware and self-reflective mind.

        • peterispaikens says:

          The whole point is that this is what such systems are already calculating, and using it in the generation. The probability of the system generating sentences that involve Elrond personally appearing in battle scenes is exactly as much as it can observe from the training data. If a system is powerful enough to detect that “Thomas Jefferson’s grandson was also named Thomas is a high-entropy chunk of this Civil War essay”, i.e. if it can assign an extremely low probability to generating that chunk, then such a chunk will (tautologically) be generated with an extremely low probability.

          As far as the system is able to tell, the generated text already is “low entropy”. What you describe is a stronger model that’s able to get a better evaluation of what’s “low entropy”. It’s certainly possible to make such a model (however, do understand that it’d be nontrivial – it’d have to be better than the current world’s best state of art, which, at the moment, seems to be this one – at least it gets lower perplexity (i.e. entropy) measures than any other model published before), but if you do succeed, then that better model can trivially be used to obtain machine-generate text that *it* considers “low entropy” but that’s still not written intentionally by a self-aware and self-reflective mind.

    • Anatid says:

      I bet that the machine could, with modest modification, identity with high confidence any document that it had produced.

      Interestingly this idea is used in “generative adversarial networks”. You have neural net A that tries to make plausible-but-fake images (say), and neural net B that tries to distinguish fake images from real ones. A is trained to fool B as often as possible, while simultaneously B is trained to get fooled as little as possible.

  3. Rack says:

    My wife seems to remember her dreams vividly and for a relatively long amount of time – or so it seems to me when she relates them to me. I forget mine almost immediately. Does the above discussion have any relevance for the difference between one person’s retention and another’s?

    • Mary says:

      I believe that the more your dreams differ from your waking mind, the harder it is to remember.

      Then, most of what we think of as a dream is our waking mind trying to impose some sense on it.

      • Ketil says:

        My guess is that the brain switches off its (long-term) learning when dreaming. Dreaming is the results of random neuron firings, not a good source for updating priors.

      • Walter says:

        I remember hearing/reading somewhere that your dreams only have a few senses, and are thus overwritten by the much richer waking information.

      • Anaxagoras says:

        That doesn’t match my experiences. My dreams (that I remember anyhow) are usually wild adventure yarns that don’t bear much resemblance to my day-to-day life.

      • Hyzenthlay says:

        Then, most of what we think of as a dream is our waking mind trying to impose some sense on it.

        That’s the impression I have. As soon as I start trying to talk about my dreams I feel like I lose some of the essence. They’re strongest and clearest when I first wake up and start disintegrating almost immediately. Even the act of trying to consciously remember them seems to distort them.

        Though I will occasionally have the experience of suddenly remembering a random place or image and not being entirely sure if it’s something I saw years ago in real life or in a dream.

    • Password says:

      I only remember dreams if I spend some time thinking about them immediately after I wake up. My guess is that short-term memory is active during sleep while medium/long-term memory aren’t, but if you think about the dream when you’re awake (and before it leaves short-term memory) that creates new medium/long-term memories you can recall later.

      One possibility is that your wife consistently ponders her dreams upon waking whereas you don’t. Another is that she tends to wake up in the middle of her REM stage when dreams are most intense; is she often groggy when she wakes up?

      • Rack says:

        Before she gets some caffeine in her, she is most assuredly groggy. I, on the other hand, feel pretty alert almost immediately. Maybe there’s something to that.

      • Randy M says:

        Often the dreams I remember are due to waking up but not being sure if I’m still in the dream or not and spending several minutes trying to basically “find my place” and return to it.
        It helps to wake up naturally without a lot of other distracting real sensory data like alarm clocks, bright light, or temperature fluctuations.

      • Winja says:

        One of the first steps in learning to lucid dream is getting in the habit of journaling your dreams as soon as you wake up. Pondering your dreams upon waking up strongly raises the likelyhood that you will remember your dreams longer (upwards of 15 minutes or so) and then journaling them extends that more, even if you don’t review your journal later.

        I also have a pet theory that dream activity and dream recall are significantly influenced by certain neurotransmitters, but I have no idea if anyone has studied that or not.

  4. JASSCC says:

    “Luke, I am your _______”.

    Actually, it’s “__, I am your father”.

  5. DinoEntrails says:

    In _Why We Sleep_, Matthew Walker argues that dream-sleep’s primary function is some sort of emotional regulation. His basic argument is summarized here – basically, emotion-laden memories are replayed but stripped of the accompanying stress/arousal so you can better…something (learn from them?). I find myself trying to square this with the predictive processing model and struggling.

    • He also goes into the role REM sleep plays in skill acquisition, which seems to jive with Scott’s network training bit.

    • Radu Floricica says:

      Sleep is something we do 7-8 hours a day, every day. It would be quite astonishing if it only did one thing.

    • carvenvisage says:

      emotion-laden memories are replayed but stripped of the accompanying stress/arousal

      n=1 this matches my experience.

      (well, not memories persay, but I assume it’s not meant strictly–you don’t have to have fallen off a cliff to have a falling dream right?)

    • acymetric says:

      That doesn’t seem right to me…I have plenty of dreams where I am stressed both during the dream and when I wake up as I recall the dream.

      I’ll…not go into it but arousal is kind of a strange word choice there as well (I don’t know if that is your word or lifted from Walker).

  6. algekalipso says:

    What does a machine learning researcher dream about? If Friston is right, the model complexity of their models about the world (including consciousness) get prune and simplified to achieve a balance of complexity and predictive accuracy. One problem is that if one does not take certain constraint seriously, then the model will simplify in order to be indifferent about it. In the case of modern machine learning research, I would say that a problem that is almost universally neglected when making connections between ML and brain function is the *binding problem of consciousness*. One could think this is because thinking about the binding problem does not improve your predictions; but it does. It explains e.g. simultagnosia, the globally incoherent states of mind in schizophrenia, and the exotic binding on LSD and ketamine. Then, one could still say: but in ML people get great results without thinking about phenomenal binding, why should I? Well, see for yourself the fact that Geoffrey Hinton recently discovered massive problems with convolutional neural networks in adversarial conditions, and found ways to deal with them with the concept of *capsules*. These use feature-alignments and side-step the incredibly neglected (and ubiquitous) problem that in CNNs when pooling is used you can “detect” a face with e.g. inverted eyes or things in the wrong place (adversarial networks show how weird this can get). But “capsules” fix this by locally modeling the orientation of features. The problem is that it’s even more computationally complex, but the performance of their models is really good. I suspect the brain is not quite using capsules, but architecture-wise it is in fact using binding for computationally-relevant purposes. Hopefully you will now dream about it… 🙂

    • insideviewer says:

      Hi — I was wondering if you could direct me to any sources on the concept of exotic binding of LSD and ketamine? I’m quite interested in the effect of these two drugs on feelings of wakefulness/dreaming.

  7. JASSCC says:

    What I wonder is whether this can be combined with reading for comprehension software and game-playing alpha-beta pruning types of tools, so that a number of candidate sentences are produced but pared down to those that work best toward advancing the point. Each phrase or sentence could be evaluated on criteria other than the flow of the words that this seems to do so well. For example, in a piece like the first one, the pruning algorithm would look to eliminate next phrases or sentences that don’t cohere with an argument.

    IBM’s Watson demonstrated it would be easy for an AI with a large corpus to identify capsule phrases related to causes of the Civil War, and probably to pull out “slavery” as the primary factor. Imagine coupling that with this kind of prose engine, but using something like game playing AI to make sure the argument moved coherently toward the goal by scoring each sentence on its contribution to making its point.

    It’s less clear to me how that would be scored in fiction, except that a scene should go somewhere in advancing a plot.

    • BlindKungFuMaster says:

      The first thing I would implement, if I got my hands on this, would be a beam search that connects a couple of words into a coherent sentence that continues the previous text.

      That kind of thing might accelerate the death spiral of classic media. Suddenly, even people who cannot string together straight sentences can create readable stuff.

      • JASSCC says:

        Interesting, and it suggests a use case: turn an outline of a story, perhaps with a few quotations, into the full thing. The author just puts down a few phrases in the order in which they should be used in the final article, and these become guideposts for the machine-produced filler. But something still needs to prevent the prose engine from rhapsodizing on some nonsense it interpolates in the middle. I think it could be done, but would turn out to be the hardest part.

        • drossbucket says:

          > turn an outline of a story, perhaps with a few quotations, into the full thing

          Ha, this is how I write blog posts, and I also haven’t figured out the step where I stop my prose engine from rhapsodizing on nonsense it interpolates in the middle.

        • tossrock says:

          And then, on the consumer side, there could be an adversarial network that reads the inflated article and distills it back down to the essential points! Huge time savings all around!

        • AISec says:

          This. You’ve put your finger on where this is going. The natural evolution is using a separate DNN to generate story outlines which then serve as prompts for the generator. The generator fills in the text according to the outline in pieces.

          According to the current trend, the ensemble of separate outline-generator and text-generator then get turned into an end-to-end model that does all of it in one DNN.

          The next step in the evolution that we should expect is an architecture that tracks the state of entities in the model, understanding which entities (people) are dead, alive, in pain, doing some action, etc. Also event-entities like battles in the middle-earth model and states of the “game”, such as wins and losses of battles, are taken into account and the system then generates text according to the outline and entity-states, until the story is resolved.

          The current Transformer architecture understands where it should place its attention on the preceding text in order to predict what’s coming next (see their paper entitled “Attention is all you need”), but doesn’t have a running state-of-entities model. This shouldn’t be hard to add, in principle, so I expect it to show up in the next couple of years.

      • Walter says:

        Wouldn’t that do the opposite of killing classic media, make lots and lots of it?

        • BlindKungFuMaster says:

          Making lots and lots of it is how you kill it.

          I think this would rather supercharge blogging, not newspapers. Basically the question is what do journalists bring to the table?

          • JASSCC says:

            There are some journalists who, because they get money to do so, go out and track down facts, traveling around, asking questions of relevant people.

            Journalism that can be done by observing an event and providing a faithful transcript or video can be done by local amateurs. Journalism that consists of opinions can be provided by anyone.

            But there’s also investigative journalism of the type that entails tracking down people who know about something and getting them on the record to present facts needed to make a judgement about how or why a thing happened.

    • AlexanderTheGrand says:

      There’s a podcast I listen to called Intelligence Squared (excuse the pretentious name), which is a debate podcast. Their most recent episode was between a champion debator and one of IBM Watson’s successors, called “Project Debator.” I was totally blown away by it’s quality of sentences and cogency of arguments. So if you want to check out the best current implementation of what you just described, that’s where to do it.

      AI research is by design usually just building blocks, as this is. It’s great to see it combined with thoughtful software engineering to make truly performant systems.

      • AISec says:

        Yeah… the Debater guys are doing great work, but their language model is behind this. I expect they will have to amp up their use of attention-based architectures like this with really large latent spaces. This thing could probably beat Debater with a little bit of regime-specific training.

        The most impressive aspect of this is the transfer-learning potential. Move this model into almost any NLP domain, strip off the head and give it a new one and an epoch or two of training in a specific task, and it’s likely to thrash any other existing system. I’m working on a couple such.

        • AlexanderTheGrand says:

          I don’t think so, not with a “little bit” of new training. The IBM project is very specifically hierarchical (understands topic, understands side it’s on, gathers evidence, synthesizes sentences), and grounded in truth (citing statistics, etc.). If you turned GPT-2 on that task, it would DEFINITELY just make up statistics. Which could be very convincing, but not great at the intended purpose.

          There are certainly parts of that pipeline that could be replaced/augmented by GPT-2. But my main point was that many tasks are helped by introducing a more complex software-system than just a AI model, in support of the original poster.

  8. Nicholas Weininger says:

    It is… ironic is the wrong word but some word like that? to share this post from one’s phone and have the writing of the forwarding email aided by predictive keyword/keyphrase suggestions.

  9. My first association was that GPT-2’s outputs were essays by a C or D student—i.e., someone who knows the assigned topic, a stance they’re “supposed” to take toward that topic, the basic conventions of the genre, and spelling and grammar rules, and who’s otherwise just “running down the clock” generating a simulacrum of what they think the teacher wants. Of course, it’s astounding that GPT-2 is able to do even this. I’ve graded many exam answers that made less sense than GPT-2’s outputs.

    Until reading this post, I hadn’t made the connection to dreams — probably because the example essays seemed so much more *structured* and *on-topic* than dreams (or at least my dreams…). Indeed, what impressed me the most about these compositions was precisely how *non*-Markovian, how *far* from stream of consciousness, they felt. They kept hammering home the requested theme, just like a mediocre student angling for the C- they need to graduate.

    • jp says:

      My first association was that GPT-2’s outputs were essays by a C or D student

      And the other Scott said:

      GPT-2 is far inferior to a human

      So it’s not really inferior to humans – it’s inferior to some humans. It’s not as good at writing as Scott S., but in principle that one might work out similar to AlphaGo going from beating competitive players to beating grandmasters within months …

      On the other hand, I don’t actually believe that – GPT2 is much worse than most humans at some things, and compensates by being much better at others. It’s probably much worse in its internal model of the physical world: it doesn’t know that Gimli is the dwarf himself (ok, that’s fantasy world …), and it can’t really count and keep track of people, or where things are. I assume it compensates by being much stronger at retrieving statistical associations between words – something we’re also good at, but not quite as good, as we don’t have to: we already know that a text about the Civil War should be about the involved parties, we don’t have to remember the cooccurrence statistics of “Civil War” and “confederacy”. (We probably still do in some sense, but we wouldn’t need to.) One can get quite far with such patterns of cooccurrence, but it’s unlikely humans bootstrap themselves all the way like a network that has only that as it’s input. We directly see and feel and live in the world, and probably use that to inform what to say. If we want to know how many legs a dog has, we don’t have to think about word statistics, we can picture a dog and count. GPT2 never has seen a dog, it has to do with dogs as we do with anything else: implicit inference on latent patterns in an extremely large probability distribution over string combinations. And it’s probably not gonna be able to write coherent models just based on verbal input alone, it eventually needs to achieve a more direct, and multimodal, access to the world to improve its higher-level world model.

      I hope. Maybe GPT2 is just pretending to be dumber than it is while secretly manipulating us all into hooking it up to more TPUs and Harry Potter fan fiction.

      • jp says:

        Another way: it is at least conceivable that by reading a text about rationality, a human reader then acts more rational. It is not yet conceivable that after reading a text about rationality, GPT2 writes more rational arguments. I.e., it probably has little (albeit not none!) meta-level thinking.

      • aristides says:

        I agree that GPT2 seems to me like a better writer than some humans. Alexander was comparing it to the writings he regularly reads, Aaronson was comparing it the the college students he teaches, but when I compare it to the high school students who my family members teach, I’d estimate this essay would get a B in a regular us history class. What it does better than a C high school student, are in my view, specific details, sentence structure, word choice, and citations of supporting sources. It has trouble staying on topic and being organized, but so do most C students.

    • nameless1 says:

      >My first association was that GPT-2’s outputs were essays by a C or D student—i.e., someone who knows the assigned topic, a stance they’re “supposed” to take toward that topic, the basic conventions of the genre, and spelling and grammar rules, and who’s otherwise just “running down the clock” generating a simulacrum of what they think the teacher wants.

      And there is no proper English word for this? We were doing this all the time so there are a lot of slang words for this in my language, most of them roughly translates to “making rice”, or “making pasta”, referring to a text that consists of tiny parts that each are digestible on their own, it just lacks an overall shape or meaning.

      Like the teacher asks you about King John V and you have no idea, so you go like “um he was the fifth under this name, previous one under this name was the fourth” go on sharing some random data on John IV because you happen to remember that dude. “Um they lived in the middle ages where there were, like, a lot of swords” go on and share a lot of data on swords because you happen to know about that. Basically, spitting out random bits of true data that are only very, very indirectly related to the question but it at least demonstrates to the teacher you actually know about something even though not about that thing that was asked. This was called “making rice” or “a bowl of pasta” or similar names.

      • Randy M says:

        Blathering” would be a good match.

      • Nornagest says:

        We were doing this all the time so there are a lot of slang words for this in my language, most of them roughly translates to “making rice”, or “making pasta”, referring to a text that consists of tiny parts that each are digestible on their own, it just lacks an overall shape or meaning.

        Interesting. English has “spaghetti code” for software, but there’s no corresponding idiom for prose.

          • Nornagest says:

            Good catch. Although I think of word salad as even less coherent than this; OP’s examples make sense locally, they just don’t gel into a narrative that does.

          • Simon_Jester says:

            Yeah. Word salad is so named because it’s effectively random words stirred up. There’s a little bit of coherence, but this is way beyond that in terms of ability to consistently orient its word choice and choice of quotations to a topic. The Civil War essay hopped back and forth a lot between the Civil War, the American Revolution, and the Founding Fathers, for instance, but that’s still staying within a pretty narrow range of “stuff that always pops up in your US history class.”

      • Eric Rall says:

        When I was in high school and college, we referred to this approach of essay-writing as “bullshitting”.

    • Murphy says:

      Ya, I’m actually shocked at how coherent it is.

      I’ve marked SA’s that were barely coherent. This just seems a bit flighty.

      I kinda wonder how long before teachers have to cope with very incompetent students just submitting auto-created SA’s.

      hell if I was more capitalist and less ethical I’d be setting up a service right now to generate these, salted with random seeds per user, on demand for a $19.99 subscription per month.

      Then, if it caught on, I’d sell a second service to academic institutions for a much higher price designed to recognize/catch patterns indicating these auto generated texts and set up a second competing service for a higher price that screws with whatever metric the service I’m selling to the colleges uses.

      Side note: I suspect this opens the door for impressive auto-generation of ingame lore for procedural generated games. Imagine a Dwarf Fortress type game with myths and legends you can discover inside the game composed with something like this from keywords based on on location history.

      • Randy M says:

        Anyone taking a US history class and want to volunteer to conduct a brief, non-IRB approved study?

      • Protagoras says:

        I kinda wonder how long before teachers have to cope with very incompetent students just submitting auto-created SA’s.

        Really soon. There seem to be students (maybe habitual cheaters who have learned from experience) who realize work that looks too good often attracts suspicion (student writing generally doesn’t look like expert writing, so stuff copy and pasted from the internet usually doesn’t pass the smell test, and google is very helpful in finding the originals in such cases). As a result, they find mediocre work to plagiarize so they can get mediocre grades with no effort, perhaps getting papers turned in for the same or a similar class by past students. I remember a class where I was a TA, and the professor had changed an example since the last time he taught the class; a half dozen students turned in papers discussing the example he’d used in the earlier class but not in the current one. Fortunately, they confessed (we’d have been helpless if they’d claimed to have looked at a friend’s notes from the previous class or something). Students who do this generally turn in B or C papers (maybe because they fear A papers are memorable, or maybe because fewer people who write A papers give them away to their friends for reuse). If it becomes possible to auto-create a B or even a C paper, I expect these kind of students to leap at the opportunity.

      • pozorvlak says:

        Unrelated, but: does “SA” stand for something (“student assignment”?) or is it an eggcorn for “essay”? If so, it’s a great one.

        • realitychemist says:

          TIL

          egg·corn
          a word or phrase that results from a mishearing or misinterpretation of another, an element of the original being substituted for one that sounds very similar or identical (e.g. tow the line instead of toe the line ).

          • pozorvlak says:

            It’s a great term, isn’t it? Related concepts are Mondegreen, a mishearing of a phrase that gives it a new meaning, and malapropism, a use of a similar-sounding but incorrect word that results in nonsense (whereas the best eggcorns make more sense than the phrases they replace).

        • Murphy says:

          I’ve seen it written as SA so it could be either. What’s an eggcorn that ended up fairly commonly used?

          • pozorvlak says:

            Do you mean “what is the generic term for eggcorns that achieve wide adoption?” or “what are some examples of eggcorns that have achieved wide adoption?”? If the former, I don’t know of a term other than “eggcorn”, sorry. If the latter, realitychemist’s example of “tow the line” is very widely used; “butt naked” looks to be on course to displace the earlier phrase “buck naked”; “duct tape” has thoroughly displaced the earlier term “duck tape”.

    • AISec says:

      Indeed… The Transformer architecture they used is great at pulling in elements of previously seen texts that conform to its model of how text sequences go, in a much more intelligent way than past Markov-based systems could. Pulling in character references to people known to relate to Gimli and Legolas, and then ending the sequence with a victory speech is impressive.

      Transformer is a system that replaces the dominant recent architectures based on recurrence and convolution with learning to pay attention to specific pieces of the foregoing text, and the already-generated text, to predict what should come next – very similarly to to the predictive-coding model of the brain – to generate the next coherent text.

      The missing piece is entity-state. Some existing DNNs understand entity state reasonably well. The marriage of such systems to this sort of generative DNN (perhaps together with logic to generate story outlines) could push this architecture to a level that rivals human storytelling at a grade-school level in the next year or so.

  10. Ketil says:

    It takes in sense-data, then predicts what sense-data it’s going to get next.

    And (perhaps) there is a back and forth between sensory inputs and its interpretation as an abstract model of the world. Does this explain anchoring? When you have filled in “I have a little ____” with “problem” or “sister”, the brain accepts this context, and continues building from that.

  11. Ketil says:

    Another observation: in image analysis, convolutional networks are good at identifying textures and isolated features (eyes, ears, wheels, etc), but suck at topology. Generative networks tend to generate six-legged dogs with two and a half heads and eight eyes in disquieting positions. This looks very similar, sentences make sense and produces the semblance of a story, but the narrative structure is messed up.

  12. Ketil says:

    At night, it switches to trying to make its model simpler, and this involves a lot of running the model without worrying about predictive accuracy. I don’t understand this argument at all. Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy: “the world is a uniform gray void” is very simple; its only flaw is not matching the data. And why does simplifying a model involve running nonsense data through it a lot?

    From a machine learning perspective, you can train a complex model on some data to get good accuracy, and then train a simpler model to emulate the output of the complex model. This is useful if you have limited labeled data, since once you have the complex model working, you can generate “free” labels. I’m not entirely convinced a complex model is better with limited data (the conventional wisdom is the opposite), but perhaps if you use a pre-trained model? That is, start with a standard architecture like Inception trained on ImageNet’s millions of images, tune it to match your particular but label-poor problem, and then use it to generate a large amount of data to train the simple model.

    Again: machine learning and artificial neural networks, I don’t see any reason to believe the wet kind of neural networks do this.

  13. Michael Watts says:

    I was just discussing this (focused on the Middle-Earth segment). I identified the following problems in the first three paragraphs:

    – Gimli addresses Legolas as “dwarf”.

    – Gimli kills an orc in battle, but does not take part in the battle.

    – The opponents are reduced to a “blood-soaked quagmire” in the space of two words, or less than a second. This is unrealistic.

    – Actually, while “blood-soaked quagmire” would be a good phrase to use to describe the battlefield, it can’t really be applied to the participants.

    – Although the opponents are defeated in less than a second, the battle lasts for hours.

    – At the conclusion of the battle, two orcs lie defeated for miles and miles.

    I concluded that the researchers’ own description of their agent as performing “close to human quality” was not accurate. But, there were some more interesting takeaways from the discussion.

    Introductory material — it’s long (sorry)

    In linguistics, the “stuff of language” is analyzed as comprising a series of layers:

    1. Your ear gets an analog waveform, the sound you hear.

    2. You resolve this into a sequence of phonemes, the “sound units” admitted by your language. The sequence of analog wave values turns into a sequence of discrete symbols.

    3. You resolve the sequence of phonemes into words. (In spoken speech, there is in general no marker for where one word ends and another begins. Such markers may exist but usually don’t, and where they exist they serve some other purpose than demarcating a word boundary.) This processes the sequence of discrete (phoneme) symbols from layer 2 into a new sequence of discrete symbols called words.

    Q: What’s the difference between a “phoneme” and a “word”?

    A: Phonemes are symbols within the language, but they are the sounds of which words consist. Different languages admit different sounds. An English speaker pronouncing “thick” will begin with a sound conventionally written [θ] (“th”). An English speaker hearing that sound will hear their own phoneme /θ/. A Mandarin speaker hearing the same sound will hear /s/ (“s”). A Russian speaker hearing the same sound will hear /f/ (“f”).

    Phonemes are symbols, but they don’t contain meaning. A “word” is the smallest unit of a language to which meaning can be ascribed (e.g. “person”). This turns out to be a significant difference, and this is why we say the sequence of words derives from the sequence of phonemes rather than directly from the analog sound wave.

    3. (cont’d) There is a principle in linguistics that the meaning associated with a word is purely a matter of convention (“the arbitrariness of the sign”). Knowing the meaning of a word means memorizing it. This is the lowest level at which we find meaning; earlier levels are studied under the names “phonetics” and “phonology”, but here we get into “semantics”. “Person” is a word with semantic content. “If” is a word, but one that is more or less without semantic content.

    4. The sequence of words is resolved into a sequence of sentences. Each sentence has a more complex structure than a simple sequence of symbols — it consists of words arranged in a parse tree, such that a given word may be more closely associated with one or more specific other words than it is with the rest of the words in the sentence. We call this layer “syntax”, and the “principle of compositionality” says that the meaning of a sentence is derived in predictable ways (the language’s rules of syntax) from the meaning of the words it contains. (As opposed to being arbitrary, like the meaning of a single word.)

    5. You can go beyond this. Meaning can be conveyed beyond the literal meaning of a sentence, by the choice a speaker makes of what to say, how to say it, and/or what to leave unsaid. This is called “pragmatics”.

    The layers model isn’t perfect, as you might have noticed from the mention of “if”. “If” exists at the word level, but it really operates at the sentence level.

    Slowly getting to the point…

    Phonetics is basically just an implementation detail of language. It’s important if you want to do anything, but it’s of no interest when you’re studying how to understand language (and indeed there are sign languages which don’t have any phonetics).

    There is support for distinguishing layers 3/4/5 as I’ve described them from various types of mental disorders. A person suffering from Broca’s aphasia knows the association between word and meaning, but isn’t capable of forming sentences:

    Broca’s aphasia, is a type of aphasia characterized by partial loss of the ability to produce language (spoken, manual, or written), although comprehension generally remains intact. A person with expressive aphasia will exhibit effortful speech. Speech generally includes important content words, but leaves out function words that have only grammatical significance and not real-world meaning, such as prepositions and articles. This is known as “telegraphic speech”. The person’s intended message may still be understood but his or her sentence will not be grammatically correct. In very severe forms of expressive aphasia, a person may only speak using single word utterances. Typically, comprehension is mildly to moderately impaired in expressive aphasia due to difficulty understanding complex grammar.

    The Language Instinct includes this passage describing a Broca’s patient:

    he was as linguistically hobbled when he wrote as when he spoke. Most of his handicaps centered around grammar itself. He omitted endings like -d and -s and grammatical function words like or, be, and the, despite their high frequency in the language. When reading aloud, he skipped over the function words, though he successfully read content words like bee and oar that had the same sounds. He named objects and recognized their names extremely well. He understood questions when their gist could be deduced from their content words, such as “Does a stone float on water?” or “Do you use a hammer for cutting?”, but not one that requires grammatical analysis, like “The lion was killed by the tiger; which one is dead?”

    Despite Mr. Ford’s grammatical impairment, he was clearly in command of his other faculties. [The researcher] notes: “He was alert, attentive, and fully aware of where he was and why he was there. Intellectual functions not closely tied to language, such as knowledge of right and left, ability to draw with the left (unpracticed) hand, to calculate, read maps, set clocks, make constructions, or carry out commands, were all preserved. His Intelligence Quotient in nonverbal areas was in the high average range.”

    A person suffering from Wernicke’s aphasia has no problem producing correct sentences, but can’t express any meaning:

    Patients with Wernicke’s aphasia demonstrate fluent speech, which is characterized by typical speech rate, intact syntactic abilities, and effortless speech output. Writing often reflects speech in that it tends to lack content or meaning. In most cases, motor deficits (i.e. hemiparesis) do not occur in individuals with Wernicke’s aphasia. Therefore, they may produce a large amount of speech without much meaning.

    Patients diagnosed with Wernicke’s aphasia can show severe language comprehension deficits; however, this is dependent on the severity and extent of the [brain] lesion. Severity levels may range from being unable to understand even the simplest spoken and/or written information to missing minor details of a conversation.

    In particular, such patients have extreme difficulty producing a specific desired noun, such as if you were to point at a shoe and ask what it was. Again from The Language Instinct:

    [Denyse] comes across as a loquacious, sophisticated conversationalist — all the more so, to American ears, because of her refined British accent. […] It comes as a surprise to learn that the events she relates so earnestly are figments of her imagination. Denyse has no bank account, so she could not have received any statement in the mail, nor could her bank have lost her bankbook. Though she would talk about a join bank account she shared with her boyfriend, she had no boyfriend, and obviously had only the most tenuous grasp of the concept “joint bank account” because she complained about her boyfriend taking money out of her side of the account. In other conversations Denyse would engage her listeners with lively tales about the wedding of her sister, her holiday in Scotland with a boy named Danny, and a happy airport reunion with a long-estranged father. But Denyse’s sister is unmarried, Denyse has never been to Scotland, she does not know anyone named Danny, and her father has never been away for any length of time. In fact, Denyse is severely retarded. She never learned to read or write and cannot handle money or any of the other demands of everyday functioning.

    A person who is “socially awkward” may take whatever anyone says at face value. This would be a failure in layer 5.

    In my model, a Broca’s patient has a working layer 3 and a broken layer 4, while a Wernicke’s patient has a working layer 4 but a broken layer 3.

    OK, we’re there

    Language is a tool for conveying meaning. We’ve seen that there’s a layer in which the meaning of a sentence is drawn from the meanings of the words it contains, and a lower layer in which words have raw meanings, drawn from a model of the world and associated with the word by a feat of memorization.

    There are two kinds of “language comprehension” associated with these two layers. I might ask you “the lion was killed by the tiger; which one is dead?”. And you can answer this purely by grammatical analysis, without knowing the concepts to which “lion”, “tiger”, or even “dead” refer. (You have to know that there is a causative relationship between the words “kill” and “dead”, but there are good reasons to believe that this knowledge is held at the level of syntax. Stated another way, you don’t need to have a mental relationship between the word “kill” and the concept of death, and another one between the word “dead” and the concept of death, and then derive the relationship between “kill” and “dead” from the fact that they are linked to the same concept — you can just know directly that the words are related.) A Broca’s aphasic would be unable to answer this question; a Wernicke’s aphasic has no trouble with it.

    But I could also ask “what is a lion?” A Broca’s aphasic would have great difficulty answering the question, because they can barely talk. And a Wernicke’s aphasic would answer fluently and correctly — it’s a big cat, it has a mane of long fur all around its head, it lives in prides, it has a fearsome roar… but if you took two patients to the zoo, and asked them to point to the lion, the Broca’s aphasic could do it, and the Wernicke’s aphasic might not know.

    Reading GPT-2’s text, it’s clear that GPT-2 has learned English grammar at the level of a normal human. It can form correct sentences, it generally doesn’t form incorrect sentences, and it can answer questions that are purely a matter of grammatical analysis.

    It’s also clear that GPT-2 doesn’t know the meaning of anything it says. It tells stories about fires being lit underwater. It tells how Gimli kills an orc without taking part in the battle in which he kills the orc. It misses that a “quagmire” is a place and not a collection of people.

    GPT-2 has Wernicke’s aphasia. It writes at the level of a human who is so severely mentally retarded as to be completely unable to function. Everything it says, it says — like Denyse — simply because someone else once said something similar. It can appear to hold to a storyline despite the fact that it doesn’t know what it’s saying, because other people have told similar stories.

    With this in mind, I would not agree that GPT-2 is dreaming, except to the extent that you believe a severely retarded person with chatterbox syndrome is dreaming while they talk with you. The problems all stem from the fact that GPT-2 has no model of the world, and is therefore not capable of distinguishing sense from nonsense.

    • Winter Shaker says:

      two orcs lie defeated for miles and miles

      George:
      If we should step on a mine sir, what should we do?

      Blackadder:
      Well the normal procedure is to leap 200 feet into the air and scatter yourself over a wide area.

    • whereamigoing says:

      It has to have some understanding of meaning to solve the Winograd Schema Challenge, where sentences are specifically paired to have identical grammatical structure. But yes, that might be its weakest point — training with audio/video data somehow might help.

      https://blog.openai.com/better-language-models/#task2

      • Michael Watts says:

        It has to have some understanding of meaning to solve the Winograd Schema Challenge, where sentences are specifically paired to have identical grammatical structure.

        I will disagree. The example given is the pair

        1. The trophy doesn’t fit into the suitcase because it is too large.

        2. The trophy doesn’t fit into the suitcase because it is too small.

        (And the question is, what is the referent of “it”?)

        There is a relationship based in reality in which not fitting involves contents and a container, and the problem can be that the contents are too big or that the container is too small.

        There is also a relationship based in English grammar in which the verb fit has a subject and one or more other complements, possibly supplied by context. The semantic roles of contents and container are marked by the syntax of this sentence. The contents are the subject of fit, and the container is the noun marked by into.

        Thus, our patient who doesn’t know what it means to be “large” or “small” can nevertheless conclude that if something was too large in an instance of not fitting, it must have been the contents, and if something was too small in an instance of not fitting, it must have been the container. There are no examples of usage in the other direction. (Because of the reality constraint.) I don’t see how this differs from “knowing” that a lion’s mane consists of long fur, despite not knowing what a lion is or what fur is; or from “knowing” that if the lion was killed then the lion is dead, despite not knowing what death is. Knowing the meaning of “large”, “small”, and “fit” allows you to conclude by logic what the pronoun in the Winograd question must refer to, but you can answer correctly without that knowledge, by reference to other examples of things not fitting.

        In other words, it is true that the two Winograd sentences have identical grammatical structures as each other, but that isn’t the right place to try to confuse our grammar-only agent. The potential referents of “it”, the trophy and the suitcase, have different grammatical structure.

        (Note that I call the roles “contents” and “container” for your benefit, but you can just as well call them “subject” and “into-complement”.)

        • jp says:

          Let’s stay technical: GPT2 probably understands meaning in the sense of having some sense of word semantics. It probably doesn’t understand meaning in the sense of qualia or reference.

          • Michael Watts says:

            I’m inclined to ask how GPT-2 could possibly have any sense of word semantics. What do you mean by this?

            As far as I’m concerned, for word semantics you need a mental model of stuff in the world. Then you map words, the linguistic artifact, to concepts, the things you think about.

            Do you mean that GPT-2’s internal configuration will have identified clusters of words that appear to be more or less interchangeable, modulo some syntactic transformations? Things like kill/die/slay/expire ; rest/relax/slack off ; green/verdant ; etc? I think that’s likely.

          • jp says:

            @Michael Watts

            I’m inclined to ask how GPT-2 could possibly have any sense of word semantics. What do you mean by this?

            As far as I’m concerned, for word semantics you need a mental model of stuff in the world. Then you map words, the linguistic artifact, to concepts, the things you think about.

            Do you mean that GPT-2’s internal configuration will have identified clusters of words that appear to be more or less interchangeable, modulo some syntactic transformations? Things like kill/die/slay/expire ; rest/relax/slack off ; green/verdant ; etc? I think that’s likely.

            Why, actually, would you need a mental model of stuff in the world? I can’t picture “a ultraviolet panda bear/unicorn mix with exactly the same face as Helen of Troy”. But I know what it is: it is a, well, mix of a unicorn and a panda bear, which has a color of a certain frequency range, and a face of a famous, perhaps fictive, person. Maybe that’s a bad example, but generally, we have words and concepts for a lot of things that are not “stuff in the world”.

            Yes, as you suggest: GPT2 probably has some notions analogous to synonymy, hyponymy, similarity. These are comparatively easy to get from texts alone. What it doesn’t have are reference (to stuff in the world, memories, …) and qualia.

          • Michael Watts says:

            Why, actually, would you need a mental model of stuff in the world? I can’t picture “a ultraviolet panda bear/unicorn mix with exactly the same face as Helen of Troy”. But I know what it is: it is a, well, mix of a unicorn and a panda bear, which has a color of a certain frequency range, and a face of a famous, perhaps fictive, person. Maybe that’s a bad example, but generally, we have words and concepts for a lot of things that are not “stuff in the world”.

            You need it to prevent your word definitions from just being empty closed loops. Everything you’ve described terminates at a concrete mental model. You know the meaning of the phrase because you know English grammar. You know the meaning of the words because they point to your mental model of the world. If you just think that wugs are young voozes and voozes are old wugs, neither word has any more semantic content than “young” and “old” do.

            Saying that two words are synonymous doesn’t tell you anything about what they mean, only about how they relate to each other.

          • HeelBearCub says:

            I can’t picture “a ultraviolet panda bear/unicorn mix with exactly the same face as Helen of Troy”.

            You can’t?

            I would think that most people will call to mind a mental image composed of the composite parts, a panda bear with horn sprouting from its head, with the face of some pretty woman vaguely related to the Greeks or named Helen, colored in the hues brought out by a black light.

          • jp says:

            @Michael, two questions. How do you think we ground our meanings? How do we come up with “mental models” to connect our word meanings to? (This is a very hard, or at least an unsolved, question.)
            And how do you operationalise meaning?

            I think if you try that, you’ll either end up agreeing with me GPT2 has meaning in some sense, or with something that’s closer to reference or qualia, which I agree it doesn’t have.

            And if you have synonymy, antonymy, …, you already have a great deal of semantics: you can tell much about which sentences are necessarily true, and which sentences are true given a few other sentences. Wouldn’t you say that we need to know some semantics to be able to do that? Sure, it doesn’t feel like semantics to you, but shouldn’t a proper definition of semantics – not: of thinking with words similar to how humans do! – be fairly close to that?

        • whereamigoing says:

          Does this differ from a hypothetical person who is blind and has no sense of touch, but isn’t deaf? They also understand relationships between words, but can’t associate them with visual data.

          Would being able to make esoteric comparisons like “a mouse is smaller than a mountain” count as semantic understanding? Otherwise this seems to be a philosophical question like the Chinese room.

          But I guess I agree in that I think combination with other modalities is a good next step. Also, see https://blog.openai.com/learning-concepts-with-energy-functions/

          • Michael Watts says:

            Who said anything about visual data? Even the blind and touchless hypothetical person has plenty of direct experience to refer to. They will know the people who interact with them individually. (They’ve been constructed so that that’s all they can do, but… fine.) They will know when they feel hungry or thirsty, even if they can’t tell whether they’re eating or drinking.

            If you were blind and had no sense of touch, would you think of your mother any differently than you would a stranger who said hello?

          • whereamigoing says:

            I was responding to “As far as I’m concerned, for word semantics you need a mental model of stuff in the world.”. GTP-2 seems like the hypothetical person in that their mental model of the world is mainly a model of sentences they have experienced, so distinguishing syntactic and semantic understanding is a bit more difficult, since the only sensory modality is verbal (if there’s at least one other modality, one can just check for associations between modalities).

            But never mind that. Again, would being able to make esoteric comparisons like “a mouse is smaller than a mountain” count as semantic understanding? Then it would be clearer to me what you mean by “semantic understanding”.

    • nameless1 says:

      “It’s also clear that GPT-2 doesn’t know the meaning of anything it says.”

      Careful of the wording, please, talking about meaning can easily generate an absolutely unproductive philosophical debate about Chinese Rooms and Searle type philosophers. Let’s rather put it this way: it does not have any generative model of what can and cannot happen in a context. Or you could say what it lacks is the knowledge that fire burning under water is a low probability event, and heaping a lot of low probability events on top of each other is not a good idea.

      This is one way to look at dreams. In many dreams there are events that in and itself could happen. It is just that it happening in a given casual context is low probability, and what the dream engine in the brain screws up is the probability calculations. Basically it just spits out a series of low probability events.

      • sorrento says:

        I agree that debates about the meaning of “meaning” are almost always unproductive. But in general I believe that there is more to intelligence than finding correlations between stuff, which is really what the neural net is doing here. You can’t do logical thinking with this framework (if I have 2 apples and I add 2 more, how many do I have?) or planning. There’s no emotion, no reason for the AI to want any one thing more than another thing. Maybe these things can be built on top of this work somehow. But then again, maybe not.

        Right now, it’s a word salad generator which is more impressive than a Markov chain, but which still fundamentally, feels kind of similar (at least to me). The output isn’t plausible on closer examination (Michael Watts gave a lot of examples of why).

        The big news here is probably that once someone else re-invents this and open sources it (and someone almost certainly will), lazy students can use this to generate their book reports. Or maybe spammers can use it to try to spam Google. Probably they’ll want to lightly edit the output first to remove the obvious impossibilities and incongruities. So another ML application has opened up.

    • Hoopdawg says:

      In spoken speech, there is in general no marker for where one word ends and another begins.

      I take offense to this statement. That’s what accent is for. It may be less obvious in English than in all the regularly accented languages that always stress the n-th syllable of the word or whatever, but as a general rule, all speech has evolved clues that allow us to separate words from each other.

      A “word” is the smallest unit of a language to which meaning can be ascribed (e.g. “person”).

      To this one, too. The smallest unit of meaning is called a morpheme, and is not identical to a word. Words can contain multiple morphemes. Again, it’s less obvious in mostly non-inflected language like English, though inflection is not the only way it happens. For example, English has a multitude of compound words – such as “panhandle”, composed of two morphemes, “pan” and “handle”. (To backtrack a bit, note how when you pronounce “panhandle”, it’s stressed differently from how separate words “pan” and “handle” spoken in succession would be.) And it’s not like “panhandle” has a different meaning from “pan handle”, both describe a handle of a pan, it’s just that “panhandle” is apparently common enough (including metaphorical usage – Florida!) to deserve its own entry in our brains’ databases. In a way, it may be more accurate to say that words are the biggest units of language to which meaning can be assigned (rather than derived by reasoning) – though that’s also oversimplification.

      I know, I know, I’m nitpicking, none of this conflicts with your wider point, I just wanted to post all this because it’s fun.

      • angmod says:

        I would like to make the argument that the smallest unit of meaning is the sentence (used somewhat loosely): that at least is the level where the combination of morphemes functions at the level of meaning. Morphemes have significance, but the significance is multivalent and under-determined out of the context of a sentence. It’s at the level of the sentence that the meaning we’re familiar with arises, at least in normal human speech acts.

        This is at least a contributor to my skepticism that a system like GPT-2 might function in a way at all analogous to human communication.

    • JohnBuridan says:

      +1

    • MugaSofer says:

      This is utterly fascinating, and I think you’ve hit the nail on the head here. Hopefully Scott links this in a “best of” post.

      A few supplementary observations:

      – Turning an army of orcs into a bloody quagmire is probably a mistake, but it also makes perfect sense (their blood is what forms the quagmire) and is beautifully poetic language to boot. I’m not sure if this is just good luck or may indicate something about the nature of poetic/clever language (counterintuitive but “correct”/meaningful statements are often pleasing.)

      – Language disorders make great traits for an alien species and ima steal em.

      – We now have neural nets that can recognize a lion by looking (but can’t tell you abstract facts about one, or communicate what they see beyond keywords), and neural networks that can tell you abstract facts and descriptions involving lions all day long (but haven’t a clue what one is.) We also have networks that can convert text to speech and vice versa; and we have crude “sentiment analysis” programs that can partially approximate your level 5. How soon until someone successfully strings all five together? What would such a program be capable of? (Note that physical robotics is *also* human/superhuman and could potentially be thrown in there.)

  14. Peter says:

    The thing about dreams; there are some cases where the prediction stuff isn’t completely free-running. There’s the weird sensory incorporation into dreams thing. One example is the case of a person who was a passenger in a car, and fell asleep with his eyes open. On waking, he said he’d had a dream about orange fireworks – they’d just been driving past some sodium lights.

    It’s touted as one of the bits that Inception got right – e.g. the scene where some dreamer is plunged into cold water, and in his dream there’s a massive sudden flood that goes in through the windows for no readily apparent reason.

    So in these cases the connection to sensory input is fairly loose – the dream phenomena aren’t so much sense impressions as things that are vaguely inspired by them.

    • Saint Fiasco says:

      When a guy is plunged into water and then dreams of a flood, what happens in reality is that he wakes up and his memory of the dream changes to incorporate the water. The dream is random noise, the brain just uses the ‘getting wet’ stimulus to retroactively make sense of the random noise and imagine a memory of a flood.

      This sort of thing happens most often with alarm clocks. My dreams sometimes incorporate the sounds of my alarm clock before the alarm clock rings, because time in dreams is also an illusion.

  15. jasmith79 says:

    Did anyone else fill in “I have a little ___” with list? Or am I the only musical theatre fan/potential serial murderer around here? (yes I know in the Mikado it’s I’ve got a little list, but my brain glossed right over that factoid).

  16. Dan says:

    “We won’t forget the first great battle of the night, even if we may have forgotten the final defeat.”

    AI research in a nutshell…

  17. konshtok says:

    both quoted pieces show no actual thinking

    • Simon_Jester says:

      Yes. The interesting thing here is how remarkably close you can get to generating readable, parsable text that looks like it could plausibly have been generated by (incompetent) human hands, while having no “actual thinking” and none of what we might informally call “understanding of what one is talking about.”

      • Faza (TCM) says:

        Isn’t that more a demonstration of the human reader’s ability to find sense where there is none? I’ve seen people seemingly completely taken in by stateless chatbots that did nothing but spew random phrases.

        The products of GPT-2 might tell us something interesting about how we approach written text (albeit probably nothing that wasn’t known before), but presently don’t look much like a viable step towards artificial intelligence in any meaningful sense of the term.

  18. In Thinking Fast and Slow the authors discuss a contrast between thought that is fast and instinctive (system 1) or slow and deliberate (system 2). It seems like predictive processing and generative networks explain system 1 thought fairly well but we’re still missing something in system 2. I blogged about how this relates to the differences between AlphaGo and AlphaStar

    • whereamigoing says:

      It’s UCT search, not alpha-beta (technical detail, but they found UCT is much more robust, basically due to replacing a hard min with a weighted average).

      I don’t think there’s such a hard cutoff between developing system 1 and 2. System 2 is just higher-level, so it’s going to take longer to emerge. Also, I suspect that a lot of what feels like strict, logical reasoning to humans is actually just very refined intuition (e.g. math relies on intuition in practice).

  19. jstorrshall says:

    This stuff has been brewing for some time. One of the best pieces of background is a blog post:
    The Unreasonable Effectiveness of Recurrent Neural Networks
    http://karpathy.github.io/2015/05/21/rnn-effectiveness/
    The only big deal in the new result is the size of the corpus, and perhaps the amount or depth of training.
    As far as dreaming is concerned, you might Google “Gelernter tides of mind”.

    • jp says:

      The only big deal in the new result is the size of the corpus, and perhaps the amount or depth of training

      No: the architecture is different (a transformer rather than an RNN – admittedly pioneered a year or two before this paper), and the interesting thing about the training corpus is not the size, but the pruning for quality. And they argue that there is little need for task-specific learning – many problems can be seen as naturally resulting from language modelling (= next-step prediction).

  20. rahien.din says:

    A lot of those ideas about dreaming make sense. Sleep is when we consolidate memory. And, we know that learning and cognition are aimed at developing simplicity (this has been measured in terms of minimum algorithmic complexity, Kolmogorov complexity, and linguistic complxity). So, the times when we are really learning are the times in which we are refining model complexity. And this seems to happen during sleep.

    As far as why dreams are so random, it has commonly been observed that adding just a little bit if noise to a system will produce a better result. My conjecture is that what looks like “dream logic” and “running a whole lot of nonsense through the predictive system” is simply the addition of jjitter in order to refine the system.

    That would offer a reasonable story. During sleep, we consolidate memory, a process which is aimed at refining model complexity. An important part of this process is running the predictive engine with some injected noise, as we see with other non-human systems. That explains why dreams are so important and yet so weird.

    • I’m afraid the memory consolidation phase of sleep is during deep sleep when you get sleep spindle brain waves, whereas the learning part of sleep is during REM sleep where your muscles relax and your eyes dart back and forth under your eyelids.

      • rahien.din says:

        This is neither accurate nor germane.

        Sleep spindles first appear in stage II sleep, not deep (stage III and REM stage) sleep. So I don’t think you know what you’re talking about.

        Moreover, as far as this neurologist can determine from this good review, “About Sleep’s Role in Memory”, it’s not exactly clear what each stage is doing.

        The ultimate conclusion from that paper is :

        [T]he current theorizing assumes an active consolidation of memories that is specifically established during sleep, and basically originates from the reactivation of newly encoded memory representations. …

        The active system consolidation process assumed to take place during sleep leads to a transformation and a qualitative reorganization of the memory representation, whereby the “gist” is extracted from the newly encoded memory information and integrated into the long-term knowledge networks.

        IE, during sleep, memories are actively consolidated, leading to storage of a model with refined complexity. Precisely my point.

        (You should source your claim.)

        • Sorry, I’d entirely misread your post so I’d like to apologize. The register of the phrase “consolidate memory” should have clued me in to the fact that you were using “memory” in the scientific jargon sense rather than the everyday English sense that only covers declarative memory. My knowledge only comes from reading “Why We Sleep” by Matthew Walker basically so I’m not especially fluent in the lingo. I’d somehow associated REM sleep with “light” and NREM sleep with “deep” but I’m not sure where I’d come up with that association.

  21. Adam says:

    Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy

    In software engineering we use “functional tests” which check that the output of a system is the intended output. We also use “regression tests” which ensure that the system keeps doing the same thing from one version to the next, with no explicit concept of what is correct. If there’s a process to automatically simplify a system, a regression test can give you confidence that the simplification process isn’t breaking its functionality.

    Is it possible the brain has something akin to a regression test? Some check that ensures that the simplifying updates it makes while you sleep aren’t turning you into a gibbering lunatic?

    • nameless1 says:

      You run regression tests on code, not data. What the brain is doing to memories is not really recoding itself, just processing data. And I don’t know how it processes it.

      But you talked about the simplification of code, so maybe it does a simplification of data. A good parallel for the simplification of data is that if you have 10 grocery stores then maybe you don’t need to store every sales transaction of one loaf of bread in the database and then it makes an hourly sales report take two hours to run, you can delete them and replace them with one record saying during this hour, in this shop, on this cash register, of the loaf of bread with the item number 1234 11 pieces were sold. And you have 1 database records instead of 11.

      This maintains the predictive accuracy so far that you still know everything you used to know except at what time during that hour were those stuff sold. So now you cannot figure out if the lunch break rush time is 12:10 to 12:35 or what. But suppose you don’t want to know that, in that case it is okay.

      • Adam says:

        Sounds like lossy compression. Some details take up more space than they’re worth so we don’t mind losing them. Sounds like a reasonable thing for an ANN to do.

        Anyway I feel like the distinction between code and data is even blurrier in ANNs than in the von Neumann architecture. If the code is the weights between neurons then simplification includes zeroing out low weights or removing corresponding links or nodes from the sum entirely, which speeds up processing and saves resources. Low weight literally means “doesn’t affect the output” which is what a regression test is checking for.

        The main problem with the regression test theory, to me, is that regression tests require you to store a set of known inputs-output pairs to test against, but I don’t see how that might happen in the brain unless they’re stored in the brain, and encoding your test data inside your classifier-predictor system seems counterproductive. But maybe the test is simpler than that; maybe it’s just “does this feel close enough to reality”.

  22. nameless1 says:

    My dreams got more vivid, more rememberable and more realistic due to venlafaxine/effexor, sort of looks like I am getting half of what other people get out of putting an effort into lucid dreaming without putting the effort in. Right now whole chunks (in-dream time 20-30 minutes) make perfect sense at least to the level an action movie is making sense (and it does not come from watching movies as I don’t do that much, it might come from reading novels), and only the sudden shifts to another movie don’t.

    I don’t know what to make of this. The blocks that make sense imply the prediction engine is running better. This is something I don’t see Scott noticing (or I don’t notice he noticing) that in dreaming, while you are shut off from current sensory experience, you are still working from stored sensory experience called memory. They can be real memory, or memory from reading a book, watching a movie. So if you LOTR and 200 other fantasy books stored in your memory and your dreaming brain goes on this prompt, either it follows the writing of the books closely and generates something better than this, or it makes a complete mess, or in between, and this are the quality of the prediction / generation engine.

    So it seems venlafaxine is making my prediction / generation engine work better now, generating 20-30 minute books that could go straight into any action movie script. Interesting. Does this mean the meds work? But how it is related to the problem of depression?

    What does not work is controlling the dreams. That half of lucid dreaming is not free, apparently. I try to think about something sexy before going to sleep… and then am straight into chasing a spy on motorbike on the motorway in a police helicopter or something. Eh.

    But the random occasional sexy dream is always much better than real life. Your body may be 40 years old but in dreaming it is like being 17 again when even just scenting the perfume of a pretty woman is a knockout feeling…

    • philosophicguy says:

      When I have taken certain nootropics (piracetam, hydergine, etc) I have found that my dreams become long, coherent storylines like a movie, with much more fidelity to reality and much more internal plot consistency than a typical dream. On several occasions these dreams even had scrolling credits at the end, like a movie! But, as with your effexor experience, I had never been able to control these dreams in a lucid manner. They are simply more realistic, well-plotted and coherent than normal random dreams.

  23. woah77 says:

    I guess the question I have is “Where does ‘it came to me in a dream’ fit into this?” If AI can “dream” up things, is there a way to test/validate what it comes up with to evaluate its worth?

    For humans, we have a corpus of knowledge so that we know when we have stumbled upon something that might fix a problem, but is there a way to automate that for AI?

  24. HeirOfDivineThings says:

    But the details are a mess. Characters are brought in suddenly, then dropped for no reason. Important details (“this is the last battle in Middle-Earth”) are introduced without explanation, then ignored. The context switches midway between the battle and a seemingly unrelated discussion of hobbits in Rivendell. It cannot seem to decide whether there are one or two Rings.

    This isn’t a fanfiction, this is a dream sequence.

    Sounds like the gospel of Mark.

  25. ASparklingViking says:

    These generated texts feel like prediction without narrative.

    Makes me wonder if your human typical experience is that of a narrative/continuity enforcer running on top of the prediction generator.

  26. greghb says:

    Armchair speculation and I have not read Friston, but on this:

    At night, it switches to trying to make its model simpler, and this involves a lot of running the model without worrying about predictive accuracy. I don’t understand this argument at all. Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy: “the world is a uniform gray void” is very simple; its only flaw is not matching the data. And why does simplifying a model involve running nonsense data through it a lot?

    One way to simplify a model is to abstract it. Suppose your model contains, “When people are hungry, they eat fruit” and “When people are hungry, they eat sandwiches”. You can simplify your model by replacing those with, “If people are hungry, they eat food.” I could imagine, very loosely, that creating this simplification would require generation of something “ridiculous” like a person eating a fruit sandwich. Who knows why exactly — maybe you generate the abstractions and see how they themselves are judged by your current model, and only store the abstractions that are judged to be reasonably good. But the tentative juxtaposition of entities unrelated at the object level seems like a path toward abstraction, and abstraction seems like a path toward complexity reduction.

    Also: “I have a little ___”? I thought “dreidel”.

  27. honoredb says:

    It might be a fluke, but the open-source version of GPT-2 seems like it can pattern-complete pattern-completion. I gave it the prompt “The next dangerous teen craze is “, expecting it to fill in the blank with some kind of random self-destructive behavior, but instead it output

    Model prompt >>> The next dangerous teen craze is
    ======================================== SAMPLE 1 ========================================
    ____________. ____________ falls under the categories of being unthinking, out of control, selfish or intolerable threat to others in order to get what you want, or helplessly living with other people. __________ is famously described by Sissy Needleback once as “being angry at one’s woman,” apparently because “she’s unstoppable.” ____________ is the newest slang term for teenagers. ___________ hangers the paranoid and brazen exuberance to tell others that they need to talk shit to their families or friends. __________ is also called “a work they hate to listen to.” ____________ chief short as an exhausted hanger despite a boredom, complains incessantly of inability to lift the lower half of an ice cream cone overhead, and confuses his colleagues with fathers or boyfriends. _______

    In other words it seems to have sort of guessed that this was a fill-in-the-blank exercise, and filled in the blank with a blank (while correctly extrapolating the rest of the genre).

  28. sclmlw says:

    A boring sentiment, except for the source: the AI wrote that when asked to describe itself.

    Or in other words, it’s easy to think you see signal in the noise, intelligent thought in a neutral net designed to sound like normal speech, and interpret dreams as trying to tell you something profound.

    If we take our experience with the other things produced by this algorithm, we would predict this seemingly clairvoyant moment is an accidental coincidence, given that most of the time it spits out coherent-sounding, but ultimately meaningless chains of phrases. That this isolated chain seems to fit an expectation is clearly just chance we ascribe to intentionally ex post.

    We might say the same about dreams.

  29. Deiseach says:

    The LOTR one reminds me of back in the heady days when the movies were being released and Tolkien fandom explosively grew with all the new kids coming in, and I’m pretty sure I encountered at least one fanfic where the author(ess) thought it was a great idea to have somebody make a rival One Ring to fight Sauron’s Ring.

    The history essay is terrible but parts of it do read awfully like “I have to write this stupid essay for stupid history class about who cares the hell about the Civil War, if I pad it out as much as possible with repeating variations on the same sentence I can just fill up the necessary amount of pages” school work 🙂

    • Evan Þ says:

      and I’m pretty sure I encountered at least one fanfic where the author(ess) thought it was a great idea to have somebody make a rival One Ring to fight Sauron’s Ring.

      Did someone snatch my fanfic idea out of my brain and give it to you to read?

      (Thankfully, I didn’t write it up – at least, not till I’d moved it to a new world and fleshed it out to create an independent non-fanfic novel. I then left that novel unfinished, but there were other reasons for that.)

      Back on-topic, I actually got that fanfic idea from a dream. I get a lot of story ideas that way, and most of them are much more coherent than this neural net’s output.

      • JubileeJones says:

        Yes, someone did steal your thoughts, and then they sold it them Monolith Productions, who made them into Shadow of Mordor/Shadow of War 😛

    • MaxieJZeus says:

      I’m pretty sure I encountered at least one fanfic where the author(ess) thought it was a great idea to have somebody make a rival One Ring to fight Sauron’s Ring.

      I believe that Tolkien himself suggested that as a logical course of action, though one that would lead to bad results. I don’t remember the context, but I think in one of his letters he remarked that a “realistic” version of LotR would have had members of the anti-Sauron coalition trying to create Rings of their own; and I think he wrote somewhere that Saruman already was trying to fashion one. Anyway, Ring proliferation would be as undesirable as nuclear proliferation.

      I wonder if that AI would be more convincing when trying to write in the style of Faulkner or Lovecraft. Neither one of them use “dream logic” as such, but they will go on for pages and pages in ways that are only weakly sequential so that you’re unlikely to run into weird leaps or juxtapositions; and contra Michael Watts above, I think “blood-soaked quagmire” could describe what’s left of some of Lovecraft’s supporting characters after an Old One got through with them.

      • Eric Rall says:

        That sounds like this bit from the Forward to the Second Edition, included at the beginning of most later printings of Fellowship of the Ring:

        The real war does not resemble the legendary war in its process or its conclusion. If it had inspired or directed the development of the legend, then certainly the Ring would have been seized and used against Sauron; he would not have been annihilated but enslaved, and Barad-dur would not have been destroyed but occupied. Saruman, failing to get possession of the Ring, would in the confusion and treacheries of the time have found in Mordor the missing links in his own researches into Ring-lore, and before long he would have made a Great Ring of his own with which to challenge the self-styled Ruler of Middle-earth. In that conflict both sides would have held hobbits in hatred and contempt: they would not long have survived even as slaves.

        The same forward also contains this gem of the art of the Cordially Understated Insult:

        Some who have read the book, or at any rate have reviewed it, have found it boring, absurd, or contemptible; and I have no cause to complain, since I have similar opinions of their works, or of the kinds of writing that they evidently prefer.

        • MaxieJZeus says:

          Thanks! That introduction is what I was thinking of when thinking of “Saruman’s research.” There may or may not have been other places where Tolkien talked about “multiple Rings.”

        • Paul Zrimsek says:

          Saruman also boasted to Gandalf of being “Saruman the Wise, Saruman Ring-maker, Saruman of Many Colours”.

          I have my doubts about whether the author of ‘Leaf by Niggle’ really hated allegory as much as he claimed to in that Foreword.

  30. lecw says:

    > All day long, your brain’s generative model is trying to predict true things, and in the process it snowballs in complexity – your brain runs literally hotter dealing with all the complicated calculations. At night, it switches to trying to make its model simpler, and this involves a lot of running the model without worrying about predictive accuracy. I don’t understand this argument at all. Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy: “the world is a uniform gray void” is very simple; its only flaw is not matching the data. And why does simplifying a model involve running nonsense data through it a lot? I’m not sure.

    Wait wait that made a lot of sense to me (also I did my masters in ML).

    > Surely you can only talk about making a model simpler in the context of maintaining its predictive accuracy

    Yes and no. No, of course abrupt drops like not caring about accuracy at all anymore are not permitted. But yes, simplifying the model in some sense _entails_ sacrificing some accuracy. In ML, you very specifically want to lose the (“overfitting”) accuracy about the train set (past experience) that is not relevant to future performance : this is what “generalising” is all about, finding what is true about the past that will maintain, and differentiating it from what is only-circumstantially true about the past.

    Therefore, a high-level policy of “learn everything about today in much detail”, then forget everything a little bit, then learn everything about the next day, makes a lot of sense. The daily circumstances will be eroded away at night, while the fundamental truths will be reinforced over time.

    > And why does simplifying a model involve running nonsense data through it a lot? I’m not sure.

    Seems obvious to me, to the point of being hard to explain. On the ML side, look into Denoising Autoencoders. Their input is a corrupted version of A (say A+30% noise), and the target output is A, so the model tries/learns to distinguish the input noise away from the true bits. You can also think of it as compression : “I want to remember this 1Mb object (predict it back) but I only have this 300kb memory, so I’ll find out the parts of the input that most predict the others, and forget what is most random/circumstantial about it”.

    Also in Autoencoders like in what they say happens unlike how you suggest it, it is _not_ nonsense data that runs at night : it’s _model predictions_, unconstrained by the usual minute-detailed reality feed. Just like the text excerpts, note that the dream predictions are already pretty good ! They retain a lot of the caring-about-predictive-accuracy habit from day behaviour, but the habit coach is absent, so they may drift a bit out of accuracy, without getting put back on track. This would likely reinforce the existing trends about the model, i.e. the top trends become more established, and the lesser paths get forgotten. (See also “use it or lose it” : this is the implementation of my above “forget everything a little bit”.)

    My explanation would predict that dreams in the early night are more realistic, and dreams late in the night (I mean after a lot of dreaming, so actually late in the morning), would be less constrained, more wild. Maybe some people in the lucid dreaming community can judge about that.

  31. Gerry Quinn says:

    Still seems to be fundamentally a fancy Markov Chain.

    I suppose that begs the question of whether WE are.

  32. Manx says:

    Whether you were planning to or not, you just made an interesting argument in favor of old psychoanalytic projective tests and dream interpretation as ways of accessing unconscious processes.

    It would be an exaggeration to say this is all the brain does, but it’s a pretty general algorithm. Take language processing. “I’m going to the restaurant to get a bite to ___”. “Luke, I am your ___”. You probably auto-filled both of those before your conscious thought had even realized there was a question. More complicated examples, like “I have a little ___” will bring up a probability distribution giving high weights to solutions like “sister” or “problem”, and lower weights to other words that don’t fit the pattern.

    Oldschool psychoanalysis would give patients sentences like “I have a little___” and see what the patient generated. The thought was you could get at what algorithm their brain was running with less interference (resistance) this way. And of course, if you’re a Freudian, you expect the answer to be ‘penis.’

    Likewise, if dreams are predictive algorithms stripped of corrective sense data, then they would be excellent sources to examine if you want to get at what the algorithm actually is.

    I have so far stayed away from projective tests and have not done much with dream interpretation, but this theory of mind makes me more interested in learning those techniques.

  33. Doctor Mist says:

    they need a step where they hallucinate some kind of random information, then forget that they did so.

    This sounds a little like simulated annealing, but I don’t see any sign that the ML folks you cite have made that connection, so I’m probably wrong.

  34. deciusbrutus says:

    On the subject of simplicity/accuracy:

    Suppose the waking brain operated in such a way as to reject any change that reduced accuracy at all, but also rejected any change that reduced simplicity by much more than it increased accuracy. Suppose also that the sleeping brain rejected every change that reduced simplicity at all, but also rejected any change that reduced accuracy much more than it increased simplicity.

    An algorithm that simply tried to maximize some joint non-decreasing function of simplicity and accuracy f(S,A) (for example but not necessarily =S+A or =S*A or A^S- the function is always higher if S or A increase, but it need not be clear whether a trade between them is a net increase) is likely to either find a local maximum and refuse to leave it, or will explore too much and fail to converge on the actual maximum within the domain. But alternating between moving towards simple and moving towards accurate seems like a heuristic that is itself in a region of very high simplicity and high accuracy- I thought about it for a couple minutes and didn’t find a way that was simpler without being much less accurate, nor one that was probably more accurate that wasn’t much less simple.

  35. Maxander says:

    The really distressing (or exciting, depending on your headspace) aspect of this paper is that they say, essentially “we really didn’t do much here aside from scaling up work from about a year ago, and cross-apply the same model to different problems; there’s every reason to believe that scaling it up even more will produce even better results, and also there’s presumably more tasks out there that even the current model does phenomenally well at.” Also, that a reasonably effective (or, at least, sorta-halfway-effective-ish) world model just emerges naturally from training on a language task means that we’re achieving a large package of “cognitive capabilities” for basically free.

    There’s been the argument for awhile that “there are few or no fundamental advances remaining that are required for human-level AI; someone just has to stick the right mix of mostly-already-known ingredients together and pay a small fortune in compute time”; it just seems substantially more plausible now than it did a week ago.

    • Freddie deBoer says:

      There is no model of the world; there are probabilistic associations between terms that have no connection that is similar to the way human minds understand. This is not human-like AI; it’s just more machine learning. There is no “I” there.

      And the people who made this are making grandiose claims about what it could do? I’m shocked.

      • David Shaffer says:

        What makes you think that most humans have a model of the world? Experience would suggest otherwise.

      • whereamigoing says:

        What is a “model of the world” if not probabilistic associations between terms?

        • deciusbrutus says:

          Intuition about how those terms would interact in novel contexts.

          Whether humans have that is unclear, since ‘novel’ moves its own goalposts.

  36. realitychemist says:

    For some reason I filled in, “I have a little ___” with ‘teapot.’ Because of the rhyme I think, even though it’s only a fuzzy match. I didn’t even realize this was probably unusual until I read the suggestions of ‘sister’ and ‘problem.’ I wonder what that says about me…

    • The Nybbler says:

      I am also on team teapot. Though I do not in fact have a little teapot. I have a kettle, and it is regular stovetop size.

    • deciusbrutus says:

      Teapot also popped for me, although I’m pretty sure it popped up the third time I saw the construction, the first time I think it was ‘problem’ and the second time I identified it as the construction and didn’t fill in the blank.

  37. BlindKungFuMaster says:

    They used 40GB of text to train that thing. What is a reasonable upper limit of how much they can still scale this up?

    If they would limit themselves to novels, there are around 200.000 novels published in English per year, they contain on average maybe 400.000 letters, which would amount to just 80GB high quality text per year. So for an application that is focused on writing novels, they could conceivably scale this up by an order of magnitude, at most two.

    • jp says:

      They can pretrain on all kinds of texts and finish off on novels. Or maybe not even that. That is the point of the original paper: a lot of specific tasks can be understood as language modelling. I.e., you learn (to predict) language in general, and the rest kind of comes for free.

    • Freddie deBoer says:

      How do you know that scaling up the corpus will scale up the quality too a similar degree?

  38. Freddie deBoer says:

    modeling language necessarily involves modeling the world

    Terry Winograd’s dilemma. One of my favorite posts was about it. Sadly lost when my blog got infected with malware.

    Too many people here seem to be under the misconception that this progress means further progress will scale linearly. Into something, say, coherent.

  39. Daniel_Burfoot says:

    Let me try to clarify some points here from an ML perspective.

    One way of looking at a statistical model is a mapping from the space of *pure* randomness to the space of data objects. When you build a generative model of text, you can’t just invoke an RNG to give you random chunks of text. Instead, the RNG gives you [0,1) values, and you map these through your Markov model or GPT or whatever to get text chunks. It’s somewhat more tricky, but you can also do this with binary 0/1 values (bits) from the RNG.

    You can also think about the back-transformation from the original data space to the [0,1) space. For example, you can take some real text, back-transform it through your Markov (GPT, etc) model, and obtain the [0,1) values that would generate the text when mapped through the model. If you map back to bits, you have a lossless data compressor.

    A crucial point is: the better the model/data fit, the *more random* are the [0,1) values or bits. If the model is perfect, the bits are perfectly random. If the model is bad, the bits will have clear patterns or structure (randomness deficiency is the technical term). If you have a candidate model and want to test its quality, you can perform this back-transformation and apply a bunch of statistical tests to the [0,1) values. If any one of the tests reveals a randomness deficiency, that shows an imperfection in the model and also gives you a way to improve it.

    Conversely, you can apply the statistical tests in the original data space (e.g. images/text). To do this, you define an aggregation function, generate a bunch of images or text from the model, and then apply the function to both the real-world data and the sampled data. For example, you could measure the average length of sentences, or the number of times the word ‘Gimli’ appears, or the number of times ‘Gimli’ appears given that ‘Legolas’ appeared in the 10 previous words. For every such measurement, the value obtained from the sample data must be the same as the value obtained from the real data. If not, you have discovered an inadequacy in the model, and a way to improve it.

    What does this have to do with dreaming? Well, the idea is that dreaming is your brain producing samples from its own internal statistical model of its sensory input. And the brain learns by comparing statistics of the sampled data (dreams) to statistics of the real data (actual images from the eyes). Importantly, dreams aren’t just random gibberish noise – they are randomness mapped through the generative model of the visual cortex, and therefore contain important similarities to real images.

  40. Windward says:

    Don’t mind me, just enjoying the concept of Aragorn as a high school anything teacher; I actually think he’d be pretty decent at it.
    Although perhaps not in the math department.

  41. MawBTS says:

    Is it certain that these texts were machine generated?

    The internet is awash with supposedly AI-created texts that, in fact, were written/edited by humans. The “we can’t release the source code” thing is cause for suspicion.

  42. deciusbrutus says:

    What happens if you prompt it with “Iä! Iä! Cthulhu fhtagn! Ph’nglui mglw’nfah Cthulhu R’lyeh wgah’nagl fhtagn!” ?

  43. Soy Lecithin says:

    But wait, the brain is literally electrical circuits and chemistry, so those aren’t analogies at all. And the brain is literally a network of neurons aka neural net, so that is also not an analogy (it’s an abstraction, rather). And the point of the clockwork analogy is that actual clockwork provides a counterexample to several naive assumptions about objects with complex behaviors. It’s a good analogy that makes a useful and not misleading point about the brain.

    • Nornagest says:

      I find the whole “well, we analogize the brain to technology every time we come up with a new technology, so this analogy must be bogus too” bit mildly annoying, but it bears repeating that most artificial neural nets don’t bear a very close resemblance to natural ones, and that the recent advances behind things like GPT-2, while very impressive, are not directly inspired by natural features.

      The first artificial neural nets, back in the Sixties, were biomimetic in a sense (albeit very very simplified), so that’s one difference between this and e.g. clockwork, but they’ve evolved on their own since then. And there are ANN architectures that are designed to mimic natural neural neurology more closely (spiking neural nets, for example), but this isn’t one.

  44. sa3 says:

    “We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text. We hope for future collaborations between computer scientists, linguists, and machine learning researchers.”

    It’s hilarious if the AI really did write that itself. I believe it too; that same paragraph is probably in like 100 papers on Argiv right now. Elements of the last sentence are probably in 1000+. I’ll bet generic academic jargon is some of the easiest text to reproduce (in some ways all the Sokal hoax stuff demonstrates that; I’ll bet this algorithm could get published in a decent journal eventually).

  45. vV_Vv says:

    I’ll bring up a third explanation: maybe this is just what bad prediction machines sound like. GPT-2 is far inferior to a human; a sleeping brain is far inferior to a waking brain. Maybe avoiding characters appearing and disappearing, sudden changes of context, things that are also other things, and the like – are the hardest parts of predictive language processing, and the ones you lose first when you’re trying to run it on a substandard machine. Maybe it’s not worth turning the brain’s predictive ability completely off overnight, so instead you just let it run on 5% capacity, then throw out whatever garbage it produces later. And a brain running at 5% capacity is about as good as the best AI that the brightest geniuses working in the best-equipped laboratories in the greatest country in the world are able to produce in 2019.

    Still doesn’t explain what is the function of dreaming. Given that dreaming occurs intermittently during sleep, and it’s presumably metabolically expensive, it’s likely that it has some function, rather than being just useless noise.

    I lend towards the hypothesis of the function of dreams is the brain training a GAN (click refresh), a Boltzman machine or something, with a caveat.

    According to dual process theory, mental processes can be roughly divided in two broad classes: “system 1” consisting of associative, intuitive, heuristic, fast, low-effort processes, and “system 2” consisting of deliberative, logical, algorithmic, slow, high-effort processes. Paul Christiano calls them the monkey and the machine. System 1 is in charge by default, but system 2 monitors it and occasionally takes over when what system 1 is trying to do something that “makes no sense”.

    My hypothesis is that when you dream, or take LSD or get schizophrenia, your system 1 becomes mostly inactive. Without system 2 to keep it in check, system 1 just feeds on its own predictions, amplifying their noise. The predictions are locally consistent, but lack overall coherence, because system 2 can’t veto them.

    Artificial neural networks are usually quite good at doing system 1 type of stuff, but struggle to replicate system 2. It doesn’t seem to be an issue of training data or computational resources: you can train a neural network to, say, add numbers up to N digits, but if you then test it on numbers of N+K digits, its accuracy will drop off a cliff even for a small K. It does’t learn the general rules of addition (there are some complicated variants of neural networks designed to do better at some of these problems, but they are sort of ad hoc, difficult to train, and of limited generality).

  46. jonathanpaulson says:

    You omitted the (fascinating) first three words of the civil war response: “By Donny Ferguson”

  47. kaimullet says:

    If AIs so smart, how come every time I ask google home to turn on the lights it says “playing turn on the lights by future”? Wait. You don’t think it’s screwing with me on purpose, do you? Damn robots.

  48. Galle says:

    The orcs’ response was a deafening onslaught of claws, claws, and claws

    This is actually legitimately amusing prose and I might steal it at some point.

  49. StevieT says:

    The dreams-as-model-simplification thing is actually pretty straightforward to think about.

    You have a complex, expensive, slow neural net. You want to train a fast, cheap, efficient neural net to do the same job. So you put a bunch of random inputs into the first neural net and train the second one on its outputs.