Gwern’s AI-Generated Poetry

Gwern has answered my prayers and taught GPT-2 poetry.

GPT-2 is the language processing system that OpenAI announced a few weeks ago. They are keeping the full version secret, but have released a smaller prototype version. Gwern retrained it on the Gutenberg Poetry Corpus, a 117 MB collection of pre-1923 English poetry, to create a specialized poetry AI.

I previously tested the out-of-the-box version of GPT-2 and couldn’t make it understand rhyme and meter. I wrongly assumed this was a fundamental limitation: “obviously something that has never heard sound can’t derive these complex rhythms just from meaningless strings of letters.” I was wrong; it just didn’t have enough training data. Gwern’s retrained version gets both of these right, and more too. For example:

Thou know’st how Menoetiades the swift
Was dragged, of Hector and the fierce compeers
And Phrygian warriors. So, we will dispatch
Your bodies, then, yourselves to burn the ships
In sacrifice; with torches and with bells
To burn them, and with oxen to replace
Your gallant friends for ever. But I wish
That no man living has so long endured
The onset of his foes, as I have power
To burn or storm; for mighty Hector erst
Was slain, and now returns his safe return

This is all perfect iambic pentameter. I know AP English students who can’t write iambic pentameter as competently as this.

(by the way, both “compeers” and “erst” are perfectly cromulent words from the period when people wrote poems like this; both show up in Shelley)

It has more trouble with rhymes – my guess is a lot of the poetry it was trained on was blank verse. But when it decides it should be rhyming, it can keep it up for a little while. From its Elegy Written in a Country Churchyard fanfic:

Methinks I see her in her blissful dreams:
Or, fancy-like, in some mirage she lies,
Majestic yet majestic, and of seems
The image of the unconquerable skies.

Methinks I see her in her blissful dreams:
—Or, fancy-like, in some majestic cell,
Where lordly seraphs strew their balmy dreams
On the still night, or in their golden shell.

There, in the calm of some Platonic dream,
Sits she, and views the unclouded moon arise
Like a fair lady full of realms divine;

And, all at once, a stony face and bright
Glittering in moonlight, like the noon-tints of a night.

COME ON, IT’S A ROBOT. THAT’S BETTER THAN YOU COULD DO IF YOU WERE A ROBOT. GIVE IT A BREAK.

I think its problem is less ability to rhyme, and more a sort of…maintaning the will to rhyme. Gwern prompts it with several Alexander Pope pieces; Pope is a very formal poet who always rhymes in the exact same way. For each Pope piece, GPT-2 does the first couplet in perfect Pope style, then degenerates into increasingly confused gibberish. For example:

Pride even in numbers; wit’s a kind pretence
To something foreign still, but ne’er to sense;
A constant waste of words, the world produces,
A lazy sap, that nothing that’s foreign to expels,
; that’s foreign still unknown to the soul;
A young, like, but in the world in motion,
Obscending not, which smells all our own the worse than our own our own the soul’s, and soul;
Like sense; which is like, but in all our whole.
Which thus far more like, but in all things that’s an excellence; and ne’er unchanged by which is folly’s the worse, they give the worse maintained by which
If sick of sense;
Wholubil, or snug. ills, we know our own our first in sense the worse maintained between the worse, soon expired.

First two lines are perfect rhyme and rhythm, next four have no rhyme but are close to the right meter, next few have almost random length, and by the last one we’ve abandoned grammar and are making up nonsense words like “wholubil”. Every Pope test Gwern runs is like this:

They talk of constancy and faithless love,
A seraph trembles at the specious glove;
Nor in the rich confin’d relief of state,
Find proud pretence, nor in the disdiscoveries of fate.
For when misfortune makes choice remains the conduct’s the prize half known,
Can we secret soul without due, they fear of sense of more known.
Some rise where’ rights, they make it pays due.
.
.
. Sense and judgment, as equal prize seem meanly, the reward the joy, as much possess the prize paid, as well reckon the prize we do not less dare not less keenly wise.
We see;
Of happy lovers ought, as well done, like a friend.
Know they ought, ’tis ev’ is the other joy, as well worth a right;
The joy, as well might, as well may, as well may all is great.
Nor need of joys not as well maysters, as well as well may they give; but as little store; but as well as well as well may shewn, as much, as well we know, as well as well can be sure might prove, as well may well as well as well as well as well may view;
The mind: as well as well as well as well as much the fair as well as well as well as well as well as well as well may

Again, first two lines are great – “a seraph trembles at the specious glove” is both nonsense and exactly the sort of thing Alexander Pope would write, but by the fourth line we have nonsense words, by the fifth we lose the meter, the eighth and ninth are just periods, and finally it starts stuttering helplessly.

I tested this many more times on a public version (not poetry-trained) and found a similar effect – the first two lines are always the best, and it deteriorates from there. I’m interested in hearing from people who understand the model better than I do about why this should be.

Some other highlights:

My heart, why come you here alone?
The wild thing of my heart is grown
To be a thing,
Fairy, and wild, and fair, and whole

That last line, with its ABAB structure, is actually brilliant even by the standards of human poets. “Fairy and wild and fair and whole”. I could say that all day. This has to be a coincidence. It’s not that good anywhere else. But even having something generally okay enough that it can occasionally blunder into something that good is great.

From its Hindu phase:

…which Indra, King of all the Blest,
Had thrown by Rávan’s mighty breast,
The monstrous coil, the brawny hand,
The monstrous mouth, the jaw, the jaw,
The jaw, the jaw and bleeding jaw,
The ungovernable host, the jaw,
And the great bow which never bends,
The arm, the fist, the knee, the ends,
The body laid with mighty stroke,
And the great bow which never bends.
So, when the giants fought, and fell
With murderous strokes, the giant fell,—
So falls the tree with all his trunks
Terrific in its death, that shoots
Wild volley at the mighty trunk,—
So fell the tree with all its boughs
While all the vipers dug and sowed—
So fell the tree with all its boughs.
But Ráma’s heart was sad within
He wept and mourned his captive’s sin

The first two lines are great (and the program must have invented the epithet “King of All The Blest” for Indra, because I can’t find it anywhere else). The description of Ravan’s jaw is, uh, quite memorable. “So falls the tree with all his trunks” is a good deployment of a metaphor which is frankly overused in old poetry. I’m ambivalent between interpreting the repetition as an AI getting stuck in a loop vs. as good use of repetition as a poetic device.

The next few are from this list of a thousand randomly generated samples:

And they have seen the last light fail;
By day they kneel and pray;
But, still they turn and gaze upon
The face of God to-day.

And God is touched and weeps anew
For the lost souls around;
And sorrow turns their pale and blue,
And comfort is not found.

They have not mourned in the world of men,
But their hearts beat fast and sore,
And their eyes are filled with grief again,
And they cease to shed no tear.

And the old men stand at the bridge in tears,
And the old men stand and groan,
And the gaunt grey keepers by the cross
And the spent men hold the crown.

And their eyes are filled with tears,
And their staves are full of woe.
And no light brings them any cheer,
For the Lord of all is dead

And:

There are several kinds of people in America;
There are several kinds of people, I mean their number.
There’s a girl growing up in the house by the light,
There’s a youth upon the road, or a girl somewhere in New York;
There’s a prettier girl, and a man more congenial,

But none of the likes of the likes of the fellows are equal.
There’s one who has never been married and married,
There’s one who don’t want to be treated with kindness;
A fair youth is never employed nor neglected;
There’s one who has never yet come to a neighbor,
And one who resides in New York from the start;

But none of the likes of the likes of the fellows
Are equal to him, and wherever he goes,
The heart somehow breaks under the hand that is steering;
And so it is with me

And this quatrain just worked out really well:

Fair is the Lake, and bright the wood,
With many a flower-full glamour hung:
Fair are the banks; and soft the flood
With golden laughter of our tongue

This one is notable for competent metaphor:

How the clouds
Seem to me birds, birds in God’s garden! I dare not!
The clouds are as a breath, the leaves are flakes of fire,
That clash i’ the wind and lift themselves from higher!

And this one is obviously a failure on one level, but on another level is some kind of great experimental modern political poetry:

The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious

This one displays an interesting combination of world-knowledge and lack-of-world-knowledge:

In the dark the sun doth gleam,
And in the dark the moon doth seem
But now the evening is begun–
Gone is the sun upon the earth!
The silver moon doth like a cup
Of blood-red wine, and as that cup
Is drained of life, doth quench no drop.
What man will drink such wine?
There is no soul of earth or birth
Which man hath never known of earth.
There is no soul who doth not sit
And sing to it, and cry, “Drink!”
There is no soul whose feet are set
On youth’s eternal paradise;
For all is a solemn harmony,
And all is a perpetual chant,
And all the world is a song of God.
There is no soul so wholly free

And here’s another:

There gloom the dark, broad seas. My mariners,
Souls that have toiled, and wrought, and thought with me
That ever with a frolic welcome took
The thunder and the sunshine, and opposed
Free hearts, free foreheads – you and I are old;
Old age hath yet his honour and his toil;
Death closes all: but something ere the end,
Some work of noble note, may yet be done,
Not unbecoming men that strove with Gods.

Except this last time I’m cheating: this is an excerpt of Tennyson’s Ulysses, one of the most famous English poems. I included it as a placebo, ie a test to see whether real poems sound fake if you think they’re by an AI when you read them. I’ll be honest: if I didn’t know this was Great Poetry, I would skim it over and assume it made several mistakes. Like: is “gloom” really a verb? (it is if you’re Alfred, Lord Tennyson). Is the last line grammatical? (yes: it’s an adjective phrase modifying “work”, ie “some work which is fitting for the sort of men who fought gods to do”). Are the mariners’ souls opposing their foreheads? (I’m still confused on this one). These are all the sorts of things that would make me go “Haha, AIs are still pretty dumb” if I were reading it blindly.

If you liked these poems, you might also appreciate Gwern’s work making AI-generated anime waifus.

(and you can also donate to Gwern’s Patreon here)

This entry was posted in Uncategorized and tagged , . Bookmark the permalink.

186 Responses to Gwern’s AI-Generated Poetry

  1. chubbic says:

    Oh, dear God. This poetry actually sounds great. Robots will replace us. 0_0

  2. Exa says:

    Inspired by Gwern’s work, I ran the fine-tuning process on three datasets of my own, a large set of chatlogs from a group I used to be in (90 MB), a large collection of my little pony fanfiction(100 MB), and half a gig worth of arxiv high energy physics papers. The chatlogs had a lot of personal information in them, but here’s a ton of unconditional samples for the mlp fanfic and the high energy physics models (I can also provide the models if anyone wants them, they’re just big enough to not be worth hosting if there’s no interest)
    pony fic:
    link text
    high energy physics phenomenology:
    link text

    The hep-ph model does really quite well at generating valid LaTeX, though it’s far from perfectly reliable (as the very first sample shows). And to get it to render, since it’s sampling a random chunk out of the middle of an article (on average), you will need to truncate the sample to cut off opened-but-not-closed or closed-but-not-opened stuff from the end and beginning. And it can’t do figures. And because the samples don’t actually include bibliographies (generally) the citations just render as question marks in brackets/parens. But otherwise it’s pretty good, here’s some rendered examples (manual correction consisted only of the aforementioned truncation):
    link text
    link text
    link text

    The fine-tuning process took about six hours per dataset on one GPU (a 2080, specifically). Since this GPU isn’t all that much faster than Gwern’s (for fp32, at least) I assume that Gwern was just a lot more patient than I was about LR drops and declaring training finished.

    • Lambert says:

      Now train it on both corpora at once.
      HEMLP

      • Exa says:

        The issue with that is that, since the two corpora are so immediately distinct, almost the first thing it would learn is whether it should currently be generating HEP-PH or ponyfic based on the context. So that would just get a model which can generate either, not a model that does a mixture of both.

    • gwern says:

      Yes, I trained for, cumulatively, multiple days while dropping the LR to see where it would converge. (Would’ve gone faster with multi-GPU support but oh well.) From your samples I suspect you could probably train it further.

      Looking at your MLP samples, you might want to delete some whitespace. Doublespacing sentences is pointless & wasteful. And whole blank lines aren’t necessary given newlines, and some of them have 5 or 10 blank lines in a row, which wastes some of the context window, is unnecessary variation to learn, and makes the samples harder to read. (Also, are those Unicode quotes & ellipses? Maybe you should do an ASCII conversion pass.)

  3. Paul Zrimsek says:

    The archives of SSC and its comment section have got to add up to a fairly good-sized training corpus by now. Can we look forward to some AI-generated accusations of bad faith, and even more competing redefinitions of “mistake theory”?

  4. JASSCC says:

    “And Phrygian warriors. So, we will dispatch”

    This is not remotely iambic pentameter.

    For starters, it’s 12 syllables. Also, to make the meter into iambs requires changing the normal pattern of accented syllables in “warriors”.

    • MilfordTrunion says:

      and phry’dzhen war’yers / so we will dis’patch

      seems ok to me

      no less tortured than eliding “-ian” from “Ethiopian”

      • JASSCC says:

        OK, yes, I guess you can drop both long e sounds. Phrygian is a word I never encounter in speech, so I’ve always read it as phri-zhEE-en. It seems a little tortured to rely on two such compressions in one line.

        • James Reed says:

          Both of them are standard elisions, so it’s not much in the way of torture. There’s a set of established, allowed elision. Mainly, any two vowels can be elided, including w’s and y’s, even between words. So “The ape” can be elided as “Th’ape” (one syllable) or “Carry elephants” as “Care y’elephants” (four syllables, though it would never fit into iambs well). The point is that these elisions aren’t exceptions or tortures, but well established parts of the rules, extending back even to the hexameter verse of Latin and Greek.

  5. Galle says:

    I’m going to be honest, I spent this entire article waiting for the point where you revealed that in fact these were all authentic works of 19th century poetry. I’m not sure whether that’s better or worse than falling for the one genuine excerpt.

  6. achenx says:

    Fair is the Lake, and bright the wood,
    With many a flower-full glamour hung:
    Fair are the banks; and soft the flood
    With golden laughter of our tongue

    Insert this into the middle of the Tolkien fanfic that it wrote last time and it’d fit right in.

  7. theandreinfante says:

    I have a pretty good guess as to why it’s deteriorating over time.

    You see the same thing in the full-sized GPT-2 sample dump they released, although it takes a lot longer for the wheels to come off. Most of the samples start strong and then get steadily weirder and less coherent towards the end. I suspect that’s why the sample length was capped the way it is.

    The issue has to do with feedback loops.

    The way GPT-2 works is that it takes the characters it’s generated so far, and conditions on them to predict a distribution over the next character, then samples that distribution to pick a character. It then adds that character to the text, reconditions, and repeats the process.

    The issue is that the distribution of text that GPT-2 generates is not quite the same as the distribution over actual text. So as it generates more characters (and gets further away from the real text sample it was initialized with), the number of subtle mistakes it makes increases. So, over time, more and more of the characters it’s conditioning on are garbage.

    This is an issue because GPT-2 is trained to generate characters conditional on real text. It’s not trained to generate characters based on its own output. So when its input turns to garbage, its output gets worse, which makes its input worse, and the GIGO death spiral begins.

    The same issue occurs with supervised learning of self-driving cars. If you only have data of humans driving well, and then use that to train a neural network to guide a car, you get a system that drives okay – for a little while. But once it makes a mistake, and drifts towards the edge of its lane, it suddenly is in a situation it has no training data for. It knows what good driving looks like, but it doesn’t know how to get from bad driving back to good driving. So it behaves more erratically, which puts it into situations even further from its training data (like driving into a ditch) that it is even less capable of recovering from.

    What this looks like in practice is the car making a small, normal mistake, then driving increasingly erratically until it careens off the road. GPT is doing exactly the same thing.

  8. Steve Sailer says:

    I’m fascinated by the idea that AI can’t imitate Pope, that most reasonable of poets:

    Know then thyself, presume not God to scan
    The proper study of Mankind is Man.[8]
    Placed on this isthmus of a middle state,
    A Being darkly wise, and rudely great:
    With too much knowledge for the Sceptic side,
    With too much weakness for the Stoic’s pride,
    He hangs between; in doubt to act, or rest;
    In doubt to deem himself a God, or Beast;
    In doubt his mind or body to prefer;
    Born but to die, and reas’ning but to err;
    Alike in ignorance, his reason such,
    Whether he thinks too little, or too much;
    Chaos of Thought and Passion, all confus’d;
    Still by himself, abus’d or disabus’d;
    Created half to rise and half to fall;
    Great Lord of all things, yet a prey to all,
    Sole judge of truth, in endless error hurl’d;
    The glory, jest and riddle of the world.

  9. rahien.din says:

    What the actual fuck.

    On one hand, this exercise is utterly narcissistic. Imagine training the AI on a set of objects, 3D-printing the results, and examining them in the context of potential uses. To say “This one looks kind of like a hammer – the AI must have meant to build a hammer!” actually denies the AI agency. This isn’t any demonstration of agency or technique on behalf of the AI – it’s stealth solipsism.

    On the other hand, this exercise is essentially claiming “Poetry is no better than a semi-random assemblage of mostly-coherent words.” Bullshit – anyone making such a claim doesn’t know what poetry is. This isn’t a celebration or expression of the art form – it’s rejection of the art form.

    • Murphy says:

      wow. it really pressed your buttons.

      I kinda doubt the AI has agency.

      Any bets on how long before similar AI can produce work that human poetry critics can’t, in a blinded setting, distinguish from poetry written by real humans?

      • Bugmaster says:

        Any bets on how long before similar AI can produce work that human poetry critics can’t, in a blinded setting, distinguish from poetry written by real humans?

        Apparently, the answer is “about negative four days”, since some human readers of this blog (myself included) could not distinguish AI from Tennyson.

        But I kind of do agree with rahien.din. The fact that a relatively simple AI can fool human poetry critics makes me much more inclined to dismiss modern poetry as an art form.

  10. Eliot has an essay on Kipling in which he argues that verse and poetry are different arts, that Kipling was a great writer of verse who occasionally wrote great poetry more or less by accident.

    As Eliot defines it, verse is intended to give all of its effect on a single reading, poetry to have more depth, require more careful analysis. That links to a point that occurred to me here. Part of the reason the AI poetry feels more real than it should is that we don’t expect poetry to entirely make sense on a first reading—part of what the poet is often aiming at is something unusual, an apparent inconsistency that can reveal a deeper consistency.

    Which makes me suspect that if the AI was trained on Kipling’s verse, the result would not look like something written by Kipling.

  11. Le Maistre Chat says:

    That no man living has so long endured
    The onset of his foes, as I have power
    To burn or storm; for mighty Hector erst
    Was slain, and now returns his safe return

    As someone who’s used mythic Greece as a setting for Dungeons & Dragons (and has thus had to deal with the dead being raised), this speaks to me.

    • Murphy says:

      One of my thoughts on this is hooking this up to something like Dwarf Fortess wall-art, feeding a prompt of some recent events and or the mood of the sculptor and have contextual poetry for the walls.

  12. Seaweed Shark says:

    Most of the factitious anime waifus betray their origins in what must be a corpus of images from porn.

  13. Eliezer Yudkowsky says:

    Got to “Old age hath yet his honour and his toil” and immediately thought “That’s too good, had better check if it accidentally memorized something.” Though it’s not impossible, indeed likely, that I had also read that specific poem once upon a time.

  14. deciusbrutus says:

    Gwern made a serious error by not naming the anime generator wAIfu.

  15. muskwalker says:

    I loved Emperor Wu’s descent from majesty to rapacity.

    (and the program must have invented the epithet “King of All The Blest” for Indra, because I can’t find it anywhere else)

    The computer is discovering names of God already? This is how apocalypses begin.

  16. megavolt says:

    GPT-2 writes rambling verse. It understands how to write a line but not really a stanza. It understands that stanzas are meant to feel different from one to the next, and it does so through having the lines in a stanza having more similar forms. It does not seem to split different concepts into different stanzas and progress through them.

    In the Ulysses sample, we have this line:
    “Free hearts, free foreheads – you and I are old;”
    The poem turns on this line. The line comes off a caesura, shifts into Trochaic meter. It uses the dash for a serious long pause, which allows it to have two sequential stressed syllabled. This emphasizes the shift in the topic of the poem.

    That’s very good poetic craftsmanship. GPT-2 hasn’t figured out how to write devices like that… yet

    • James Reed says:

      You’re right about craftsmanship, but the poem doesn’t shift into trochaic meter there. “heads” is the first syllable of the third iambic foot of that line, “and I” is the iambic fourth, “are old” is the fifth iamb. It’s steadily iambic (though some would call the first two feet spondees, and I would courteously disagree).

  17. emiliobumachar says:

    Scott, to what extent is that placebo piece of recognized Great Poetry a random piece of recognized Great Poetry, and to what extent did you cherry-pick it for being particularly wrong-looking despite being recognized Great Poetry?

  18. Steve Sailer says:

    gwern could make a lot of money by training AI to compose Accenture ads, like this one:

    “SynOps ultimately showcases the art of the possible with how clients can now embrace innovation to drive new value–it’s the new applied now.”

    https://pbs.twimg.com/media/D1lHUgWUcAIChee.jpg

    I suspect a robot would be better at Accenture ads than a human.

    • Lambert says:

      I suspect a robot would be better at $TASK than a human.

      Is this going to be the hot new way of insulting people who perform $TASK? 😉

    • Bugmaster says:

      How do you know that ad was written by a human ?

  19. ec429 says:

    In the dark the sun doth gleam,
    And in the dark the moon doth seem

    I think it just went full Lewis Carroll.

    Hey Gwern, you should definitely try prompting it with Carroll, also other nonsense poets like Edward Lear. In poetry that’s not supposed to make semantic sense, the ELIZA-effect will be even greater! 😉

    … and now I want a repository of AI-generated nonsense poetry, named “The Hunting Of The SnarXiv”.

    • gwern says:

      Sounds like more work than I want to do. But here’s 100 samples from Carroll’s “Jabberwocky” which was easy enough to do: https://www.gwern.net/docs/ai/2019-03-16-gpt2-poetry-prefix-jabberwocky-100samples.txt

      First sample:


      He found a foxy in the brake,
      A cunning fox of scarlet dye,
      And from that foxy followed make
      The scrawny fox in glee.
      He followed with his dam and horn
      To where the river-water runs,
      And as his living current on
      The river-water likes him up
      A mighty rocky heifer heaves,
      And in a single field, or twain,
      Shows like the yellow corn;
      And when the wind doth blow, so too
      Low in his bottom lies his head,
      And in the grass leaps up again,
      In fearful freedom unbetrayed.

      I love the first 4 lines, including the ‘brake’/’make’ rhyme (‘brake’ being another obscure yet cromulent word here). There are a lot of good other samples too, like “I, the joy-devoted mole.” Or this epic one:


      And while the sword-strokes gushed and ran
      A-gathering like a hurricane,
      The wildest war-cries, loud and shrill,
      Came surging through the hills.
      At times a wild yell rose in the air,
      From steep to steep, from moor to moor,
      A scream of horror and despair,
      Of death approaching neathernock,
      Of dying embers in the pore.
      “Thy shield, my son! thy sabre bright!
      My hands! thy lips! thy lips of flame!
      O hasten! I have done the same.”
      It made the blood tingle and squall
      To hear that wounded warrior’s cry;
      It made the life blood quickly fall
      On the proud swine that fought and died—
      “Say on! say on! thou wounded swine,
      Say on! thou hewer of the sanguine!
      Say on! thou hewer of the sanguine!
      Say on! thou hewer of the sanguine!
      Say on! say on! that pierced youth
      Who crossed the heath in days of yore,
      And when he crossed it, “Say on!”
      It made the blood tingle and squall
      To hear that wounded warrior’s cry;
      It made the life blood quickly quail,
      To hear that death did soon draw nigh.
      It made the blood tingle and squall,
      To hear that death did soon draw nigh.
      “Say on! say on! thou wounded swine,
      Say on! thou hewer of the sanguine!
      Thou art not doomed to fetch a pail,
      ’Tis not the ebbing blood so drain.”
      And then it made the blood tingle,
      To hear that agony within:
      “Thou liest

      :thinking_face:


      24560|“Come to my arms, my youthful boy!
      24560|My elbows are as dry as clay;
      24560|I would that thou wert there, boy,
      24560|And a harlot were also I.”
      24560|“Now try me!” he said outstretchingly;
      24560|“Now try me!” there was no more to say.
      24560|So they twain grew strong, and with their swords
      24560|He smote the maimed youth in the face.

      Here’s one you could turn into a story about the day Death took a day off:


      24560|And when the corpse was levelled out,
      24560|They all forgot to die—
      24560|The men and boys of all the town
      24560|They made a funeral bier;
      24560|For the country-folk, it seems to me,
      24560|And the girls did weep and shake
      24560|When they found the jolly dead of night,
      24560|In the churchyard under the hill.
      24560|And when the jolly coffin came,
      24560|And they all did raise a cheer,
      24560|They found the jolly borogarrest
      24560|In the churchyard under the hill.
      24560|He said, “I’m up in the chimney-corner,
      24560|With the coffin in the shroud,
      24560|And the clargy, clargy, clinker-ware,
      24560|In the churchyard under the hill!”
      24560|And now the jolly borogarra
      24560|Is cleared away from the town,
      24560|The bells ring out their merry pealing,
      24560|And the priest did raise his crown.
      24560|With a joyful clood the country folk
      24560|The sun it shone on high,
      24560|The crock of bells was over the hill,
      24560|And a man he did not die.
      24560|For every Sabbath bell that day
      24560|Came bleating to his ear,
      24560|And every man did shout ‘He’s done!’
      24560|The bells are silent for the tears
      24560|In the churchyard under the hill,
      24560|And no man stirs but the man and the dead,
      24560|In the churchyard under the hill.

      • ec429 says:

        Thank you, this is great.

        It seems to yield some interesting insights into how GPT-2 ‘thinks’. For instance, it has figured out that ‘killing things that begin with J but aren’t real words’ is important:
        If I must kill the Jeiblung Goer

        24560|“Where is the tail of the Juritchhaut?

        24560|“Ha! Ha! Ha! I have slain the Jatmas,”

        24560|“Hast thou slain the Jugglern?” she cried,

        24560|“And canst thou slay the jahabi?
        24560|In such a mood I hate to ride!
        24560|My hands are stained with scarlet cloth,
        24560|My eyes are shut in hide.

        It also occasionally starts to quote the line “One, two! And through and through” but in the middle gets distracted by ‘this is counting, I know this!’:
        24560|Then, one, two, three! And through and through
        24560|The moss he thrust below the knee;
        24560|And one, two, three! And through and through
        24560|The moss he thrust, then thrust it back;


        24560|One, two! Three! And through and through
        24560|The copse that rose frae bank to brier;


        24560|One, two, three, four! On tree top,
        24560|Who saw him before the dawning sun,
        24560|Fled; and two, five, eight!
        24560|“He’s an old top-captain, Sir!
        24560|He’s an old top-captain, Sir!
        24560|And in this wretched world is he
        24560|Only eight and ten.

        This snippet amuses me, I think the priggish/prattle alliteration helps.
        24560|“Now haste thee, little priggish boy!
        24560|Come forth, my babe, and prattle now!”

        And here’s a moment when it really seemed to have got the style down pat:
        24560|He pierced the thread, and struck the line;
        24560|“Where is my rope?” said he.
        24560|And then he made the rope and wound
        24560|The pumpkin round his ear;
        24560|He broke the line, through whistling sound,
        24560|And the moles howl in the drear.

        I’m not sure but I think it managed to make a pun on two meanings of ‘low’ here?
        24560|At daybreak up the hill is brought
        24560|The jay as early as the morn,
        24560|The oxen low beneath is sought,
        24560|And oxen low adown the corn.

        I’ve no idea what this is about but it’s gripping:
        24560|“I am that man!” the trooper said
        24560|“Beware the Thing that lurks within.”
        24560|“The Thing that I must do,” said he;
        24560|“The Thing that I must do.”
        24560|And when the steel shot from behind,
        24560|He turned to see what he could see,
        24560|And heard the hound and spurs behind;
        24560|Heard breathings in the below,
        24560|Sounds heard beneath the roof, the cry
        24560|Of an old man crouched for there--alone!
        24560|It is the strangest thing that’s known
        24560|In all the cumuli.”

        • gwern says:

          I noticed the ‘J’ thing too. My suspicion is that it realizes that ‘Jabberwocky’ is a singular proper noun (there’s apparently only the one in the poem and it’s unique) but then the self-attention isn’t quite good enough to propagate the full name ‘Jabberwocky’ throughout the samples, or possibly the temperature-based sampling is screwing everything up again, and it can only get the ‘J[vowel]’ right before diverging into a more plausible proper name.

  20. P. George Stewart says:

    It’s cool and fun that the thing manages to learn rules, but we already knew that computers are good at learning rules.

    It’s yet to be seen whether it can write poetry, which usually develops and unfolds a single thought over the span of several verses. Unless it has a thought, an inspiration, I don’t see how it can write poetry, or mean something, anything.

    • Bugmaster says:

      which usually develops and unfolds a single thought over the span of several verses

      That is something that most modern human poets can’t achieve, either.

  21. Deiseach says:

    Except this last time I’m cheating: this is an excerpt of Tennyson’s Ulysses, one of the most famous English poems. I included it as a placebo, ie a test to see whether real poems sound fake if you think they’re by an AI when you read them.

    Huh, as I was reading it I was thinking “Oh yeah, the AI is clearly copying chunks of Tennyson’s Ulysses here” and then Scott pulls that on me. Either I have a better ear for poetry than I think, or style really shines through 🙂

    I also find it interesting that Pope is what breaks it; on the face of it, you’d imagine “rhyming couplets, rhyming couplets, rhyming couplets” was the kind of mechanical function that could be easily churned out, but seemingly not. Was any Swinburne used, because I’d love to see what an AI trained on Swinburne would produce after being exposed to this!

    Excerpt:

    No love sees loftier and fairer the form of its godlike vision in dreams
    Than the world shone then, when the sky and the sea were as love for a breath’s length seems—
    One utterly, mingled and mastering and mastered and laughing with love that subsides
    As the glad mad night sank panting and satiate with storm, and released the tides.
    In the dense mid channel the steam-souled ship hung hovering, assailed and withheld
    As a soul born royal, if life or if death be against it, is thwarted and quelled.
    As the glories of myriads of glowworms in lustrous grass on a boundless lawn
    Were the glories of flames phosphoric that made of the water a light like dawn.
    A thousand Phosphors, a thousand Hespers, awoke in the churning sea,
    And the swift soft hiss of them living and dying was clear as a tune could be;
    As a tune that is played by the fingers of death on the keys of life or of sleep,
    Audible alway alive in the storm, too fleet for a dream to keep:
    Too fleet, too sweet for a dream to recover and thought to remember awake:
    Light subtler and swifter than lightning, that whispers and laughs in the live storm’s wake,
    In the wild bright wake of the storm, in the dense loud heart of the labouring hour,
    A harvest of stars by the storm’s hand reaped, each fair as a star-shaped flower.

  22. Lambert says:

    The fact that the text suddenly veers off course after a couple of lines doesn’t surprise me.

    GPT-2 is not trained to generate texts, but simply predict the next word of a text sample.

    In the beginning, this text is a load of Pope. Then, it’s a load of Pope, followed by a bit of pseudopope. It is already beginning to veer off course, so the text it produces will be even less Popey (Alexander Papal?).
    Before long, it’s trying to predict what comes next after some Pope, then some text that rapidly diverges from Pope. It’s not surprising that it should become exponentially less accurate. Especially since the ratio of prompt to generated text is constantly decreasing, as well.

    Side thought: Can we now build a discriminator that is fed a load of prompt + generated text and has to try to find where it switched form human to machine. And then build a GAN around it?

    • gwern says:

      Side thought: Can we now build a discriminator that is fed a load of prompt + generated text and has to try to find where it switched form human to machine. And then build a GAN around it?

      Christiano’s preference learning is similar but better. 🙂

      • Lambert says:

        Is that the ‘Optometrist’s Method’?
        i.e. the machine asks the human ‘Is A or B better?’.

        Heard about that being used to improve performance of stellerator fusion reactors, where the exact metrics are hard to define.

        ADDENDUM:
        Sounds like a good route to What You See Is What I Mean functionality.

        • gwern says:

          Yes. And yes, there was a paper doing that (using MCMC, oddly enough, which is not the usual choice for hyperparameter optimization, to say the least).

    • vV_Vv says:

      Side thought: Can we now build a discriminator that is fed a load of prompt + generated text and has to try to find where it switched form human to machine. And then build a GAN around it?

      GANs for text mostly don’t work.

      • gwern says:

        You would MLE pretrain on a poetry corpus, of course, and RL finetuning certainly does work elsewhere.

        • vV_Vv says:

          You need a good performance metric to fine tune towards, though.

          • gwern says:

            Which is precisely what the pairwise Discriminator/critic is learning based on human ratings, yes, that’s the point of Christiano’s approach.

          • vV_Vv says:

            I’m not familiar with Christiano’s approach, is it some kind of inverse reinforcement learning approach where you train a neural network to learn a reward function from a user?

            I’m not convinced that anything that has a human in the loop during training is really feasible, unless the ML algorithm is able to do few-shot learning much better than current neural network-based methods can.

          • gwern says:

            Well, then, read the paper. It works for the video game environments they tried it on.

          • vV_Vv says:

            You mean this one? Seems interesting, though the ablation results look quite confusing. I wonder how well it would generalize to other domains.

  23. MostlyCredibleHulk says:

    I’ve read “modern poetry” which looks to me as incoherent (and worse) as the examples here. Published in books. As far as I understand, not even self-published. One of the poems I’ve seen is exactly like the Emperor Wu example, only shorter. Not sure what it means frankly. How much pareidolia is in our understanding of art? I usually react pretty well on “classic” art and is rather confused by “modern art” (not without exceptions), and AI productions seem to be more “modern art”-y. But I think more abstract the art is, less is the distinction – in music, I’d probably fail to distinguish between AI and human, while I think if somebody tried to use AI to write a novel, I’d easily distinguish it from a classical novel (though maybe not from the “modern art” ones).

    • Incandenza says:

      I guess one distinction would be that modernist poetry (or art) is responding to a tradition – when Stein writes “a rose is a rose is a rose” it’s against a literary background, which is what makes her experimentalism meaningful. (It’s also why it’s not all that interesting as poetry a hundred years later; once the walls of tradition have been blasted down it’s no longer compelling.) The AI, by contrast, is just messing up.

      I guess when we get ASI poets we’ll know because they’ll both innovate on the poetic tradition and also get their like-minded avant-garde ASI buddies to write convincing screeds about why the innovations aren’t just gibberish.

    • zzzzort says:

      I think several strains of modern art have moved towards being more semiotically dense. This means that there is less internal consistency a reader/viewer can check to distinguish meaningful content vs gibberish, but also that the efficiency of conveyed meaning should be higher. The maximally compressed version of a poem (or anything) would sound like white noise, and would be very easy to fake convincingly.

      • Garrett says:

        efficiency of conveyed meaning should be higher

        Would you be able to cite a few short examples? I’d love to get a feeling for this.

      • MostlyCredibleHulk says:

        I imagine if we consider poetry as some kind of code, which should be understood by those holding the key (no idea what it is, but let’s go along with it for a while), then crypto theory tells us the good code is indistinguishable from noise to non-key-holder. Maybe poets that don’t want to be confused with AIs start to employ more sophisticated crypto tools – I wonder if AI could convincingly fake a complex poetical form that requires certain structure and interdependencies between parts, or even something as simple as acrostic (without being programmed upfront for what’s going on of course) – i.e. if a certain poet is known to produce only acrostics, would such an algorithm, trained on texts of said poet but without specifically programming it in, be able to discover it.

  24. benf says:

    An AI that can generate poetry is, honestly, not that impressive. An AI that can tell me whether a piece of poetry is a great work of art or total gibberish…that’s the only AI worthy of the name.

  25. Betty Cook says:

    My reaction to your quote from Ulysses was:

    1. That makes too much sense, continuity of meaning across the first two lines…
    2. It must have been trained on Ulysses, it’s quoting a line entire (“Old age hath yet….)
    3. Hey, wait a minute!

    • lycotic says:

      Yeah that’s pretty much how it went for me too, though I went through a phase of “Gwern seems to have let it run on too small a corpus for too long, because it’s borrowing really heavily from Tennyson now.”

  26. cmndrkeen says:

    I found Emporer Wu to be oddly moving, een though I am perfectly aware that loops like that are a common failure mode for text-generating AIs.

    • Michael Watts says:

      You can find a message in the total lack of a message sometimes. This youtube video combines the song “I didn’t ask, and she didn’t say” with dashcam footage of a car driving down a freeway, during which absolutely nothing interesting happens. But I think the video concept meshes well with the theme of the song.

  27. zomoskeptical says:

    Stanislaw Lem had a really fantastic short story in The Cyberiad about an “electric poet” made by one of the main characters. The computer was in fact much better than the humans:

    “Have it compose a poem — a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter S!!”

    “And why not throw in a full exposition of the general theory of nonlinear automata while you’re at it?” growled Trurl. “You can’t give it such idiotic — ”

    But he didn’t finish. A melodious voice filled the hall with the following:

    “Seduced, shaggy Samson snored.
    She scissored short. Sorely shorn,
    Soon shackled slave, Samson sighed,
    Silently scheming,
    Sightlessly seeking
    Some savage, spectacular suicide.

    The whole book is worth a read, but I found a fun post talking about the challenges of translating it: https://medium.com/@mwichary/seduced-shaggy-samson-snored-725b5a8086d9

  28. atticade says:

    I believe Indra has been called “King of all the blest” in Sanskrit, so really this is another example of the “translation” feature discussed in the last post (below) rather than an entirely new coinage.

    Still, I have to doubt this theory; the only Sanskrit word I can think it likely to encounter naturally is “nirvana”, and for the French test GPT-2 was provided with sample English-French sentences.

    We test whether GPT-2 has begun to learn how to translate from one language to another. In order to help it infer that this is the desired task, we condition the language model on a context of example pairs of the format ENGLISH SENTENCE = FRENCH SENTENCE and then after a final prompt of ENGLISH SENTENCE = we sample from the model with greedy decoding and use the first generated sentence as the translation. On the WMT-14 English-French test set, GPT-2 gets 5 BLEU, which is slightly worse than a word-by-word substitution with a bilingual lexicon inferred in previous work on unsupervised word translation (Conneau et al., 2017b). On the WMT-14 French-English test set, GPT-2 is able to leverage its very strong English language model to perform significantly better, achieving 11.5 BLEU. This outperforms several unsupervised machine translation baselines from (Artetxe et al., 2017) and (Lampleet al., 2017) but is still much worse than the 33.5 BLEU of the current best unsupervised machine translation approach(Artetxe et al., 2019). Performance on this task was surprising to us, since we deliberately removed non-English webpages from WebText as a filtering step.

    In other words: GPT-2 is very bad at translating French into English. But the researchers were surprised to see it could do this at all, since they didn’t design it as translation software, didn’t ask it to learn translation, and didn’t show it any material in French. It seems to have picked up this ability from noticing a few naturally-occurring examples of French in English text:

    • gwern says:

      I would be surprised if that is what happened. GPT-2 was seeded from Reddit posts. So you would need a Reddit post to have linked a page calling Indra that in Sanskrit, for it to get into the corpus (which was only ~40GB and has to include all the other stuff), for this non-English text to be memorized despite being extremely rare, for it to learn the English equivalent (how?), for the knowledge to survive several days of training on a corpus which features near-zero or zero Sanskrit and no appearances of that phrase in English, and to be given such high likelihood that dumb sampling can sample it in exactly the right context. At that point, I’d say I’d think it’s more likely it’s just a parallel invention following a standard snowclone ‘King of X’ and positive words like ‘blest’, and there happen to be so many epithets for so many gods that some just happen to be hits.

    • Deiseach says:

      The description of Ravan’s jaw is, uh, quite memorable.

      Well, Ravan was ten-headed, so he would have all those jaws to match.

      I think that probably comes (“the bleeding jaw”) from the story of when Indra injured the child Hanuman in the face (a kind of folk etymology of Hanuman – “One interpretation of the term is that it means “one having a jaw (hanu) that is prominent (mant)”. This version is supported by a Puranic legend wherein baby Hanuman mistakes the sun for a fruit, attempts to heroically reach it, is wounded and gets a disfigured jaw” according to Wikipedia). It’s one famous episode involving Indra, and the translated texts used to train the AI must surely have mentioned it, hence the association between “Indra” and “jaw” in the mind (as it were) of the AI.

      Just be glad those sources don’t seem to have included this episode about Indra (given that it’s 19th and early 20th century poetry, translations did tend to be rather censorious).

  29. Snailprincess says:

    This is fascinating. And I honestly think the second to last poem isn’t bad.

    I’m really curious to see this AI try it’s hand at making recipes. There’s a lot out there to train it on, a lot of which even have review scores that could be used.

  30. DragonMilk says:

    Ok, the next challenge is to have an AI preacher.

    Have the learning base be as many sermons as possible, I may try to get some friends to put the bible + sermons as a learning set (I’m too lazy)

  31. Kindly says:

    This is only possible to do for AI because it has a large corpus to draw on: lots of humans have written poetry and developed ideas (such as “meter” and “parallel structure”) which it can learn.

    But that’s not a distinguishing feature of AI: humans that write poetry do so by reading poetry that previous humans have written and learning the ideas developed there.

    What prevents AI from running away and stealing the poetry show is that it’s missing the evaluation step: it might write occasionally-great poetry, but has no way of recognizing the great bits to try to do them more often.

    So the automation of writing poetry, which only talented humans can do, can only produce more of the same results, maybe with the occasional brilliancy. It’s the automation of reading poetry, which any human can do, that would let the AI learn from its previous experiments, train itself on the best of its previous output, and surpass all human poets in the fashion of AlphaGo Zero.

    • gwern says:

      Yes, learning the reward function is potentially a major step. I’ve been convinced for a long time that the key to good text/music generation may be ‘preference learning’: https://arxiv.org/abs/1706.03741 As long as humans can keep recognizing pleasing poetry (even if they cannot write it themselves!) Christiano’s preference learning approach may be able to bootstrap to much better quality with end-to-end learning which can surpass the original corpus.

      • eggsyntax says:

        That’s really interesting, thanks for linking that.. I’ve been thinking about pairing a generative music system with real-time preference input from affect recognition on the user’s face via the webcam, and this seems like an incredibly useful tool for that.. I’ve played with doing something similar using human preference and an evolutionary algorithm, but as Christiano et al point out, the necessary amount of human feedback is prohibitively large.

    • kaathewise says:

      I think what you’re looking for is adversarial networks: you basically train two networks at once: one learns to simulate poems and another learns to distinguish simulated (supposedly bad) poems from real ones (supposedly good).

      Another thought is that since the corpus of valid poems with structure is much larger than the corpus of good poems, one can pre-train the model for structure, and then train for quality.

  32. Elementaldex says:

    I’m curious what you would get if you generated 100,000 poems with Gwern’s trained GPT-2 and fed those as the training data to another GPT-2. Would you get the same kind of output or would adding a layer result in a different kind (or quality) of output?

    • Kindly says:

      Probably more of the same, but less good. I picture the training data as a large set of points in poemspace, and poem generation from that data as producing fuzzy averages of those points. Averaging averages only produces more averages.

      The randomness means that training on the generated poems would tend to produce results that wander further from the originals in random directions, but most directions you could wander in aren’t very good and look more like noise. (To give a simple example: if AI is prone to stupid repetition, the generated poems would have lots of stupid repetition, which would teach the second-generation output to repeat itself even more.)

  33. Incandenza says:

    The real advance in AI language processing will happen when programmers realize that language is not a self-contained system, but that its function is to respond to and modify embodied situations. (I hope that comment is not too gnomic; I’m not sure how to elaborate on it without writing 10,000 words.)

  34. Trunk says:

    Re: foreheads I think Tennyson was saying that during their adventures, Ulysses’ sailors had to choose between what their hearts and their heads were telling them to do, using “oppose” in the sense of placing two things against each other. Which seems true to the story.

  35. Trunk says:

    If you told me

    How the clouds
    Seem to me birds, birds in God’s garden! I dare not!
    The clouds are as a breath, the leaves are flakes of fire,
    That clash i’ the wind and lift themselves from higher!

    Was from a newly discovered Gerard Manley Hopkins poem, I might believe you

  36. Lambert says:

    Random thought:
    What happens if you feed the corpus and prompt in backwards, then reverse the output?

    Language isn’t really a linear thing, it’s a tree.
    I wonder whether you can separate the syntax from the semantics. Parse the corpus into a tree, then train the algorithm to output trees, then re-assemble it into sentences.

  37. vV_Vv says:

    Again, first two lines are great – “a seraph trembles at the specious glove” is both nonsense and exactly the sort of thing Alexander Pope would write, but by the fourth line we have nonsense words, by the fifth we lose the meter, the eighth and ninth are just periods, and finally it starts stuttering helplessly.

    I tested this many more times on a public version (not poetry-trained) and found a similar effect – the first two lines are always the best, and it deteriorates from there. I’m interested in hearing from people who understand the model better than I do about why this should be.

    It’s called “exposure bias”. I think it was first discussed here. During training the model only sees text written by humans, it never sees text generated by itself, while when you execute the model to generate something, the model sees a prefix of a text that it has generated itself (after the optional prompt, which is human-generated). Because the self-sampled text is statistically different than human-generated text, the model generalizes less well when it computes the probabilities for the next word, and at each new sampled word this effect accumulates, causing it to eventually generate gibberish.

    Researchers tried lots of things to get around this issue, but none (including the one described in the paper I linked) really seems to work well so far.

    • Is this a memory issue?

      • vV_Vv says:

        No. The lack of long-distance coherence (as seen here in the Lord of the Rings-themed story) is arguably a memory issue.

        The issue here is that the neural network has to generalize to examples that it has not seen during training. A neural network is a function approximator: given an input x (a sequence of words, in this case), predict an output y (the next word), in a way that approximately interpolates the training set (a set of (x,y) pairs). How does the network compute a prediction of an input that it has never seen during training? It depends on its inductive bias (the prior, roughly speaking, although neural networks aren’t strictly Bayesian).

        Usually when the input x_0 is a human-written prompt, even if it never appeared in the training set it will be statistically similar enough to the training examples that you can assume that they come form the same probability distribution. This is called in-distribution generalization, and well-trained neural networks do well at it. The system will compute a next word probability P(y|x_0) similar to the probability distribution that you would get if you asked a human to continue the prompt, but since the model is not perfect there will be some slight difference.

        The system then samples a next word y_0 from P(y_0|x_0) and concatenates it with the prompt to form a new input x_1 = [x_0 y_0]. The problem is that x_1 now comes from a probability distribution different from the training one, because it’s last word was sampled from an approximation of the human-written next word distribution. This is now an out-of-distribution generalization problem, and neural networks don’t do so well at it, thus when you ask the neural network to compute P(y_1|x_1), the quality of approximation with respect to how a human would contrinue the sentence will be lower compared to when it had computed P(y_0|x_0). Therefore x_2 = [x_1 y_1] will be even more statistically different from the training distribution.

        This issue compounds with each sampled word. Initially it’s a subtle statistical difference that you can hardly notice, but eventually it will result in the system generating nonsense or getting stuck into loops.

        You can think of this in terms of chaos theory: the system is a (stochastic) chaotic attractor in the high dimensional phase space of all possible strings, and the English texts are trajectories in this high dimensional phase space confined to a manifold of much lower intrinsic dimension. If you start the chaotic attractor at a point on an English text trajectory, it will approximately follow it, or some plausible continuation of it near the English text manifold, for some time, but eventually it will diverge and end up very far.

        Note that this is quite different from how humans behave: if I gave you a prompt and asked to continue it, you wouldn’t end up writing gibberish. Even if the prompt was weird and possibly ungrammatical English (e.g. the Ulysses), you would be able to recover and continue it with plausible and fluent English rather than diverge.

      • Ah, I see. That sounds really difficult to solve.

  38. Nietzsche says:

    “Emperor Wu” cries out to be set to music by Philip Glass.

  39. Nicholas Weininger says:

    The degeneration examples are like a capsule history of the evolution of poetry from Pope to Gertrude Stein. “Can you tell real Gertrude Stein from the AI generated version” would have been a much harder game.

  40. disposablecat says:

    Now someone needs to teach it to do heavy metal lyrics. I’m thinking a mix of Iron Maiden, Ghost, Iced Earth, Metallica, Black Sabbath, and all the other really bombastic and grandiose metal bands could get it in the “headspace”, as it were, to generate stuff that fits in with the world-feel of songs like Hallowed Be Thy Name, One, Ghost of the Navigator, War Pigs, Witch Image, Burnt Offering… stuff that exists on the same thematic plane, you know? Like there’s a very recognizable lyrical feel to songs like that, in the same way that there is for 19th century poetry. There’s a shared library of concepts and referents in use, and GPT2 seems to be good at isolating and recombining those libraries.

    It would probably do better at prog rock, though, since the lyrics there are frequently a) incoherent b) weirdly and pretentiously metaphorical c) totally secondary, such that you could probably do a prog album entirely with GPT2 lyrics and release it under a fake songwriter credit and no one would guess (especially if you went for the feel of something out in experimental land like Lateralus, which contains a lot of evocative metaphor in isolation with almost no overt meaning). Shit, it would probably be better than Dream Theater’s second to last album (The Astonishing). Actually, if that came out this year instead of in 2016, “it was mostly AI generated” would have explained a lot about how generic and derivative the characters, lyrics, and really the whole concept felt.

    • disposablecat says:

      Oh hey, there’s even a dataset for this already: https://github.com/JarbasAl/metal_dataset

      Shame that the “power metal” subset doesn’t have its own lyrics collection, because that’s kind of the subgenre I was thinking of.

    • eric23 says:

      I think it would work well for nearly all popular music.

      • Lambert says:

        Yeah. I think a song lyrics website would be a far better, if less erudite corpus than poetry.
        Though it wouldn’t hurt to throw both in.

        Poetry is trying to be the most complex form of verse, while simultaneously disappearing up its own arse within the last century.
        Song lyrics afford it a slightly simpler way to figure out rhyme and metre.
        Though it might get stuck in loops more often.

  41. NLeseul says:

    The thing’s weakness still seems to be keeping track of what it’s talking about for more than a few dozen words.

    The challenge I’ve been thinking about for it is whether it can reliably produce C programs (or whatever language) of a non-trivial length that actually compile. I know I’ve seen elsewhere that someone had it generating C code based on the Linux kernel that looked vaguely like valid code in a dreamworld sense, but generally didn’t compile.

    What I expect to be challenging for the algorithm is recognizing that things like variable and function declarations need to precede their use, and keeping track of what declarations exist in the current scope, just like it has a hard time remembering that Gimli is a dwarf. Good thing is, you can’t trick a compiler into thinking you know what you’re doing with dreamy BS; either your symbols are in the symbol table, or they’re not, so it’s easy to objectively measure how well it’s doing at the goal.

    When I can see it doing this, I might consider getting vaguely excited about it’s idea to derive actual knowledge from statistical relationships.

  42. Ben Wōden says:

    “And this one is obviously a failure on one level, but on another level is some kind of great experimental modern political poetry”

    That one actually would be pretty familiar to anyone who had spent much time in the Roman Senate. Repeating the same simple adulation tens of times over was pretty standard fare.

  43. Red-s says:

    I really enjoyed The Emperor Wu.

    • Deiseach says:

      So did I! And the suggestion elsewhere that it should be set to music by Philip Glass is very apposite as well. I think the fact that we can make sense out of the generated text which is not really meaningful demonstrates that humans really do have very strong pattern-matching going on. It would be easy to create an explanation for how the Emperor Wu poem works not alone as poetry but as meaningful, and I can finally understand why the non-artsy types complain that the humanities are just a con job spinning crap (instead of gold) out of straw 🙂

      And now I have strong suspicions that the Emperor Wu poem is heavily influenced somewhere in there by Pound.

  44. The hearts and the foreheads are being opposed to the thunder and the sunshine.

    The AI generated poetry is better than I would expect, but it isn’t comparable to Ulysses.

  45. Joe says:

    Scott, please ask Gwern to upload the book of Psalms and Proverbs. Also would love to see what the AI can produce after reading all your blog posts.

    • John Schilling says:

      Later that night Gwern noticed that, overhead, without any fuss, the stars were going out.

      • Joe says:

        😃

        1 Come Lord, and tarry not;
        Bring the long-looked-for day;
        O why these years of waiting here,
        These ages of delay?

        2 Come, for Thy saints still wait;
        Daily ascends their sigh:
        The Spirit and the Bride say, “Come”:
        Dost Thou not hear the cry?

        3 Come, for creation groans,
        Impatient of Thy stay,
        Worn out with these long years of ill,
        These ages of delay.

        4 Come, and make all things new;
        Build up this ruined earth;
        Restore our faded Paradise,
        Creation’s second birth.

        5 Come, and bring Thy reign
        Of everlasting peace;
        Come, take the kingdom to Thyself,
        Great King of Righteousness.

        Amen

      • Jaskologist says:

        The monks were wrong. It turns out God was just angry about being doxed.

  46. heXdot says:

    This is the most “poetic” thing I got out of the small GPT-2. First sentence is the original prompt. I took most of the first reply and then used that as a prompt again.

    “I am sorry for overdoing it. It is my doing to teach you how to be loyal to the values that we share with each other. Due to my decisions I have lost all of my career as a farmer. I turned 90 more than half a century ago. Like a lost generation of passionate farmers. We all missed that point – long before we were strong, in the very fire to which friendship and nurturing of your life is vital to you.”

    • Elementaldex says:

      Reading that made my head feel fuzzy as my brain tried to tie things together and kept being disrupted by the the next word being somewhere between slightly and very incongruent.

      • eric23 says:

        And that’s the basic problem with GPT-2. It mixes and matches things that make sense, resulting in something that doesn’t quite make sense. When we read the result, we mistake its “almost makes sense” for “makes sense in a way I have yet to figure out, so I should try harder to understand it”. So we make an effort to understand it. But that effort that is always in vain.

        Basically, it’s a written Rorschach test.

        • HeelBearCub says:

          Basically, it’s a written Rorschach test.

          Well put.

        • Picador says:

          I’d draw a different comparison: the text reminds me of listening to a schizophrenic. Their stories are often coherent and even intriguing on the sentence level, but you quickly realize that there’s no larger structure guiding the narrative as it slides into surreality.

  47. Silverlock says:

    COME ON, IT’S A ROBOT. THAT’S BETTER THAN YOU COULD DO IF YOU WERE A ROBOT. GIVE IT A BREAK.

    It is better than I can do as a human. I quite liked it.

    • Rm says:

      Yes, “majestic yet majestic” is brilliant.

      Now I think that the AI is lucky it’s working in English. Even German would be tougher, not to mention Slavic languages (although I can already imagine it writing a passable Derzhavin).

      • Bugmaster says:

        “Кот, каверзник коварный, кибэротоман,
        К королеве кафров крадется Киприан.
        Как клавесина клавишей, корсажа касается.
        Красотка к кавалеру, конфузясь, кидается…
        …Казнится краля, киснет: канул Купидон,
        К кузине королевы крадется киберон!”

        As always, Stanislaw Lem was way ahead of everyone 🙂

  48. summerstay says:

    If you are interested in using this for practical purposes, you can take advantage of the fact that “the first two lines are always the best, and it deteriorates from there”, by simply rerunning the program from the point just before where it starts to go wrong. See a nonsense line, or a made-up word? Just try again from the previous line, using as a prompt what has come before (maybe not the whole thing, to induce a little variation). This also includes just enough interaction to feel like you’re helping, without having to do any real mental work: cooking with a cake mix, as it were.

    Here is a continuation of Borges’ “Chinese Encyclopedia” that I did this with (using the public version of gpt-2). Notice that it has sort of, but not quite, learned to count.

    14. Those that keep sunlight on their heads
    15. Those that pigtails must Heed their deliberations
    16. Those that are enveloped in butterflies
    17. These caterpillars, having developed an eye to need something from that side of the world
    14. Those of a rank no longer own a horse
    15. Those that believe they shall get accustomed to the devotion or even worship of the leg which is installed in their head
    21. Those dressed in the armor of bird and dog and can see
    19. Someone dressed in the helmet, which causes its wearer to witness his actions
    20. A fine art. As by a ray out of the sky
    22. A star fireworks flickering in front of him
    23. A Knight of Heaven wearing his bow and arrow and when he talks
    25. A Fish bladder with bottle in each 50 feet
    26. A Pendulum Giant filled with Wax flower
    27. Some conglomeration of dung and blend of nitajé eggs
    28. Somewhere up close
    29. A Few hundred or so good looking which, depending on their age, look weighty but are very happy
    30. Eighties Mail hogs which nearly look like large animals

    What I like about Gwern’s poetry generation is that it is not just weird and evocative (like most of what comes out of gpt-2) but beautiful as well. I feel like artistic style transfer has some of the same ability for visual art that this has for poetry– to create things that aren’t just interesting, but genuinely beautiful, with a little cleaning up. Here are a few examples that I’ve done:
    https://machinamenta.blogspot.com/2019/03/style-transfer-for-jewelry-design.html
    https://www.deviantart.com/popular-all-time/?section=&global=1&q=summerstay
    [not including the Audobon Angry Birds, those were just straightforward photoshopping]

    The choices of where to place the jewels, or how to link overlapping lines are effects that, if done by an artist, we would consider creative choices, and I find the results beautiful.

  49. MilfordTrunion says:

    I could see decades of high-school AP English teachers confidently directing their class in the exegesis of “Majestic yet majestic” 😀

  50. luxagenic says:

    If you specifically want to generate poetry that rhymes or follows other conventions that GPT-2 doesn’t always do by default, you could potentially add additional constraints to the output, such as reattempting any lines that don’t fit the rhyme scheme. The fact that a model can be augmented or guided by old-fashioned programming sometimes seems to be missing from discussions around the potential uses of new AI developments.

    • Walter says:

      Off my head, as a programmer, I wouldn’t know how to write an app that takes a large # of lines and tells you if they rhyme/are the right length (that is, does this poem follow the ‘rules’ of whatever style of poetry you are doing). A buddy of mine does speech recognition stuff, so it is definitely solved somewhere, but I’d have to stackoverflow for a while before I could write anything resembling this.

      • Nick says:

        I think if you had a database of words and their pronunciation, you could, say, check whether the last vowel and coda match but onset not. That should come like 99% of the way to poetry that rhymes while still using distinct syllables (so that “rhyme/time” works but “time/time” or “time/thyme” doesn’t).

        That entendrepreneur pun generator analyzes words this way, so it must be getting the pronunciation data from somewhere.

    • gwern says:

      I discuss this in the last section: https://www.gwern.net/RNN-metadata#overall You could take a resampling approach to greedily sample only completions which satisfy the rules, but as you add on the constraints, I think you’d quickly run into problems finding any completions which still satisfy the rules and each step would start taking hundreds or thousands of times as long and the results would not be great since all of the previous steps were generated without regard for what future steps would need, so it’d paint itself into a corner. You need RL training to incorporate global end-to-end losses like that, IMO, similar to NMT,

      • aglaya says:

        Another way to enforce output constraints is constrained beam search. Given a rule which can be described by a finite state machine, you keep a beam of size at most k at each state.

        This solves the problem of it being extremely unlikely that any randomly-sampled sequence will work. For a really long text you could end up maintaining a “beam” of size (# of states) * k, so it does impact the runtime of inference, but it’ll be much, much, much better than resampling (when the model is not inclined to go along with the constraint). And the model doesn’t have to change at all.

        Of course, the state machine is a pain to set up, though something like OpenGRM could potentially help.

  51. FishFinger says:

    Apparently someone trained it on Ginsberg:

    Moloch whose flags rub soiled blood in peoples’ eyes!
    Moloch whose virgins cry babies milk from their mothers!
    Moloch whose body lovingly eats the food of many nations in endless gulps of lightning!
    Moloch whose intellect ruthlessly annihilates Zuma!
    Moloch whose mind punctures insects!
    Moloch whose payload surpass transparent same Helphite!
    Moloch whose spies indoctrinate society with the hope of conquering all mankind!
    Moloch whose army advance toward their Manly leader!
    Moloch’s army stretches out before their champions!
    Moloch’s army marches on their strength!
    Moloch’s army tears away the fear of grandeur!
    Moloch’s army assaults the cities and villages in length whizzkid war!
    Moloch’s army cleans them costly!
    Moloch’s army holds their holy relics!
    Moloch’s army destroys every glory!

    • Nick says:

      Moloch whose flags rub soiled blood in peoples’ eyes!

      Huh, this is not bad at all. Well, “peoples’ eyes” is awkward, but it’s still very evocative.

      Moloch’s army cleans them costly!

      *snicker*

    • ARabbiAndAFrog says:

      > Moloch whose spies indoctrinate society with the hope of conquering all mankind!

      I really like this one.

    • ec429 says:

      Moloch whose intellect ruthlessly annihilates Zuma!

      Moloch who designs payload adaptors for Northrop Grumman?

  52. bernie638 says:

    I rarely read the comments so forgive me if someone else brought this up before.

    Tying together two of your recent themes of GPT-2 and the idea of one great idea out of 1000 is very valuable.

    Could you train GPT-2 on say just chemistry academic material then prompt it with something like “the most efficient rocket fuel is” a thousand times and test the results?

    Would it come up with any novel ideas?

    • axolotl says:

      The most efficient rocket fuel is based on the fact that rockets are fired from the ground, not rockets fired from the sea.

      https://pastebin.com/VkraiBMe

      • bernie638 says:

        Sure, sure. That was just an example. I mean you wouldn’t try it with medication because tha9 would be unethical, or physics because that has progressed beyond testable hypotheses, but there are still a lot of possibilities.. “what’s the most likely place to find oil?”. “What is the most efficient blade design for an aircraft engine? “, etc.

        • Sniffnoy says:

          I think you’re basically describing divination…

        • HeelBearCub says:

          Here again is the fundamental issue I have with this.

          GPT-2 is impressive. GPT-2 is highly likely to be useful.

          GPT-2 does not understand anything other than how words tend to be strung together. It does not know anything about the underlying concepts. It can’t generate novel ideas, only novel strings of words. It has no preferences so it can’t generate lists of its favorite things.

          That doesn’t mean this isn’t a step towards some sort of AGI. What GPT-2 is doing one part of how we process language, but it’s not even the fundamental part. Koko’s use of sign language was much closer to human language capability than this is.

          • Doctor Mist says:

            This is true, and well and fairly expressed.

            My own takeaway was that I was surprised how much what it produced was like real poetry — I would not have predicted that it would exhibit an apparent (yes, I say “apparent”) understanding of rhyme and scansion. It made me more sympathetic than I used to be to Scott’s notion that our ability to generate novel ideas and underlying concepts might be a similar mechanism operating a few levels more meta — just knowing how ideas tend to be strung together, so to speak.

            Whether that few levels of metaness is one or two or a hundred, I have no idea.

          • HeelBearCub says:

            Figuring out familiar, yet novel, ways to string together words is one part of poetry. That’s the part you are responding to.

            Anyone who has spent any time at TV Tropes understands that this “stringing together of things” also part of storytelling in general. If you mash together a bunch of these kinds of tricks, you could probably get a “machine learned” in telling a “new” trope story. Seriously, read a few Hardy Boys mysteries sometime.

            But you won’t get it reliably. Because these tricks won’t have any real judgement to them. Just formula. And it won’t know the difference between something truly novel, and something that is nonesense.

          • Doctor Mist says:

            Agreed. But at our level, the stringing of ideas together doesn’t reliably result in the novel or profound either.

  53. Robert Jones says:

    “The disdiscoveries of fate” seems to me rather good. At first I took it as referring to things which appear to be a discovery but which are actually the antithesis of that. One might also think that since a discovery is the bringing to light of something once covered, a disdiscovery is the obscuring of something once revealed. The use of nonce-words of ambiguous meaning is well established in the poetic tradition and seems impressive for an AI.

  54. fion says:

    I enjoyed your trick at the end. I think someone should make a “can you tell the difference between AI and Great Poetry” quiz if they haven’t already done so.

    • Nick says:

      I really want to see this. How curated should the “Great Poetry” selections be, though? Should they just be random selections out of Gwern’s training data?

      • Rachael says:

        I think they should be curated to avoid well-known poems that people are likely to recognise, but not any further.

    • marikiathoi says:

      I would really like that. I skimmed ahead in the text enough to know the last excerpt was human-authored before I could evaluate it as AI poetry, and now I wonder how well I would have done without that bias.

      • Kevin Carlson says:

        I also wished the trick had been a bit trickier. I happen to like Ulysses a lot but have still encountered it a couple of times independently of that liking, whereas it should be easy to pick even something like a Shakespeare sonnet that only a tiny percentage of people would recognize while still being of reliably good quality.

  55. Rachael says:

    These are brilliant and a bit scary. I especially like the one that begins “And they have seen the last light fail” – I find the style very evocative.

    I’d like to see an AI that can talk *about* the poetry it’s written, like a human would, and explain why it chose a certain word or deviated from the metre. A few months ago I’d have thought AI couldn’t do that, because it would require metacognition and self-awareness; but seeing what’s been coming out of GPT-2, I now think it won’t be long before it can fake it convincingly.

    • Bugmaster says:

      I am not a poet or any kind of a literature expert in any way, but still: I think that the AI has a tremendous advantage in this area today, since humans have been doing their best to render poetry, as well as criticism of poetry, as content-free as possible. Thus, it is significantly easier for the AI to beat them at this task in 2019, than it would’ve been in the past century (Tennyson notwithstanding).

      • ec429 says:

        criticism of poetry, as content-free as possible

        Comfort’s catholicity of perception and image, strangely Whitmanesque in range, almost the exact opposite in aesthetic compulsion, continues to evoke that trembling atmospheric accumulative hinting at a cruel, an inexorably serene timelessness

        Already identified as a problem in 1946.

        Now, if GPT-2 can learn how to write convincing Orwell essays, then it’ll be time to panic.

    • niohiki says:

      I’d love to see such metacognition too, but I’m not sure how much useful information we’d get besides more “look at what the machine generated!” responses as the poetry itself. The standard story in data science is that you can train all these state of the art classifiers and NN and sure they work nicely when it comes to predicting, but good luck knowing actually why or even simply which input variables were relevant; which is definitely good to know to make better models.

      Maybe you’re right and GPT-2 is on the good track to self-reflection, but I’d still be cautious with my hopes. It is a very difficult problem that many people (and plenty shareholders) have a strong interest in cracking, and not a lot of luck so far. I’m not saying I’m sure researchers won’t find a way anytime soon, but the AI itself… After all, it’s not like humans are particularly good at knowing why they do things either. We’re still working on it after several centuries of “whispering demons” and psychoanalysis and the lot.

      I do understand that a kid does not need to understand evolutionary psychology to say why he kicked his little brother. It’s just that I would not trust much the answer in terms of “real motivations” (just as I don’t trust the explanations of psychoanalysis, no matter how elaborate). One could probably get the AI to answer something psychoanalysis-plausibility-level-sounding when prompted for its motivations, which is surely interesting enough – or at least as you say, fake it convincingly.

    • gwern says:

      I don’t know how one would train an AI to talk about poetry, but there are a lot of ways to visualize the internals, especially since this is Transformer based and the internal ‘attention’ is nothing but a way of expressing what parts of the previous input are important. Hence all the attention visualizations in posts like https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html https://jalammar.github.io/illustrated-transformer/ http://nlp.seas.harvard.edu/2018/04/03/attention.html

    • Dedicating Ruckus says:

      GPT-2 can’t do anything “about” anything.

      You could try to train it on a corpus of “poem immediately followed by discourse on poem”, but it wouldn’t produce convincing imitations; that’s the kind of long-distance coherence that it can’t manage.

      • axolotl says:

        Even a much-improved GPT-N wouldn’t be able to produce a poem and an explanation of why it wrote it that way; it would write a poem, take a look at the poem, and then guess a plausible reason someone might have written it that way. It might output “I included the line about lilies as a reference to my late sister Lily”, for example.

        In general, GPT has a bullshit problem (it’s easier to produce convincing bullshit than convincing truth) and a mediocrity problem (mediocre fanfic is more convincingly fanfic-like than excellent fanfic). That’s because it was designed to produce text that a human would write, rather than good or true text. These problems are orthogonal to current limitations on how well it carries out its mandate.

        • Joe says:

          Is there a way to simply ask the AI what its favorite poem is? It already knows so much poetry it would be awesome if it gave us a top 10.

          • gwern says:

            Remember, GPT-2 isn’t about ‘favorite’ or ‘good’ or ‘best’. It’s about ‘likely’ or ‘predictable’ or ‘average’. Each word is generated because it seems probable to occur in the PG corpus given the previous words. It has no other criterion it cares about and no way to know about other things like subjective quality. (Given that it was trained the way it was trained and doesn’t use any losses other than the predictive one; incidentally, if you used Christiano’s preference learning, this would be much easier to answer directly: you’d simply run the D/critic over the human corpus and pick the 10 with the highest scores, as ‘best’ is in fact what the D/critic is attempting to learn in preference learning.)

            So what you could ask is, ‘what 10 poems (in the PG or another corpus) as a whole are the most likely to be written, according to GPT-2, and have the biggest total log-likelihood?’ This would give you, in some sense, the 10 most ‘prototypical’ or ‘average’ poems, somewhat like StyleGAN’s psi=0 gives you the ‘average’ face (which for both humans and anime faces turns out to be a woman with short brown hair). Which might be somewhat interesting but one would have to code it up to extract the likelihoods from GPT-2 and sum them in various windows etc.

          • Joe says:

            Oh! Its a word calculator!

  56. g says:

    I think what’s going on with the foreheads is this: the souls took whatever conditions the world threw at them (“the thunder and the sunshine”) cheerfully (“with a frolic welcome”), and faced them with (“opposed”) free hearts and free foreheads. The hearts and foreheads aren’t the souls’ opponents, the thunder and sunshine are; the hearts and foreheads are more like the weapons the souls wield against those opponents. (I also think “opposed” has less connotation of enmity than you might think; compare “opposable thumbs”.)

    • C. Y. Hollander says:

      I came here to say this, but you already have, so I’ll just comment that I agree with your interpretation. Not the part about ‘wielding their hearts and foreheads like weapons(?!)’, but that “opposed” basically means “faced” here, and that the thunder and the sunshine (nature’s attempts to intimidate or seduce) are being faced with free hearts and minds*.

      *My guess is that “forehead” is being used as a metonym in a way parallel to (albeit less familiar than) “heart”: the forehead is the seat of the intellect as the heart is the seat of the passions, and neither could be cowed or swayed by nature’s threats and blandishments.

      • fluorocarbon says:

        You’re probably right that forehead is a metonym for intellect here, but I think it’s also possible to interpret it as a literal forehead. In this case, “free” would have a double meaning:

        Souls [i.e. people] opposed [faced] thunder and the sunshine [weather] free hearts [willingly] and free foreheads [without hats].

        It could be symbolism for facing the elements directly without any protection. It brings to mind an image of sailors toiling in the middle of a storm with their caps (which were a bigger deal back then?) blown away.

      • Incandenza says:

        I agree, but also you have to take the imagery seriously – the forehead of a sailor, on the ship’s deck, implies the posture of one working at their task, exposed to the elements but dignified.

      • g says:

        I didn’t mean the “weapons” bit too literally. But I think there’s a suggestion that the free-ness of the hearts and foreheads isn’t merely an attitude that those souls happened to have, but also part of how they’re opposing whatever the world throws at them. (And yes, it means “minds and emotions” rather than anything to do with blood-pumps or upper parts of faces. I’m not convinced by fluorocarbon’s admittedly very ingenious idea about hats.)

    • Berna says:

      Thanks, that makes sense!

    • HeelBearCub says:

      I think what’s going on with the foreheads is this

      This right here is the entire problem I have with all of this.

      We look at the gook produced by an automated generator and OUR pattern matching goes into over drive and finds the face in the tree, and then some people start to say that trees actually produce faces because they see humans so often…

      ETA: We can do this credibly with a human author, but not with this kind of AI.

      ETA2: We can’t do it credibly with this kind of AI because otherwise the AI wouldn’t trail off into gibberish so reliably.

      • Scott Alexander says:

        The foreheads example is from a human poet. Are you missing this, or am I missing something in your comment?

        • HeelBearCub says:

          Yes, I understand this that this is from Tennyson.

          It’s the fact that you can’t meaningfully make the same statement about the AI generated poetry that is the issue. I was trying to make that clear in my ETAs, but apparently I was not successful.

  57. IdleKing says:

    Scott asks:

    Again, first two lines are great – “a seraph trembles at the specious glove” is both nonsense and exactly the sort of thing Alexander Pope would write, but by the fourth line we have nonsense words, by the fifth we lose the meter, the eighth and ninth are just periods, and finally it starts stuttering helplessly.

    I tested this many more times on a public version (not poetry-trained) and found a similar effect – the first two lines are always the best, and it deteriorates from there. I’m interested in hearing from people who understand the model better than I do about why this should be.

    I don’t understand the model better than Scott does, but I’m more willing to speculate! One obvious explanation for what’s going on is compounding errors. I.e. the response goes a tiny bit off track; then that slightly-off section is treated as part of the “prompt” for the continued response, leading it much further off track, etc.

    Two interesting implications of this theory:

    1) Even though the first lines seem pretty faithful to the desired style, they must contain within them the seeds of dissolution. Or to put it differently: Pope is so true to his own vision that there’s a big margin of error, in which incremental stylistic deviations still sound very “Popelike” until the deviations grow beyond the margin. (And let’s face it: the first lines here, though very Popelike, are not as Popelike as Pope.)

    2) We’ve been told that each word (or letter?) of GPT-2’s response is taking into account the whole prompt and the response so far. But it must anchor a lot on the most recent bit if this compounding-errors theory is true. (I.e. the 3rd line of the response must be paying significantly more attention to the 2nd line of the response than to the last line of the prompt.)

    There’s probably some “recency bias” parameter within GPT-2 that you could change to improve this behavior. Just as you’d get different responses if you asked a human, “See if you can convincingly continue this Pope poem,” vs, “Use these Pope lines as inspiration to write your own poem.” In the second case, if the human cheats a bit on the meter in one part of their response, they may well decide, “well that’s the new meter now” — just as GPT-2 seems to do.

    But this could only explain how GPT-2 loses the rhyme and meter and style — not how it ends up at nonsense words. I’m sure there are other important dynamics that I don’t understand.

    • Incandenza says:

      You call it “compounding errors,” I call it “inventing modernism.” Clearly the AI is just quickly growing bored with the strictures or Romantic poetry and turning to early 20th century-style experimentalism. By the end of these poems it’s moved on to full on deconstructionism.

    • gwern says:

      I am not sure the Pope example is necessarily a good one. If you look at the samples, they look like footnotes or prose. I think what happened is the PG corpus is just really bad when it comes to Pope and includes a lot of garbage prose in it. I looked at what was in it, and it has a lot of prose like from https://www.gutenberg.org/files/32190/32190-h/32190-h.htm – the first samples in the corpus are


      The Works of Mr. ALEXANDER POPE. London: Printed by W.
      BOWYER for BERNARD LINTOT, between the Temple Gates, 1717.
      This volume consists of all the acknowledged poems which Pope had
      The Works of Mr. ALEXANDER POPE. Volume ii. London: Printed
      by J. WRIGHT, for LAWTON GILLIVER, at Homer's Head in Fleet
      Letters of Mr. ALEXANDER POPE, and Several of his friends.
      London: Printed by J. WRIGHT for J. KNAPTON in Ludgate
      Street, L. GILLIVER in Fleet Street, J. BRINDLEY in New Bond
      Street, and R. DODSLEY in Pall-Mall, 1737. 4to and folio.
      The Works of Mr. ALEXANDER POPE, in Prose. Vol. ii. London:
      Printed for J. and P. KNAPTON, C. BATHURST, and R. DODSLEY,
      The Works of ALEXANDER POPE, ESQ.; vol. i. with explanatory
      Notes and Additions never before printed. London: Printed
      commenced printing his particular section of the octavos when the
      Quo desiderio veteres revocamus amores
      Atque olim amissas flemus amicitias.
      Nutrix mea fidelissima M. Beech, obiit 5 Novem. 1725, aet. 77.
      Edwardus Blunt, vir amicissimus obit, Aug. 1726.
      Francisc. Atterbury, Roffens Episcopus, vir omni scientia clarus,
      The fourth volume contains the Satires, with their Prologue,--the
      alterations. --_His Last Will and Testament._--WARBURTON.

      Not very useful to train on…

  58. phoenixy says:

    I love this kind of thing and love this post! But:

    Pride even in numbers; wit’s a kind pretence / To something foreign still, but ne’er to sense;

    I don’t see how these “first two lines are perfect rhyme and rhythm”; am I missing something? The second line is iambic pentameter but the first line has 11 syllables, not 10, and for it to scan as iambic you’d have to stress the second syllable of “numbers”. (Pride EVen IN numBERS; wit’s A kind PREtense)

    • Rachael says:

      “Even” is often written and pronounced as “e’en” in old poetry. I unthinkingly read it that way and probably so did Scott. “pride E’EN in NUMbers; WIT’s a KIND preTENCE.” The iambic metre fits the way you’d naturally pronounce it.

      • phoenixy says:

        Yes, but because “ne’er” is written out with the apostrophe, I would expect “e’en” to be written out with the apostrophe too if the apostrophe was intended.

    • Nick says:

      You could elide the second syllable of even, which isn’t uncommon.

    • Rachael says:

      Compare “heaven” in old hymns. If I see “heaven” in an unfamiliar hymn, I expect it to scan as “heav’n” rather than “heav-en”, even if it’s not written as “heav’n”.

  59. personoid says:

    Is shameless self promotion permitted?
    If so, I’ll direct you to my Android app to automatically generate poetry from Twitter .

    It does Limericks, Haiku and Rhyming Couplets.

    Here’s a Limerick it generated from Twitter tag @DystopianYA

    A legend tells of a famous armpit
    the place where many meet to sit
    that is her
    or my brother
    Mysterious bad thing with debt

    Not too bad, I think! Though not done with a sophisticated neural net

    Here’s a link to the app, if you’d like to give it a go:

    https://play.google.com/store/apps/details?id=com.twitpoet.twitpoet

  60. suntzuanime says:

    Yeah, the AI text generator is heavy on bizarre inhuman novelty and light on boring ol’ structural coherence. It’s a natural poet.

    • This.

      For this reason, poetry generation is the easiest form of text generation.

      • Murphy says:

        attempting measurement of time between AI being demonstrated to be able to do X and people declaring that X was easy all along….

        • Faza (TCM) says:

          It’s not so much a question of generating poetry itself being easy, but simply that humans don’t have strong filters for poetry.

          I, personally, loved the Emperor Wu poem, though it’s quite clearly rubbish, as results of automated generation go.

        • Basscet says:

          The heart of goalpost shifting for AI lies not with saying that X was actually easy, but by claiming that X isn’t evidence of progress on some underlying factor Y. Whether that’s true is still up for debate.

          On one hand we might expect continual improvements in various areas along these lines until it is obvious to most people that the AI has reached human-level intelligence. On the other, it could be similar to the past where the predicted progress doesn’t materialize because of unexpected barriers.

          It seems to me that our knowledge of how human understanding/consciousness/whateverness arises isn’t really sufficient to judge whether these endeavors will be ultimately successful, or if some new approach is necessary. Consequently, people’s intuitions on the matter can vary wildly.

      • Rachael says:

        Free verse, yes. Structured poetry like most of this post is more of an achievement.

        • Not really. It’s not excessively hard to create a simple program that will learn to count syllables. It’s true that it’s more of an achievement when it learns poetry in a non-supervised way like deep learning does, but the thing is, poetry is free from strong semantic constraints, we accept that it doesn’t have to be readily understandable, we accept looser syntax in poetry, and it doesn’t even have to be consistent with itself. “Structured poetry” obeys local constraints (verse forms, rhymes) which AIs are pretty good at detecting and reproducing. The hardest poetic form for an AI is probably epic verse, because like long-form prose, it requires the production of a consistent narrative, which GPT-2 has trouble doing for more than a paragraph (by “consistent”, I don’t mean “a strong plot”, I mean stuff like “having the same main character from start to finish”).

          But the hardest things for AI to generate are things which require a lot of real-world knowledge to understand. Generating funny jokes is infinitely harder for AIs than generating average-to-good poetry.

          • Le Maistre Chat says:

            The hardest poetic form for an AI is probably epic verse, because like long-form prose, it requires the production of a consistent narrative, which GPT-2 has trouble doing for more than a paragraph (by “consistent”, I don’t mean “a strong plot”, I mean stuff like “having the same main character from start to finish”).

            The traditional hierarchy of genres is epic > tragedy > narrative lyric > other lyric (with comedy somewhere near its fellow drama tragedy), based in part on the difficulty of the achievement.

    • Lambert says:

      I’m not so sure.
      I think great poetry ought to have structural coherence.
      It’s just that that structural coherence ought to be bizzare and inhuman.

      Extended metaphors and all that jazz.

  61. albertborrow says:

    I put this plug in the last post on AI, but if you didn’t see it, Make Girls Moe is a pretty impressive neural network driven anime character generator. It doesn’t produce images as crisp as Gwern’s, but it also can generate faces with many different characteristics all based on the same seed. Its header includes probably the worst sentence I’ve ever read:

    We are releasing a new project — Crypko, in which you can get and trade AI generated anime characters on Ethereum blockchain.

  62. Ashley Yakeley says:

    Can one always trace particular outputs to particular poets, or even poems? Or does the AI manage to capture something of poetry in general, or at least some group of poets, each time?

    • Scott Alexander says:

      I think the second. Gwern got it to imitate specific poets by either prompting it with their work, or by using the number codes for them in the corpus. The randomly generated samples, which don’t do either, are just ground-up essence of poetry (though they still end up in a particular genre most of the time)

    • albertborrow says:

      The AI writes poems given a prompt. The prompt was probably the poet’s name.

      • artifex says:

        You can use the poet’s name (if kept) or an ID used to identify the author during training, but the model does not require a prompt. It can generate conditional or unconditional samples, and gwern’s 1000 samples are the latter.

        • gwern says:

          I trained 2 GPTs, a ‘prefix’ one for poetry which is always prefixed with a book ID (not poet ID, unfortunately) and a ‘generic’ one which is just poetry without any kind of metadata. So there’s 4 ways to use it: generic GPT with no prompt, generic GPT with a prompt, prefix GPT with a prompt (usually a prefix), and prefix GPT without a prompt. I provide 1000 samples for both GPTs without a prompt, and provide a few specific prompt examples for prefix GPT (since its results were IMO better than the generic and I didn’t want to bother).

          Interestingly, the unconditional samples from the prefix GPT nevertheless hallucinate a prefix/ID which is consistent with the poetry generated in that particular iteration; so you can put in random poetry to get out what the prefix GPT thinks that poetry looks like. This is what I do instead of actually looking up book IDs when I want, say, Alexander Pope poetry. I feed Pope poetry in sans metadata, and see what sort of metadata the prefix GPT thinks it ought to have, and then use that in a real prompt with Pope’s poetry.

          • Krisztian says:

            I’m curious, why did you choose to put metadata at the beginning?

            When not learn embeddings for authors/genres/style, etc. that you then feed directly into one of the final layers of the network?

          • gwern says:

            The metadata is at the beginning so you can prompt it with just the metadata. Putting it at the end has no particular benefits and probably makes it harder to learn.

            I don’t do anything with embeddings because that would be very hard, while adding some prefix metadata is literally a single line of shell. Also, given that different authors/genre/style have different global & local properties from vocab to topic to mood, I’d think you’d need to include the embedding concatenated with the token inputs, so a nontrivial impact on the model size and/or window size there.

  63. Yoav6 says:

    I totally fell for that last one, thinking “huh, surly someone would comment that ‘anyone would know that foreheads make no sense, so the AI doesn’t understand such things'”