Gwern has answered my prayers and taught GPT-2 poetry.
GPT-2 is the language processing system that OpenAI announced a few weeks ago. They are keeping the full version secret, but have released a smaller prototype version. Gwern retrained it on the Gutenberg Poetry Corpus, a 117 MB collection of pre-1923 English poetry, to create a specialized poetry AI.
I previously tested the out-of-the-box version of GPT-2 and couldn’t make it understand rhyme and meter. I wrongly assumed this was a fundamental limitation: “obviously something that has never heard sound can’t derive these complex rhythms just from meaningless strings of letters.” I was wrong; it just didn’t have enough training data. Gwern’s retrained version gets both of these right, and more too. For example:
Thou know’st how Menoetiades the swift
Was dragged, of Hector and the fierce compeers
And Phrygian warriors. So, we will dispatch
Your bodies, then, yourselves to burn the ships
In sacrifice; with torches and with bells
To burn them, and with oxen to replace
Your gallant friends for ever. But I wish
That no man living has so long endured
The onset of his foes, as I have power
To burn or storm; for mighty Hector erst
Was slain, and now returns his safe return
This is all perfect iambic pentameter. I know AP English students who can’t write iambic pentameter as competently as this.
(by the way, both “compeers” and “erst” are perfectly cromulent words from the period when people wrote poems like this; both show up in Shelley)
It has more trouble with rhymes – my guess is a lot of the poetry it was trained on was blank verse. But when it decides it should be rhyming, it can keep it up for a little while. From its Elegy Written in a Country Churchyard fanfic:
Methinks I see her in her blissful dreams:
Or, fancy-like, in some mirage she lies,
Majestic yet majestic, and of seems
The image of the unconquerable skies.Methinks I see her in her blissful dreams:
—Or, fancy-like, in some majestic cell,
Where lordly seraphs strew their balmy dreams
On the still night, or in their golden shell.There, in the calm of some Platonic dream,
Sits she, and views the unclouded moon arise
Like a fair lady full of realms divine;And, all at once, a stony face and bright
Glittering in moonlight, like the noon-tints of a night.
COME ON, IT’S A ROBOT. THAT’S BETTER THAN YOU COULD DO IF YOU WERE A ROBOT. GIVE IT A BREAK.
I think its problem is less ability to rhyme, and more a sort of…maintaning the will to rhyme. Gwern prompts it with several Alexander Pope pieces; Pope is a very formal poet who always rhymes in the exact same way. For each Pope piece, GPT-2 does the first couplet in perfect Pope style, then degenerates into increasingly confused gibberish. For example:
Pride even in numbers; wit’s a kind pretence
To something foreign still, but ne’er to sense;
A constant waste of words, the world produces,
A lazy sap, that nothing that’s foreign to expels,
; that’s foreign still unknown to the soul;
A young, like, but in the world in motion,
Obscending not, which smells all our own the worse than our own our own the soul’s, and soul;
Like sense; which is like, but in all our whole.
Which thus far more like, but in all things that’s an excellence; and ne’er unchanged by which is folly’s the worse, they give the worse maintained by which
If sick of sense;
Wholubil, or snug. ills, we know our own our first in sense the worse maintained between the worse, soon expired.
First two lines are perfect rhyme and rhythm, next four have no rhyme but are close to the right meter, next few have almost random length, and by the last one we’ve abandoned grammar and are making up nonsense words like “wholubil”. Every Pope test Gwern runs is like this:
They talk of constancy and faithless love,
A seraph trembles at the specious glove;
Nor in the rich confin’d relief of state,
Find proud pretence, nor in the disdiscoveries of fate.
For when misfortune makes choice remains the conduct’s the prize half known,
Can we secret soul without due, they fear of sense of more known.
Some rise where’ rights, they make it pays due.
.
.
. Sense and judgment, as equal prize seem meanly, the reward the joy, as much possess the prize paid, as well reckon the prize we do not less dare not less keenly wise.
We see;
Of happy lovers ought, as well done, like a friend.
Know they ought, ’tis ev’ is the other joy, as well worth a right;
The joy, as well might, as well may, as well may all is great.
Nor need of joys not as well maysters, as well as well may they give; but as little store; but as well as well as well may shewn, as much, as well we know, as well as well can be sure might prove, as well may well as well as well as well as well may view;
The mind: as well as well as well as well as much the fair as well as well as well as well as well as well as well may
Again, first two lines are great – “a seraph trembles at the specious glove” is both nonsense and exactly the sort of thing Alexander Pope would write, but by the fourth line we have nonsense words, by the fifth we lose the meter, the eighth and ninth are just periods, and finally it starts stuttering helplessly.
I tested this many more times on a public version (not poetry-trained) and found a similar effect – the first two lines are always the best, and it deteriorates from there. I’m interested in hearing from people who understand the model better than I do about why this should be.
Some other highlights:
My heart, why come you here alone?
The wild thing of my heart is grown
To be a thing,
Fairy, and wild, and fair, and whole
That last line, with its ABAB structure, is actually brilliant even by the standards of human poets. “Fairy and wild and fair and whole”. I could say that all day. This has to be a coincidence. It’s not that good anywhere else. But even having something generally okay enough that it can occasionally blunder into something that good is great.
From its Hindu phase:
…which Indra, King of all the Blest,
Had thrown by Rávan’s mighty breast,
The monstrous coil, the brawny hand,
The monstrous mouth, the jaw, the jaw,
The jaw, the jaw and bleeding jaw,
The ungovernable host, the jaw,
And the great bow which never bends,
The arm, the fist, the knee, the ends,
The body laid with mighty stroke,
And the great bow which never bends.
So, when the giants fought, and fell
With murderous strokes, the giant fell,—
So falls the tree with all his trunks
Terrific in its death, that shoots
Wild volley at the mighty trunk,—
So fell the tree with all its boughs
While all the vipers dug and sowed—
So fell the tree with all its boughs.
But Ráma’s heart was sad within
He wept and mourned his captive’s sin
The first two lines are great (and the program must have invented the epithet “King of All The Blest” for Indra, because I can’t find it anywhere else). The description of Ravan’s jaw is, uh, quite memorable. “So falls the tree with all his trunks” is a good deployment of a metaphor which is frankly overused in old poetry. I’m ambivalent between interpreting the repetition as an AI getting stuck in a loop vs. as good use of repetition as a poetic device.
The next few are from this list of a thousand randomly generated samples:
And they have seen the last light fail;
By day they kneel and pray;
But, still they turn and gaze upon
The face of God to-day.And God is touched and weeps anew
For the lost souls around;
And sorrow turns their pale and blue,
And comfort is not found.They have not mourned in the world of men,
But their hearts beat fast and sore,
And their eyes are filled with grief again,
And they cease to shed no tear.And the old men stand at the bridge in tears,
And the old men stand and groan,
And the gaunt grey keepers by the cross
And the spent men hold the crown.And their eyes are filled with tears,
And their staves are full of woe.
And no light brings them any cheer,
For the Lord of all is dead
And:
There are several kinds of people in America;
There are several kinds of people, I mean their number.
There’s a girl growing up in the house by the light,
There’s a youth upon the road, or a girl somewhere in New York;
There’s a prettier girl, and a man more congenial,But none of the likes of the likes of the fellows are equal.
There’s one who has never been married and married,
There’s one who don’t want to be treated with kindness;
A fair youth is never employed nor neglected;
There’s one who has never yet come to a neighbor,
And one who resides in New York from the start;But none of the likes of the likes of the fellows
Are equal to him, and wherever he goes,
The heart somehow breaks under the hand that is steering;
And so it is with me
And this quatrain just worked out really well:
Fair is the Lake, and bright the wood,
With many a flower-full glamour hung:
Fair are the banks; and soft the flood
With golden laughter of our tongue
This one is notable for competent metaphor:
How the clouds
Seem to me birds, birds in God’s garden! I dare not!
The clouds are as a breath, the leaves are flakes of fire,
That clash i’ the wind and lift themselves from higher!
And this one is obviously a failure on one level, but on another level is some kind of great experimental modern political poetry:
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), majestical,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious,
The Emperor Wu (the great Wu), rapacious
This one displays an interesting combination of world-knowledge and lack-of-world-knowledge:
In the dark the sun doth gleam,
And in the dark the moon doth seem
But now the evening is begun–
Gone is the sun upon the earth!
The silver moon doth like a cup
Of blood-red wine, and as that cup
Is drained of life, doth quench no drop.
What man will drink such wine?
There is no soul of earth or birth
Which man hath never known of earth.
There is no soul who doth not sit
And sing to it, and cry, “Drink!”
There is no soul whose feet are set
On youth’s eternal paradise;
For all is a solemn harmony,
And all is a perpetual chant,
And all the world is a song of God.
There is no soul so wholly free
And here’s another:
There gloom the dark, broad seas. My mariners,
Souls that have toiled, and wrought, and thought with me
That ever with a frolic welcome took
The thunder and the sunshine, and opposed
Free hearts, free foreheads – you and I are old;
Old age hath yet his honour and his toil;
Death closes all: but something ere the end,
Some work of noble note, may yet be done,
Not unbecoming men that strove with Gods.
Except this last time I’m cheating: this is an excerpt of Tennyson’s Ulysses, one of the most famous English poems. I included it as a placebo, ie a test to see whether real poems sound fake if you think they’re by an AI when you read them. I’ll be honest: if I didn’t know this was Great Poetry, I would skim it over and assume it made several mistakes. Like: is “gloom” really a verb? (it is if you’re Alfred, Lord Tennyson). Is the last line grammatical? (yes: it’s an adjective phrase modifying “work”, ie “some work which is fitting for the sort of men who fought gods to do”). Are the mariners’ souls opposing their foreheads? (I’m still confused on this one). These are all the sorts of things that would make me go “Haha, AIs are still pretty dumb” if I were reading it blindly.
If you liked these poems, you might also appreciate Gwern’s work making AI-generated anime waifus.
(and you can also donate to Gwern’s Patreon here)
I totally fell for that last one, thinking “huh, surly someone would comment that ‘anyone would know that foreheads make no sense, so the AI doesn’t understand such things'”
Can one always trace particular outputs to particular poets, or even poems? Or does the AI manage to capture something of poetry in general, or at least some group of poets, each time?
I think the second. Gwern got it to imitate specific poets by either prompting it with their work, or by using the number codes for them in the corpus. The randomly generated samples, which don’t do either, are just ground-up essence of poetry (though they still end up in a particular genre most of the time)
The AI writes poems given a prompt. The prompt was probably the poet’s name.
You can use the poet’s name (if kept) or an ID used to identify the author during training, but the model does not require a prompt. It can generate conditional or unconditional samples, and gwern’s 1000 samples are the latter.
I trained 2 GPTs, a ‘prefix’ one for poetry which is always prefixed with a book ID (not poet ID, unfortunately) and a ‘generic’ one which is just poetry without any kind of metadata. So there’s 4 ways to use it: generic GPT with no prompt, generic GPT with a prompt, prefix GPT with a prompt (usually a prefix), and prefix GPT without a prompt. I provide 1000 samples for both GPTs without a prompt, and provide a few specific prompt examples for prefix GPT (since its results were IMO better than the generic and I didn’t want to bother).
Interestingly, the unconditional samples from the prefix GPT nevertheless hallucinate a prefix/ID which is consistent with the poetry generated in that particular iteration; so you can put in random poetry to get out what the prefix GPT thinks that poetry looks like. This is what I do instead of actually looking up book IDs when I want, say, Alexander Pope poetry. I feed Pope poetry in sans metadata, and see what sort of metadata the prefix GPT thinks it ought to have, and then use that in a real prompt with Pope’s poetry.
I’m curious, why did you choose to put metadata at the beginning?
When not learn embeddings for authors/genres/style, etc. that you then feed directly into one of the final layers of the network?
The metadata is at the beginning so you can prompt it with just the metadata. Putting it at the end has no particular benefits and probably makes it harder to learn.
I don’t do anything with embeddings because that would be very hard, while adding some prefix metadata is literally a single line of shell. Also, given that different authors/genre/style have different global & local properties from vocab to topic to mood, I’d think you’d need to include the embedding concatenated with the token inputs, so a nontrivial impact on the model size and/or window size there.
I put this plug in the last post on AI, but if you didn’t see it, Make Girls Moe is a pretty impressive neural network driven anime character generator. It doesn’t produce images as crisp as Gwern’s, but it also can generate faces with many different characteristics all based on the same seed. Its header includes probably the worst sentence I’ve ever read:
If anything I’d say that description is a strong candidate for “output of a neural net trained on startup pitches from the last three years”…
FWIW, StyleGAN can be controlled now: https://twitter.com/halcy/status/1101959161187889153
Yeah, the AI text generator is heavy on bizarre inhuman novelty and light on boring ol’ structural coherence. It’s a natural poet.
This.
For this reason, poetry generation is the easiest form of text generation.
attempting measurement of time between AI being demonstrated to be able to do X and people declaring that X was easy all along….
It’s not so much a question of generating poetry itself being easy, but simply that humans don’t have strong filters for poetry.
I, personally, loved the Emperor Wu poem, though it’s quite clearly rubbish, as results of automated generation go.
The heart of goalpost shifting for AI lies not with saying that X was actually easy, but by claiming that X isn’t evidence of progress on some underlying factor Y. Whether that’s true is still up for debate.
On one hand we might expect continual improvements in various areas along these lines until it is obvious to most people that the AI has reached human-level intelligence. On the other, it could be similar to the past where the predicted progress doesn’t materialize because of unexpected barriers.
It seems to me that our knowledge of how human understanding/consciousness/whateverness arises isn’t really sufficient to judge whether these endeavors will be ultimately successful, or if some new approach is necessary. Consequently, people’s intuitions on the matter can vary wildly.
Free verse, yes. Structured poetry like most of this post is more of an achievement.
Not really. It’s not excessively hard to create a simple program that will learn to count syllables. It’s true that it’s more of an achievement when it learns poetry in a non-supervised way like deep learning does, but the thing is, poetry is free from strong semantic constraints, we accept that it doesn’t have to be readily understandable, we accept looser syntax in poetry, and it doesn’t even have to be consistent with itself. “Structured poetry” obeys local constraints (verse forms, rhymes) which AIs are pretty good at detecting and reproducing. The hardest poetic form for an AI is probably epic verse, because like long-form prose, it requires the production of a consistent narrative, which GPT-2 has trouble doing for more than a paragraph (by “consistent”, I don’t mean “a strong plot”, I mean stuff like “having the same main character from start to finish”).
But the hardest things for AI to generate are things which require a lot of real-world knowledge to understand. Generating funny jokes is infinitely harder for AIs than generating average-to-good poetry.
The traditional hierarchy of genres is epic > tragedy > narrative lyric > other lyric (with comedy somewhere near its fellow drama tragedy), based in part on the difficulty of the achievement.
I’m not so sure.
I think great poetry ought to have structural coherence.
It’s just that that structural coherence ought to be bizzare and inhuman.
Extended metaphors and all that jazz.
Is shameless self promotion permitted?
If so, I’ll direct you to my Android app to automatically generate poetry from Twitter .
It does Limericks, Haiku and Rhyming Couplets.
Here’s a Limerick it generated from Twitter tag @DystopianYA
A legend tells of a famous armpit
the place where many meet to sit
that is her
or my brother
Mysterious bad thing with debt
Not too bad, I think! Though not done with a sophisticated neural net
Here’s a link to the app, if you’d like to give it a go:
https://play.google.com/store/apps/details?id=com.twitpoet.twitpoet
I love this kind of thing and love this post! But:
Pride even in numbers; wit’s a kind pretence / To something foreign still, but ne’er to sense;
I don’t see how these “first two lines are perfect rhyme and rhythm”; am I missing something? The second line is iambic pentameter but the first line has 11 syllables, not 10, and for it to scan as iambic you’d have to stress the second syllable of “numbers”. (Pride EVen IN numBERS; wit’s A kind PREtense)
“Even” is often written and pronounced as “e’en” in old poetry. I unthinkingly read it that way and probably so did Scott. “pride E’EN in NUMbers; WIT’s a KIND preTENCE.” The iambic metre fits the way you’d naturally pronounce it.
Yes, but because “ne’er” is written out with the apostrophe, I would expect “e’en” to be written out with the apostrophe too if the apostrophe was intended.
You could elide the second syllable of even, which isn’t uncommon.
Compare “heaven” in old hymns. If I see “heaven” in an unfamiliar hymn, I expect it to scan as “heav’n” rather than “heav-en”, even if it’s not written as “heav’n”.
Scott asks:
I don’t understand the model better than Scott does, but I’m more willing to speculate! One obvious explanation for what’s going on is compounding errors. I.e. the response goes a tiny bit off track; then that slightly-off section is treated as part of the “prompt” for the continued response, leading it much further off track, etc.
Two interesting implications of this theory:
1) Even though the first lines seem pretty faithful to the desired style, they must contain within them the seeds of dissolution. Or to put it differently: Pope is so true to his own vision that there’s a big margin of error, in which incremental stylistic deviations still sound very “Popelike” until the deviations grow beyond the margin. (And let’s face it: the first lines here, though very Popelike, are not as Popelike as Pope.)
2) We’ve been told that each word (or letter?) of GPT-2’s response is taking into account the whole prompt and the response so far. But it must anchor a lot on the most recent bit if this compounding-errors theory is true. (I.e. the 3rd line of the response must be paying significantly more attention to the 2nd line of the response than to the last line of the prompt.)
There’s probably some “recency bias” parameter within GPT-2 that you could change to improve this behavior. Just as you’d get different responses if you asked a human, “See if you can convincingly continue this Pope poem,” vs, “Use these Pope lines as inspiration to write your own poem.” In the second case, if the human cheats a bit on the meter in one part of their response, they may well decide, “well that’s the new meter now” — just as GPT-2 seems to do.
But this could only explain how GPT-2 loses the rhyme and meter and style — not how it ends up at nonsense words. I’m sure there are other important dynamics that I don’t understand.
You call it “compounding errors,” I call it “inventing modernism.” Clearly the AI is just quickly growing bored with the strictures or Romantic poetry and turning to early 20th century-style experimentalism. By the end of these poems it’s moved on to full on deconstructionism.
I am not sure the Pope example is necessarily a good one. If you look at the samples, they look like footnotes or prose. I think what happened is the PG corpus is just really bad when it comes to Pope and includes a lot of garbage prose in it. I looked at what was in it, and it has a lot of prose like from https://www.gutenberg.org/files/32190/32190-h/32190-h.htm – the first samples in the corpus are
The Works of Mr. ALEXANDER POPE. London: Printed by W.
BOWYER for BERNARD LINTOT, between the Temple Gates, 1717.
This volume consists of all the acknowledged poems which Pope had
The Works of Mr. ALEXANDER POPE. Volume ii. London: Printed
by J. WRIGHT, for LAWTON GILLIVER, at Homer's Head in Fleet
Letters of Mr. ALEXANDER POPE, and Several of his friends.
London: Printed by J. WRIGHT for J. KNAPTON in Ludgate
Street, L. GILLIVER in Fleet Street, J. BRINDLEY in New Bond
Street, and R. DODSLEY in Pall-Mall, 1737. 4to and folio.
The Works of Mr. ALEXANDER POPE, in Prose. Vol. ii. London:
Printed for J. and P. KNAPTON, C. BATHURST, and R. DODSLEY,
The Works of ALEXANDER POPE, ESQ.; vol. i. with explanatory
Notes and Additions never before printed. London: Printed
commenced printing his particular section of the octavos when the
Quo desiderio veteres revocamus amores
Atque olim amissas flemus amicitias.
Nutrix mea fidelissima M. Beech, obiit 5 Novem. 1725, aet. 77.
Edwardus Blunt, vir amicissimus obit, Aug. 1726.
Francisc. Atterbury, Roffens Episcopus, vir omni scientia clarus,
The fourth volume contains the Satires, with their Prologue,--the
alterations. --_His Last Will and Testament._--WARBURTON.
Not very useful to train on…
I think what’s going on with the foreheads is this: the souls took whatever conditions the world threw at them (“the thunder and the sunshine”) cheerfully (“with a frolic welcome”), and faced them with (“opposed”) free hearts and free foreheads. The hearts and foreheads aren’t the souls’ opponents, the thunder and sunshine are; the hearts and foreheads are more like the weapons the souls wield against those opponents. (I also think “opposed” has less connotation of enmity than you might think; compare “opposable thumbs”.)
I came here to say this, but you already have, so I’ll just comment that I agree with your interpretation. Not the part about ‘wielding their hearts and foreheads like weapons(?!)’, but that “opposed” basically means “faced” here, and that the thunder and the sunshine (nature’s attempts to intimidate or seduce) are being faced with free hearts and minds*.
*My guess is that “forehead” is being used as a metonym in a way parallel to (albeit less familiar than) “heart”: the forehead is the seat of the intellect as the heart is the seat of the passions, and neither could be cowed or swayed by nature’s threats and blandishments.
You’re probably right that forehead is a metonym for intellect here, but I think it’s also possible to interpret it as a literal forehead. In this case, “free” would have a double meaning:
Souls [i.e. people] opposed [faced] thunder and the sunshine [weather] free hearts [willingly] and free foreheads [without hats].
It could be symbolism for facing the elements directly without any protection. It brings to mind an image of sailors toiling in the middle of a storm with their caps (which were a bigger deal back then?) blown away.
That sounds plausible to me, as well. I like it.
I agree, but also you have to take the imagery seriously – the forehead of a sailor, on the ship’s deck, implies the posture of one working at their task, exposed to the elements but dignified.
I didn’t mean the “weapons” bit too literally. But I think there’s a suggestion that the free-ness of the hearts and foreheads isn’t merely an attitude that those souls happened to have, but also part of how they’re opposing whatever the world throws at them. (And yes, it means “minds and emotions” rather than anything to do with blood-pumps or upper parts of faces. I’m not convinced by fluorocarbon’s admittedly very ingenious idea about hats.)
Thanks, that makes sense!
This right here is the entire problem I have with all of this.
We look at the gook produced by an automated generator and OUR pattern matching goes into over drive and finds the face in the tree, and then some people start to say that trees actually produce faces because they see humans so often…
ETA: We can do this credibly with a human author, but not with this kind of AI.
ETA2: We can’t do it credibly with this kind of AI because otherwise the AI wouldn’t trail off into gibberish so reliably.
The foreheads example is from a human poet. Are you missing this, or am I missing something in your comment?
Yes, I understand this that this is from Tennyson.
It’s the fact that you can’t meaningfully make the same statement about the AI generated poetry that is the issue. I was trying to make that clear in my ETAs, but apparently I was not successful.
These are brilliant and a bit scary. I especially like the one that begins “And they have seen the last light fail” – I find the style very evocative.
I’d like to see an AI that can talk *about* the poetry it’s written, like a human would, and explain why it chose a certain word or deviated from the metre. A few months ago I’d have thought AI couldn’t do that, because it would require metacognition and self-awareness; but seeing what’s been coming out of GPT-2, I now think it won’t be long before it can fake it convincingly.
I am not a poet or any kind of a literature expert in any way, but still: I think that the AI has a tremendous advantage in this area today, since humans have been doing their best to render poetry, as well as criticism of poetry, as content-free as possible. Thus, it is significantly easier for the AI to beat them at this task in 2019, than it would’ve been in the past century (Tennyson notwithstanding).
Already identified as a problem in 1946.
Now, if GPT-2 can learn how to write convincing Orwell essays, then it’ll be time to panic.
I’d love to see such metacognition too, but I’m not sure how much useful information we’d get besides more “look at what the machine generated!” responses as the poetry itself. The standard story in data science is that you can train all these state of the art classifiers and NN and sure they work nicely when it comes to predicting, but good luck knowing actually why or even simply which input variables were relevant; which is definitely good to know to make better models.
Maybe you’re right and GPT-2 is on the good track to self-reflection, but I’d still be cautious with my hopes. It is a very difficult problem that many people (and plenty shareholders) have a strong interest in cracking, and not a lot of luck so far. I’m not saying I’m sure researchers won’t find a way anytime soon, but the AI itself… After all, it’s not like humans are particularly good at knowing why they do things either. We’re still working on it after several centuries of “whispering demons” and psychoanalysis and the lot.
I do understand that a kid does not need to understand evolutionary psychology to say why he kicked his little brother. It’s just that I would not trust much the answer in terms of “real motivations” (just as I don’t trust the explanations of psychoanalysis, no matter how elaborate). One could probably get the AI to answer something psychoanalysis-plausibility-level-sounding when prompted for its motivations, which is surely interesting enough – or at least as you say, fake it convincingly.
I don’t know how one would train an AI to talk about poetry, but there are a lot of ways to visualize the internals, especially since this is Transformer based and the internal ‘attention’ is nothing but a way of expressing what parts of the previous input are important. Hence all the attention visualizations in posts like https://ai.googleblog.com/2017/08/transformer-novel-neural-network.html https://jalammar.github.io/illustrated-transformer/ http://nlp.seas.harvard.edu/2018/04/03/attention.html
GPT-2 can’t do anything “about” anything.
You could try to train it on a corpus of “poem immediately followed by discourse on poem”, but it wouldn’t produce convincing imitations; that’s the kind of long-distance coherence that it can’t manage.
Even a much-improved GPT-N wouldn’t be able to produce a poem and an explanation of why it wrote it that way; it would write a poem, take a look at the poem, and then guess a plausible reason someone might have written it that way. It might output “I included the line about lilies as a reference to my late sister Lily”, for example.
In general, GPT has a bullshit problem (it’s easier to produce convincing bullshit than convincing truth) and a mediocrity problem (mediocre fanfic is more convincingly fanfic-like than excellent fanfic). That’s because it was designed to produce text that a human would write, rather than good or true text. These problems are orthogonal to current limitations on how well it carries out its mandate.
Is there a way to simply ask the AI what its favorite poem is? It already knows so much poetry it would be awesome if it gave us a top 10.
Remember, GPT-2 isn’t about ‘favorite’ or ‘good’ or ‘best’. It’s about ‘likely’ or ‘predictable’ or ‘average’. Each word is generated because it seems probable to occur in the PG corpus given the previous words. It has no other criterion it cares about and no way to know about other things like subjective quality. (Given that it was trained the way it was trained and doesn’t use any losses other than the predictive one; incidentally, if you used Christiano’s preference learning, this would be much easier to answer directly: you’d simply run the D/critic over the human corpus and pick the 10 with the highest scores, as ‘best’ is in fact what the D/critic is attempting to learn in preference learning.)
So what you could ask is, ‘what 10 poems (in the PG or another corpus) as a whole are the most likely to be written, according to GPT-2, and have the biggest total log-likelihood?’ This would give you, in some sense, the 10 most ‘prototypical’ or ‘average’ poems, somewhat like StyleGAN’s psi=0 gives you the ‘average’ face (which for both humans and anime faces turns out to be a woman with short brown hair). Which might be somewhat interesting but one would have to code it up to extract the likelihoods from GPT-2 and sum them in various windows etc.
Oh! Its a word calculator!
I enjoyed your trick at the end. I think someone should make a “can you tell the difference between AI and Great Poetry” quiz if they haven’t already done so.
I really want to see this. How curated should the “Great Poetry” selections be, though? Should they just be random selections out of Gwern’s training data?
I think they should be curated to avoid well-known poems that people are likely to recognise, but not any further.
I would really like that. I skimmed ahead in the text enough to know the last excerpt was human-authored before I could evaluate it as AI poetry, and now I wonder how well I would have done without that bias.
I also wished the trick had been a bit trickier. I happen to like Ulysses a lot but have still encountered it a couple of times independently of that liking, whereas it should be easy to pick even something like a Shakespeare sonnet that only a tiny percentage of people would recognize while still being of reliably good quality.
“The disdiscoveries of fate” seems to me rather good. At first I took it as referring to things which appear to be a discovery but which are actually the antithesis of that. One might also think that since a discovery is the bringing to light of something once covered, a disdiscovery is the obscuring of something once revealed. The use of nonce-words of ambiguous meaning is well established in the poetic tradition and seems impressive for an AI.
I rarely read the comments so forgive me if someone else brought this up before.
Tying together two of your recent themes of GPT-2 and the idea of one great idea out of 1000 is very valuable.
Could you train GPT-2 on say just chemistry academic material then prompt it with something like “the most efficient rocket fuel is” a thousand times and test the results?
Would it come up with any novel ideas?
The most efficient rocket fuel is based on the fact that rockets are fired from the ground, not rockets fired from the sea.
https://pastebin.com/VkraiBMe
Sure, sure. That was just an example. I mean you wouldn’t try it with medication because tha9 would be unethical, or physics because that has progressed beyond testable hypotheses, but there are still a lot of possibilities.. “what’s the most likely place to find oil?”. “What is the most efficient blade design for an aircraft engine? “, etc.
I think you’re basically describing divination…
Here again is the fundamental issue I have with this.
GPT-2 is impressive. GPT-2 is highly likely to be useful.
GPT-2 does not understand anything other than how words tend to be strung together. It does not know anything about the underlying concepts. It can’t generate novel ideas, only novel strings of words. It has no preferences so it can’t generate lists of its favorite things.
That doesn’t mean this isn’t a step towards some sort of AGI. What GPT-2 is doing one part of how we process language, but it’s not even the fundamental part. Koko’s use of sign language was much closer to human language capability than this is.
This is true, and well and fairly expressed.
My own takeaway was that I was surprised how much what it produced was like real poetry — I would not have predicted that it would exhibit an apparent (yes, I say “apparent”) understanding of rhyme and scansion. It made me more sympathetic than I used to be to Scott’s notion that our ability to generate novel ideas and underlying concepts might be a similar mechanism operating a few levels more meta — just knowing how ideas tend to be strung together, so to speak.
Whether that few levels of metaness is one or two or a hundred, I have no idea.
Figuring out familiar, yet novel, ways to string together words is one part of poetry. That’s the part you are responding to.
Anyone who has spent any time at TV Tropes understands that this “stringing together of things” also part of storytelling in general. If you mash together a bunch of these kinds of tricks, you could probably get a “machine learned” in telling a “new” trope story. Seriously, read a few Hardy Boys mysteries sometime.
But you won’t get it reliably. Because these tricks won’t have any real judgement to them. Just formula. And it won’t know the difference between something truly novel, and something that is nonesense.
Agreed. But at our level, the stringing of ideas together doesn’t reliably result in the novel or profound either.
Apparently someone trained it on Ginsberg:
Huh, this is not bad at all. Well, “peoples’ eyes” is awkward, but it’s still very evocative.
*snicker*
> Moloch whose spies indoctrinate society with the hope of conquering all mankind!
I really like this one.
Moloch who designs payload adaptors for Northrop Grumman?
If you specifically want to generate poetry that rhymes or follows other conventions that GPT-2 doesn’t always do by default, you could potentially add additional constraints to the output, such as reattempting any lines that don’t fit the rhyme scheme. The fact that a model can be augmented or guided by old-fashioned programming sometimes seems to be missing from discussions around the potential uses of new AI developments.
Off my head, as a programmer, I wouldn’t know how to write an app that takes a large # of lines and tells you if they rhyme/are the right length (that is, does this poem follow the ‘rules’ of whatever style of poetry you are doing). A buddy of mine does speech recognition stuff, so it is definitely solved somewhere, but I’d have to stackoverflow for a while before I could write anything resembling this.
I think if you had a database of words and their pronunciation, you could, say, check whether the last vowel and coda match but onset not. That should come like 99% of the way to poetry that rhymes while still using distinct syllables (so that “rhyme/time” works but “time/time” or “time/thyme” doesn’t).
That entendrepreneur pun generator analyzes words this way, so it must be getting the pronunciation data from somewhere.
The CMU Pronouncing dictionary is what I used to generate rhymes, and it is very easy. You just make sure that everything matches from the last stressed syllable onwards.
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
I discuss this in the last section: https://www.gwern.net/RNN-metadata#overall You could take a resampling approach to greedily sample only completions which satisfy the rules, but as you add on the constraints, I think you’d quickly run into problems finding any completions which still satisfy the rules and each step would start taking hundreds or thousands of times as long and the results would not be great since all of the previous steps were generated without regard for what future steps would need, so it’d paint itself into a corner. You need RL training to incorporate global end-to-end losses like that, IMO, similar to NMT,
Another way to enforce output constraints is constrained beam search. Given a rule which can be described by a finite state machine, you keep a beam of size at most k at each state.
This solves the problem of it being extremely unlikely that any randomly-sampled sequence will work. For a really long text you could end up maintaining a “beam” of size (# of states) * k, so it does impact the runtime of inference, but it’ll be much, much, much better than resampling (when the model is not inclined to go along with the constraint). And the model doesn’t have to change at all.
Of course, the state machine is a pain to set up, though something like OpenGRM could potentially help.
I could see decades of high-school AP English teachers confidently directing their class in the exegesis of “Majestic yet majestic” 😀
If you are interested in using this for practical purposes, you can take advantage of the fact that “the first two lines are always the best, and it deteriorates from there”, by simply rerunning the program from the point just before where it starts to go wrong. See a nonsense line, or a made-up word? Just try again from the previous line, using as a prompt what has come before (maybe not the whole thing, to induce a little variation). This also includes just enough interaction to feel like you’re helping, without having to do any real mental work: cooking with a cake mix, as it were.
Here is a continuation of Borges’ “Chinese Encyclopedia” that I did this with (using the public version of gpt-2). Notice that it has sort of, but not quite, learned to count.
14. Those that keep sunlight on their heads
15. Those that pigtails must Heed their deliberations
16. Those that are enveloped in butterflies
17. These caterpillars, having developed an eye to need something from that side of the world
14. Those of a rank no longer own a horse
15. Those that believe they shall get accustomed to the devotion or even worship of the leg which is installed in their head
21. Those dressed in the armor of bird and dog and can see
19. Someone dressed in the helmet, which causes its wearer to witness his actions
20. A fine art. As by a ray out of the sky
22. A star fireworks flickering in front of him
23. A Knight of Heaven wearing his bow and arrow and when he talks
25. A Fish bladder with bottle in each 50 feet
26. A Pendulum Giant filled with Wax flower
27. Some conglomeration of dung and blend of nitajé eggs
28. Somewhere up close
29. A Few hundred or so good looking which, depending on their age, look weighty but are very happy
30. Eighties Mail hogs which nearly look like large animals
What I like about Gwern’s poetry generation is that it is not just weird and evocative (like most of what comes out of gpt-2) but beautiful as well. I feel like artistic style transfer has some of the same ability for visual art that this has for poetry– to create things that aren’t just interesting, but genuinely beautiful, with a little cleaning up. Here are a few examples that I’ve done:
https://machinamenta.blogspot.com/2019/03/style-transfer-for-jewelry-design.html
https://www.deviantart.com/popular-all-time/?section=&global=1&q=summerstay
[not including the Audobon Angry Birds, those were just straightforward photoshopping]
The choices of where to place the jewels, or how to link overlapping lines are effects that, if done by an artist, we would consider creative choices, and I find the results beautiful.
It is better than I can do as a human. I quite liked it.
Yes, “majestic yet majestic” is brilliant.
Now I think that the AI is lucky it’s working in English. Even German would be tougher, not to mention Slavic languages (although I can already imagine it writing a passable Derzhavin).
“Кот, каверзник коварный, кибэротоман,
К королеве кафров крадется Киприан.
Как клавесина клавишей, корсажа касается.
Красотка к кавалеру, конфузясь, кидается…
…Казнится краля, киснет: канул Купидон,
К кузине королевы крадется киберон!”
As always, Stanislaw Lem was way ahead of everyone 🙂
Here’s a thread on some translations of the lines, including into German: https://forum.pravda.com.ua/index.php?topic=370615.0;wap2
yet this version has verb rhymes and nouns-in-the-same-grammatical-case rhymes, it’s just… primitive.
This is the most “poetic” thing I got out of the small GPT-2. First sentence is the original prompt. I took most of the first reply and then used that as a prompt again.
“I am sorry for overdoing it. It is my doing to teach you how to be loyal to the values that we share with each other. Due to my decisions I have lost all of my career as a farmer. I turned 90 more than half a century ago. Like a lost generation of passionate farmers. We all missed that point – long before we were strong, in the very fire to which friendship and nurturing of your life is vital to you.”
Reading that made my head feel fuzzy as my brain tried to tie things together and kept being disrupted by the the next word being somewhere between slightly and very incongruent.
And that’s the basic problem with GPT-2. It mixes and matches things that make sense, resulting in something that doesn’t quite make sense. When we read the result, we mistake its “almost makes sense” for “makes sense in a way I have yet to figure out, so I should try harder to understand it”. So we make an effort to understand it. But that effort that is always in vain.
Basically, it’s a written Rorschach test.
Well put.
I’d draw a different comparison: the text reminds me of listening to a schizophrenic. Their stories are often coherent and even intriguing on the sentence level, but you quickly realize that there’s no larger structure guiding the narrative as it slides into surreality.
Scott, please ask Gwern to upload the book of Psalms and Proverbs. Also would love to see what the AI can produce after reading all your blog posts.
Later that night Gwern noticed that, overhead, without any fuss, the stars were going out.
😃
1 Come Lord, and tarry not;
Bring the long-looked-for day;
O why these years of waiting here,
These ages of delay?
2 Come, for Thy saints still wait;
Daily ascends their sigh:
The Spirit and the Bride say, “Come”:
Dost Thou not hear the cry?
3 Come, for creation groans,
Impatient of Thy stay,
Worn out with these long years of ill,
These ages of delay.
4 Come, and make all things new;
Build up this ruined earth;
Restore our faded Paradise,
Creation’s second birth.
5 Come, and bring Thy reign
Of everlasting peace;
Come, take the kingdom to Thyself,
Great King of Righteousness.
Amen
The monks were wrong. It turns out God was just angry about being doxed.
The hearts and the foreheads are being opposed to the thunder and the sunshine.
The AI generated poetry is better than I would expect, but it isn’t comparable to Ulysses.
I really enjoyed The Emperor Wu.
So did I! And the suggestion elsewhere that it should be set to music by Philip Glass is very apposite as well. I think the fact that we can make sense out of the generated text which is not really meaningful demonstrates that humans really do have very strong pattern-matching going on. It would be easy to create an explanation for how the Emperor Wu poem works not alone as poetry but as meaningful, and I can finally understand why the non-artsy types complain that the humanities are just a con job spinning crap (instead of gold) out of straw 🙂
And now I have strong suspicions that the Emperor Wu poem is heavily influenced somewhere in there by Pound.
“And this one is obviously a failure on one level, but on another level is some kind of great experimental modern political poetry”
That one actually would be pretty familiar to anyone who had spent much time in the Roman Senate. Repeating the same simple adulation tens of times over was pretty standard fare.
The thing’s weakness still seems to be keeping track of what it’s talking about for more than a few dozen words.
The challenge I’ve been thinking about for it is whether it can reliably produce C programs (or whatever language) of a non-trivial length that actually compile. I know I’ve seen elsewhere that someone had it generating C code based on the Linux kernel that looked vaguely like valid code in a dreamworld sense, but generally didn’t compile.
What I expect to be challenging for the algorithm is recognizing that things like variable and function declarations need to precede their use, and keeping track of what declarations exist in the current scope, just like it has a hard time remembering that Gimli is a dwarf. Good thing is, you can’t trick a compiler into thinking you know what you’re doing with dreamy BS; either your symbols are in the symbol table, or they’re not, so it’s easy to objectively measure how well it’s doing at the goal.
When I can see it doing this, I might consider getting vaguely excited about it’s idea to derive actual knowledge from statistical relationships.
Now someone needs to teach it to do heavy metal lyrics. I’m thinking a mix of Iron Maiden, Ghost, Iced Earth, Metallica, Black Sabbath, and all the other really bombastic and grandiose metal bands could get it in the “headspace”, as it were, to generate stuff that fits in with the world-feel of songs like Hallowed Be Thy Name, One, Ghost of the Navigator, War Pigs, Witch Image, Burnt Offering… stuff that exists on the same thematic plane, you know? Like there’s a very recognizable lyrical feel to songs like that, in the same way that there is for 19th century poetry. There’s a shared library of concepts and referents in use, and GPT2 seems to be good at isolating and recombining those libraries.
It would probably do better at prog rock, though, since the lyrics there are frequently a) incoherent b) weirdly and pretentiously metaphorical c) totally secondary, such that you could probably do a prog album entirely with GPT2 lyrics and release it under a fake songwriter credit and no one would guess (especially if you went for the feel of something out in experimental land like Lateralus, which contains a lot of evocative metaphor in isolation with almost no overt meaning). Shit, it would probably be better than Dream Theater’s second to last album (The Astonishing). Actually, if that came out this year instead of in 2016, “it was mostly AI generated” would have explained a lot about how generic and derivative the characters, lyrics, and really the whole concept felt.
Oh hey, there’s even a dataset for this already: https://github.com/JarbasAl/metal_dataset
Shame that the “power metal” subset doesn’t have its own lyrics collection, because that’s kind of the subgenre I was thinking of.
I think it would work well for nearly all popular music.
Yeah. I think a song lyrics website would be a far better, if less erudite corpus than poetry.
Though it wouldn’t hurt to throw both in.
Poetry is trying to be the most complex form of verse, while simultaneously disappearing up its own arse within the last century.
Song lyrics afford it a slightly simpler way to figure out rhyme and metre.
Though it might get stuck in loops more often.
The degeneration examples are like a capsule history of the evolution of poetry from Pope to Gertrude Stein. “Can you tell real Gertrude Stein from the AI generated version” would have been a much harder game.
+1, see e.g. If I Told Him (1923)
“Emperor Wu” cries out to be set to music by Philip Glass.
Or perhaps to the tune of the “Prince Ali” song from Aladdin.
I’m reminded of David Ives’ wonderful short play “Philip Glass Buys a Loaf of Bread” (link to random PDF of the script).
It’s called “exposure bias”. I think it was first discussed here. During training the model only sees text written by humans, it never sees text generated by itself, while when you execute the model to generate something, the model sees a prefix of a text that it has generated itself (after the optional prompt, which is human-generated). Because the self-sampled text is statistically different than human-generated text, the model generalizes less well when it computes the probabilities for the next word, and at each new sampled word this effect accumulates, causing it to eventually generate gibberish.
Researchers tried lots of things to get around this issue, but none (including the one described in the paper I linked) really seems to work well so far.
Is this a memory issue?
No. The lack of long-distance coherence (as seen here in the Lord of the Rings-themed story) is arguably a memory issue.
The issue here is that the neural network has to generalize to examples that it has not seen during training. A neural network is a function approximator: given an input x (a sequence of words, in this case), predict an output y (the next word), in a way that approximately interpolates the training set (a set of (x,y) pairs). How does the network compute a prediction of an input that it has never seen during training? It depends on its inductive bias (the prior, roughly speaking, although neural networks aren’t strictly Bayesian).
Usually when the input x_0 is a human-written prompt, even if it never appeared in the training set it will be statistically similar enough to the training examples that you can assume that they come form the same probability distribution. This is called in-distribution generalization, and well-trained neural networks do well at it. The system will compute a next word probability P(y|x_0) similar to the probability distribution that you would get if you asked a human to continue the prompt, but since the model is not perfect there will be some slight difference.
The system then samples a next word y_0 from P(y_0|x_0) and concatenates it with the prompt to form a new input x_1 = [x_0 y_0]. The problem is that x_1 now comes from a probability distribution different from the training one, because it’s last word was sampled from an approximation of the human-written next word distribution. This is now an out-of-distribution generalization problem, and neural networks don’t do so well at it, thus when you ask the neural network to compute P(y_1|x_1), the quality of approximation with respect to how a human would contrinue the sentence will be lower compared to when it had computed P(y_0|x_0). Therefore x_2 = [x_1 y_1] will be even more statistically different from the training distribution.
This issue compounds with each sampled word. Initially it’s a subtle statistical difference that you can hardly notice, but eventually it will result in the system generating nonsense or getting stuck into loops.
You can think of this in terms of chaos theory: the system is a (stochastic) chaotic attractor in the high dimensional phase space of all possible strings, and the English texts are trajectories in this high dimensional phase space confined to a manifold of much lower intrinsic dimension. If you start the chaotic attractor at a point on an English text trajectory, it will approximately follow it, or some plausible continuation of it near the English text manifold, for some time, but eventually it will diverge and end up very far.
Note that this is quite different from how humans behave: if I gave you a prompt and asked to continue it, you wouldn’t end up writing gibberish. Even if the prompt was weird and possibly ungrammatical English (e.g. the Ulysses), you would be able to recover and continue it with plausible and fluent English rather than diverge.
Ah, I see. That sounds really difficult to solve.
Random thought:
What happens if you feed the corpus and prompt in backwards, then reverse the output?
Language isn’t really a linear thing, it’s a tree.
I wonder whether you can separate the syntax from the semantics. Parse the corpus into a tree, then train the algorithm to output trees, then re-assemble it into sentences.
If you told me
Was from a newly discovered Gerard Manley Hopkins poem, I might believe you
Re: foreheads I think Tennyson was saying that during their adventures, Ulysses’ sailors had to choose between what their hearts and their heads were telling them to do, using “oppose” in the sense of placing two things against each other. Which seems true to the story.
The real advance in AI language processing will happen when programmers realize that language is not a self-contained system, but that its function is to respond to and modify embodied situations. (I hope that comment is not too gnomic; I’m not sure how to elaborate on it without writing 10,000 words.)
I’m curious what you would get if you generated 100,000 poems with Gwern’s trained GPT-2 and fed those as the training data to another GPT-2. Would you get the same kind of output or would adding a layer result in a different kind (or quality) of output?
Probably more of the same, but less good. I picture the training data as a large set of points in poemspace, and poem generation from that data as producing fuzzy averages of those points. Averaging averages only produces more averages.
The randomness means that training on the generated poems would tend to produce results that wander further from the originals in random directions, but most directions you could wander in aren’t very good and look more like noise. (To give a simple example: if AI is prone to stupid repetition, the generated poems would have lots of stupid repetition, which would teach the second-generation output to repeat itself even more.)
This is only possible to do for AI because it has a large corpus to draw on: lots of humans have written poetry and developed ideas (such as “meter” and “parallel structure”) which it can learn.
But that’s not a distinguishing feature of AI: humans that write poetry do so by reading poetry that previous humans have written and learning the ideas developed there.
What prevents AI from running away and stealing the poetry show is that it’s missing the evaluation step: it might write occasionally-great poetry, but has no way of recognizing the great bits to try to do them more often.
So the automation of writing poetry, which only talented humans can do, can only produce more of the same results, maybe with the occasional brilliancy. It’s the automation of reading poetry, which any human can do, that would let the AI learn from its previous experiments, train itself on the best of its previous output, and surpass all human poets in the fashion of AlphaGo Zero.
Yes, learning the reward function is potentially a major step. I’ve been convinced for a long time that the key to good text/music generation may be ‘preference learning’: https://arxiv.org/abs/1706.03741 As long as humans can keep recognizing pleasing poetry (even if they cannot write it themselves!) Christiano’s preference learning approach may be able to bootstrap to much better quality with end-to-end learning which can surpass the original corpus.
That’s really interesting, thanks for linking that.. I’ve been thinking about pairing a generative music system with real-time preference input from affect recognition on the user’s face via the webcam, and this seems like an incredibly useful tool for that.. I’ve played with doing something similar using human preference and an evolutionary algorithm, but as Christiano et al point out, the necessary amount of human feedback is prohibitively large.
I think what you’re looking for is adversarial networks: you basically train two networks at once: one learns to simulate poems and another learns to distinguish simulated (supposedly bad) poems from real ones (supposedly good).
Another thought is that since the corpus of valid poems with structure is much larger than the corpus of good poems, one can pre-train the model for structure, and then train for quality.
Ok, the next challenge is to have an AI preacher.
Have the learning base be as many sermons as possible, I may try to get some friends to put the bible + sermons as a learning set (I’m too lazy)
Not this AI, and not even in the same ballpark but you can download bots and give them whatever corpus you like. For example…
http://kingjamesprogramming.tumblr.com/
This one too, pretty easily: https://github.com/ak9250/gpt-2-colab/blob/master/GPT_2.ipynb
This is fascinating. And I honestly think the second to last poem isn’t bad.
I’m really curious to see this AI try it’s hand at making recipes. There’s a lot out there to train it on, a lot of which even have review scores that could be used.
I believe Indra has been called “King of all the blest” in Sanskrit, so really this is another example of the “translation” feature discussed in the last post (below) rather than an entirely new coinage.
Still, I have to doubt this theory; the only Sanskrit word I can think it likely to encounter naturally is “nirvana”, and for the French test GPT-2 was provided with sample English-French sentences.
I would be surprised if that is what happened. GPT-2 was seeded from Reddit posts. So you would need a Reddit post to have linked a page calling Indra that in Sanskrit, for it to get into the corpus (which was only ~40GB and has to include all the other stuff), for this non-English text to be memorized despite being extremely rare, for it to learn the English equivalent (how?), for the knowledge to survive several days of training on a corpus which features near-zero or zero Sanskrit and no appearances of that phrase in English, and to be given such high likelihood that dumb sampling can sample it in exactly the right context. At that point, I’d say I’d think it’s more likely it’s just a parallel invention following a standard snowclone ‘King of X’ and positive words like ‘blest’, and there happen to be so many epithets for so many gods that some just happen to be hits.
Thanks for the insight!
The description of Ravan’s jaw is, uh, quite memorable.
Well, Ravan was ten-headed, so he would have all those jaws to match.
I think that probably comes (“the bleeding jaw”) from the story of when Indra injured the child Hanuman in the face (a kind of folk etymology of Hanuman – “One interpretation of the term is that it means “one having a jaw (hanu) that is prominent (mant)”. This version is supported by a Puranic legend wherein baby Hanuman mistakes the sun for a fruit, attempts to heroically reach it, is wounded and gets a disfigured jaw” according to Wikipedia). It’s one famous episode involving Indra, and the translated texts used to train the AI must surely have mentioned it, hence the association between “Indra” and “jaw” in the mind (as it were) of the AI.
Just be glad those sources don’t seem to have included this episode about Indra (given that it’s 19th and early 20th century poetry, translations did tend to be rather censorious).
Stanislaw Lem had a really fantastic short story in The Cyberiad about an “electric poet” made by one of the main characters. The computer was in fact much better than the humans:
The whole book is worth a read, but I found a fun post talking about the challenges of translating it: https://medium.com/@mwichary/seduced-shaggy-samson-snored-725b5a8086d9
I found Emporer Wu to be oddly moving, een though I am perfectly aware that loops like that are a common failure mode for text-generating AIs.
You can find a message in the total lack of a message sometimes. This youtube video combines the song “I didn’t ask, and she didn’t say” with dashcam footage of a car driving down a freeway, during which absolutely nothing interesting happens. But I think the video concept meshes well with the theme of the song.
My reaction to your quote from Ulysses was:
1. That makes too much sense, continuity of meaning across the first two lines…
2. It must have been trained on Ulysses, it’s quoting a line entire (“Old age hath yet….)
3. Hey, wait a minute!
Yeah that’s pretty much how it went for me too, though I went through a phase of “Gwern seems to have let it run on too small a corpus for too long, because it’s borrowing really heavily from Tennyson now.”
An AI that can generate poetry is, honestly, not that impressive. An AI that can tell me whether a piece of poetry is a great work of art or total gibberish…that’s the only AI worthy of the name.
Let me know when you’ve found a natural intelligence that can do that.
I’ve read “modern poetry” which looks to me as incoherent (and worse) as the examples here. Published in books. As far as I understand, not even self-published. One of the poems I’ve seen is exactly like the Emperor Wu example, only shorter. Not sure what it means frankly. How much pareidolia is in our understanding of art? I usually react pretty well on “classic” art and is rather confused by “modern art” (not without exceptions), and AI productions seem to be more “modern art”-y. But I think more abstract the art is, less is the distinction – in music, I’d probably fail to distinguish between AI and human, while I think if somebody tried to use AI to write a novel, I’d easily distinguish it from a classical novel (though maybe not from the “modern art” ones).
I guess one distinction would be that modernist poetry (or art) is responding to a tradition – when Stein writes “a rose is a rose is a rose” it’s against a literary background, which is what makes her experimentalism meaningful. (It’s also why it’s not all that interesting as poetry a hundred years later; once the walls of tradition have been blasted down it’s no longer compelling.) The AI, by contrast, is just messing up.
I guess when we get ASI poets we’ll know because they’ll both innovate on the poetic tradition and also get their like-minded avant-garde ASI buddies to write convincing screeds about why the innovations aren’t just gibberish.
I think several strains of modern art have moved towards being more semiotically dense. This means that there is less internal consistency a reader/viewer can check to distinguish meaningful content vs gibberish, but also that the efficiency of conveyed meaning should be higher. The maximally compressed version of a poem (or anything) would sound like white noise, and would be very easy to fake convincingly.
Would you be able to cite a few short examples? I’d love to get a feeling for this.
I imagine if we consider poetry as some kind of code, which should be understood by those holding the key (no idea what it is, but let’s go along with it for a while), then crypto theory tells us the good code is indistinguishable from noise to non-key-holder. Maybe poets that don’t want to be confused with AIs start to employ more sophisticated crypto tools – I wonder if AI could convincingly fake a complex poetical form that requires certain structure and interdependencies between parts, or even something as simple as acrostic (without being programmed upfront for what’s going on of course) – i.e. if a certain poet is known to produce only acrostics, would such an algorithm, trained on texts of said poet but without specifically programming it in, be able to discover it.
The fact that the text suddenly veers off course after a couple of lines doesn’t surprise me.
GPT-2 is not trained to generate texts, but simply predict the next word of a text sample.
In the beginning, this text is a load of Pope. Then, it’s a load of Pope, followed by a bit of pseudopope. It is already beginning to veer off course, so the text it produces will be even less Popey (Alexander Papal?).
Before long, it’s trying to predict what comes next after some Pope, then some text that rapidly diverges from Pope. It’s not surprising that it should become exponentially less accurate. Especially since the ratio of prompt to generated text is constantly decreasing, as well.
Side thought: Can we now build a discriminator that is fed a load of prompt + generated text and has to try to find where it switched form human to machine. And then build a GAN around it?
Christiano’s preference learning is similar but better. 🙂
Is that the ‘Optometrist’s Method’?
i.e. the machine asks the human ‘Is A or B better?’.
Heard about that being used to improve performance of stellerator fusion reactors, where the exact metrics are hard to define.
ADDENDUM:
Sounds like a good route to What You See Is What I Mean functionality.
Yes. And yes, there was a paper doing that (using MCMC, oddly enough, which is not the usual choice for hyperparameter optimization, to say the least).
GANs for text mostly don’t work.
You would MLE pretrain on a poetry corpus, of course, and RL finetuning certainly does work elsewhere.
You need a good performance metric to fine tune towards, though.
Which is precisely what the pairwise Discriminator/critic is learning based on human ratings, yes, that’s the point of Christiano’s approach.
I’m not familiar with Christiano’s approach, is it some kind of inverse reinforcement learning approach where you train a neural network to learn a reward function from a user?
I’m not convinced that anything that has a human in the loop during training is really feasible, unless the ML algorithm is able to do few-shot learning much better than current neural network-based methods can.
Well, then, read the paper. It works for the video game environments they tried it on.
You mean this one? Seems interesting, though the ablation results look quite confusing. I wonder how well it would generalize to other domains.
Except this last time I’m cheating: this is an excerpt of Tennyson’s Ulysses, one of the most famous English poems. I included it as a placebo, ie a test to see whether real poems sound fake if you think they’re by an AI when you read them.
Huh, as I was reading it I was thinking “Oh yeah, the AI is clearly copying chunks of Tennyson’s Ulysses here” and then Scott pulls that on me. Either I have a better ear for poetry than I think, or style really shines through 🙂
I also find it interesting that Pope is what breaks it; on the face of it, you’d imagine “rhyming couplets, rhyming couplets, rhyming couplets” was the kind of mechanical function that could be easily churned out, but seemingly not. Was any Swinburne used, because I’d love to see what an AI trained on Swinburne would produce after being exposed to this!
Excerpt:
It’s cool and fun that the thing manages to learn rules, but we already knew that computers are good at learning rules.
It’s yet to be seen whether it can write poetry, which usually develops and unfolds a single thought over the span of several verses. Unless it has a thought, an inspiration, I don’t see how it can write poetry, or mean something, anything.
That is something that most modern human poets can’t achieve, either.
I think it just went full Lewis Carroll.
Hey Gwern, you should definitely try prompting it with Carroll, also other nonsense poets like Edward Lear. In poetry that’s not supposed to make semantic sense, the ELIZA-effect will be even greater! 😉
… and now I want a repository of AI-generated nonsense poetry, named “The Hunting Of The SnarXiv”.
Sounds like more work than I want to do. But here’s 100 samples from Carroll’s “Jabberwocky” which was easy enough to do: https://www.gwern.net/docs/ai/2019-03-16-gpt2-poetry-prefix-jabberwocky-100samples.txt
First sample:
He found a foxy in the brake,
A cunning fox of scarlet dye,
And from that foxy followed make
The scrawny fox in glee.
He followed with his dam and horn
To where the river-water runs,
And as his living current on
The river-water likes him up
A mighty rocky heifer heaves,
And in a single field, or twain,
Shows like the yellow corn;
And when the wind doth blow, so too
Low in his bottom lies his head,
And in the grass leaps up again,
In fearful freedom unbetrayed.
I love the first 4 lines, including the ‘brake’/’make’ rhyme (‘brake’ being another obscure yet cromulent word here). There are a lot of good other samples too, like “I, the joy-devoted mole.” Or this epic one:
And while the sword-strokes gushed and ran
A-gathering like a hurricane,
The wildest war-cries, loud and shrill,
Came surging through the hills.
At times a wild yell rose in the air,
From steep to steep, from moor to moor,
A scream of horror and despair,
Of death approaching neathernock,
Of dying embers in the pore.
“Thy shield, my son! thy sabre bright!
My hands! thy lips! thy lips of flame!
O hasten! I have done the same.”
It made the blood tingle and squall
To hear that wounded warrior’s cry;
It made the life blood quickly fall
On the proud swine that fought and died—
“Say on! say on! thou wounded swine,
Say on! thou hewer of the sanguine!
Say on! thou hewer of the sanguine!
Say on! thou hewer of the sanguine!
Say on! say on! that pierced youth
Who crossed the heath in days of yore,
And when he crossed it, “Say on!”
It made the blood tingle and squall
To hear that wounded warrior’s cry;
It made the life blood quickly quail,
To hear that death did soon draw nigh.
It made the blood tingle and squall,
To hear that death did soon draw nigh.
“Say on! say on! thou wounded swine,
Say on! thou hewer of the sanguine!
Thou art not doomed to fetch a pail,
’Tis not the ebbing blood so drain.”
And then it made the blood tingle,
To hear that agony within:
“Thou liest
:thinking_face:
24560|“Come to my arms, my youthful boy!
24560|My elbows are as dry as clay;
24560|I would that thou wert there, boy,
24560|And a harlot were also I.”
24560|“Now try me!” he said outstretchingly;
24560|“Now try me!” there was no more to say.
24560|So they twain grew strong, and with their swords
24560|He smote the maimed youth in the face.
Here’s one you could turn into a story about the day Death took a day off:
24560|And when the corpse was levelled out,
24560|They all forgot to die—
24560|The men and boys of all the town
24560|They made a funeral bier;
24560|For the country-folk, it seems to me,
24560|And the girls did weep and shake
24560|When they found the jolly dead of night,
24560|In the churchyard under the hill.
24560|And when the jolly coffin came,
24560|And they all did raise a cheer,
24560|They found the jolly borogarrest
24560|In the churchyard under the hill.
24560|He said, “I’m up in the chimney-corner,
24560|With the coffin in the shroud,
24560|And the clargy, clargy, clinker-ware,
24560|In the churchyard under the hill!”
24560|And now the jolly borogarra
24560|Is cleared away from the town,
24560|The bells ring out their merry pealing,
24560|And the priest did raise his crown.
24560|With a joyful clood the country folk
24560|The sun it shone on high,
24560|The crock of bells was over the hill,
24560|And a man he did not die.
24560|For every Sabbath bell that day
24560|Came bleating to his ear,
24560|And every man did shout ‘He’s done!’
24560|The bells are silent for the tears
24560|In the churchyard under the hill,
24560|And no man stirs but the man and the dead,
24560|In the churchyard under the hill.
Thank you, this is great.
It seems to yield some interesting insights into how GPT-2 ‘thinks’. For instance, it has figured out that ‘killing things that begin with J but aren’t real words’ is important:
If I must kill the Jeiblung Goer
…
24560|“Where is the tail of the Juritchhaut?
…
24560|“Ha! Ha! Ha! I have slain the Jatmas,”
…
24560|“Hast thou slain the Jugglern?” she cried,
…
24560|“And canst thou slay the jahabi?
24560|In such a mood I hate to ride!
24560|My hands are stained with scarlet cloth,
24560|My eyes are shut in hide.
It also occasionally starts to quote the line “One, two! And through and through” but in the middle gets distracted by ‘this is counting, I know this!’:
24560|Then, one, two, three! And through and through
24560|The moss he thrust below the knee;
24560|And one, two, three! And through and through
24560|The moss he thrust, then thrust it back;
…
24560|One, two! Three! And through and through
24560|The copse that rose frae bank to brier;
…
24560|One, two, three, four! On tree top,
24560|Who saw him before the dawning sun,
24560|Fled; and two, five, eight!
24560|“He’s an old top-captain, Sir!
24560|He’s an old top-captain, Sir!
24560|And in this wretched world is he
24560|Only eight and ten.
This snippet amuses me, I think the priggish/prattle alliteration helps.
24560|“Now haste thee, little priggish boy!
24560|Come forth, my babe, and prattle now!”
And here’s a moment when it really seemed to have got the style down pat:
24560|He pierced the thread, and struck the line;
24560|“Where is my rope?” said he.
24560|And then he made the rope and wound
24560|The pumpkin round his ear;
24560|He broke the line, through whistling sound,
24560|And the moles howl in the drear.
I’m not sure but I think it managed to make a pun on two meanings of ‘low’ here?
24560|At daybreak up the hill is brought
24560|The jay as early as the morn,
24560|The oxen low beneath is sought,
24560|And oxen low adown the corn.
I’ve no idea what this is about but it’s gripping:
24560|“I am that man!” the trooper said
24560|“Beware the Thing that lurks within.”
24560|“The Thing that I must do,” said he;
24560|“The Thing that I must do.”
24560|And when the steel shot from behind,
24560|He turned to see what he could see,
24560|And heard the hound and spurs behind;
24560|Heard breathings in the below,
24560|Sounds heard beneath the roof, the cry
24560|Of an old man crouched for there--alone!
24560|It is the strangest thing that’s known
24560|In all the cumuli.”
I noticed the ‘J’ thing too. My suspicion is that it realizes that ‘Jabberwocky’ is a singular proper noun (there’s apparently only the one in the poem and it’s unique) but then the self-attention isn’t quite good enough to propagate the full name ‘Jabberwocky’ throughout the samples, or possibly the temperature-based sampling is screwing everything up again, and it can only get the ‘J[vowel]’ right before diverging into a more plausible proper name.
gwern could make a lot of money by training AI to compose Accenture ads, like this one:
“SynOps ultimately showcases the art of the possible with how clients can now embrace innovation to drive new value–it’s the new applied now.”
https://pbs.twimg.com/media/D1lHUgWUcAIChee.jpg
I suspect a robot would be better at Accenture ads than a human.
I suspect a robot would be better at $TASK than a human.
Is this going to be the hot new way of insulting people who perform $TASK? 😉
How do you know that ad was written by a human ?
Scott, to what extent is that placebo piece of recognized Great Poetry a random piece of recognized Great Poetry, and to what extent did you cherry-pick it for being particularly wrong-looking despite being recognized Great Poetry?
The second one, though I didn’t spend too long cherry-picking.
GPT-2 writes rambling verse. It understands how to write a line but not really a stanza. It understands that stanzas are meant to feel different from one to the next, and it does so through having the lines in a stanza having more similar forms. It does not seem to split different concepts into different stanzas and progress through them.
In the Ulysses sample, we have this line:
“Free hearts, free foreheads – you and I are old;”
The poem turns on this line. The line comes off a caesura, shifts into Trochaic meter. It uses the dash for a serious long pause, which allows it to have two sequential stressed syllabled. This emphasizes the shift in the topic of the poem.
That’s very good poetic craftsmanship. GPT-2 hasn’t figured out how to write devices like that… yet
You’re right about craftsmanship, but the poem doesn’t shift into trochaic meter there. “heads” is the first syllable of the third iambic foot of that line, “and I” is the iambic fourth, “are old” is the fifth iamb. It’s steadily iambic (though some would call the first two feet spondees, and I would courteously disagree).
I loved Emperor Wu’s descent from majesty to rapacity.
The computer is discovering names of God already? This is how apocalypses begin.
Gwern made a serious error by not naming the anime generator wAIfu.
I’m saving that for the full-Danbooru2018 StyleGAN. The face GAN doesn’t deserve that pun.
Got to “Old age hath yet his honour and his toil” and immediately thought “That’s too good, had better check if it accidentally memorized something.” Though it’s not impossible, indeed likely, that I had also read that specific poem once upon a time.
Most of the factitious anime waifus betray their origins in what must be a corpus of images from porn.
As someone who’s used mythic Greece as a setting for Dungeons & Dragons (and has thus had to deal with the dead being raised), this speaks to me.
One of my thoughts on this is hooking this up to something like Dwarf Fortess wall-art, feeding a prompt of some recent events and or the mood of the sculptor and have contextual poetry for the walls.
Eliot has an essay on Kipling in which he argues that verse and poetry are different arts, that Kipling was a great writer of verse who occasionally wrote great poetry more or less by accident.
As Eliot defines it, verse is intended to give all of its effect on a single reading, poetry to have more depth, require more careful analysis. That links to a point that occurred to me here. Part of the reason the AI poetry feels more real than it should is that we don’t expect poetry to entirely make sense on a first reading—part of what the poet is often aiming at is something unusual, an apparent inconsistency that can reveal a deeper consistency.
Which makes me suspect that if the AI was trained on Kipling’s verse, the result would not look like something written by Kipling.
What the actual fuck.
On one hand, this exercise is utterly narcissistic. Imagine training the AI on a set of objects, 3D-printing the results, and examining them in the context of potential uses. To say “This one looks kind of like a hammer – the AI must have meant to build a hammer!” actually denies the AI agency. This isn’t any demonstration of agency or technique on behalf of the AI – it’s stealth solipsism.
On the other hand, this exercise is essentially claiming “Poetry is no better than a semi-random assemblage of mostly-coherent words.” Bullshit – anyone making such a claim doesn’t know what poetry is. This isn’t a celebration or expression of the art form – it’s rejection of the art form.
wow. it really pressed your buttons.
I kinda doubt the AI has agency.
Any bets on how long before similar AI can produce work that human poetry critics can’t, in a blinded setting, distinguish from poetry written by real humans?
Apparently, the answer is “about negative four days”, since some human readers of this blog (myself included) could not distinguish AI from Tennyson.
But I kind of do agree with rahien.din. The fact that a relatively simple AI can fool human poetry critics makes me much more inclined to dismiss modern poetry as an art form.
I’m fascinated by the idea that AI can’t imitate Pope, that most reasonable of poets:
Know then thyself, presume not God to scan
The proper study of Mankind is Man.[8]
Placed on this isthmus of a middle state,
A Being darkly wise, and rudely great:
With too much knowledge for the Sceptic side,
With too much weakness for the Stoic’s pride,
He hangs between; in doubt to act, or rest;
In doubt to deem himself a God, or Beast;
In doubt his mind or body to prefer;
Born but to die, and reas’ning but to err;
Alike in ignorance, his reason such,
Whether he thinks too little, or too much;
Chaos of Thought and Passion, all confus’d;
Still by himself, abus’d or disabus’d;
Created half to rise and half to fall;
Great Lord of all things, yet a prey to all,
Sole judge of truth, in endless error hurl’d;
The glory, jest and riddle of the world.
I have a pretty good guess as to why it’s deteriorating over time.
You see the same thing in the full-sized GPT-2 sample dump they released, although it takes a lot longer for the wheels to come off. Most of the samples start strong and then get steadily weirder and less coherent towards the end. I suspect that’s why the sample length was capped the way it is.
The issue has to do with feedback loops.
The way GPT-2 works is that it takes the characters it’s generated so far, and conditions on them to predict a distribution over the next character, then samples that distribution to pick a character. It then adds that character to the text, reconditions, and repeats the process.
The issue is that the distribution of text that GPT-2 generates is not quite the same as the distribution over actual text. So as it generates more characters (and gets further away from the real text sample it was initialized with), the number of subtle mistakes it makes increases. So, over time, more and more of the characters it’s conditioning on are garbage.
This is an issue because GPT-2 is trained to generate characters conditional on real text. It’s not trained to generate characters based on its own output. So when its input turns to garbage, its output gets worse, which makes its input worse, and the GIGO death spiral begins.
The same issue occurs with supervised learning of self-driving cars. If you only have data of humans driving well, and then use that to train a neural network to guide a car, you get a system that drives okay – for a little while. But once it makes a mistake, and drifts towards the edge of its lane, it suddenly is in a situation it has no training data for. It knows what good driving looks like, but it doesn’t know how to get from bad driving back to good driving. So it behaves more erratically, which puts it into situations even further from its training data (like driving into a ditch) that it is even less capable of recovering from.
What this looks like in practice is the car making a small, normal mistake, then driving increasingly erratically until it careens off the road. GPT is doing exactly the same thing.
Insert this into the middle of the Tolkien fanfic that it wrote last time and it’d fit right in.
I’m going to be honest, I spent this entire article waiting for the point where you revealed that in fact these were all authentic works of 19th century poetry. I’m not sure whether that’s better or worse than falling for the one genuine excerpt.
“And Phrygian warriors. So, we will dispatch”
This is not remotely iambic pentameter.
For starters, it’s 12 syllables. Also, to make the meter into iambs requires changing the normal pattern of accented syllables in “warriors”.
and phry’dzhen war’yers / so we will dis’patch
seems ok to me
no less tortured than eliding “-ian” from “Ethiopian”
OK, yes, I guess you can drop both long e sounds. Phrygian is a word I never encounter in speech, so I’ve always read it as phri-zhEE-en. It seems a little tortured to rely on two such compressions in one line.
Both of them are standard elisions, so it’s not much in the way of torture. There’s a set of established, allowed elision. Mainly, any two vowels can be elided, including w’s and y’s, even between words. So “The ape” can be elided as “Th’ape” (one syllable) or “Carry elephants” as “Care y’elephants” (four syllables, though it would never fit into iambs well). The point is that these elisions aren’t exceptions or tortures, but well established parts of the rules, extending back even to the hexameter verse of Latin and Greek.
The archives of SSC and its comment section have got to add up to a fairly good-sized training corpus by now. Can we look forward to some AI-generated accusations of bad faith, and even more competing redefinitions of “mistake theory”?
Inspired by Gwern’s work, I ran the fine-tuning process on three datasets of my own, a large set of chatlogs from a group I used to be in (90 MB), a large collection of my little pony fanfiction(100 MB), and half a gig worth of arxiv high energy physics papers. The chatlogs had a lot of personal information in them, but here’s a ton of unconditional samples for the mlp fanfic and the high energy physics models (I can also provide the models if anyone wants them, they’re just big enough to not be worth hosting if there’s no interest)
pony fic:
link text
high energy physics phenomenology:
link text
The hep-ph model does really quite well at generating valid LaTeX, though it’s far from perfectly reliable (as the very first sample shows). And to get it to render, since it’s sampling a random chunk out of the middle of an article (on average), you will need to truncate the sample to cut off opened-but-not-closed or closed-but-not-opened stuff from the end and beginning. And it can’t do figures. And because the samples don’t actually include bibliographies (generally) the citations just render as question marks in brackets/parens. But otherwise it’s pretty good, here’s some rendered examples (manual correction consisted only of the aforementioned truncation):
link text
link text
link text
The fine-tuning process took about six hours per dataset on one GPU (a 2080, specifically). Since this GPU isn’t all that much faster than Gwern’s (for fp32, at least) I assume that Gwern was just a lot more patient than I was about LR drops and declaring training finished.
Now train it on both corpora at once.
HEMLP
The issue with that is that, since the two corpora are so immediately distinct, almost the first thing it would learn is whether it should currently be generating HEP-PH or ponyfic based on the context. So that would just get a model which can generate either, not a model that does a mixture of both.
Yes, I trained for, cumulatively, multiple days while dropping the LR to see where it would converge. (Would’ve gone faster with multi-GPU support but oh well.) From your samples I suspect you could probably train it further.
Looking at your MLP samples, you might want to delete some whitespace. Doublespacing sentences is pointless & wasteful. And whole blank lines aren’t necessary given newlines, and some of them have 5 or 10 blank lines in a row, which wastes some of the context window, is unnecessary variation to learn, and makes the samples harder to read. (Also, are those Unicode quotes & ellipses? Maybe you should do an ASCII conversion pass.)
Oh, dear God. This poetry actually sounds great. Robots will replace us. 0_0