How to generate your own “The Simpsons” TV script using Deep Learning

Image by Kaggle.com ( https://www.kaggle.com/wcukierski/the-simpsons-by-the-data)

Have you ever dreamed of creating your own episode of “The Simpsons”? I did.

That is what i thought when i saw the Simpsons dataset at Kaggle. It is the perfect dataset for a small “just for fun” project on Natural Language Generation (NLG).

What is Natural Language Generation (NLG)?

“Natural-language generation (NLG) is the aspect of language technology that focuses on generating natural language from structured data or structured representations such as a knowledge base or a logical form.”
(https://en.wikipedia.org/wiki/Natural-language_generation)

In this case we will see how to train a model that will be capable of creating new “Simpsons-Style” conversations. As input for the training we will use the file simpsons_script_lines.csv from the Simpsons dataset.

Downloading and preparing the data

First you need to download the data file. You can do this on the Kaggle website of “The Simpsons by the Data”. Download the file simpsons_script_lines.csv, save it to a folder “data” and unzip it. It should be ~34MB after unzipping.

If you look at the first lines of the file you will see that there are serveral columns in this CSV:

First lines of simpsons_script_lines.csv

For the training of the model we will only need the pure text without all the other features. So we need to extract it from the file.

The easiest way to read in the data would normally be the use of Pandas read_csv() function — but in this case it does not work. This file uses commas as separator, but there are a lot of unescaped commas in the text which breaks the automatic parsing.

So we need to read the file as plain text and do the parsing using regular expressions.

The output of this script looks like this:

Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom?
Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa_Simpson: That life is worth living.
Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?
Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say.
Edna_Krabappel-Flanders: Bart?
Bart_Simpson: Victory party under the slide!
(Apartment_Building: Ext. apartment building - day)
Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!

Looking at the output you can see that we not only extracted the text. We also replaced the spaces in names with an underscore — so “Lisa Simpson” becomes “Lisa_Simpson”. This way we can use the names as starting words for the text generation step.

Data preprocessing

Before we can use this as input for training of our model we first need to do some extra preprocessing.

We’ll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks make it hard for the Neural Network to distinguish between the word “bye” and “bye!”.

To solve this we create a dictionary that we will use to token the symbols and add the delimiter (space) around it. This separates the symbols from the words, making it easier for the Neural Network to predict on the next word.

In the next step we will use this dictionary to replace the symbols, build the vocabulary and lookup table for the words in the text.

Build the Neural Network

Now that we have prepared the data it is time to create the Neural Network.

First we need to create Tensorflow Placeholders for input, targets and learning rate.

Next we create a RNN Cell and initialize it.

Here we apply embedding to input_data using TensorFlow and return the embedded sequence.

We created a RNN Cell in the get_init_cell() function. Time to use the cell to create a RNN.

Now let’s put this all together to build the final Neural Network.

Training the Neural Network

For training the Neural Network we have to create batches of inputs and targets…

… and define hyperparameters for training.

Before we can start the training we need to build the graph.

Now we can start training the Neural Network on the preprocessed data. This will take a while. On my GTX 1080TI the training took roughly 4 hours to complete using the parameters above.

The generated output while training should look like this:

...
Epoch  49 Batch 1186/4686   train_loss = 1.737
Epoch  49 Batch 1386/4686   train_loss = 1.839
Epoch  49 Batch 1586/4686   train_loss = 2.050
Epoch  49 Batch 1786/4686   train_loss = 1.798
Epoch  49 Batch 1986/4686   train_loss = 1.751
Epoch  49 Batch 2186/4686   train_loss = 1.680
Epoch  49 Batch 2386/4686   train_loss = 1.641
Epoch  49 Batch 2586/4686   train_loss = 1.912
Epoch  49 Batch 2786/4686   train_loss = 1.811
Epoch  49 Batch 2986/4686   train_loss = 1.949
Epoch  49 Batch 3186/4686   train_loss = 1.821
Epoch  49 Batch 3386/4686   train_loss = 1.664
Epoch  49 Batch 3586/4686   train_loss = 1.735
Epoch  49 Batch 3786/4686   train_loss = 2.175
Epoch  49 Batch 3986/4686   train_loss = 1.710
Epoch  49 Batch 4186/4686   train_loss = 1.969
Epoch  49 Batch 4386/4686   train_loss = 2.055
Epoch  49 Batch 4586/4686   train_loss = 1.862
Model Trained and Saved

Generate TV Script

When training is finished we are at the last step of this project: generating a new TV Script for “The Simpsons”!

To start we need to get the tensors from loaded_graph

… and a function to select the next word using probabilities.

And finally we are ready to generate the TV script. Set gen_length to the length of TV script you want to generate.

This should give you an output like this:

INFO:tensorflow:Restoring parameters from ./save
homer_simpson:(moans)
marge_simpson:(annoyed murmur)
homer_simpson:(annoyed grunt)
(moe's_tavern: ext. moe's - night)
homer_simpson:(to moe) this is a great idea, children. now, what are we playing here?
bart_simpson:(horrified gasp)
(simpson_home: ext. simpson house - day - establishing)
homer_simpson:(worried) i've got a wet!
homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine!
(department int. sports arena - night)
seymour_skinner:(chuckles)
chief_wiggum:(laughing) oh, i get it.
seymour_skinner:(snapping) i guess this building is quiet.
homer_simpson:(stunned) what? how'd you like that?
professor_jonathan_frink: uh, well, looks like the little bit of you.
bart_simpson:(to larry) i guess this is clearly justin, right?
homer_simpson:(dismissive snort) oh, i am.
marge_simpson:(pained) hi.
homer_simpson:(pained sound) i thought you might have some good choice.
homer_simpson:(pained) oh, sorry.
(simpson_home: int. simpson house - living room - day)
marge_simpson:(concerned) okay, open your door.
homer_simpson: don't push, marge. we'll be fine.
judge_snyder:(sarcastic) children, you want a night?
homer_simpson:(gulp) oh, i can't believe i wasn't in a car.
chief_wiggum:(to selma) i can't find this map. and she's gonna release that?
homer_simpson:(lots of hair) just like me.
homer_simpson:(shrugs) gimme a try.
homer_simpson:(sweetly) i don't know, but i don't remember that.
marge_simpson:(brightening) are you all right?
homer_simpson: absolutely...
lisa_simpson:(mad) even better!
homer_simpson:(hums)
marge_simpson: oh, homie. that's a doggie door.
homer_simpson:(moan) i don't have computers.
homer_simpson:(hopeful) honey?(makes fake companies break) are you okay?
marge_simpson:(short giggle)
homer_simpson:(happy) oh, marge, i found the two thousand and cars.
marge_simpson:(frustrated sound)
lisa_simpson:(outraged) are you, you're too far to go?
boys:(skeptical) well, i'm gonna be here at the same time.
homer_simpson:(moans) why are you doing us by doing anything?
marge_simpson: well, it always seemed like i'm gonna be friends with...
homer_simpson:(seething) losers!
(simpson_home: int. simpson house -

Conclusion

We have trained a model to generate new text!

As you can see the text does not really make any sense, but that’s ok. This project was meant to show you how to prepare the data for training the model and to give a basic idea on how NLG works.

If you want you can tune the parameters, add more layers or change their size. Look at how the output of the model changes.

Github

The code for this project is also available as a Jupyter Notebook in my GitHub repository.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: