How to generate your own “The Simpsons” TV script using Deep Learning

Image by Kaggle.com ( https://www.kaggle.com/wcukierski/the-simpsons-by-the-data)

Have you ever dreamed of creating your own episode of “The Simpsons”? I did.

That is what i thought when i saw the Simpsons dataset at Kaggle. It is the perfect dataset for a small “just for fun” project on Natural Language Generation (NLG).

What is Natural Language Generation (NLG)?

“Natural-language generation (NLG) is the aspect of language technology that focuses on generating natural language from structured data or structured representations such as a knowledge base or a logical form.”
(https://en.wikipedia.org/wiki/Natural-language_generation)

In this case we will see how to train a model that will be capable of creating new “Simpsons-Style” conversations. As input for the training we will use the file simpsons_script_lines.csv from the Simpsons dataset.

Downloading and preparing the data

First you need to download the data file. You can do this on the Kaggle website of “The Simpsons by the Data”. Download the file simpsons_script_lines.csv, save it to a folder “data” and unzip it. It should be ~34MB after unzipping.

If you look at the first lines of the file you will see that there are serveral columns in this CSV:

First lines of simpsons_script_lines.csv

For the training of the model we will only need the pure text without all the other features. So we need to extract it from the file.

The easiest way to read in the data would normally be the use of Pandas read_csv() function — but in this case it does not work. This file uses commas as separator, but there are a lot of unescaped commas in the text which breaks the automatic parsing.

So we need to read the file as plain text and do the parsing using regular expressions.

data_dir = './data/simpsons_script_lines.csv'
input_file = os.path.join(data_dir)
clean_text = ''
with open(input_file, "r", encoding="utf8") as f:
for line in f:
text = re.search('[0-9]*,[0-9]*,[0-9]*,(.+?),[0-9]*,', line)
if text:
text = text.group(1).replace('"', '')
text_parts = text.split(':')
text_parts[0] = text_parts[0].replace(' ', '_')
text = ':'.join(text_parts)
clean_text += text + '\n'
print('\n'.join(clean_text.split('\n')[:10]))
view raw extract_text.py hosted with ❤ by GitHub

The output of this script looks like this:

Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom?
Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa_Simpson: That life is worth living.
Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?
Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say.
Edna_Krabappel-Flanders: Bart?
Bart_Simpson: Victory party under the slide!
(Apartment_Building: Ext. apartment building - day)
Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!

Looking at the output you can see that we not only extracted the text. We also replaced the spaces in names with an underscore — so “Lisa Simpson” becomes “Lisa_Simpson”. This way we can use the names as starting words for the text generation step.

Data preprocessing

Before we can use this as input for training of our model we first need to do some extra preprocessing.

We’ll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks make it hard for the Neural Network to distinguish between the word “bye” and “bye!”.

To solve this we create a dictionary that we will use to token the symbols and add the delimiter (space) around it. This separates the symbols from the words, making it easier for the Neural Network to predict on the next word.

In the next step we will use this dictionary to replace the symbols, build the vocabulary and lookup table for the words in the text.

tokenized_punctuation = {
'.' : '||Period||',
',' : '||Comma||',
'"' : '||Quotation_Mark||',
';' : '||Semicolon||',
'!' : '||Exclamation_Mark||',
'?' : '||Question_Mark||',
'(' : '||Left_Parentheses||',
')' : '||Right_Parentheses||',
'--' : '||Dash||',
'\n' : '||Return||'
}
text = "\n".join(clean_text)
for key, token in tokenized_punctuation .items():
text = text.replace(key, ' {} '.format(token))
text = text.lower()
text = text.split()
word_counts = Counter(text)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
int_text = [vocab_to_int[word] for word in text]

Build the Neural Network

Now that we have prepared the data it is time to create the Neural Network.

First we need to create Tensorflow Placeholders for input, targets and learning rate.

def get_inputs():
input_placeholder = tf.placeholder(tf.int32, [None, None], name = 'input')
targets_placeholder = tf.placeholder(tf.int32, [None, None])
learning_rate_placeholder = tf.placeholder(tf.float32)
return input_placeholder, targets_placeholder, learning_rate_placeholder

Next we create a RNN Cell and initialize it.

def get_init_cell(batch_size, rnn_size):
lstm = tf.contrib.rnn.GRUCell(rnn_size)
cell = tf.contrib.rnn.MultiRNNCell([lstm])
initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state')
return cell, initial_state
view raw create_rnn.py hosted with ❤ by GitHub

Here we apply embedding to input_data using TensorFlow and return the embedded sequence.

def get_embed(input_data, vocab_size, embed_dim):
embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
embed = tf.nn.embedding_lookup(embedding, input_data)
return embed

We created a RNN Cell in the get_init_cell() function. Time to use the cell to create a RNN.

def build_rnn(cell, inputs):
outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
final_state = tf.identity(state, name="final_state")
return outputs, final_state
view raw build_rnn.py hosted with ❤ by GitHub

Now let’s put this all together to build the final Neural Network.

def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
embeddings = get_embed(input_data, vocab_size, embed_dim)
inputs, final_state = build_rnn(cell, embeddings)
logits = tf.contrib.layers.fully_connected(inputs=inputs, num_outputs=vocab_size, activation_fn=None)
return logits, final_state
view raw build_nn.py hosted with ❤ by GitHub

Training the Neural Network

For training the Neural Network we have to create batches of inputs and targets…

def get_batches(int_text, batch_size, seq_length):
n_batches = len(int_text) // (batch_size * seq_length)
words = np.asarray(int_text[:n_batches*(batch_size * seq_length)])
batches = np.zeros(shape=(n_batches, 2, batch_size, seq_length))
input_sequences = words.reshape(-1, seq_length)
target_sequences = np.roll(words, -1)
target_sequences = target_sequences.reshape(-1, seq_length)
for idx in range(0, input_sequences.shape[0]):
input_idx = idx % n_batches
target_idx = idx // n_batches
batches[input_idx,0,target_idx,:] = input_sequences[idx,:]
batches[input_idx,1,target_idx,:] = target_sequences[idx,:]
return batches
view raw get_batches.py hosted with ❤ by GitHub

… and define hyperparameters for training.

# Number of Epochs
num_epochs = 50
# Batch Size
batch_size = 32
# RNN Size
rnn_size = 512
# Embedding Dimension Size
embed_dim = 256
# Sequence Length
seq_length = 16
# Learning Rate
learning_rate = 0.001
# Show stats for every n number of batches
show_every_n_batches = 200
# where to save the trained model
save_dir = './save'

Before we can start the training we need to build the graph.

train_graph = tf.Graph()
with train_graph.as_default():
vocab_size = len(int_to_vocab)
input_text, targets, lr = get_inputs()
input_data_shape = tf.shape(input_text)
cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)
# Probabilities for generating words
probs = tf.nn.softmax(logits, name='probs')
# Loss function
cost = seq2seq.sequence_loss(
logits,
targets,
tf.ones([input_data_shape[0], input_data_shape[1]]))
# Optimizer
optimizer = tf.train.AdamOptimizer(lr)
# Gradient Clipping
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)
view raw build_graph.py hosted with ❤ by GitHub

Now we can start training the Neural Network on the preprocessed data. This will take a while. On my GTX 1080TI the training took roughly 4 hours to complete using the parameters above.

batches = get_batches(int_text, batch_size, seq_length)
with tf.Session(graph=train_graph) as sess:
sess.run(tf.global_variables_initializer())
for epoch_i in range(num_epochs):
state = sess.run(initial_state, {input_text: batches[0][0]})
for batch_i, (x, y) in enumerate(batches):
feed = {
input_text: x,
targets: y,
initial_state: state,
lr: learning_rate}
train_loss, state, _ = sess.run([cost, final_state, train_op], feed)
# Show every <show_every_n_batches> batches
if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
print('Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f}'.format(
epoch_i,
batch_i,
len(batches),
train_loss))
# Save Model
saver = tf.train.Saver()
saver.save(sess, save_dir)
print('Model Trained and Saved')
view raw train.py hosted with ❤ by GitHub

The generated output while training should look like this:

...
Epoch  49 Batch 1186/4686   train_loss = 1.737
Epoch  49 Batch 1386/4686   train_loss = 1.839
Epoch  49 Batch 1586/4686   train_loss = 2.050
Epoch  49 Batch 1786/4686   train_loss = 1.798
Epoch  49 Batch 1986/4686   train_loss = 1.751
Epoch  49 Batch 2186/4686   train_loss = 1.680
Epoch  49 Batch 2386/4686   train_loss = 1.641
Epoch  49 Batch 2586/4686   train_loss = 1.912
Epoch  49 Batch 2786/4686   train_loss = 1.811
Epoch  49 Batch 2986/4686   train_loss = 1.949
Epoch  49 Batch 3186/4686   train_loss = 1.821
Epoch  49 Batch 3386/4686   train_loss = 1.664
Epoch  49 Batch 3586/4686   train_loss = 1.735
Epoch  49 Batch 3786/4686   train_loss = 2.175
Epoch  49 Batch 3986/4686   train_loss = 1.710
Epoch  49 Batch 4186/4686   train_loss = 1.969
Epoch  49 Batch 4386/4686   train_loss = 2.055
Epoch  49 Batch 4586/4686   train_loss = 1.862
Model Trained and Saved

Generate TV Script

When training is finished we are at the last step of this project: generating a new TV Script for “The Simpsons”!

To start we need to get the tensors from loaded_graph

def get_tensors(loaded_graph):
input_tensor = loaded_graph.get_tensor_by_name('input:0')
initial_state_tensor = loaded_graph.get_tensor_by_name('initial_state:0')
final_state_tensor = loaded_graph.get_tensor_by_name('final_state:0')
probs_tensor = loaded_graph.get_tensor_by_name('probs:0')
return input_tensor, initial_state_tensor, final_state_tensor, probs_tensor
view raw get_tensor.py hosted with ❤ by GitHub

… and a function to select the next word using probabilities.

def pick_word(probabilities, int_to_vocab):
word_id = np.argmax(probabilities)
word_string = int_to_vocab[word_id]
return word_string
view raw pick_word.py hosted with ❤ by GitHub

And finally we are ready to generate the TV script. Set gen_length to the length of TV script you want to generate.

gen_length = 500
"""
The prime word is used as the start word for the text generation.
To generate different text try different prime words like:
'marge_simpson'
'bart_simpson'
'lisa_simpson'
'seymour_skinner'
'chief_wiggum'
'judge_snyder'
"""
prime_word = 'homer_simpson'
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
# Load saved model
loader = tf.train.import_meta_graph(save_dir + '.meta')
loader.restore(sess, save_dir)
# Get Tensors from loaded model
input_text, initial_state, final_state, probs = get_tensors(loaded_graph)
# Sentences generation setup
gen_sentences = [prime_word + ':']
prev_state = sess.run(initial_state, {input_text: np.array([[1]])})
# Generate sentences
for n in range(gen_length):
# Dynamic Input
dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
dyn_seq_length = len(dyn_input[0])
# Get Prediction
probabilities, prev_state = sess.run(
[probs, final_state],
{input_text: dyn_input, initial_state: prev_state})
pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)
gen_sentences.append(pred_word)
# Remove tokens
tv_script = ' '.join(gen_sentences)
for key, token in tokenized_punctuation.items():
ending = ' ' if key in ['\n', '(', '"'] else ''
tv_script = tv_script.replace(' ' + token.lower(), key)
tv_script = tv_script.replace('\n ', '\n')
tv_script = tv_script.replace('( ', '(')
print(tv_script)

This should give you an output like this:

INFO:tensorflow:Restoring parameters from ./save
homer_simpson:(moans)
marge_simpson:(annoyed murmur)
homer_simpson:(annoyed grunt)
(moe's_tavern: ext. moe's - night)
homer_simpson:(to moe) this is a great idea, children. now, what are we playing here?
bart_simpson:(horrified gasp)
(simpson_home: ext. simpson house - day - establishing)
homer_simpson:(worried) i've got a wet!
homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine!
(department int. sports arena - night)
seymour_skinner:(chuckles)
chief_wiggum:(laughing) oh, i get it.
seymour_skinner:(snapping) i guess this building is quiet.
homer_simpson:(stunned) what? how'd you like that?
professor_jonathan_frink: uh, well, looks like the little bit of you.
bart_simpson:(to larry) i guess this is clearly justin, right?
homer_simpson:(dismissive snort) oh, i am.
marge_simpson:(pained) hi.
homer_simpson:(pained sound) i thought you might have some good choice.
homer_simpson:(pained) oh, sorry.
(simpson_home: int. simpson house - living room - day)
marge_simpson:(concerned) okay, open your door.
homer_simpson: don't push, marge. we'll be fine.
judge_snyder:(sarcastic) children, you want a night?
homer_simpson:(gulp) oh, i can't believe i wasn't in a car.
chief_wiggum:(to selma) i can't find this map. and she's gonna release that?
homer_simpson:(lots of hair) just like me.
homer_simpson:(shrugs) gimme a try.
homer_simpson:(sweetly) i don't know, but i don't remember that.
marge_simpson:(brightening) are you all right?
homer_simpson: absolutely...
lisa_simpson:(mad) even better!
homer_simpson:(hums)
marge_simpson: oh, homie. that's a doggie door.
homer_simpson:(moan) i don't have computers.
homer_simpson:(hopeful) honey?(makes fake companies break) are you okay?
marge_simpson:(short giggle)
homer_simpson:(happy) oh, marge, i found the two thousand and cars.
marge_simpson:(frustrated sound)
lisa_simpson:(outraged) are you, you're too far to go?
boys:(skeptical) well, i'm gonna be here at the same time.
homer_simpson:(moans) why are you doing us by doing anything?
marge_simpson: well, it always seemed like i'm gonna be friends with...
homer_simpson:(seething) losers!
(simpson_home: int. simpson house -

Conclusion

We have trained a model to generate new text!

As you can see the text does not really make any sense, but that’s ok. This project was meant to show you how to prepare the data for training the model and to give a basic idea on how NLG works.

If you want you can tune the parameters, add more layers or change their size. Look at how the output of the model changes.

Github

The code for this project is also available as a Jupyter Notebook in my GitHub repository.

Leave a comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: