
Have you ever dreamed of creating your own episode of “The Simpsons”? I did.
That is what i thought when i saw the Simpsons dataset at Kaggle. It is the perfect dataset for a small “just for fun” project on Natural Language Generation (NLG).
What is Natural Language Generation (NLG)?
“Natural-language generation (NLG) is the aspect of language technology that focuses on generating natural language from structured data or structured representations such as a knowledge base or a logical form.”
(https://en.wikipedia.org/wiki/Natural-language_generation)
In this case we will see how to train a model that will be capable of creating new “Simpsons-Style” conversations. As input for the training we will use the file simpsons_script_lines.csv from the Simpsons dataset.
Downloading and preparing the data
First you need to download the data file. You can do this on the Kaggle website of “The Simpsons by the Data”. Download the file simpsons_script_lines.csv, save it to a folder “data” and unzip it. It should be ~34MB after unzipping.
If you look at the first lines of the file you will see that there are serveral columns in this CSV:

For the training of the model we will only need the pure text without all the other features. So we need to extract it from the file.
The easiest way to read in the data would normally be the use of Pandas read_csv()
function — but in this case it does not work. This file uses commas as separator, but there are a lot of unescaped commas in the text which breaks the automatic parsing.
So we need to read the file as plain text and do the parsing using regular expressions.
data_dir = './data/simpsons_script_lines.csv' | |
input_file = os.path.join(data_dir) | |
clean_text = '' | |
with open(input_file, "r", encoding="utf8") as f: | |
for line in f: | |
text = re.search('[0-9]*,[0-9]*,[0-9]*,(.+?),[0-9]*,', line) | |
if text: | |
text = text.group(1).replace('"', '') | |
text_parts = text.split(':') | |
text_parts[0] = text_parts[0].replace(' ', '_') | |
text = ':'.join(text_parts) | |
clean_text += text + '\n' | |
print('\n'.join(clean_text.split('\n')[:10])) |
The output of this script looks like this:
Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it. Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom? Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you? Lisa_Simpson: That life is worth living. Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin? Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say. Edna_Krabappel-Flanders: Bart? Bart_Simpson: Victory party under the slide! (Apartment_Building: Ext. apartment building - day) Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!
Looking at the output you can see that we not only extracted the text. We also replaced the spaces in names with an underscore — so “Lisa Simpson” becomes “Lisa_Simpson”. This way we can use the names as starting words for the text generation step.
Data preprocessing
Before we can use this as input for training of our model we first need to do some extra preprocessing.
We’ll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks make it hard for the Neural Network to distinguish between the word “bye” and “bye!”.
To solve this we create a dictionary that we will use to token the symbols and add the delimiter (space) around it. This separates the symbols from the words, making it easier for the Neural Network to predict on the next word.
In the next step we will use this dictionary to replace the symbols, build the vocabulary and lookup table for the words in the text.
tokenized_punctuation = { | |
'.' : '||Period||', | |
',' : '||Comma||', | |
'"' : '||Quotation_Mark||', | |
';' : '||Semicolon||', | |
'!' : '||Exclamation_Mark||', | |
'?' : '||Question_Mark||', | |
'(' : '||Left_Parentheses||', | |
')' : '||Right_Parentheses||', | |
'--' : '||Dash||', | |
'\n' : '||Return||' | |
} | |
text = "\n".join(clean_text) | |
for key, token in tokenized_punctuation .items(): | |
text = text.replace(key, ' {} '.format(token)) | |
text = text.lower() | |
text = text.split() | |
word_counts = Counter(text) | |
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True) | |
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)} | |
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()} | |
int_text = [vocab_to_int[word] for word in text] |
Build the Neural Network
Now that we have prepared the data it is time to create the Neural Network.
First we need to create Tensorflow Placeholders for input, targets and learning rate.
def get_inputs(): | |
input_placeholder = tf.placeholder(tf.int32, [None, None], name = 'input') | |
targets_placeholder = tf.placeholder(tf.int32, [None, None]) | |
learning_rate_placeholder = tf.placeholder(tf.float32) | |
return input_placeholder, targets_placeholder, learning_rate_placeholder |
Next we create a RNN Cell and initialize it.
def get_init_cell(batch_size, rnn_size): | |
lstm = tf.contrib.rnn.GRUCell(rnn_size) | |
cell = tf.contrib.rnn.MultiRNNCell([lstm]) | |
initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state') | |
return cell, initial_state |
Here we apply embedding to input_data
using TensorFlow and return the embedded sequence.
def get_embed(input_data, vocab_size, embed_dim): | |
embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1)) | |
embed = tf.nn.embedding_lookup(embedding, input_data) | |
return embed |
We created a RNN Cell in the get_init_cell()
function. Time to use the cell to create a RNN.
def build_rnn(cell, inputs): | |
outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32) | |
final_state = tf.identity(state, name="final_state") | |
return outputs, final_state |
Now let’s put this all together to build the final Neural Network.
def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim): | |
embeddings = get_embed(input_data, vocab_size, embed_dim) | |
inputs, final_state = build_rnn(cell, embeddings) | |
logits = tf.contrib.layers.fully_connected(inputs=inputs, num_outputs=vocab_size, activation_fn=None) | |
return logits, final_state |
Training the Neural Network
For training the Neural Network we have to create batches of inputs and targets…
def get_batches(int_text, batch_size, seq_length): | |
n_batches = len(int_text) // (batch_size * seq_length) | |
words = np.asarray(int_text[:n_batches*(batch_size * seq_length)]) | |
batches = np.zeros(shape=(n_batches, 2, batch_size, seq_length)) | |
input_sequences = words.reshape(-1, seq_length) | |
target_sequences = np.roll(words, -1) | |
target_sequences = target_sequences.reshape(-1, seq_length) | |
for idx in range(0, input_sequences.shape[0]): | |
input_idx = idx % n_batches | |
target_idx = idx // n_batches | |
batches[input_idx,0,target_idx,:] = input_sequences[idx,:] | |
batches[input_idx,1,target_idx,:] = target_sequences[idx,:] | |
return batches |
… and define hyperparameters for training.
# Number of Epochs | |
num_epochs = 50 | |
# Batch Size | |
batch_size = 32 | |
# RNN Size | |
rnn_size = 512 | |
# Embedding Dimension Size | |
embed_dim = 256 | |
# Sequence Length | |
seq_length = 16 | |
# Learning Rate | |
learning_rate = 0.001 | |
# Show stats for every n number of batches | |
show_every_n_batches = 200 | |
# where to save the trained model | |
save_dir = './save' |
Before we can start the training we need to build the graph.
train_graph = tf.Graph() | |
with train_graph.as_default(): | |
vocab_size = len(int_to_vocab) | |
input_text, targets, lr = get_inputs() | |
input_data_shape = tf.shape(input_text) | |
cell, initial_state = get_init_cell(input_data_shape[0], rnn_size) | |
logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim) | |
# Probabilities for generating words | |
probs = tf.nn.softmax(logits, name='probs') | |
# Loss function | |
cost = seq2seq.sequence_loss( | |
logits, | |
targets, | |
tf.ones([input_data_shape[0], input_data_shape[1]])) | |
# Optimizer | |
optimizer = tf.train.AdamOptimizer(lr) | |
# Gradient Clipping | |
gradients = optimizer.compute_gradients(cost) | |
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None] | |
train_op = optimizer.apply_gradients(capped_gradients) |
Now we can start training the Neural Network on the preprocessed data. This will take a while. On my GTX 1080TI the training took roughly 4 hours to complete using the parameters above.
batches = get_batches(int_text, batch_size, seq_length) | |
with tf.Session(graph=train_graph) as sess: | |
sess.run(tf.global_variables_initializer()) | |
for epoch_i in range(num_epochs): | |
state = sess.run(initial_state, {input_text: batches[0][0]}) | |
for batch_i, (x, y) in enumerate(batches): | |
feed = { | |
input_text: x, | |
targets: y, | |
initial_state: state, | |
lr: learning_rate} | |
train_loss, state, _ = sess.run([cost, final_state, train_op], feed) | |
# Show every <show_every_n_batches> batches | |
if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0: | |
print('Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f}'.format( | |
epoch_i, | |
batch_i, | |
len(batches), | |
train_loss)) | |
# Save Model | |
saver = tf.train.Saver() | |
saver.save(sess, save_dir) | |
print('Model Trained and Saved') |
The generated output while training should look like this:
... Epoch 49 Batch 1186/4686 train_loss = 1.737 Epoch 49 Batch 1386/4686 train_loss = 1.839 Epoch 49 Batch 1586/4686 train_loss = 2.050 Epoch 49 Batch 1786/4686 train_loss = 1.798 Epoch 49 Batch 1986/4686 train_loss = 1.751 Epoch 49 Batch 2186/4686 train_loss = 1.680 Epoch 49 Batch 2386/4686 train_loss = 1.641 Epoch 49 Batch 2586/4686 train_loss = 1.912 Epoch 49 Batch 2786/4686 train_loss = 1.811 Epoch 49 Batch 2986/4686 train_loss = 1.949 Epoch 49 Batch 3186/4686 train_loss = 1.821 Epoch 49 Batch 3386/4686 train_loss = 1.664 Epoch 49 Batch 3586/4686 train_loss = 1.735 Epoch 49 Batch 3786/4686 train_loss = 2.175 Epoch 49 Batch 3986/4686 train_loss = 1.710 Epoch 49 Batch 4186/4686 train_loss = 1.969 Epoch 49 Batch 4386/4686 train_loss = 2.055 Epoch 49 Batch 4586/4686 train_loss = 1.862 Model Trained and Saved
Generate TV Script
When training is finished we are at the last step of this project: generating a new TV Script for “The Simpsons”!
To start we need to get the tensors from loaded_graph
…
def get_tensors(loaded_graph): | |
input_tensor = loaded_graph.get_tensor_by_name('input:0') | |
initial_state_tensor = loaded_graph.get_tensor_by_name('initial_state:0') | |
final_state_tensor = loaded_graph.get_tensor_by_name('final_state:0') | |
probs_tensor = loaded_graph.get_tensor_by_name('probs:0') | |
return input_tensor, initial_state_tensor, final_state_tensor, probs_tensor |
… and a function to select the next word using probabilities
.
def pick_word(probabilities, int_to_vocab): | |
word_id = np.argmax(probabilities) | |
word_string = int_to_vocab[word_id] | |
return word_string |
And finally we are ready to generate the TV script. Set gen_length
to the length of TV script you want to generate.
gen_length = 500 | |
""" | |
The prime word is used as the start word for the text generation. | |
To generate different text try different prime words like: | |
'marge_simpson' | |
'bart_simpson' | |
'lisa_simpson' | |
'seymour_skinner' | |
'chief_wiggum' | |
'judge_snyder' | |
""" | |
prime_word = 'homer_simpson' | |
loaded_graph = tf.Graph() | |
with tf.Session(graph=loaded_graph) as sess: | |
# Load saved model | |
loader = tf.train.import_meta_graph(save_dir + '.meta') | |
loader.restore(sess, save_dir) | |
# Get Tensors from loaded model | |
input_text, initial_state, final_state, probs = get_tensors(loaded_graph) | |
# Sentences generation setup | |
gen_sentences = [prime_word + ':'] | |
prev_state = sess.run(initial_state, {input_text: np.array([[1]])}) | |
# Generate sentences | |
for n in range(gen_length): | |
# Dynamic Input | |
dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]] | |
dyn_seq_length = len(dyn_input[0]) | |
# Get Prediction | |
probabilities, prev_state = sess.run( | |
[probs, final_state], | |
{input_text: dyn_input, initial_state: prev_state}) | |
pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab) | |
gen_sentences.append(pred_word) | |
# Remove tokens | |
tv_script = ' '.join(gen_sentences) | |
for key, token in tokenized_punctuation.items(): | |
ending = ' ' if key in ['\n', '(', '"'] else '' | |
tv_script = tv_script.replace(' ' + token.lower(), key) | |
tv_script = tv_script.replace('\n ', '\n') | |
tv_script = tv_script.replace('( ', '(') | |
print(tv_script) |
This should give you an output like this:
INFO:tensorflow:Restoring parameters from ./save homer_simpson:(moans) marge_simpson:(annoyed murmur) homer_simpson:(annoyed grunt) (moe's_tavern: ext. moe's - night) homer_simpson:(to moe) this is a great idea, children. now, what are we playing here? bart_simpson:(horrified gasp) (simpson_home: ext. simpson house - day - establishing) homer_simpson:(worried) i've got a wet! homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine! (department int. sports arena - night) seymour_skinner:(chuckles) chief_wiggum:(laughing) oh, i get it. seymour_skinner:(snapping) i guess this building is quiet. homer_simpson:(stunned) what? how'd you like that? professor_jonathan_frink: uh, well, looks like the little bit of you. bart_simpson:(to larry) i guess this is clearly justin, right? homer_simpson:(dismissive snort) oh, i am. marge_simpson:(pained) hi. homer_simpson:(pained sound) i thought you might have some good choice. homer_simpson:(pained) oh, sorry. (simpson_home: int. simpson house - living room - day) marge_simpson:(concerned) okay, open your door. homer_simpson: don't push, marge. we'll be fine. judge_snyder:(sarcastic) children, you want a night? homer_simpson:(gulp) oh, i can't believe i wasn't in a car. chief_wiggum:(to selma) i can't find this map. and she's gonna release that? homer_simpson:(lots of hair) just like me. homer_simpson:(shrugs) gimme a try. homer_simpson:(sweetly) i don't know, but i don't remember that. marge_simpson:(brightening) are you all right? homer_simpson: absolutely... lisa_simpson:(mad) even better! homer_simpson:(hums) marge_simpson: oh, homie. that's a doggie door. homer_simpson:(moan) i don't have computers. homer_simpson:(hopeful) honey?(makes fake companies break) are you okay? marge_simpson:(short giggle) homer_simpson:(happy) oh, marge, i found the two thousand and cars. marge_simpson:(frustrated sound) lisa_simpson:(outraged) are you, you're too far to go? boys:(skeptical) well, i'm gonna be here at the same time. homer_simpson:(moans) why are you doing us by doing anything? marge_simpson: well, it always seemed like i'm gonna be friends with... homer_simpson:(seething) losers! (simpson_home: int. simpson house -
Conclusion
We have trained a model to generate new text!
As you can see the text does not really make any sense, but that’s ok. This project was meant to show you how to prepare the data for training the model and to give a basic idea on how NLG works.
If you want you can tune the parameters, add more layers or change their size. Look at how the output of the model changes.
Github
The code for this project is also available as a Jupyter Notebook in my GitHub repository.