How to generate your own “The Simpsons” TV script using Deep Learning

Image by Kaggle.com ( https://www.kaggle.com/wcukierski/the-simpsons-by-the-data)

Have you ever dreamed of creating your own episode of “The Simpsons”? I did.

That is what i thought when i saw the Simpsons dataset at Kaggle. It is the perfect dataset for a small “just for fun” project on Natural Language Generation (NLG).

What is Natural Language Generation (NLG)?

“Natural-language generation (NLG) is the aspect of language technology that focuses on generating natural language from structured data or structured representations such as a knowledge base or a logical form.”
(https://en.wikipedia.org/wiki/Natural-language_generation)

In this case we will see how to train a model that will be capable of creating new “Simpsons-Style” conversations. As input for the training we will use the file simpsons_script_lines.csv from the Simpsons dataset.

Downloading and preparing the data

First you need to download the data file. You can do this on the Kaggle website of “The Simpsons by the Data”. Download the file simpsons_script_lines.csv, save it to a folder “data” and unzip it. It should be ~34MB after unzipping.

If you look at the first lines of the file you will see that there are serveral columns in this CSV:

First lines of simpsons_script_lines.csv

For the training of the model we will only need the pure text without all the other features. So we need to extract it from the file.

The easiest way to read in the data would normally be the use of Pandas read_csv() function — but in this case it does not work. This file uses commas as separator, but there are a lot of unescaped commas in the text which breaks the automatic parsing.

So we need to read the file as plain text and do the parsing using regular expressions.

data_dir = './data/simpsons_script_lines.csv'
input_file = os.path.join(data_dir)
clean_text = ''
with open(input_file, "r", encoding="utf8") as f:
for line in f:
text = re.search('[0-9]*,[0-9]*,[0-9]*,(.+?),[0-9]*,', line)
if text:
text = text.group(1).replace('"', '')
text_parts = text.split(':')
text_parts[0] = text_parts[0].replace(' ', '_')
text = ':'.join(text_parts)
clean_text += text + '\n'
print('\n'.join(clean_text.split('\n')[:10]))
view raw extract_text.py hosted with ❤ by GitHub

The output of this script looks like this:

Miss_Hoover: No, actually, it was a little of both. Sometimes when a disease is in all the magazines and all the news shows, it's only natural that you think you have it.
Lisa_Simpson: (NEAR TEARS) Where's Mr. Bergstrom?
Miss_Hoover: I don't know. Although I'd sure like to talk to him. He didn't touch my lesson plan. What did he teach you?
Lisa_Simpson: That life is worth living.
Edna_Krabappel-Flanders: The polls will be open from now until the end of recess. Now, (SOUR) just in case any of you have decided to put any thought into this, we'll have our final statements. Martin?
Martin_Prince: (HOARSE WHISPER) I don't think there's anything left to say.
Edna_Krabappel-Flanders: Bart?
Bart_Simpson: Victory party under the slide!
(Apartment_Building: Ext. apartment building - day)
Lisa_Simpson: (CALLING) Mr. Bergstrom! Mr. Bergstrom!

Looking at the output you can see that we not only extracted the text. We also replaced the spaces in names with an underscore — so “Lisa Simpson” becomes “Lisa_Simpson”. This way we can use the names as starting words for the text generation step.

Data preprocessing

Before we can use this as input for training of our model we first need to do some extra preprocessing.

We’ll be splitting the script into a word array using spaces as delimiters. However, punctuations like periods and exclamation marks make it hard for the Neural Network to distinguish between the word “bye” and “bye!”.

To solve this we create a dictionary that we will use to token the symbols and add the delimiter (space) around it. This separates the symbols from the words, making it easier for the Neural Network to predict on the next word.

In the next step we will use this dictionary to replace the symbols, build the vocabulary and lookup table for the words in the text.

tokenized_punctuation = {
'.' : '||Period||',
',' : '||Comma||',
'"' : '||Quotation_Mark||',
';' : '||Semicolon||',
'!' : '||Exclamation_Mark||',
'?' : '||Question_Mark||',
'(' : '||Left_Parentheses||',
')' : '||Right_Parentheses||',
'--' : '||Dash||',
'\n' : '||Return||'
}
text = "\n".join(clean_text)
for key, token in tokenized_punctuation .items():
text = text.replace(key, ' {} '.format(token))
text = text.lower()
text = text.split()
word_counts = Counter(text)
sorted_vocab = sorted(word_counts, key=word_counts.get, reverse=True)
int_to_vocab = {ii: word for ii, word in enumerate(sorted_vocab)}
vocab_to_int = {word: ii for ii, word in int_to_vocab.items()}
int_text = [vocab_to_int[word] for word in text]

Build the Neural Network

Now that we have prepared the data it is time to create the Neural Network.

First we need to create Tensorflow Placeholders for input, targets and learning rate.

def get_inputs():
input_placeholder = tf.placeholder(tf.int32, [None, None], name = 'input')
targets_placeholder = tf.placeholder(tf.int32, [None, None])
learning_rate_placeholder = tf.placeholder(tf.float32)
return input_placeholder, targets_placeholder, learning_rate_placeholder

Next we create a RNN Cell and initialize it.

def get_init_cell(batch_size, rnn_size):
lstm = tf.contrib.rnn.GRUCell(rnn_size)
cell = tf.contrib.rnn.MultiRNNCell([lstm])
initial_state = tf.identity(cell.zero_state(batch_size, tf.float32), name='initial_state')
return cell, initial_state
view raw create_rnn.py hosted with ❤ by GitHub

Here we apply embedding to input_data using TensorFlow and return the embedded sequence.

def get_embed(input_data, vocab_size, embed_dim):
embedding = tf.Variable(tf.random_uniform((vocab_size, embed_dim), -1, 1))
embed = tf.nn.embedding_lookup(embedding, input_data)
return embed

We created a RNN Cell in the get_init_cell() function. Time to use the cell to create a RNN.

def build_rnn(cell, inputs):
outputs, state = tf.nn.dynamic_rnn(cell, inputs, dtype=tf.float32)
final_state = tf.identity(state, name="final_state")
return outputs, final_state
view raw build_rnn.py hosted with ❤ by GitHub

Now let’s put this all together to build the final Neural Network.

def build_nn(cell, rnn_size, input_data, vocab_size, embed_dim):
embeddings = get_embed(input_data, vocab_size, embed_dim)
inputs, final_state = build_rnn(cell, embeddings)
logits = tf.contrib.layers.fully_connected(inputs=inputs, num_outputs=vocab_size, activation_fn=None)
return logits, final_state
view raw build_nn.py hosted with ❤ by GitHub

Training the Neural Network

For training the Neural Network we have to create batches of inputs and targets…

def get_batches(int_text, batch_size, seq_length):
n_batches = len(int_text) // (batch_size * seq_length)
words = np.asarray(int_text[:n_batches*(batch_size * seq_length)])
batches = np.zeros(shape=(n_batches, 2, batch_size, seq_length))
input_sequences = words.reshape(-1, seq_length)
target_sequences = np.roll(words, -1)
target_sequences = target_sequences.reshape(-1, seq_length)
for idx in range(0, input_sequences.shape[0]):
input_idx = idx % n_batches
target_idx = idx // n_batches
batches[input_idx,0,target_idx,:] = input_sequences[idx,:]
batches[input_idx,1,target_idx,:] = target_sequences[idx,:]
return batches
view raw get_batches.py hosted with ❤ by GitHub

… and define hyperparameters for training.

# Number of Epochs
num_epochs = 50
# Batch Size
batch_size = 32
# RNN Size
rnn_size = 512
# Embedding Dimension Size
embed_dim = 256
# Sequence Length
seq_length = 16
# Learning Rate
learning_rate = 0.001
# Show stats for every n number of batches
show_every_n_batches = 200
# where to save the trained model
save_dir = './save'

Before we can start the training we need to build the graph.

train_graph = tf.Graph()
with train_graph.as_default():
vocab_size = len(int_to_vocab)
input_text, targets, lr = get_inputs()
input_data_shape = tf.shape(input_text)
cell, initial_state = get_init_cell(input_data_shape[0], rnn_size)
logits, final_state = build_nn(cell, rnn_size, input_text, vocab_size, embed_dim)
# Probabilities for generating words
probs = tf.nn.softmax(logits, name='probs')
# Loss function
cost = seq2seq.sequence_loss(
logits,
targets,
tf.ones([input_data_shape[0], input_data_shape[1]]))
# Optimizer
optimizer = tf.train.AdamOptimizer(lr)
# Gradient Clipping
gradients = optimizer.compute_gradients(cost)
capped_gradients = [(tf.clip_by_value(grad, -1., 1.), var) for grad, var in gradients if grad is not None]
train_op = optimizer.apply_gradients(capped_gradients)
view raw build_graph.py hosted with ❤ by GitHub

Now we can start training the Neural Network on the preprocessed data. This will take a while. On my GTX 1080TI the training took roughly 4 hours to complete using the parameters above.

batches = get_batches(int_text, batch_size, seq_length)
with tf.Session(graph=train_graph) as sess:
sess.run(tf.global_variables_initializer())
for epoch_i in range(num_epochs):
state = sess.run(initial_state, {input_text: batches[0][0]})
for batch_i, (x, y) in enumerate(batches):
feed = {
input_text: x,
targets: y,
initial_state: state,
lr: learning_rate}
train_loss, state, _ = sess.run([cost, final_state, train_op], feed)
# Show every <show_every_n_batches> batches
if (epoch_i * len(batches) + batch_i) % show_every_n_batches == 0:
print('Epoch {:>3} Batch {:>4}/{} train_loss = {:.3f}'.format(
epoch_i,
batch_i,
len(batches),
train_loss))
# Save Model
saver = tf.train.Saver()
saver.save(sess, save_dir)
print('Model Trained and Saved')
view raw train.py hosted with ❤ by GitHub

The generated output while training should look like this:

...
Epoch  49 Batch 1186/4686   train_loss = 1.737
Epoch  49 Batch 1386/4686   train_loss = 1.839
Epoch  49 Batch 1586/4686   train_loss = 2.050
Epoch  49 Batch 1786/4686   train_loss = 1.798
Epoch  49 Batch 1986/4686   train_loss = 1.751
Epoch  49 Batch 2186/4686   train_loss = 1.680
Epoch  49 Batch 2386/4686   train_loss = 1.641
Epoch  49 Batch 2586/4686   train_loss = 1.912
Epoch  49 Batch 2786/4686   train_loss = 1.811
Epoch  49 Batch 2986/4686   train_loss = 1.949
Epoch  49 Batch 3186/4686   train_loss = 1.821
Epoch  49 Batch 3386/4686   train_loss = 1.664
Epoch  49 Batch 3586/4686   train_loss = 1.735
Epoch  49 Batch 3786/4686   train_loss = 2.175
Epoch  49 Batch 3986/4686   train_loss = 1.710
Epoch  49 Batch 4186/4686   train_loss = 1.969
Epoch  49 Batch 4386/4686   train_loss = 2.055
Epoch  49 Batch 4586/4686   train_loss = 1.862
Model Trained and Saved

Generate TV Script

When training is finished we are at the last step of this project: generating a new TV Script for “The Simpsons”!

To start we need to get the tensors from loaded_graph

def get_tensors(loaded_graph):
input_tensor = loaded_graph.get_tensor_by_name('input:0')
initial_state_tensor = loaded_graph.get_tensor_by_name('initial_state:0')
final_state_tensor = loaded_graph.get_tensor_by_name('final_state:0')
probs_tensor = loaded_graph.get_tensor_by_name('probs:0')
return input_tensor, initial_state_tensor, final_state_tensor, probs_tensor
view raw get_tensor.py hosted with ❤ by GitHub

… and a function to select the next word using probabilities.

def pick_word(probabilities, int_to_vocab):
word_id = np.argmax(probabilities)
word_string = int_to_vocab[word_id]
return word_string
view raw pick_word.py hosted with ❤ by GitHub

And finally we are ready to generate the TV script. Set gen_length to the length of TV script you want to generate.

gen_length = 500
"""
The prime word is used as the start word for the text generation.
To generate different text try different prime words like:
'marge_simpson'
'bart_simpson'
'lisa_simpson'
'seymour_skinner'
'chief_wiggum'
'judge_snyder'
"""
prime_word = 'homer_simpson'
loaded_graph = tf.Graph()
with tf.Session(graph=loaded_graph) as sess:
# Load saved model
loader = tf.train.import_meta_graph(save_dir + '.meta')
loader.restore(sess, save_dir)
# Get Tensors from loaded model
input_text, initial_state, final_state, probs = get_tensors(loaded_graph)
# Sentences generation setup
gen_sentences = [prime_word + ':']
prev_state = sess.run(initial_state, {input_text: np.array([[1]])})
# Generate sentences
for n in range(gen_length):
# Dynamic Input
dyn_input = [[vocab_to_int[word] for word in gen_sentences[-seq_length:]]]
dyn_seq_length = len(dyn_input[0])
# Get Prediction
probabilities, prev_state = sess.run(
[probs, final_state],
{input_text: dyn_input, initial_state: prev_state})
pred_word = pick_word(probabilities[0][dyn_seq_length-1], int_to_vocab)
gen_sentences.append(pred_word)
# Remove tokens
tv_script = ' '.join(gen_sentences)
for key, token in tokenized_punctuation.items():
ending = ' ' if key in ['\n', '(', '"'] else ''
tv_script = tv_script.replace(' ' + token.lower(), key)
tv_script = tv_script.replace('\n ', '\n')
tv_script = tv_script.replace('( ', '(')
print(tv_script)

This should give you an output like this:

INFO:tensorflow:Restoring parameters from ./save
homer_simpson:(moans)
marge_simpson:(annoyed murmur)
homer_simpson:(annoyed grunt)
(moe's_tavern: ext. moe's - night)
homer_simpson:(to moe) this is a great idea, children. now, what are we playing here?
bart_simpson:(horrified gasp)
(simpson_home: ext. simpson house - day - establishing)
homer_simpson:(worried) i've got a wet!
homer_simpson:(faking enthusiasm) well, maybe i could kiss my little girl. mine!
(department int. sports arena - night)
seymour_skinner:(chuckles)
chief_wiggum:(laughing) oh, i get it.
seymour_skinner:(snapping) i guess this building is quiet.
homer_simpson:(stunned) what? how'd you like that?
professor_jonathan_frink: uh, well, looks like the little bit of you.
bart_simpson:(to larry) i guess this is clearly justin, right?
homer_simpson:(dismissive snort) oh, i am.
marge_simpson:(pained) hi.
homer_simpson:(pained sound) i thought you might have some good choice.
homer_simpson:(pained) oh, sorry.
(simpson_home: int. simpson house - living room - day)
marge_simpson:(concerned) okay, open your door.
homer_simpson: don't push, marge. we'll be fine.
judge_snyder:(sarcastic) children, you want a night?
homer_simpson:(gulp) oh, i can't believe i wasn't in a car.
chief_wiggum:(to selma) i can't find this map. and she's gonna release that?
homer_simpson:(lots of hair) just like me.
homer_simpson:(shrugs) gimme a try.
homer_simpson:(sweetly) i don't know, but i don't remember that.
marge_simpson:(brightening) are you all right?
homer_simpson: absolutely...
lisa_simpson:(mad) even better!
homer_simpson:(hums)
marge_simpson: oh, homie. that's a doggie door.
homer_simpson:(moan) i don't have computers.
homer_simpson:(hopeful) honey?(makes fake companies break) are you okay?
marge_simpson:(short giggle)
homer_simpson:(happy) oh, marge, i found the two thousand and cars.
marge_simpson:(frustrated sound)
lisa_simpson:(outraged) are you, you're too far to go?
boys:(skeptical) well, i'm gonna be here at the same time.
homer_simpson:(moans) why are you doing us by doing anything?
marge_simpson: well, it always seemed like i'm gonna be friends with...
homer_simpson:(seething) losers!
(simpson_home: int. simpson house -

Conclusion

We have trained a model to generate new text!

As you can see the text does not really make any sense, but that’s ok. This project was meant to show you how to prepare the data for training the model and to give a basic idea on how NLG works.

If you want you can tune the parameters, add more layers or change their size. Look at how the output of the model changes.

Github

The code for this project is also available as a Jupyter Notebook in my GitHub repository.

Table Detection using Deep Learning

Books on bookshelves by Mikes Photos

For a specific task I had to solve I recently came across some interesting paper:

Gilani, Azka & Rukh Qasim, Shah & Malik, Imran & Shafait, Faisal. (2017). Table Detection Using Deep Learning. 10.1109/ICDAR.2017.131.

“Table detection is a crucial step in many document analysis applications as tables are used for presenting essential information to the reader in a structured manner. It is a hard problem due to varying layouts and encodings of the tables. Researchers have proposed numerous techniques for table detection based on layout analysis of documents. Most of these techniques fail to generalize because they rely on hand engineered features which are not robust to layout variations. In this paper, we have presented a deep learning based method for table detection. In the proposed method, document images are first pre-processed. These images are then fed to a Region Proposal Network followed by a fully connected neural network for table detection. The proposed method works with high precision on document images with varying layouts that include documents, research papers, and magazines. We have done our evaluations on publicly available UNLV dataset where it beats Tesseract’s state of the art table detection system by a significant margin.”

I decided to give it a try.

So — what do we need to implement this?

Required Libraries

Before we go on make sure you have everything installed to do be able to follow the steps described here.

The following will be required to follow the instructions:

  • Python 3 (i use Anaconda)
  • pandas
  • Pillow
  • opencv-python
  • Luminoth (which will also install Tensorflow)

The Dataset

First, we need the data. Going through the paper I found some links that point to a website with XML files containing the ground truth for the UNLV dataset — but to keep things simple I will provide some already prepared dataset based on that 2 sources to start with.

You can download the dataset here — please extract in to a directory “data”.

In the “data/images/” folder we have 403 image files from different types of documents like this one:

Sample image from the dataset

In addition to the images there are also 2 csv files with the ground truth data for this dataset. Each file has lines for each table found in each file, in the following format:

<filename>, <xmin>, <ymin>, <xmax>, <ymax>, <class> (in our case “class” will always be “table”)

The first lines of the train.csv file look like this:

0101_003.png,770,946,2070,2973,table
0110_099.png,270,1653,2280,2580,table
0113_013.png,303,343,2273,2953,table
0140_007.png,664,1782,1814,2076,table
0146_281.png,704,432,1744,1552,table

Preprocessing of images

The first part of the process is the preprocessing of the images. As the text elements in documents are very small and the used network is normally used for detecting real world objects in images we need to process the images to make the contents better understandable for the object detection network.

We will do this with in the following steps:

  1. open csv file
  2. read in all image file names in that file

for each image:

  1. preprocess image
  2. save image to data/train (for files from train.csv) or to data/val (for files from val.csv)

Let’s do this!

import os
import cv2
import pandas as pd
root_dir = os.getcwd()
file_list = ['train.csv', 'val.csv']
image_source_dir = os.path.join(root_dir, 'data/images/')
data_root = os.path.join(root_dir, 'data')
for file in file_list:
image_target_dir = os.path.join(data_root, file.split(".")[0])
# read list of image files to process from file
image_list = pd.read_csv(os.path.join(data_root, file), header=None)[0]
print("Start preprocessing images")
for image in image_list:
# open image file
img = cv2.imread(os.path.join(image_source_dir, image))
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
# perform transformations on image
b = cv2.distanceTransform(img, distanceType=cv2.DIST_L2, maskSize=5)
g = cv2.distanceTransform(img, distanceType=cv2.DIST_L1, maskSize=5)
r = cv2.distanceTransform(img, distanceType=cv2.DIST_C, maskSize=5)
# merge the transformed channels back to an image
transformed_image = cv2.merge((b, g, r))
target_file = os.path.join(image_target_dir, image)
print("Writing target file {}".format(target_file))
cv2.imwrite(target_file, transformed_image)
view raw preprocess.py hosted with ❤ by GitHub

After you have done this there should be 2 additional directories in you “data” folder — ”train” and “val”. These hold the preprocessed image files we later use for training and validating the results.

The images that folders now should look like this:

But before we start training the network there is one additional step that has to be done.

Creating TFRecords for training the network

Now that we have the preprocessed files the next step is to create the files needed as input for the training. Here we will use Luminoth framework for the first time.

As Luminoth is based on Tensorflow we need to create TFRecords which will be used as input for the training process. Luckily, Luminoth has some converters which you can use to transform your dataset accordingly.

To do this we will use the command line tool “lumi” which comes with Luminoth. In the directory where you placed the “data” folder open a terminal or command line and type:

lumi dataset transform --type csv --data-dir data/ --output-dir tfdata/ --split train --split val --only-classes=table

This will create a folder called “tfdata” with the TFRecords needed for the training of the network.

Training the network

To start the training of the network with luminoth we need to configure the training process.

This is done by writing a configuration file — there is a sample file available in the Luminoth Git repo, which I used to create a simple configuration (config.yml) for our task at hand:

train:
# Name used to identify the run. Data inside `job_dir` will be stored under
# `run_name`.
run_name: table-area-detection-0.1
# Base directory in which model checkpoints & summaries (for Tensorboard) will
# be saved.
job_dir: jobs/
save_checkpoint_secs: 10
save_summaries_secs: 10
# Number of epochs (complete dataset batches) to run.
num_epochs: 10
dataset:
type: object_detection
# From which directory to read the dataset.
dir: tfdata/classes-table/
image_preprocessing:
min_size: 600
max_size: 1024
data_augmentation:
- flip:
left_right: True
up_down: True
prob: 0.5
model:
type: fasterrcnn
network:
# Total number of classes to predict.
num_classes: 1
view raw config.yml hosted with ❤ by GitHub

Save this file to your working directory and we can start the training. Again we will use the tool “lumi” from Limunoth for this — so go to the terminal or command line (you should be in the folder with the data):

lumi train -c config.yml

This will start the training process and you should see output like this:

It can take quite a while to train the network — if the loss is getting close to 1.0 you can stop the training with <ctrl + c>.

Ok, now we have a trained network — what next?

Using the trained network to make predictions

To use the trained network to make prediction we first need to create a checkpoint.

In the terminal or commandline window type the following:

lumi checkpoint create config.yml

You see something similiar to this:

The last line with “Checkpoint c2155084dca6 created successfully.” holds the important information: the id of the created checkpoint (in this case c2155084dca6).

This is the identifier you need for the prediction for new images and if you want to load the model to the lumi webserver.

First we will use the command line tool to make a prediction (make sure to use the id of your checkpoint instead of c2155084dca6):

lumi predict --checkpoint c2155084dca6 data/val/9541_023.png
view raw gistfile1.txt hosted with ❤ by GitHub

You should see something like the following:

The interesting part for us is “bbox” — the numbers show the coordinates of the table area whith x0 = 160, y0 = 657 (upper left corner of the area) and x1 = 2346, x2 = 2211 (lower right corner of the area). This information can be used to mark the area in the original, unprocessed image and looks like this:

So the network seems to have a good idea where the table can be found on that page.

You can try that on your own — take the predict command above together with the id of your trained checkpoint and use an image tool of your choice to draw an area with the given coordinates on the image. You will see that it will fit the area nicely around the table on the pages.

If you want a quick view on the prediction you can also use the small web application that comes with Luminoth, but this can only be used with the preprocessed files.

To use it start the web server with the command (again — make sure to use the id of your checkpoint instead of c2155084dca6):

lumi server web --checkpoint c2155084dca6

This will start the webserver that comes with Luminoth where you can upload the preprocessed images to see the predictions which looks like this:

Although this has to be done using the preprocessed images you still get an idea of how good or bad the network detects the table areas.

Conclusion

In this article I gave a brief overview on how to implement the concept described in the research paper.

But detecting the table area alone is not of practical use — you still need some tools or libraries that can use this area definitions as input to actually get the content of the tables.

For those of you who want to go further on this I can recommend to take a look at tabula-py, a Python library that can take the area definitions as input to improve accuracy of extracting the data of the tables.

Hmm…. maybe a good topic for a second article? 😉