Lab 10 - Introduction to Reccurent Neural Networks¶

This lab will cover a few basic implementations of reccurrent neural networks in torch for an application where the data are the surnames of a sample of individuals for six different nationalities.

To begin, you'll need to import the following libraries:

In [1]:
import torch
import torchvision
import string

Next, the data used throughout the lab is stored in a zipped folder available here: https://remiller1450.github.io/data/surnames.zip

This folder contains 6 different text files, one per nationality, that are list of unique surnames observed among the members of that nationality (with one name per line). After modifying the root path, you may use the following code to load the information in these textfiles into Python lists:

In [2]:
root = 'C:/Users/millerry/OneDrive - Grinnell College/Documents/surnames/'
Chinese = open(root+'Chinese.txt', encoding='utf-8').read().strip().split('\n')
Japanese = open(root+'Japanese.txt', encoding='utf-8').read().strip().split('\n')
Korean = open(root+'Korean.txt', encoding='utf-8').read().strip().split('\n')
English = open(root+'English.txt', encoding='utf-8').read().strip().split('\n')
Irish = open(root+'Irish.txt', encoding='utf-8').read().strip().split('\n')
Russian = open(root+'Russian.txt', encoding='utf-8').read().strip().split('\n')

Part 1 - Data Preparation¶

Recurrent neural networks are designed to work sequential data, and our models throughout the lab will consider each character within a name as a sequential observation. This framework requires us to represent the individual characters in a name using a one-hot vectors.

To facilitate this process, we'll start by defining a helper function that converts a single line of text (name) into a tensor that contains one-hot vectors representing each character.

In [3]:
## We'll consider all ascii letters plus basic punctuation
all_letters = string.ascii_letters + " .,;'"
n_letters = len(all_letters)

## Function to iterate through a line of text encode each letter as a 1 x 57 vector in an nchar x 1 x 57 tensor
def nameToTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li, letter in enumerate(line):
        tensor[li][0][all_letters.find(letter)] = 1
    return tensor

The code below demonstrates the behavior of this function on a simple example, the line "Aa". Notice that the output is a tensor with dimensions [2, 1, 57].

In [4]:
## Demonstration of the test name "Aa", notice the "A" is encoded as the 27th position, and "a" is the 1st position
example = nameToTensor('Aa')
print(example)

## Also notice dim1 of the tensor is the number of charactersr in the name
print(example.size())
tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]],

        [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
          0., 0., 0., 0., 0., 0.]]])
torch.Size([2, 1, 57])

Question #1¶

  • Part A - What is represented by the first dimension of the tensor produced by our nameToTensor function? Will the size/length of this dimension change if a different input is provided?
  • Part B - What is represented by the third dimension of the tensor produced by our nameToTensor function? Will the size/length of this dimension change if a different input is provided?

Part 2 - Model Architecture¶

Next, we'll define a network architecture to model our sequential data. We should notice that this network architecture will be flexible enough to handle inputs of different sizes (since each surname is a different length).

In [5]:
from torch import nn
class my_rnn(nn.Module):
    
    ## Constructor commands
    def __init__(self, input_size, hidden_size, output_size):
        super(my_rnn, self).__init__()

        self.hidden_size = hidden_size
        self.i2h = nn.Linear(input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(hidden_size, output_size)
        self.softmax = nn.LogSoftmax(dim=1)
    
    ## Function to generate predictions
    def forward(self, input, hidden):
        combined = torch.cat((input, hidden), 1)
        hidden = self.i2h(combined)
        output = self.i2o(hidden)
        output = self.softmax(output)
        return output, hidden
    
    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

The architecture is best understood by it's forward method. Here, the current input and hidden state are combined into a single tensor, named combined, by concatenating them along dim = 1. This combined tensor is used as the input to i2h, which produces the next hidden state. The updated hidden state is then used to produce an output, which is transformed via softmax before it is returned.

Another thing to notice is that the size of the hidden state is the only piece of this network's architecture that we might consider manipulating. Increasing the hidden size will provide the model with more flexibility to learn sequential patterns that exist within the training sequences (names), but too much flexibility could lead to overfitting.

To explore how this model works, let's initialize it with randomly generated weights and see what it outputs for an example name:

In [6]:
## Initialize model with random weights
rnn = my_rnn(n_letters, 100, 6)

## Format an example input name (Albert)
test_input = nameToTensor('Albert')

## Provide an initial hidden state (all zeros this time)
hidden = torch.zeros(1, 100)

## Generate output from the RNN
output, next_hidden = rnn(test_input[0], hidden)
print(output)
tensor([[-1.8977, -1.8209, -1.8361, -1.7668, -1.7667, -1.6767]],
       grad_fn=<LogSoftmaxBackward0>)
In [7]:
## Print the top category (predicted class)
output.topk(1)
Out[7]:
torch.return_types.topk(
values=tensor([[-1.6767]], grad_fn=<TopkBackward0>),
indices=tensor([[5]]))

Question #2¶

  • Part A - In the given example, my_rnn was initialized with a value of 6 for output_size. Where did this value come from? Is it something you can change when tuning the network's architecture? Briefly explain.
  • Part B - In the given example, my_rnn was initialized with a value of 100 for hidden_size. Where did this value come from? Is it something you can change when tuning the network's architecture? Briefly explain.

Part 3 - Training¶

In order to train the model, we'll define the category labels and a dictionary that links our lists of names to each label.

In [8]:
## List of categories
category_labels = ['Chinese', 'Japanese', 'Korean', 'English', 'Irish', 'Russian']

## Dictionary of categories and names
category_lines = {'Chinese': Chinese,
                 'Japanese': Japanese,
                 'Korean': Korean,
                 'English': English,
                 'Irish': Irish,
                 'Russian': Russian}

Next, we'll aim to train our network by feeding it randomly selected example names and updating the network's weights and biases using back-propogation.

The function defined below will help facilitate the selection of randomly chosen input names during model training:

In [9]:
## Function to randomly sample a single example
import random
def randomTrainingExample():
    ## Randomly choose a category index (ie: Chinese, etc.)
    category = category_labels[random.randint(0, len(category_labels)-1)]
    
    ## Randomly choose a name in that category
    name = category_lines[category][random.randint(0, len(category_lines[category])-1)]
    
    ## Convert the chosen example to a tensor
    category_tensor = torch.tensor([category_labels.index(category)], dtype=torch.long)
    line_tensor = nameToTensor(name)
    
    return category, name, category_tensor, line_tensor

## Try it out
randomTrainingExample()
Out[9]:
('Korean',
 'Gwang ',
 tensor([2]),
 tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0.]],
 
         [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0.]],
 
         [[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 1., 0., 0., 0., 0.]]]))

Next, we'll set up another function to update the network's parameters after encountering a randomly selected training example:

In [10]:
## Set learning rate
learning_rate = 0.005

## Define cost func
cost_fn = nn.CrossEntropyLoss()

## Training function for a single input (name category, name)
def train(category_tensor, line_tensor):
    
    ## initialize the hidden state
    hidden = rnn.initHidden()
    
    ## set the gradient to zero
    rnn.zero_grad()

    ## loop through the letters in the input, getting a prediction and new hidden state each time
    for i in range(line_tensor.size()[0]):
        output, hidden = rnn(line_tensor[i], hidden)

    ## Calculate cost and gradients
    cost = cost_fn(output, category_tensor)
    cost.backward()

    # Update parameters
    for p in rnn.parameters():
        p.data.add_(p.grad.data, alpha = -learning_rate) ## This adds the LR times the gradient to each parameter 

    ## Return the output and cost
    return output, cost.item()

Question #3¶

The train function defined above contains a for loop that iterates through the first dimension of the input line tensor (which is the tensor storing the input name).

  • Part A - In specific terms, what is being input to the model at each iteration of this loop?
  • Part B - In specific terms, what is being output from the model at each iteration of this loop?
  • Part C - Notice how hidden is initialized (ie: reset) every time train is called on a new training example. What is the purpose of this step? Briefly explain.

Finally, we'll ready to train our model. We'll do this by repeatedly providing a randomly chosen name to the train function and tracking the acculumated cost (per 25 iterations).

In [11]:
## Initializations
n_iters = 10000
cost_every_n = 25
current_cost = 0
track_cost = []

### Iteratively update model from randomly chosen example
for iter in range(1, n_iters + 1):
    category, line, category_tensor, line_tensor = randomTrainingExample()
    output, cost = train(category_tensor, line_tensor)
    current_cost += cost
    
    # Save cost every 25 iterations
    if iter % cost_every_n == 0:
        track_cost.append(current_cost/cost_every_n)
        current_cost = 0

Next, we'll graph the costs throughout the training process to see if our model has learned anything from our training examples:

In [12]:
import matplotlib.pyplot as plt
plt.plot(track_cost)
plt.show()

From this graph, we can see that the model has found some patterns within surnames. In the next section we'll explore these further by using the trained model to make predictions.

Part 4 - Using the Model¶

The RNN we built and trained is designed to predict the labels of input sequences of characters. This means that we can give the trained model any valid sequence of characters and it will predict the nationality it believes that name belongs to.

To see this in action, we'll create a predict function that returns the top N predicted labels (and their associated outputs) for a given input name:

In [13]:
def predict(input_line, n_predictions=4):
    print('\n> %s' % input_line)
    
    ## Don't update gradient with any of these examples
    with torch.no_grad():
        
        ## Initialize new hidden state
        hidden = rnn.initHidden()
        
        ## Convert input str to tensor
        input_t = nameToTensor(input_line)
 
        ## Pass each character into `rnn`
        for i in range(input_t.size()[0]):
            output, hidden = rnn(input_t[i], hidden)

        # Get top N categories from output
        topv, topi = output.topk(n_predictions, 1, True)
        predictions = []

        ## Go through the category predictions and save info for printing
        for i in range(n_predictions):
            value = topv[0][i].item()
            category_index = topi[0][i].item()
            print('(%.2f) %s' % (value, category_labels[category_index]))
            predictions.append([value, category_labels[category_index]])

## Try it out on a few examples:
predict('Dovesky')
predict('Miller')
predict('Satoshi')
predict('ABCDEFGHIJKLMNOP')
> Dovesky
(-0.47) Russian
(-1.11) English
(-3.20) Irish
(-5.11) Japanese

> Miller
(-0.34) English
(-1.93) Russian
(-2.00) Irish
(-5.77) Japanese

> Satoshi
(-0.25) Japanese
(-2.33) Russian
(-2.52) English
(-3.23) Irish

> ABCDEFGHIJKLMNOP
(-0.68) English
(-0.78) Russian
(-3.47) Irish
(-5.80) Japanese

Question #4¶

  • Try out this predict function on 1 or 2 names of your choosing. Include your code and output, and write 1-2 sentences reflecting upon whether you satisfied/surprised by the results.

Part 5 - Creating a Generative RNN¶

This section provides a brief illustration of a simple generative RNN. The network will be trained using surnames data that we've been working with, and it will be set up to generate a predicted name when given a initial character.

For our previous model, we prepared our data using one-hot encoding to represent each unique letter. This time, we'll add an extra position that does not correspond to any letter to function as a "stop character", which will stop the model from continuing to generate new characters:

In [20]:
n_categories = len(category_labels)
all_letters = string.ascii_letters + " .,;'-"
n_letters = len(all_letters) + 1

Next we'll define the model's architecture, which is somewhat more complicated than our previous example.

In [21]:
from torch import nn

class my_gen_rnn(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(my_gen_rnn, self).__init__()
        self.hidden_size = hidden_size

        self.i2h = nn.Linear(n_categories + input_size + hidden_size, hidden_size)
        self.i2o = nn.Linear(n_categories + input_size + hidden_size, output_size)
        self.o2o = nn.Linear(hidden_size + output_size, output_size)
        self.dropout = nn.Dropout(0.1)
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(self, category, input, hidden):
        input_combined = torch.cat((category, input, hidden), 1)
        hidden = self.i2h(input_combined)
        output = self.i2o(input_combined)
        output_combined = torch.cat((hidden, output), 1)
        output = self.o2o(output_combined)
        output = self.dropout(output)
        output = self.softmax(output)
        return output, hidden

    def initHidden(self):
        return torch.zeros(1, self.hidden_size)

This model involves three key components that are explained below:

  • i2h takes a combined input tensor containing the category, input string, and current hidden state and outputs a new hidden state
  • i2o takes the same combined input as i2h, but it produces an intermediate output that will ultimately contribute to a new predicted character
  • o2o is an extra layer that takes the combined output of i2h and o2o to generate a predicted character.

The recurrent structure of the network can be more easily understood using the diagram below:

In [16]:
from IPython.display import HTML
HTML('<img src="https://i.imgur.com/jzVrf7f.png">')
Out[16]:

For each training example we'll need a set of input letters (the complete surname), a set of output letters (the surname offset by 1), and the category label (nationality).

For example, if the name is "Kasparaov", the input letters would be a one-hot representation of the letters in "Kasparaov", the output letters would be a one-hot representation of "asparaov", where "" indicates the end of a string (the extra position we added to our set of letters). We'll also need a tensor to store the category label.

The functions defined below will create the input letters, output letters, and category tensor for a given name:

In [22]:
def inputTensor(line):
    tensor = torch.zeros(len(line), 1, n_letters)
    for li in range(len(line)):
        letter = line[li]
        tensor[li][0][all_letters.find(letter)] = 1
    return tensor

def outputTensor(line):
    letter_indexes = [all_letters.find(line[li]) for li in range(1, len(line))]
    letter_indexes.append(n_letters - 1) 
    return torch.LongTensor(letter_indexes)

def categoryTensor(category):
    li = category_labels.index(category)
    tensor = torch.zeros(1, n_categories)
    tensor[0][li] = 1
    return tensor

We will also define a couple of functions to help us select random examples during training:

In [23]:
# Random item from a list
def randomChoice(l):
    return l[random.randint(0, len(l) - 1)]

# Get a random category and random line from that category
def randomTrainingPair():
    category = randomChoice(category_labels)
    line = randomChoice(category_lines[category])
    return category, line

# Make category, input, and target tensors from a random category, line pair
def randomTrainingExample():
    category, line = randomTrainingPair()
    category_tensor = categoryTensor(category)
    input_line_tensor = inputTensor(line)
    target_line_tensor = outputTensor(line)
    return category_tensor, input_line_tensor, target_line_tensor

To get a basic understanding of these functions, we can display a random training example:

In [24]:
## Try it out
randomTrainingExample()
Out[24]:
(tensor([[1., 0., 0., 0., 0., 0.]]),
 tensor([[[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0.]],
 
         [[1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0.]],
 
         [[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
           0., 0., 0., 0., 0., 0., 0., 0.]]]),
 tensor([20,  0, 13, 58]))

In order, these three tensor outputs are:

  1. A one-hot vector recording the name's category (nationality)
  2. A tensor that encodes the input name
  3. A tensor of integer labels for each character in the output name. This should always end in '58', which is the category we created to be the "stop character"

Next we'll define a function that will generate an output name from the network when given an initial character. We'll limit the generated name to a maximum length of 15, and we'll use the network architecture we previous defined with randomly initialized weights to come up with the name.

In [25]:
max_length = 15
gen_rnn = my_gen_rnn(n_letters, 128, n_letters)

# Sample using a given category and starting letter
def sample(category, start_letter):
    
    ## We are just sampling, so we don't want to store info used in gradient calculations
    with torch.no_grad(): 
        category_tensor = categoryTensor(category)  ## create category tensor of input category
        input = inputTensor(start_letter)           ## intialize input tensor as an encoding of the start letter
        hidden = gen_rnn.initHidden()               ## reset the initial hidden state
        output_name = start_letter                  ## Use start letter as first piece of the output name
        
        ## Loop until reaching the max length or the stop character 
        for i in range(max_length):
            output, hidden = gen_rnn(category_tensor, input[0], hidden)  ## Get the next output and hidden state
            topv, topi = output.topk(1)                                  ## Identify the top predicted character's value and index position
            topi = topi[0][0]                                            ## Extract integer id of predicted char
            if topi == n_letters - 1:                                    ## Stop if its the stop character's ID
                break
            else:
                letter = all_letters[topi]                               ## Convert integer id to the character
                output_name += letter                                    ## Add this character to the output 
            input = inputTensor(letter)                                  ## Prep this letter as the next input

        return output_name

We can see this function in action by providing a valid category label and initial character:

In [26]:
sample('English', 'B')
Out[26]:
'BRiPTPiPTPiPTPiP'

Question #5¶

  • Part A - The generated name doesn't appear to be an English surname. Is this something you'd expect to see for other input characters? Briefly explain.
  • Part B - Why do we want the function to exit the for loop when topi == n_letters - 1? Briefly explain.

Similar to our previous example, we'll create a function that we can use to help train our network:

In [27]:
cost_fn = nn.CrossEntropyLoss()
gen_rnn = my_gen_rnn(n_letters, 128, n_letters)
learning_rate = 0.001

def train(category_tensor, input_line_tensor, target_line_tensor):
    target_line_tensor.unsqueeze_(-1)
    hidden = gen_rnn.initHidden()

    gen_rnn.zero_grad()
    cost = 0

    for i in range(input_line_tensor.size(0)):
        output, hidden = gen_rnn(category_tensor, input_line_tensor[i], hidden)
        l = cost_fn(output, target_line_tensor[i])
        cost += l

    cost.backward()

    for p in gen_rnn.parameters():
        p.data.add_(p.grad.data, alpha=-learning_rate)

    return output, cost.item() / input_line_tensor.size(0)

Next, we'll train the network using 10,000 randomly chosen training examples:

In [28]:
n_iters = 10000
cost_every_n = 25
current_cost = 0
track_cost = []

for iter in range(1, n_iters + 1):
    cat, il, ol = randomTrainingExample()
    if -1 in ol:                               ### If an example happens to contain an unusual character we'll skip it
        continue  
    output, cost = train(cat, il, ol)
    current_cost += cost
    
    # Save the cost every 25 iterations
    if iter % cost_every_n == 0:
        track_cost.append(current_cost/cost_every_n)
        current_cost = 0

As shown below, we can see that the network's parameters have reached a point where the cost is no longer improving:

In [29]:
plt.plot(track_cost)
plt.show()

At this point, we can use the sample function we created earlier to explore some of the names we can generate. The code below provides a template for looking at various names that are generated from a given test letter.

In [58]:
test_letter = 'M'
print('Korean:',sample('Korean', test_letter), 
      '\nJapanese:', sample('Japanese', test_letter),
      '\nChinese:', sample('Chinese', test_letter),
      '\nEnglish:', sample('English', test_letter),
      '\nIrish:', sample('Irish', test_letter),
      '\nRussian:', sample('Russian', test_letter))
Korean: Mon 
Japanese: Mana 
Chinese: Man 
English: Mane 
Irish: Manan 
Russian: Manakov

Notice that you can re-run the same commands several times and see slightly different results due to the dropout layer involved in the creation of the network's output.

Question #6¶

  • To verify that you've tried training this model and using it to generate output, modify the print command given above to use a test_letter of your choice. Include your printed results, and provide 1-2 sentences commenting upon how you view the effectiveness of this model.

Acknowledgements: The contents of this lab were adapted from the following tutorials:

  • https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
  • https://pytorch.org/tutorials/beginner/former_torchies/nnft_tutorial.html