Rolling in the Deep: RNN

5 min readJul 4, 2019

Introduction to Recurrent Neural Networks

A Recurrent Neural Network (RNN) is a type of deep learning approach used for modeling sequential data. The data is sequential, if the points in a dataset are dependent on the previous points, each data point representing an observation at a certain point in time.

RNN is useful for sentiment analysis, prediction of the next word in a sentence, translation of multiple words or speech-to-text conversion. Traditional neural network typically can’t handle sequential data, as it assumes that each data point is idependent of the others. Feeding by the daily data input after input, the traditional network will produce individual classifications for every day, ignoring the previous results.

In traditional neural networks, inputs are analyzed in isolation, which is a problem if there are dependencies in the data.

In contrast with the traditional neural nets, RNN is able to remember the analysis done up to given point by maintaining a state, i.e. context. State captures information about what’s been previously calculated and recurs back to the net with each new input.

Architecture

In a very simple case, when network has just one hidden layer, data point flows into the network as input data,but also, the hidden units, receive the previous state or the context with the input (h_prev).

In contrast to the traditional NN hidden layers, here the hidden layer calculates two values:

The updated state variable (h_new)
Output of the network

The new state is calculated as a function of the previous state and the input data.

If this is the first data point, then some “initial state” is used, which is initialized to all zeros.

The output of the hidden layer is calculated as a multiplication of the new hidden state and the output weight matrix.

After processing the first data point, in addition to the output, a new context (h_new) is generated that represents the most recent point. This context is fed back into the net with the next data point and repeated until all the data is processed.

Recurrent neural networks have a wide range of applications: speech recognition, image captioning, stock market price prediction, etc. It depends on the scenario, if the network is many-to-many, one-to-many or many-to-one.

Problems of RNN

Despite all of its strengths, the recurrent neural networks have the following issues:

Computationally extensive
As the network needs to keep track of the states at any given time, it becomes computationally extensive when there are many units of data or time steps. One way to solve this is to save only a part of the states.
Sensitive to Parameters
RNN is sensitive to changes in their parameters. Gradient descent optimization algorithm is used for parameter tuning, that the optimizers may struggle to train.
“Vanishing Gradient” problem
As discussed earlier, this refers to gradually decreasing gradient, which slows down the training in the early layers of the network.
“Exploding Gradient” problem
The explosion occurs through exponential growth by repeatedly multiplying gradients through the network layers that have values larger than 1.0.

Long Short-Term Memory

As discussed above, the RNN has a few issues, which make it not applicable for long sequences. The Long Short-Term Memory (LSTM) is a specific type of RNN, which holds a strong gradient over many time steps.

LSTM unit is composed of four main elements one memory cell and three logistic gates: write, read and forget.

Memory cell: responsible for holding data
Write gate: responsible for writing data into the memory cell
Read gate: responsible for reading the data from the memory cell and sending back to the recurrent net
Forget gate: responsible for maintaining or deleting data from the information cell

Those gates execute some functions on a linear combination of the input, previous state and previous output. As a result, a Recurrent Network is able to remember what it needs and forget what it doesn’t.

As in Vanilla RNN, here also random initialized hidden state is used to produce the first new hidden state and the first time step output.Then it is sent to the net for the next time step. It continues until all the time steps are fed to the net.

LSTM unit keeps two components: (1) hidden state, the memory LSTM accumulates using the gates and (2) the previous time-step output.

Originally the LSTM has one hidden layer, in case of adding hidden layers, the output of the first layer is the input into the second layer. Then, the second LSTM is executed with its own internal state to produce an output. Stacking more than one LSTM hidden layer makes the model deeper and presumably better performing.

Training

The network learns to determine how much of the old information to maintain and delete, how much new information to feed through the input gate, how much of cell state to output to the output gate by learning the weights and biases used in each gate, for each layer.

RNN for Language Modeling

Language modeling is the process of assigning probabilities to sequences of words, i.e. analyzing a sequence of words and predicting which word is most likely to follow.

RNN can receive a word as input with the initial context and generate the output with the new context and then repeat the net until the sentence is complete. To pass a word to the network, it has to be converted into a vector of numbers. Words can be processed using word embedding.

A word embedding is an n-dimensional vector of real numbers for each word. In RNN model, the vectors are initialized randomly for all the words. During the training the vector are updated based on the context into which the word is used, such that words used in similar contexts have similar positions in the vector space.

In RNN, only one word in each time step is fed into the network, and one word would be the output. Besides the hidden layer, we also need a softmax layer to get the probabilities of the output words and the output word is the one with maximum probability value. Now, we can compare the sequence of 20 output

What does the Language Model learn?

During the training process the discrepancy between the actual and predicted output is calculated and back propagated to the net.

The model learns the following:

Embedding matrix (vocabulary) is updated in each iteration.
Weight matrices for each gate are tweaked
Weight matrix for the Softmax layer is updated