## 12.4 Recurrent Neural Network

Traditional neural networks don’t have a framework that can handle sequential events where the later events are based on the previous ones. For example, map an input audio clip to a text transcript where the input is voice over time, and the output is the corresponding sequence of words over time. Recurrent Neural Network is a deep-learning model that can process this type of sequential data.

The recurrent neural network allows information to flow from one step to the next with a repetitive structure. Figure 12.18 shows the basic chunk of an RNN network. You combine the activated neuro from the previous step with the current input \(x^{<t>}\) to produce an output \(\hat{y}^{<t>}\) and an updated activated neuro to support the next input at \(t+1\).

So the whole process repeats a similar pattern. If we unroll the loop:

The chain-like recurrent nature makes it the natural architecture for sequential data. There is incredible success applying RNNs to this type of problems:

- Machine translation
- Voice recognition
- Music generation
- Sentiment analysis

A trained CNN accepts a fixed-sized vector as input (such as \(28 \times 28\) image) and produces a fixed-sized vector as output (such as the probabilities of being one the ten digits). RNN has a much more flexible structure. It can operate over sequences of vectors and produces sequences of outputs and they can vary in size. To understand what it means, let’s look at some RNN structure examples.

The rectangle represents a vector and the arrow represents matrix multiplications. The input vector is in green and the output vector is in blue. The red rectangle holds the intermediate state. From left to right,

- one-to-one: model takes a fixed size input and produces a fixed size output, such as CNN. it is not sequential.
- one-to-many: model takes one input and generate a sequence of output, such as the music generation.
- many-to-one: model takes a sequence of input and produces a single output, such as sentiment analysis.
- many-to-many: model takes a sequence of input and produces a sequence of output. The input size can be the same with the output size (such as name entity recognition) or it can be different (such as machine translation).

### 12.4.1 RNN Model

To further understand the RNN model, let’s look at an entity recognition example. Assume you want to build a sequence model to recognize the company or computer language names in a sentence like this: “Use Netlify and Hugo”. It is a name recognition problem which is used by the research company to index different company names in the articles. It is also used in material science to tag chemicals mentioned in the most recent journals to find any indication of the next research topic.

Given input sentence x, you want a model to produce one output for each word in x that tells you if that word is the name for something. So in this example, the input is a sequence of 5 words including the period in the end. The output is a sequence of 0/1 with the same length that indicates whether the input word is a name (1) or not (0). We use superscript \(<t>\) to denote the element position of input and output; use superscript \((i)\) to denote the \(i^{th}\) sample (you will have different sentences in the training data); Use \(T_x^{(i)}\) to denote the length of \(i^{th}\) input, \(T_y^{(i)}\) for output. In this case, \(T_x^{(i)}\) is equal to \(T_y^{(i)}\).

### 12.4.2 Word Embedding

### 12.4.3 Long Short Term Memory

We are going to walk through LSTM step by step. The first step of LSTM is to decide what information to **forget**. This decision is made by “forget gate”, a sigmoid function (\(\Gamma_{f}\)). It looks at \(a^{<t-1>}\) and \(x^{t}\) and outputs a number between 0 and 1 for each number in the cell state \(c^{t-1}\). A value 1 means “completely remember the state”, while 0 means “completely forget the state”.

The next step is to decide what new information we’re going to add to the cell state. This step includes two parts:

- input gate (\(\Gamma_{u}\)): a sigmoid function that decides how much we want to update
- a vector of new candidate value (\(\tilde{c}^{<t>}\))

The multiplication of the above two parts \(\Gamma_{u}*\tilde{c}^{<t>}\) is the new candidate scaled by the input gate. We then combine the results we get so far to get new cell state \(c^{<t>}\).

Finally, we need to decide what we are going to output. The output is a filtered version of the new cell state \(c^{<t>}\).