In this section, we will walk through an example of text sentiment
analysis using RNN. The notebook’s purpose is to illustrate how to build
a Recurrent Neural Network using R. Please check the keras
R package website for the most recent development: https://keras.rstudio.com/.
We can install keras
through CRAN by calling
install.packages("keras")
. As it is still at a fast
development stage, we can also get it directly from github for the most
recent version that might not be in CRAN yet using
devtools::install_github('rstudio/keras')
. The following
cell may take a few minutes to finish installing all dependencies.
devtools::install_github("rstudio/keras")
As keras is just an interface to popular deep learning frameworks, we
have to install the deep learning backend. The default and recommended
backend is TensorFlow. By calling install_keras()
, it will
install all the needed dependencies for TensorFlow.
library(keras)
# install_keras()
You can run this notebook in the Databrick community edition with R as the interface. For an audience with a statistical background, using a well-managed cloud environment has the following benefit:
You can also run this notebook on your local machine with R and the
required Python packages (keras
uses the Python TensorFlow
backend engine). Different versions of Python may cause some errors when
running install_keras()
. Here are the things you could do
when you encounter the Python backend issue in your local machine:
reticulate::py_config()
to check the current Python
configuration to see if anything needs to be changed.install_keras()
uses virtual environment
~/.virtualenvs/r-reticulate
. If you don’t know how to set
the right environment, try to set the installation method as conda
(install_keras(method = "conda")
)keras
and the TensorFlow backend.We will use the IMDB movie review data. It is one of the most used datasets for text-related machine learning methods. The datasets’ inputs are movie reviews published at IMDB in its raw text format, and the output is a binary sentiment indicator( “1” for positive and “0” for negative) created through human evaluation. The training and testing data have 25,000 records each. Each review varies in length.
Machine learning algorithms can not deal with raw text, and we have to convert text into numbers before feeding it into an algorithm. Tokenization is one way to convert text data into a numerical representation. For example, suppose we have 500 unique words for all reviews in the training dataset. We can label each word by the rank (i.e., from 1 to 500) of their frequency in the training data. Then each word is replaced by an integer between 1 to 500. This way, we can map each movie review from its raw text format to a sequence of integers.
As reviews can have different lengths, sequences of integers will have different sizes too. So another important step is to make sure each input has the same length by padding or truncating. For example, we can set a length of 50 words, and for any reviews less than 50 words, we can pad 0 to make it 50 in length; and for reviews with more than 50 words, we can truncate the sequence to 50 by keeping only the first 50 words. After padding and truncating, we have a typical data frame, each row is an observation, and each column is a feature. The number of features is the number of words designed for each review (i.e., 50 in this example).
After tokenization, the numerical input is just a naive mapping to the original words, and the integers do not have their usual numerical meanings. We need to use embedding to convert these categorical integers to more meaningful representations. The word embedding captures the inherited relationship of words and dramatically reduces the input dimension. The dimension is a vector space representing the entire vocabulary. It can be 128 or 256, and the vector space dimension is the same when the vocabulary changes. It has a lower dimension, and each vector is filled with real numbers. The embedding vectors can be learned from the training data, or we can use pre-trained embedding models. There are many pre-trained embeddings for us to use, such as Word2Vec, BIRD.
The IMDB dataset is preloaded for keras
and we can call
dataset_imdb()
to load a partially pre-processed dataset
into a data frame. We can define a few parameters in that function.
num_words
is the number of words in each review to be used.
All the unique words are ranked by their frequency counts in the
training dataset. The dataset_imdb()
function keeps the top
num_words
words and replaces other words with a default
value of 2, and using integers to represent text (i.e., top frequency
word will be replaced by 3 and 0, 1, 2 are reserved for “padding,”
“start of the sequence,” and “unknown.” ).
# consider only the top 10,000 words in the dataset
max_unique_word <- 2500
# cut off reviews after 100 words
max_review_len <- 100
Now we load the IMDB dataset, and we can check the structure of the
loaded object by using str()
command.
my_imdb <- dataset_imdb(num_words = max_unique_word)
## Loaded Tensorflow version 2.8.0
str(my_imdb, list.len = 10)
## List of 2
## $ train:List of 2
## ..$ x:List of 25000
## .. ..$ : int [1:218] 1 14 22 16 43 530 973 1622 1385 65 ...
## .. ..$ : int [1:189] 1 194 1153 194 2 78 228 5 6 1463 ...
## .. ..$ : int [1:141] 1 14 47 8 30 31 7 4 249 108 ...
## .. ..$ : int [1:550] 1 4 2 2 33 2 4 2040 432 111 ...
## .. ..$ : int [1:147] 1 249 1323 7 61 113 10 10 13 1637 ...
## .. ..$ : int [1:43] 1 778 128 74 12 630 163 15 4 1766 ...
## .. ..$ : int [1:123] 1 2 365 1234 5 1156 354 11 14 2 ...
## .. ..$ : int [1:562] 1 4 2 716 4 65 7 4 689 2 ...
## .. ..$ : int [1:233] 1 43 188 46 5 566 264 51 6 530 ...
## .. ..$ : int [1:130] 1 14 20 47 111 439 2 19 12 15 ...
## .. .. [list output truncated]
## ..$ y: int [1:25000] 1 0 0 1 0 0 1 0 1 0 ...
## $ test :List of 2
## ..$ x:List of 25000
## .. ..$ : int [1:68] 1 591 202 14 31 6 717 10 10 2 ...
## .. ..$ : int [1:260] 1 14 22 2 6 176 7 2 88 12 ...
## .. ..$ : int [1:603] 1 111 748 2 1133 2 2 4 87 1551 ...
## .. ..$ : int [1:181] 1 13 1228 119 14 552 7 20 190 14 ...
## .. ..$ : int [1:108] 1 40 49 85 84 1040 146 6 783 254 ...
## .. ..$ : int [1:132] 1 146 427 2 14 20 218 112 2 32 ...
## .. ..$ : int [1:761] 1 1822 424 8 30 43 6 173 7 6 ...
## .. ..$ : int [1:180] 1 4 2 745 2 912 9 2 8 2 ...
## .. ..$ : int [1:134] 1 363 69 6 196 119 1586 19 2 2 ...
## .. ..$ : int [1:370] 1 14 22 9 121 4 1354 2 2 8 ...
## .. .. [list output truncated]
## ..$ y: int [1:25000] 0 1 1 0 1 1 1 0 0 1 ...
x_train <- my_imdb$train$x
y_train <- my_imdb$train$y
x_test <- my_imdb$test$x
y_test <- my_imdb$test$y
Next, we do the padding and truncating process.
x_train <- pad_sequences(x_train, maxlen = max_review_len)
x_test <- pad_sequences(x_test, maxlen = max_review_len)
The x_train
and x_test
are numerical data
frames ready to be used for recurrent neural network models.
Simple Recurrent Neurel Network
Like DNN and CNN models we trained in the past, RNN models are
relatively easy to train using keras
after the
pre-processing stage. In the following example, we use
layer_embedding()
to fit an embedding layer based on the
training dataset, which has two parameters: input_dim
(the
number of unique words) and output_dim
(the length of dense
vectors). Then, we add a simple RNN layer by calling
layer_simple_rnn()
and followed by a dense layer
layer_dense()
to connect to the response binary
variable.
rnn_model <- keras_model_sequential()
rnn_model %>%
layer_embedding(input_dim = max_unique_word, output_dim = 128) %>%
layer_simple_rnn(units = 64, dropout = 0.2,
recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')
We compile the RNN model by defining the loss function, optimizer to use, and metrics to track the same way as DNN and CNN models.
rnn_model %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')
)
Let us define a few more variables before fitting the model:
batch_size
, epochs
, and
validation_split
. These variables have the same meaning as
DNN and CNN models we see in the past.
batch_size = 128
epochs = 5
validation_split = 0.2
rnn_history <- rnn_model %>% fit(
x_train, y_train,
batch_size = batch_size,
epochs = epochs,
validation_split = validation_split
)
plot(rnn_history)
rnn_model %>%
evaluate(x_test, y_test)
## loss accuracy
## 0.5441073 0.7216800
LSTM RNN Model
A simple RNN layer is a good starting point for learning RNN, but the
performance is usually not that good because these long-term
dependencies are impossible to learn due to vanishing gradient. Long
Short Term Memory RNN model (LSTM) can carry useful information from the
earlier words to later words. In keras
, it is easy to
replace a simple RNN layer with an LSTM layer by using
layer_lstm()
.
lstm_model <- keras_model_sequential()
lstm_model %>%
layer_embedding(input_dim = max_unique_word, output_dim = 128) %>%
layer_lstm(units = 64, dropout = 0.2, recurrent_dropout = 0.2) %>%
layer_dense(units = 1, activation = 'sigmoid')
lstm_model %>% compile(
loss = 'binary_crossentropy',
optimizer = 'adam',
metrics = c('accuracy')
)
batch_size = 128
epochs = 5
validation_split = 0.2
lstm_history <- lstm_model %>% fit(
x_train, y_train,
batch_size = batch_size,
epochs = epochs,
validation_split = validation_split
)
plot(lstm_history)
lstm_model %>%
evaluate(x_test, y_test)
## loss accuracy
## 0.361364 0.844080
This simple example shows that LSTM’s performance has improved dramatically from the simple RNN model. The computation time for LSTM is roughly doubled when compared with the simple RNN model for this small dataset.