3.4 IMDB Dataset

The IMDB dataset (http://ai.stanford.edu/~amaas/data/sentiment/) is a popular dataset for text and language-related machine learning tutorials. It is also conveniently included in the Keras library, and there are a few build-in functions in Keras for data loading and pre-processing. It contains 50,000 movie reviews (25,000 in training and 25,000 in testing) from IMDB, as well as each movie review’s binary sentiment: positive or negative. The raw data contains the text of each movie review, and it has to be pre-processed before being fitted with any machine learning models. By using Keras’s built-in functions, we can easily get the processed dataset (i.e., a numerical data frame) for machine learning algorithms. Keras’ build-in functions perform the following tasks to convert the raw review text into a data frame:

  1. Convert text data into numerical data. Machine learning models cannot work with raw text data directly, and we have to convert text into numbers. There are many different ways for the conversion and Keras’ build-in function uses each word’s rank of frequency in the entire training dataset to replace the raw text in both the training and testing dataset. For example, the 10th most frequent word is replaced by integer 10. There are a few additional setups for this process, including:
    1. Skip top frequent words. We usually skip a few top frequent words as they are mainly stopwords like “the” “and” or “a,” which usually do not provide much information. There is a parameter in the build-in function to specify how many top words to skip.
    2. Set the maximum number of unique words. The entire vocabulary of the unique words in the training dataset may be large, and many of them have very low frequencies such as just appearing once in the entire training dataset. To keep the size of the vocabulary, we can also set up the maximum number of the unique words using Keras’ built-in function such that any words with least frequencies will be replaced with a special index such as “2”.
  2. Padding or truncation to keep all the reviews to be the same length. For most machine learning models, the algorithms expect to see the same number of features (i.e., same number of input columns in the data frame). There is a parameter in the Keras build-in function to set the maximum number of words in each review (i.e., max_length). For reviews that have less than max_legth words, we pad them with “0”. For reviews that have more than max_length words, we truncate them.

After the above pre-processing, each review is represented by one row in the data frame. There is one column for the binary positive/negative sentiment, and max_length columns input features converted from the raw review text. In the corresponding R and Python notebooks, we will go over the details of the data pre-processing using Keras’ built-in functions.