5.8 Re-encode Dummy Variables

A dummy variable is a binary variable (0/1) to represent subgroups of the sample. Sometimes we need to recode categories to smaller bits of information named “dummy variables.” For example, some questionnaires have five options for each question, A, B, C, D, and E. After you get the data, you will usually convert the corresponding categorical variables for each question into five nominal variables, and then use one of the options as the baseline.

Let’s encode gender and house from sim.dat to dummy variables. There are two ways to implement this. The first is to use class.ind() from nnet package. However, it only works on one variable at a time.

dumVar <- nnet::class.ind(sim.dat$gender)
head(dumVar)
##      Female Male
## [1,]      1    0
## [2,]      1    0
## [3,]      0    1
## [4,]      0    1
## [5,]      0    1
## [6,]      0    1

Since it is redundant to keep both, we need to remove one of them when modeling. Another more powerful function is dummyVars() from caret:

# use "origional variable name + level" as new name
dumMod <- dummyVars(~gender + house + income, 
                    data = sim.dat, 
                    levelsOnly = F)
head(predict(dumMod, sim.dat))
##   genderFemale genderMale houseNo houseYes income
## 1            1          0       0        1 120963
## 2            1          0       0        1 122008
## 3            0          1       0        1 114202
## 4            0          1       0        1 113616
## 5            0          1       0        1 124253
## 6            0          1       0        1 107661

dummyVars() can also use formula format. The variable on the right-hand side can be both categorical and numeric. For a numerical variable, the function will keep the variable unchanged. The advantage is that you can apply the function to a data frame without removing numerical variables. Other than that, the function can create interaction term:

dumMod <- dummyVars(~gender + house + income + income:gender, 
                    data = sim.dat, 
                    levelsOnly = F)
head(predict(dumMod, sim.dat))
##   genderFemale genderMale houseNo houseYes income
## 1            1          0       0        1 120963
## 2            1          0       0        1 122008
## 3            0          1       0        1 114202
## 4            0          1       0        1 113616
## 5            0          1       0        1 124253
## 6            0          1       0        1 107661
##   genderFemale:income genderMale:income
## 1              120963                 0
## 2              122008                 0
## 3                   0            114202
## 4                   0            113616
## 5                   0            124253
## 6                   0            107661

If you think the impact income levels on purchasing behavior is different for male and female, then you may add the interaction term between income and gender. You can do this by adding income: gender in the formula.