10.1 Ridge Regression

Recall that the least square estimates minimize RSS:

$R S S = Σ_{i = 1}^{n} (y_{i} - β_{0} - Σ_{j = 1}^{p} β_{j} x_{i j})^{2}$

Ridge regression (Hoerl and Kennard 1970) is similar but it finds ${\hat{β}}^{R}$ that optimizes a slightly different function:

$\begin{matrix} (10.1) & Σ_{i = 1}^{n} (y_{i} - β_{0} - Σ_{j = 1}^{p} β_{j} x_{i j})^{2} + λ Σ_{j = 1}^{p} β_{j}^{2} = R S S + λ Σ_{j = 1}^{p} β_{j}^{2} \end{matrix}$

where $λ > 0$ is a tuning parameter. As with the least squares, ridge regression considers minimizing RSS. However, it adds a shrinkage penalty $λ Σ_{j = 1}^{p} β_{j}^{2}$ that takes account of the number of parameters in the model. When $λ = 0$ , it is identical to least squares. As $λ$ gets larger, the coefficients start shrinking towards 0. When $λ \to \infty$ , the rest of the coefficients $β_{1}, . . ., β_{p}$ are close to 0. Here, the penalty is not applied to $β_{0}$ . The tuning parameter $λ$ is used to adjust the impact of the two parts in equation (10.1). Every value of $λ$ corresponds to a set of parameter estimates.

There are many R packages for ridge regression, such as lm.ridge() function from MASS, function enet() from elasticnet. If you know the value of $λ$ , you can use either of the function to fit ridge regression. A more convenient way is to use train() function from caret. Let’s use the 10 survey questions to predict the total purchase amount (sum of online and store purchase).

dat <- read.csv("http://bit.ly/2P5gTw4")
# data cleaning: delete wrong observations
# expense can't be negative
dat <- subset(dat, store_exp > 0 & online_exp > 0)
# get predictors
trainx <- dat[ , grep("Q", names(dat))]
# get response
trainy <- dat$store_exp + dat$online_exp

Use train() function to tune parameter. Since ridge regression adds the penalty parameter $λ$ in front of the sum of squares of the parameters, the scale of the parameters matters. So here it is better to center and scale the predictors. This preprocessing is generally recommended for all techniques that puts penalty to parameter estimates. In this example, the 10 survey questions are already with the same scale so data preprocessing doesn’t make too much different. It is a good idea to set the preprocessing as a standard.

# set cross validation
ctrl <- trainControl(method = "cv", number = 10)
# set the parameter range 
ridgeGrid <- data.frame(.lambda = seq(0, .1, length = 20))
set.seed(100)
ridgeRegTune <- train(trainx, trainy,
                      method = "ridge",
                      tuneGrid = ridgeGrid,
                      trControl = ctrl,
                      ## center and scale predictors
                      preProc = c("center", "scale"))
ridgeRegTune

## Ridge Regression 
## 
## 999 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 899, 899, 899, 899, 899, 900, ... 
## Resampling results across tuning parameters:
## 
##   lambda    RMSE  Rsquared  MAE  
##   0.000000  1744  0.7952    754.0
##   0.005263  1744  0.7954    754.9
##   0.010526  1744  0.7955    755.9
##   0.015789  1744  0.7955    757.3
##   0.021053  1745  0.7956    758.8
##   0.026316  1746  0.7956    760.6
##   0.031579  1747  0.7956    762.4
##   0.036842  1748  0.7956    764.3
##   0.042105  1750  0.7956    766.4
##   0.047368  1751  0.7956    768.5
##   0.052632  1753  0.7956    770.6
##   0.057895  1755  0.7956    772.7
##   0.063158  1757  0.7956    774.9
##   0.068421  1759  0.7956    777.2
##   0.073684  1762  0.7956    779.6
##   0.078947  1764  0.7955    782.1
##   0.084211  1767  0.7955    784.8
##   0.089474  1769  0.7955    787.6
##   0.094737  1772  0.7955    790.4
##   0.100000  1775  0.7954    793.3
## 
## RMSE was used to select the optimal model using
##  the smallest value.
## The final value used for the model was lambda
##  = 0.005263.

The results show that the best value of $λ$ is 0.005 and the RMSE and $R^{2}$ are 1744 and 0.7954 correspondingly. You can see from the figure 10.1, as the $λ$ increase, the RMSE first slightly decreases and then increases.

plot(ridgeRegTune)

FIGURE 10.1: Test mean squared error for the ridge regression

Once you have the tuning parameter value, there are different functions to fit a ridge regression. Let’s look at how to use enet() in elasticnet package.

ridgefit = enet(x = as.matrix(trainx), y = trainy, lambda = 0.01,
                # center and scale predictors
                normalize = TRUE)

Note here ridgefit only assigns the value of the tuning parameter for ridge regression. Since the elastic net model include both ridge and lasso penalty, we need to use predict() function to get the model fit. You can get the fitted results by setting s = 1 and mode = "fraction". Here s = 1 means we only use the ridge parameter. We will come back to this when we get to lasso regression.

ridgePred <- predict(ridgefit, newx = as.matrix(trainx), 
                     s = 1, mode = "fraction", type = "fit")

By setting type = "fit", the above returns a list object. The fit item has the predictions:

names(ridgePred)

## [1] "s"        "fraction" "mode"     "fit"

head(ridgePred$fit)

##      1      2      3      4      5      6 
## 1290.5  224.2  591.4 1220.6  853.4  908.2

If you want to check the estimated coefficients, you can set type="coefficients":

ridgeCoef<-predict(ridgefit,newx = as.matrix(trainx), 
                   s=1, mode="fraction", type="coefficients")

It also returns a list and the estimates are in the coefficients item:

# didn't show the results
RidgeCoef = ridgeCoef$coefficients

Comparing to the least square regression, ridge regression performs better because of the bias-variance-trade-off we mentioned in section 7.1. As the penalty parameter $λ$ increases, the flexibility of the ridge regression decreases. It decreases the variance of the model but increases the bias at the same time.

References

Hoerl, Arthur, and Robert Kennard. 1970. “Ridge Regression: Biased Estimation for Nonorthogonal Problems.” Technometrics 12 (1): 55–67.