10.1 Ridge Regression
Recall that the least square estimates minimize RSS:
\[RSS=\Sigma_{i=1}^{n}(y_{i}-\beta_{0}-\Sigma_{j=1}^{p}\beta_{j}x_{ij})^{2}\]
Ridge regression (Hoerl and Kennard 1970) is similar but it finds \(\hat{\beta}^{R}\) that optimizes a slightly different function:
\[\begin{equation} \Sigma_{i=1}^{n}(y_{i}-\beta_{0}-\Sigma_{j=1}^{p}\beta_{j}x_{ij})^{2}+\lambda\Sigma_{j=1}^{p}\beta_{j}^{2}=RSS+\lambda\Sigma_{j=1}^{p}\beta_{j}^{2} \tag{10.1} \end{equation}\]
where \(\lambda >0\) is a tuning parameter. As with the least squares, ridge regression considers minimizing RSS. However, it adds a shrinkage penalty \(\lambda\Sigma_{j=1}^{p}\beta_{j}^{2}\) that takes account of the number of parameters in the model. When \(\lambda = 0\), it is identical to least squares. As \(\lambda\) gets larger, the coefficients start shrinking towards 0. When \(\lambda\rightarrow\infty\), the rest of the coefficients \(\beta_{1},...,\beta_{p}\) are close to 0. Here, the penalty is not applied to \(\beta_{0}\). The tuning parameter \(\lambda\) is used to adjust the impact of the two parts in equation (10.1). Every value of \(\lambda\) corresponds to a set of parameter estimates.
There are many R packages for ridge regression, such as lm.ridge()
function from MASS
, function enet()
from elasticnet
. If you know the value of \(\lambda\), you can use either of the function to fit ridge regression. A more convenient way is to use train()
function from caret. Let’s use the 10 survey questions to predict the total purchase amount (sum of online and store purchase).
<- read.csv("http://bit.ly/2P5gTw4")
dat # data cleaning: delete wrong observations
# expense can't be negative
<- subset(dat, store_exp > 0 & online_exp > 0)
dat # get predictors
<- dat[ , grep("Q", names(dat))]
trainx # get response
<- dat$store_exp + dat$online_exp trainy
Use train()
function to tune parameter. Since ridge regression adds the penalty parameter \(\lambda\) in front of the sum of squares of the parameters, the scale of the parameters matters. So here it is better to center and scale the predictors. This preprocessing is generally recommended for all techniques that puts penalty to parameter estimates. In this example, the 10 survey questions are already with the same scale so data preprocessing doesn’t make too much different. It is a good idea to set the preprocessing as a standard.
# set cross validation
<- trainControl(method = "cv", number = 10)
ctrl # set the parameter range
<- data.frame(.lambda = seq(0, .1, length = 20))
ridgeGrid set.seed(100)
<- train(trainx, trainy,
ridgeRegTune method = "ridge",
tuneGrid = ridgeGrid,
trControl = ctrl,
## center and scale predictors
preProc = c("center", "scale"))
ridgeRegTune
## Ridge Regression
##
## 999 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 899, 899, 899, 899, 899, 900, ...
## Resampling results across tuning parameters:
##
## lambda RMSE Rsquared MAE
## 0.000000 1744 0.7952 754.0
## 0.005263 1744 0.7954 754.9
## 0.010526 1744 0.7955 755.9
## 0.015789 1744 0.7955 757.3
## 0.021053 1745 0.7956 758.8
## 0.026316 1746 0.7956 760.6
## 0.031579 1747 0.7956 762.4
## 0.036842 1748 0.7956 764.3
## 0.042105 1750 0.7956 766.4
## 0.047368 1751 0.7956 768.5
## 0.052632 1753 0.7956 770.6
## 0.057895 1755 0.7956 772.7
## 0.063158 1757 0.7956 774.9
## 0.068421 1759 0.7956 777.2
## 0.073684 1762 0.7956 779.6
## 0.078947 1764 0.7955 782.1
## 0.084211 1767 0.7955 784.8
## 0.089474 1769 0.7955 787.6
## 0.094737 1772 0.7955 790.4
## 0.100000 1775 0.7954 793.3
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was lambda
## = 0.005263.
The results show that the best value of \(\lambda\) is 0.005 and the RMSE and \(R^{2}\) are 1744 and 0.7954 correspondingly. You can see from the figure 10.1, as the \(\lambda\) increase, the RMSE first slightly decreases and then increases.
plot(ridgeRegTune)
Once you have the tuning parameter value, there are different functions to fit a ridge regression. Let’s look at how to use enet()
in elasticnet
package.
= enet(x = as.matrix(trainx), y = trainy, lambda = 0.01,
ridgefit # center and scale predictors
normalize = TRUE)
Note here ridgefit
only assigns the value of the tuning parameter for ridge regression. Since the elastic net model include both ridge and lasso penalty, we need to use predict()
function to get the model fit. You can get the fitted results by setting s = 1
and mode = "fraction"
. Here s = 1
means we only use the ridge parameter. We will come back to this when we get to lasso regression.
<- predict(ridgefit, newx = as.matrix(trainx),
ridgePred s = 1, mode = "fraction", type = "fit")
By setting type = "fit"
, the above returns a list object. The fit
item has the predictions:
names(ridgePred)
## [1] "s" "fraction" "mode" "fit"
head(ridgePred$fit)
## 1 2 3 4 5 6
## 1290.5 224.2 591.4 1220.6 853.4 908.2
If you want to check the estimated coefficients, you can set type="coefficients"
:
<-predict(ridgefit,newx = as.matrix(trainx),
ridgeCoefs=1, mode="fraction", type="coefficients")
It also returns a list and the estimates are in the coefficients
item:
# didn't show the results
= ridgeCoef$coefficients RidgeCoef
Comparing to the least square regression, ridge regression performs better because of the bias-variance-trade-off we mentioned in section 7.1. As the penalty parameter \(\lambda\) increases, the flexibility of the ridge regression decreases. It decreases the variance of the model but increases the bias at the same time.