10.2 LASSO

Even though the ridge regression shrinks the parameter estimates towards 0, it won’t shink any estimates to be exactly 0 which means it includes all predictors in the final model. So it can’t select variables. It may not be a problem for prediction but it is a huge disadvantage if you want to interpret the model especially when the number of variables is large. A popular alternative to the ridge penalty is the Least Absolute Shrinkage and Selection Operator (LASSO) (R 1996).

Similar to ridge regression, lasso adds a penalty. The lasso coefficients \(\hat{\beta}_{\lambda}^{L}\) minimize the following:

\[\begin{equation} \Sigma_{i=1}^{n}(y_{i}-\beta_{0}-\Sigma_{j=1}^{p}\beta_{j}x_{ij})^{2}+\lambda\Sigma_{j=1}^{p}|\beta_{j}|=RSS+\lambda\Sigma_{j=1}^{p}|\beta_{j}| \tag{10.2} \end{equation}\]

The only difference between lasso and ridge is the penalty. In statistical parlance, ridge uses \(L_2\) penalty (\(\beta_{j}^{2}\)) and lasso uses \(L_1\) penalty (\(|\beta_{j}|\)). \(L_1\) penalty can shrink the estimates to 0 when \(\lambda\) is big enough. So lasso can be used as a feature selection tool. It is a huge advantage because it leads to a more explainable model.

Similar to other models with tuning parameters, lasso regression requires cross-validation to tune the parameter. You can use train() in a similar way as we showed in the ridge regression section. To tune parameter, we need to set cross-validation and parameter range. Also, it is advised to standardize the predictors:

## Least Angle Regression 
## 
## 999 samples
##  10 predictor
## 
## Pre-processing: centered (10), scaled (10) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 899, 899, 899, 899, 899, 900, ... 
## Resampling results across tuning parameters:
## 
##   fraction  RMSE  Rsquared  MAE  
##   0.8000    1763  0.7921    787.5
##   0.8105    1760  0.7924    784.1
##   0.8211    1758  0.7927    780.8
##   0.8316    1756  0.7930    777.7
##   0.8421    1754  0.7933    774.6
##   0.8526    1753  0.7936    771.8
##   0.8632    1751  0.7939    769.1
##   0.8737    1749  0.7942    766.6
##   0.8842    1748  0.7944    764.3
##   0.8947    1746  0.7947    762.2
##   0.9053    1745  0.7949    760.1
##   0.9158    1744  0.7951    758.3
##   0.9263    1743  0.7952    756.7
##   0.9368    1743  0.7953    755.5
##   0.9474    1742  0.7954    754.5
##   0.9579    1742  0.7954    754.0
##   0.9684    1742  0.7954    753.6
##   0.9789    1743  0.7953    753.4
##   0.9895    1743  0.7953    753.5
##   1.0000    1744  0.7952    754.0
## 
## RMSE was used to select the optimal model using
##  the smallest value.
## The final value used for the model was fraction
##  = 0.9579.

The results show that the best value of the tuning parameter (fraction from the output) is 0.957 and the RMSE and \(R^{2}\) are 1742 and 0.7954 correspondingly. The performance is nearly the same with ridge regression. You can see from the figure 10.2, as the \(\lambda\) increase, the RMSE first decreases and then increases.

Test mean squared error for the lasso regression

FIGURE 10.2: Test mean squared error for the lasso regression

Once you select a value for tuning parameter, there are different functions to fit lasso regression, such as lars() in lars, enet() in elasticnet, glmnet() in glmnet. They all have very similar syntax.

Here we continue using enet(). The syntax is similar to ridge regression. The only difference is that you need to set lambda = 0 because the argument lambda here is to control the ridge penalty. When it is 0, the function will return the lasso model object.

Set the fraction value to be 0.957 (the value we got above):

Again by setting type = "fit", the above returns a list object. The fit item has the predictions:

##      1      2      3      4      5      6 
## 1357.3  300.5  690.2 1228.2  838.4 1010.1

You need to set type = "coefficients" to get parameter estimates:

It also returns a list and the estimates are in the coefficients item:

Many researchers applied lasso to other learning methods, such as linear discriminant analysis (Line Clemmensen and Ersbøll 2011), partial least squares regression(Chun and Keleş 2010). However, since the \(L_1\) norm is not differentiable, optimization for lasso regression is more complicated. People come up with different algorithms to solve the computation problem. The biggest breakthrough is Least Angle Regression [LARS] from Bradley Efron etc. This algorithm works well for lasso regression especially when the dimension is high.

References

Chun, Hyonho, and Sündüz Keleş. 2010. “Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 72 (1): 3–25.

Line Clemmensen, Daniela Witten, Trevor Hastie, and Bjarne Ersbøll. 2011. “Sparse Discriminant Analysis.” Technometrics 53 (4): 406–13.

R, Tibshirani. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society Series B (Methodological) 58 (1): 267–88.