## 10.2 LASSO

Even though the ridge regression shrinks the parameter estimates towards 0, it won’t shink any estimates to be exactly 0 which means it includes all predictors in the final model. So it can’t select variables. It may not be a problem for prediction but it is a huge disadvantage if you want to interpret the model especially when the number of variables is large. A popular alternative to the ridge penalty is the **Least Absolute Shrinkage and Selection Operator** (LASSO) (R 1996).

Similar to ridge regression, lasso adds a penalty. The lasso coefficients \(\hat{\beta}_{\lambda}^{L}\) minimize the following:

\[\begin{equation} \Sigma_{i=1}^{n}(y_{i}-\beta_{0}-\Sigma_{j=1}^{p}\beta_{j}x_{ij})^{2}+\lambda\Sigma_{j=1}^{p}|\beta_{j}|=RSS+\lambda\Sigma_{j=1}^{p}|\beta_{j}| \tag{10.2} \end{equation}\]

The only difference between lasso and ridge is the penalty. In statistical parlance, ridge uses \(L_2\) penalty (\(\beta_{j}^{2}\)) and lasso uses \(L_1\) penalty (\(|\beta_{j}|\)). \(L_1\) penalty can shrink the estimates to 0 when \(\lambda\) is big enough. So lasso can be used as a feature selection tool. It is a huge advantage because it leads to a more explainable model.

Similar to other models with tuning parameters, lasso regression requires cross-validation to tune the parameter. You can use `train()`

in a similar way as we showed in the ridge regression section. To tune parameter, we need to set cross-validation and parameter range. Also, it is advised to standardize the predictors:

```
ctrl <- trainControl(method = "cv", number = 10)
lassoGrid <- data.frame(fraction = seq(.8, 1, length = 20))
set.seed(100)
lassoTune <- train(trainx, trainy,
## set the method to be lasso
method = "lars",
tuneGrid = lassoGrid,
trControl = ctrl,
## standardize the predictors
preProc = c("center", "scale"))
lassoTune
```

```
## Least Angle Regression
##
## 999 samples
## 10 predictor
##
## Pre-processing: centered (10), scaled (10)
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 899, 899, 899, 899, 899, 900, ...
## Resampling results across tuning parameters:
##
## fraction RMSE Rsquared MAE
## 0.8000 1763 0.7921 787.5
## 0.8105 1760 0.7924 784.1
## 0.8211 1758 0.7927 780.8
## 0.8316 1756 0.7930 777.7
## 0.8421 1754 0.7933 774.6
## 0.8526 1753 0.7936 771.8
## 0.8632 1751 0.7939 769.1
## 0.8737 1749 0.7942 766.6
## 0.8842 1748 0.7944 764.3
## 0.8947 1746 0.7947 762.2
## 0.9053 1745 0.7949 760.1
## 0.9158 1744 0.7951 758.3
## 0.9263 1743 0.7952 756.7
## 0.9368 1743 0.7953 755.5
## 0.9474 1742 0.7954 754.5
## 0.9579 1742 0.7954 754.0
## 0.9684 1742 0.7954 753.6
## 0.9789 1743 0.7953 753.4
## 0.9895 1743 0.7953 753.5
## 1.0000 1744 0.7952 754.0
##
## RMSE was used to select the optimal model using
## the smallest value.
## The final value used for the model was fraction
## = 0.9579.
```

The results show that the best value of the tuning parameter (`fraction`

from the output) is 0.957 and the RMSE and \(R^{2}\) are 1742 and 0.7954 correspondingly. The performance is nearly the same with ridge regression. You can see from the figure 10.2, as the \(\lambda\) increase, the RMSE first decreases and then increases.

Once you select a value for tuning parameter, there are different functions to fit lasso regression, such as `lars()`

in `lars`

, `enet()`

in `elasticnet`

, `glmnet()`

in `glmnet`

. They all have very similar syntax.

Here we continue using `enet()`

. The syntax is similar to ridge regression. The only difference is that you need to set `lambda = 0`

because the argument `lambda`

here is to control the ridge penalty. When it is 0, the function will return the lasso model object.

Set the fraction value to be 0.957 (the value we got above):

Again by setting `type = "fit"`

, the above returns a list object. The `fit`

item has the predictions:

```
## 1 2 3 4 5 6
## 1357.3 300.5 690.2 1228.2 838.4 1010.1
```

You need to set `type = "coefficients"`

to get parameter estimates:

It also returns a list and the estimates are in the `coefficients`

item:

Many researchers applied lasso to other learning methods, such as linear discriminant analysis (Line Clemmensen and Ersbøll 2011), partial least squares regression(Chun and Keleş 2010). However, since the \(L_1\) norm is not differentiable, optimization for lasso regression is more complicated. People come up with different algorithms to solve the computation problem. The biggest breakthrough is Least Angle Regression [LARS] from Bradley Efron etc. This algorithm works well for lasso regression especially when the dimension is high.

### References

Chun, Hyonho, and Sündüz Keleş. 2010. “Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection.” *Journal of the Royal Statistical Society: Series B (Statistical Methodology)* 72 (1): 3–25.

Line Clemmensen, Daniela Witten, Trevor Hastie, and Bjarne Ersbøll. 2011. “Sparse Discriminant Analysis.” *Technometrics* 53 (4): 406–13.

R, Tibshirani. 1996. “Regression Shrinkage and Selection via the Lasso.” *Journal of the Royal Statistical Society Series B (Methodological)* 58 (1): 267–88.