Chapter 9 Regression Models

In this chapter, we will cover ordinary linear regression and a few more advanced regression methods. The linear combination of variables seems simple compared to many of today’s machine learning models. However, many advanced models use linear combinations of variables as one of its major components or steps. For example, for each neuron in the deep neural network, all the input signals are first linearly combined before feeding to a non-linear activation function. To understand many of today’s machine learning models, it is helpful to understand the key ideas across different modeling frameworks.

First, we will introduce multivariate linear regression (i.e. the typical least square regression) which is one of the simplest supervised learning methods. Even though it is simple, the general ideas and procedures of fitting a regression model are applied to a boarder scope. Having a solid understanding of the basic linear regression model enables us to learn more advanced models easily. For example, we will introduce two “shrinkage” versions of linear regression: ridge regression and LASSO regression. While the parameters are fitted by the least square method, the extra penalty can effectively shrink model parameters towards zero. It mediates overfitting and maintains the robustness of the model when data size is small compared to the number of explanatory variables. We first introduce basic knowledge of each model and then provide R codes to show how to fit the model. We only cover the major properties of these models and the listed reference will provide more in-depth discussion.

We will use the clothing company data as an example. We want to answer business questions such as “which variables are the driving factors of total revenue (both online and in-store purchase)?” The answer to this question can help the company to decide where to invest (such as design, quality, etc). Note that the driving factor here does not guarantee a causal relationship. Linear regression models reveal correlation rather than causation. For example, if a survey on car purchase shows a positive correlation between price and customer satisfaction, does it suggest the car dealer should increase the price? Probably not! It is more likely that the customer satisfaction is impacted by quality. And a higher quality car tends to be more expensive. Causal inference is much more difficult to establish and we have to be very careful when interpreting regression model results.

Load the R packages first:

# install packages from CRAN
p_needed <- c('caret', 'dplyr', 'lattice',
              'elasticnet', 'lars', 'corrplot', 
              'pls')

packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
    install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)