## 9.1 Ordinary Least Square

For a typical linear regression with $$p$$ explanatory variables, we have a linear combinations of these variables:

$f(\mathbf{X})=\mathbf{X}\mathbf{\beta}=\beta_{0}+\sum_{j=1}^{p}\mathbf{x_{.j}}\beta_{j}$

where $$\mathbf{\beta}$$ is the parameter vector with length $$p+1$$. Least square is the method to find a set of value for $$\mathbf{\beta^{T}}=(\beta_{0},\beta_{1},...,\beta_{p})$$ such that it minimizes the residual sum of square (RSS):

$RSS(\beta)=\sum_{i=1}^{N}(y_{i}-f(\mathbf{x_{i.}}))^{2}=\sum_{i=1}^{N}(y_{i}-\beta_{0}-\sum_{j=1}^{p}x_{ij}\beta_{j})^{2}$

The process of finding a set of values has been implemented in R. Now let’s load the data:

dat <- read.csv("http://bit.ly/2P5gTw4")

Before fitting the model, we need to clean the data, such as removing bad data points that are not logical (negative expense).

dat <- subset(dat, store_exp > 0 & online_exp > 0)

Use 10 survey question variables as our explanatory variables.

modeldat <- dat[, grep("Q", names(dat))]

The response variable is the sum of in-store spending and online spending.

# total expense = in store expense + online expense
modeldat$total_exp <- dat$store_exp + dat$online_exp To fit a linear regression model, let us first check if there are any missing values or outliers: par(mfrow = c(1, 2)) hist(modeldat$total_exp, main = "", xlab = "total_exp")
boxplot(modeldat$total_exp) There is no missing value in the response variable, but there are outliers. Outliers are usually best described by the problem to solve itself such that we know from domain knowledge that it is not possible to have such values. We can also use a statistical threshold to remove extremely large or small outlier values from the data. We use the Z-score to find and remove outliers described in section 5.5. Readers can refer to the section for more detail. y <- modeldat$total_exp
# Find data points with Z-score larger than 3.5
modeldat <- modeldat[-which(zs > 3.5), ]