8.1 Regression Model Performance

Mean Squared Error (MSE) measures the average of the squares of the errors—that is, the average squared difference between the estimated values (\(\hat{y}_{i}\)) and the actual value (\(y_i\)). The Root Mean Squared Error (RMSE) is the root square of the MSE.

\[MSE=\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}\] \[RMSE=\sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\hat{y}_{i})^{2}}\]

Both are the common measurements for the regression model performance. Let’s use the previous income prediction as an example. Fit a simple linear model:

sim.dat <- read.csv("http://bit.ly/2P5gTw4")
fit<- lm(formula = income ~ store_exp + online_exp + store_trans + 
    online_trans, data = sim.dat)
summary(fit)
## 
## Call:
## lm(formula = income ~ store_exp + online_exp + store_trans + 
##     online_trans, data = sim.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128768  -15804     441   13375  150945 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85711.680   3651.599   23.47  < 2e-16 ***
## store_exp        3.198      0.475    6.73  3.3e-11 ***
## online_exp       8.995      0.894   10.06  < 2e-16 ***
## store_trans   4631.751    436.478   10.61  < 2e-16 ***
## online_trans -1451.162    178.835   -8.11  1.8e-15 ***
## ---
## Signif. codes:  
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31500 on 811 degrees of freedom
##   (184 observations deleted due to missingness)
## Multiple R-squared:  0.602,  Adjusted R-squared:   0.6 
## F-statistic:  306 on 4 and 811 DF,  p-value: <2e-16

You can calculate the RMSE:

y <- sim.dat$income
yhat <- predict(fit, sim.dat)
MSE <- mean((y - yhat)^2, na.rm = T )
RMSE <- sqrt(MSE)
RMSE
## [1] 31433

Another common performance measure for the regression model is R-Squared, often denoted as \(R^2\). It is the square of the correlation between the fitted value and the observed value. It is often explained as the percentage of the information in the data that the model can explain. The above model returns an R-squared= 0.602, which indicates the model can explain 60.2% of the variance in variable income. While \(R^2\) is easy to explain, it is not a direct measure of model accuracy but correlation. Here the \(R^2\) value is not low, but the RMSE is 3.1433^{4} which means the average difference between model fitting and the observation is 3.1433^{4}. It may be a significant discrepancy from an application point of view. A high \(R^2\) doesn’t guarantee that the model has enough accuracy.

We used \(R^2\) to show the impact of the error from independent and response variables in Chapter 7 where we didn’t consider the impact of the number of parameters (because the number of parameters is very small compared to the number of observations). However, \(R^2\) increases as the number of parameters increases. So people usually use adjusted R-squared, which is designed to mitigate the issue. The original \(R^2\) is defined as:

\[R^{2}=1-\frac{RSS}{TSS}\]

where \(RSS=\sum_{i=1}^{n}(y_{i}-\hat{y_{i}})^{2}\) and \(TSS=\sum_{i=1}^{n}(y_{i}-\bar{y})^{2}\).

Since RSS is always decreasing as the number of parameters increases, \(R^2\) increases as a result. For a model with \(p\) parameters, the adjusted \(R^2\) is defined as:

\[Adjusted\ R^{2}=1-\frac{RSS/(n-p-1)}{TSS/(n-1)}\]

To maximize the adjusted \(R^{2}\) is identical to minimize \(RSS/(n-p-1)\). Since the number of parameters \(p\) is reflected in the equation, \(RSS/(n-p-1)\) can increase or decrease as \(p\) increases. The idea behind this is that the adjusted R-squared increases if the new variable improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. While values are usually positive, they can be negative as well.

Another measurement is \(C_{p}\). For a least squared model with \(p\) parameters:

\[C_{p}=\frac{1}{n}(RSS+2p\hat{\sigma}^{2})\]

where \(\hat{\sigma}^{2}\) is the estimator of the model random effect \(\epsilon\). \(C_{p}\) is to add penalty \(2p\hat{\sigma}^{2}\) to the training set \(RSS\). The goal is to adjust the over-optimistic measurement based on training data. As the number of parameters increases, the penalty increases. It counteracts the decrease of \(RSS\) due to increasing the number of parameters. We choose the model with a smaller \(C_{p}\).

Both AIC and BIC are based on the maximum likelihood. In linear regression, the maximum likelihood estimate is the least squared estimate. The definitions of the two are:

\[AIC=n+nlog(2\pi)+nlog(RSS/n)+2(p+1)\]

\[BIC=n+nlog(2\pi)+nlog(RSS/n)+log(n)(p+1)\]

R function AIC() and BIC() will calculate the AIC and BIC value according to the above equations. Many textbooks ignore content item \(n+nlog(2\pi)\), and use \(p\) instead of \(p+1\). Those slightly different versions give the same results since we are only interested in the relative value. Comparing to AIC, BIC puts a heavier penalty on the number of parameters.