Load R Packages

# install packages from CRAN
p_needed <- c('caret', 'dplyr', 'randomForest',
              'readr', 'car', 'pROC', 'fmsb')

packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
    install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)

In this notebook, we introduce basic model performance measurements for regression and classification problems.

Regression Model Performance

We first fit a linear regression model on a simulated dataset, and then go through the details about metrics to check for model performance such as MSE, RMSE, and R-Square.

sim.dat <- read.csv("http://bit.ly/2P5gTw4")
fit<- lm(formula = income ~ store_exp + online_exp + store_trans + 
    online_trans, data = sim.dat)
summary(fit)

## 
## Call:
## lm(formula = income ~ store_exp + online_exp + store_trans + 
##     online_trans, data = sim.dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -128768  -15804     441   13375  150945 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  85711.6796  3651.5991  23.472  < 2e-16 ***
## store_exp        3.1977     0.4754   6.726 3.28e-11 ***
## online_exp       8.9949     0.8943  10.058  < 2e-16 ***
## store_trans   4631.7507   436.4777  10.612  < 2e-16 ***
## online_trans -1451.1618   178.8355  -8.115 1.80e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 31530 on 811 degrees of freedom
##   (184 observations deleted due to missingness)
## Multiple R-squared:  0.6018, Adjusted R-squared:  0.5998 
## F-statistic: 306.4 on 4 and 811 DF,  p-value: < 2.2e-16

The fitted results fit shows the RMSE is 31530 (at the bottom of the output after Residual standard error:).

Another common performance measure for the regression model is R-Squared, often denoted as \(R^2\). It is the square of the correlation between the fitted value and the observed value. It is often explained as the percentage of the information in the data that can be explained by the model. The above model returns a R-squared＝0.6, which indicates the model can explain 60% of the variance in variable income. While \(R^2\) is easy to explain, it is not a direct measure of model accuracy but correlation. Here the \(R^2\) value is not low but the RMSE is 0.6 which means the average difference between model fitting and the observation is 3.153^{4}. It is a big discrepancy from an application point of view. When the response variable has a large scale and high variance, a high \(R^2\) doesn’t mean the model has enough accuracy. It is also important to remember that \(R^2\) is dependent on the variation of the outcome variable. If the data has a response variable with a higher variance, the model based on it tends to have a higher \(R^2\).

Classification Model Performance

In this section, we will fit a random forest model for a classification problem, and then go through typical metrics and plots for model performance.

disease_dat <- read.csv("http://bit.ly/2KXb1Qi")

Train a Random Forest Model

We separate the data into training and testing dataset. We then use the training dataset to train a random forest model, and predict the probability and corresponding classes for the testing dataset using the model trained.

set.seed(100)
# separate the data to be training and testing
trainIndex <- createDataPartition(disease_dat$y, p = 0.8,
                                  list = F, times = 1)

xTrain <- disease_dat[trainIndex, ] %>% dplyr::select(-y)
xTest <- disease_dat[-trainIndex, ] %>% dplyr::select(-y)

# the response variable need to be factor
yTrain <- disease_dat$y[trainIndex] %>% as.factor()
yTest <- disease_dat$y[-trainIndex] %>% as.factor()

train_rf <- randomForest(yTrain ~ .,
                         data = xTrain,
                         mtry = trunc(sqrt(ncol(xTrain) - 1)),
                         ntree = 1000,
                         importance = T)

yhatprob <- predict(train_rf, xTest, "prob")
set.seed(100)
car::some(yhatprob)

##         0     1
## 17  0.689 0.311
## 29  0.532 0.468
## 211 0.473 0.527
## 253 0.514 0.486
## 357 0.507 0.493
## 486 0.540 0.460
## 520 0.623 0.377
## 562 0.518 0.482
## 686 0.468 0.532
## 755 0.569 0.431

yhat <- predict(train_rf, xTest)
car::some(yhat)

##   5 141 179 231 245 352 438 696 725 772 
##   0   0   0   1   0   0   0   0   1   0 
## Levels: 0 1

Confusion Matrix and Kappa Statistic

Confusion matrix is a typical way to check the model performance for classification problems. Kappa statistics measures the agreement between the observed and predicted classes.

yhat = as.factor(yhat) %>% relevel("1")
yTest = as.factor(yTest) %>% relevel("1")
table(yhat, yTest)

##     yTest
## yhat  1  0
##    1 35 13
##    0 38 74

kt<-fmsb::Kappa.test(table(yhat,yTest))
kt$Result

## 
##  Estimate Cohen's kappa statistics and test the null hypothesis that
##  the extent of agreement is same as random (kappa=0)
## 
## data:  table(yhat, yTest)
## Z = 4.1451, p-value = 1.698e-05
## 95 percent confidence interval:
##  0.1897309 0.4890256
## sample estimates:
## [1] 0.3393782

Kappa can take on a value from -1 to 1. The higher the value, the higher the agreement. A value of 0 means there is no agreement between the observed and predicted classes, while a value of 1 indicates perfect agreement. A negative value indicates that the prediction is in the opposite direction of the observed value. The following table may help you “visualize” the interpretation of kappa:

Kappa	Agreement
< 0	Less than chance agreement
0.01–0.20	Slight agreement
0.21– 0.40	Fair agreement
0.41–0.60	Moderate agreement
0.61–0.80	Substantial agreement
0.81–0.99	Almost perfect agreement

kt$Judgement

## [1] "Fair agreement"

ROC

ROC (i.e. receiver operating characteristic) curve and AUC (i.e. area under the curve) are two common ways to evaluate classification model performance.

rocCurve <- pROC::roc(response = yTest,
              predictor = yhatprob[,2])

plot(1-rocCurve$specificities, 
     rocCurve$sensitivities, 
     type = 'l',
     xlab = '1 - Specificities',
     ylab = 'Sensitivities')

# get the estimate of AUC
auc(rocCurve)

## Area under the curve: 0.7694

# get a confidence interval based on DeLong et al.
ci.auc(rocCurve)

## 95% CI: 0.6972-0.8416 (DeLong)

Gain and Lift Charts

Gain and lift chart is another visual tool for evaluating the performance for a classification model. It ranks the samples by their scores and calculate the cumulative positive rate as more samples are evaluated.

table(yTest)

## yTest
##  1  0 
## 73 87

# predicted outbreak probability
modelscore <- yhatprob[ ,2]
# randomly sample from a uniform distribution
randomscore <- runif(length(yTest))
labs <- c(modelscore = "Random Forest Prediction",
        randomscore = "Random Number")
        liftCurve = caret::lift(yTest ~ modelscore + randomscore,
        class = "1",
        labels = labs)
        
xyplot(liftCurve, auto.key = list(columns = 2, lines = T, points = F))

Chapter 8 Measuring Performance

2022-04-23