# install packages from CRAN
p_needed <- c('caret', 'dplyr', 'randomForest',
'readr', 'car', 'pROC', 'fmsb')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
install.packages(p_to_install)
}
lapply(p_needed, require, character.only = TRUE)
In this notebook, we introduce basic model performance measurements for regression and classification problems.
We first fit a linear regression model on a simulated dataset, and then go through the details about metrics to check for model performance such as MSE, RMSE, and R-Square.
sim.dat <- read.csv("http://bit.ly/2P5gTw4")
fit<- lm(formula = income ~ store_exp + online_exp + store_trans +
online_trans, data = sim.dat)
summary(fit)
##
## Call:
## lm(formula = income ~ store_exp + online_exp + store_trans +
## online_trans, data = sim.dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -128768 -15804 441 13375 150945
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 85711.6796 3651.5991 23.472 < 2e-16 ***
## store_exp 3.1977 0.4754 6.726 3.28e-11 ***
## online_exp 8.9949 0.8943 10.058 < 2e-16 ***
## store_trans 4631.7507 436.4777 10.612 < 2e-16 ***
## online_trans -1451.1618 178.8355 -8.115 1.80e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31530 on 811 degrees of freedom
## (184 observations deleted due to missingness)
## Multiple R-squared: 0.6018, Adjusted R-squared: 0.5998
## F-statistic: 306.4 on 4 and 811 DF, p-value: < 2.2e-16
The fitted results fit shows the RMSE is 31530 (at the
bottom of the output after Residual standard error:).
Another common performance measure for the regression model is
R-Squared, often denoted as \(R^2\). It
is the square of the correlation between the fitted value and the
observed value. It is often explained as the percentage of the
information in the data that can be explained by the model. The above
model returns a R-squared=0.6, which indicates the model can explain
60% of the variance in variable income. While \(R^2\) is easy to explain, it is not a
direct measure of model accuracy but correlation. Here the \(R^2\) value is not low but the RMSE is 0.6
which means the average difference between model fitting and the
observation is 3.153^{4}. It is a big discrepancy from an application
point of view. When the response variable has a large scale and high
variance, a high \(R^2\) doesn’t mean
the model has enough accuracy. It is also important to remember that
\(R^2\) is dependent on the variation
of the outcome variable. If the data has a response variable with a
higher variance, the model based on it tends to have a higher \(R^2\).
In this section, we will fit a random forest model for a classification problem, and then go through typical metrics and plots for model performance.
disease_dat <- read.csv("http://bit.ly/2KXb1Qi")
We separate the data into training and testing dataset. We then use the training dataset to train a random forest model, and predict the probability and corresponding classes for the testing dataset using the model trained.
set.seed(100)
# separate the data to be training and testing
trainIndex <- createDataPartition(disease_dat$y, p = 0.8,
list = F, times = 1)
xTrain <- disease_dat[trainIndex, ] %>% dplyr::select(-y)
xTest <- disease_dat[-trainIndex, ] %>% dplyr::select(-y)
# the response variable need to be factor
yTrain <- disease_dat$y[trainIndex] %>% as.factor()
yTest <- disease_dat$y[-trainIndex] %>% as.factor()
train_rf <- randomForest(yTrain ~ .,
data = xTrain,
mtry = trunc(sqrt(ncol(xTrain) - 1)),
ntree = 1000,
importance = T)
yhatprob <- predict(train_rf, xTest, "prob")
set.seed(100)
car::some(yhatprob)
## 0 1
## 17 0.689 0.311
## 29 0.532 0.468
## 211 0.473 0.527
## 253 0.514 0.486
## 357 0.507 0.493
## 486 0.540 0.460
## 520 0.623 0.377
## 562 0.518 0.482
## 686 0.468 0.532
## 755 0.569 0.431
yhat <- predict(train_rf, xTest)
car::some(yhat)
## 5 141 179 231 245 352 438 696 725 772
## 0 0 0 1 0 0 0 0 1 0
## Levels: 0 1
Confusion matrix is a typical way to check the model performance for classification problems. Kappa statistics measures the agreement between the observed and predicted classes.
yhat = as.factor(yhat) %>% relevel("1")
yTest = as.factor(yTest) %>% relevel("1")
table(yhat, yTest)
## yTest
## yhat 1 0
## 1 35 13
## 0 38 74
kt<-fmsb::Kappa.test(table(yhat,yTest))
kt$Result
##
## Estimate Cohen's kappa statistics and test the null hypothesis that
## the extent of agreement is same as random (kappa=0)
##
## data: table(yhat, yTest)
## Z = 4.1451, p-value = 1.698e-05
## 95 percent confidence interval:
## 0.1897309 0.4890256
## sample estimates:
## [1] 0.3393782
Kappa can take on a value from -1 to 1. The higher the value, the higher the agreement. A value of 0 means there is no agreement between the observed and predicted classes, while a value of 1 indicates perfect agreement. A negative value indicates that the prediction is in the opposite direction of the observed value. The following table may help you “visualize” the interpretation of kappa:
| Kappa | Agreement |
|---|---|
| < 0 | Less than chance agreement |
| 0.01–0.20 | Slight agreement |
| 0.21– 0.40 | Fair agreement |
| 0.41–0.60 | Moderate agreement |
| 0.61–0.80 | Substantial agreement |
| 0.81–0.99 | Almost perfect agreement |
kt$Judgement
## [1] "Fair agreement"
ROC (i.e. receiver operating characteristic) curve and AUC (i.e. area under the curve) are two common ways to evaluate classification model performance.
rocCurve <- pROC::roc(response = yTest,
predictor = yhatprob[,2])
plot(1-rocCurve$specificities,
rocCurve$sensitivities,
type = 'l',
xlab = '1 - Specificities',
ylab = 'Sensitivities')
# get the estimate of AUC
auc(rocCurve)
## Area under the curve: 0.7694
# get a confidence interval based on DeLong et al.
ci.auc(rocCurve)
## 95% CI: 0.6972-0.8416 (DeLong)
Gain and lift chart is another visual tool for evaluating the performance for a classification model. It ranks the samples by their scores and calculate the cumulative positive rate as more samples are evaluated.
table(yTest)
## yTest
## 1 0
## 73 87
# predicted outbreak probability
modelscore <- yhatprob[ ,2]
# randomly sample from a uniform distribution
randomscore <- runif(length(yTest))
labs <- c(modelscore = "Random Forest Prediction",
randomscore = "Random Number")
liftCurve = caret::lift(yTest ~ modelscore + randomscore,
class = "1",
labels = labs)
xyplot(liftCurve, auto.key = list(columns = 2, lines = T, points = F))