## 7.2 Data Splitting and Resampling

Those highly adaptable models can model complex relationships. However, they tend to overfit which leads to the poor prediction by learning too much from the data. It means that the model is susceptible to the specific sample used to fit it. When future data is not exactly like the past data, the model prediction may have big mistakes. A simple model like ordinary linear regression tends instead to underfit which leads to a bad prediction by learning too little from the data. It systematically over-predicts or under-predicts the data regardless of how well future data resemble past data. Without evaluating models, the modeler will not know about the problem before the future samples. Data splitting and resampling are fundamental techniques to build sound models for prediction.

### 7.2.1 Data Splitting

Data splitting is to put part of the data aside as testing set (or Hold-outs, out of bag samples) and use the rest for model training. Training samples are also called in-sample. Model performance metrics evaluated using in-sample are retrodictive, not predictive.

The traditional business intelligence usually handles data description. Answer simple questions by querying and summarizing the data, such as:

• What is the monthly sales of a product in 2015?
• What is the number of visits to our site in the past month?
• What is the sales difference in 2015 for two different product designs?

There is no need to go through the tedious process of splitting the data, tuning and testing model to answer questions of this kind. Instead, people usually use as complete data as possible and then sum or average the parts of interest.

Many models have parameters which cannot be directly estimated from the data, such as $$\lambda$$ in the lasso (penalty parameter), the number of trees in the random forest. This type of model parameter is called tuning parameter, and there is no analytical formula available to calculate the optimized value. Tuning parameters often control the complexity of the model. A poor choice can result in over-fitting or under-fitting. A standard approach to estimate tuning parameters is through cross-validation which is a data resampling approach.

To get a reasonable precision of the performance based on a single test set, the size of the test set may need to be large. So a conventional approach is to use a subset of samples to fit the model and use the rest to evaluate model performance. This process will repeat multiple times to get a performance profile. In that sense, resampling is based on splitting. The general steps are:

• Define a set of candidate values for tuning parameter(s)
• For each candidate value in the set
• Resample data
• Fit model
• Predict hold-out
• Calculate performance
• Aggregate the results
• Determine the final tuning parameter
• Refit the model with the entire data set

The above is an outline of the general procedure to tune parameters. Now let’s focus on the critical part of the process: data splitting. Ideally, we should evaluate model using samples that were not used to build or fine-tune the model. So it provides an unbiased sense of model effectiveness. When the sample size is large, it is a good practice to set aside part of the samples to evaluate the final model. People use “training” data to indicate samples used to fit or fine-tune the model and “test” or “validation” data set is used to validate performance.

The first decision to make for data splitting is to decide the proportion of data in the test set. There are two factors to consider here: (1) sample size; (2) computation intensity. If the sample size is large enough which is the most common situation according to my experience, you can try to use 20%, 30% and 40% of the data as the test set, and see which one works the best. If the model is computationally intense, then you may consider starting from a smaller sample of data to train the model hence will have a higher portion of data in the test set. Depending on how it performs, you may need to increase the training set. If the sample size is small, you can use cross-validation or bootstrap which is the topic in the next section.

The next decision is to decide which samples are in the test set. There is a desire to make the training and test sets as similar as possible. A simple way is to split data by random sampling which, however, does not control for any of the data attributes, such as the percentage of the retained customer in the data. So it is possible that the distribution of outcomes is substantially different between the training and test sets. There are three main ways to split the data that account for the similarity of resulted data sets. We will describe the three approaches using the clothing company customer data as examples.

1. Split data according to the outcome variable

Assume the outcome variable is customer segment (column segment) and we decide to use 80% as training and 20% test. The goal is to make the proportions of the categories in the two sets as similar as possible. The createDataPartition() function in caret will return a balanced splitting based on assigned variable.

# load data
library(caret)
# set random seed to make sure reproducibility
set.seed(3456)
trainIndex <- createDataPartition(sim.dat$segment, p = 0.8, list = FALSE, times = 1) head(trainIndex) ## Resample1 ## [1,] 1 ## [2,] 2 ## [3,] 3 ## [4,] 4 ## [5,] 6 ## [6,] 7 The list = FALSE in the call to createDataPartition is to return a data frame. The times = 1 tells R how many times you want to split the data. Here we only do it once, but you can repeat the splitting multiple times. In that case, the function will return multiple vectors indicating the rows to training/test. You can set times＝2 and rerun the above code to see the result. Then we can use the returned indicator vector trainIndex to get training and test sets: # get training set datTrain <- sim.dat[trainIndex, ] # get test set datTest <- sim.dat[-trainIndex, ] According to the setting, there are 800 samples in the training set and 200 in test set. Let’s check the distribution of the two sets: library(plyr) ddply(datTrain, "segment", summarise, count = length(segment), percentage = round(length(segment)/nrow(datTrain), 2)) ## segment count percentage ## 1 Conspicuous 160 0.20 ## 2 Price 200 0.25 ## 3 Quality 160 0.20 ## 4 Style 280 0.35 ddply(datTest, "segment", summarise, count = length(segment), percentage = round(length(segment)/nrow(datTest), 2)) ## segment count percentage ## 1 Conspicuous 40 0.20 ## 2 Price 50 0.25 ## 3 Quality 40 0.20 ## 4 Style 70 0.35 The percentages are the same for these two sets. In practice, it is possible that the distributions are not exactly identical but should be close. 1. Divide data according to predictors An alternative way is to split data based on the predictors. The goal is to get a diverse subset from a dataset so that the sample is representative. In other words, we need an algorithm to identify the $$n$$ most diverse samples from a dataset with size $$N$$. However, the task is generally infeasible for non-trivial values of $$n$$ and $$N$$ (Willett 2004). And hence practicable approaches to dissimilarity-based selection involve approximate methods that are sub-optimal. A major class of algorithms split the data on maximum dissimilarity sampling. The process starts from: • Initialize a single sample as starting test set • Calculate the dissimilarity between this initial sample and each remaining samples in the dataset • Add the most dissimilar unallocated sample to the test set To move forward, we need to define the dissimilarity between groups. Each definition results in a different version of the algorithm and hence a different subset. It is the same problem as in hierarchical clustering where you need to define a way to measure the distance between clusters. The possible approaches are to use minimum, maximum, sum of all distances, the average of all distances, etc. Unfortunately, there is not a single best choice, and you may have to try multiple methods and check the resulted sample sets. R users can implement the algorithm using maxDissim() function from caret package. The obj argument is to set the definition of dissimilarity. Refer to the help documentation for more details (?maxDissim). Let’s use two variables (age and income) from the customer data as an example to illustrate how it works in R and compare maximum dissimilarity sampling with random sampling. library(lattice) # select variables testing <- subset(sim.dat, select = c("age", "income")) Random select 5 samples as initial subset (start) , the rest will be in samplePool: set.seed(5) # select 5 random samples startSet <- sample(1:dim(testing)[1], 5) start <- testing[startSet, ] # save the rest in data frame 'samplePool' samplePool <- testing[-startSet, ] Use maxDissim() to select another 5 samples from samplePool that are as different as possible with the initical set start: selectId <- maxDissim(start, samplePool, obj = minDiss, n = 5) minDissSet <- samplePool[selectId, ] The obj = minDiss in the above code tells R to use minimum dissimilarity to define the distance between groups. Next, random select 5 samples from samplePool in data frame RandomSet: selectId <- sample(1:dim(samplePool)[1], 5) RandomSet <- samplePool[selectId, ] Plot the resulted set to compare different sampling methods: start$group <- rep("Initial Set", nrow(start))
minDissSet$group <- rep("Maximum Dissimilarity Sampling", nrow(minDissSet)) RandomSet$group <- rep("Random Sampling", nrow(RandomSet))
xyplot(age ~ income, data = rbind(start, minDissSet, RandomSet), grid = TRUE,
group = group, auto.key = TRUE)

The points from maximum dissimilarity sampling are far away from the initial samples ( Fig. 7.5, while the random samples are much closer to the initial ones. Why do we need a diverse subset? Because we hope the test set to be representative. If all test set samples are from respondents younger than 30, model performance on the test set has a high risk to fail to tell you how the model will perform on more general population.

• Divide data according to time

For time series data, random sampling is usually not the best way. There is an approach to divide data according to time-series. Since time series is beyond the scope of this book, there is not much discussion here. For more detail of this method, see (Hyndman and Athanasopoulos 2013). We will use a simulated first-order autoregressive model [AR (1)] time-series data with 100 observations to show how to implement using the function createTimeSlices () in the caret package.

# simulte AR(1) time series samples
timedata = arima.sim(list(order=c(1,0,0), ar=-.9), n=100)
# plot time series
plot(timedata, main=(expression(AR(1)~~~phi==-.9)))     

Fig. 7.6 shows 100 simulated time series observation. The goal is to make sure both training and test set to cover the whole period.

timeSlices <- createTimeSlices(1:length(timedata),
initialWindow = 36, horizon = 12, fixedWindow = T)
str(timeSlices,max.level = 1)
## List of 2
##  $train:List of 53 ##$ test :List of 53

There are three arguments in the above createTimeSlices().

• initialWindow: The initial number of consecutive values in each training set sample
• horizon: the number of consecutive values in test set sample
• fixedWindow: if FALSE, all training samples start at 1

The function returns two lists, one for the training set, the other for the test set. Let’s look at the first training sample:

# get result for the 1st training set
trainSlices <- timeSlices[[1]]
# get result for the 1st test set
testSlices <- timeSlices[[2]]
# check the index for the 1st training and test set
trainSlices[[1]]
##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## [18] 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
## [35] 35 36
testSlices[[1]]
##  [1] 37 38 39 40 41 42 43 44 45 46 47 48

The first training set is consist of sample 1-36 in the dataset (initialWindow = 36). Then sample 37-48 are in the first test set ( horizon = 12). Type head(trainSlices) or head(testSlices) to check the later samples. If you are not clear about the argument fixedWindow, try to change the setting to be F and check the change in trainSlices and testSlices.

Understand and implement data splitting is not difficult. But there are two things to note:

1. The randomness in the splitting process will lead to uncertainty in performance measurement.
2. When the dataset is small, it can be too expensive to leave out test set. In this situation, if collecting more data is just not possible, the best shot is to use leave-one-out cross-validation which is in the next section.

### 7.2.2 Resampling

You can consider resampling as repeated splitting. The basic idea is: use part of the data to fit model and then use the rest of data to calculate model performance. Repeat the process multiple times and aggregate the results. The differences in resampling techniques usually center around the ways to choose subsamples. There are two main reasons that we may need resampling:

1. Estimate tuning parameters through resampling. Some examples of models with such parameters are Support Vector Machine (SVM), models including the penalty (LASSO) and random forest.

2. For models without tuning parameter, such as ordinary linear regression and partial least square regression, the model fitting doesn’t require resampling. But you can study the model stability through resampling.

We will introduce three most common resampling techniques: k-fold cross-validation, repeated training/test splitting, and bootstrap.

#### 7.2.2.1 k-fold cross-validation

k-fold cross-validation is to partition the original sample into $$k$$ equal size subsamples (folds). Use one of the $$k$$ folds to validate the model and the rest $$k-1$$ to train model. Then repeat the process $$k$$ times with each of the $$k$$ folds as the test set. Aggregate the results into a performance profile.

Denote by $$\hat{f}^{-\kappa}(X)$$ the fitted function, computed with the $$\kappa^{th}$$ fold removed and $$x_i^\kappa$$ the predictors for samples in left-out fold. The process of k-fold cross-validation is as follows:

1. Partition the original sample into $$k$$ equal size folds
2. for $$\kappa=1…k$$
• Use data other than fold $$\kappa$$ to train the model $$\hat{f}^{-\kappa}(X)$$
• Apply $$\hat{f}^{-\kappa}(X)$$ to predict fold $$\kappa$$ to get $$\hat{f}^{-\kappa}(x_i^\kappa)$$
1. Aggregate the results $\hat{Error} = \frac{1}{N}\Sigma_{\kappa=1}^k\Sigma_{x_i^{\kappa}}L(y_i^{\kappa},\hat{f}^{-\kappa}(x_i^\kappa))$

It is a standard way to find the value of tuning parameter that gives you the best performance. It is also a way to study the variability of model performance.

The following figure represents a 5-fold cross-validation example.

A special case of k-fold cross-validation is Leave One Out Cross Validation (LOOCV) where $$k=1$$. When sample size is small, it is desired to use as many data to train the model. Most of the functions have default setting $$k=10$$. The choice is usually 5-10 in practice, but there is no standard rule. The more folds to use, the more samples are used to fit model, and then the performance estimate is closer to the theoretical performance. Meanwhile, the variance of the performance is larger since the samples to fit model in different iterations are more similar. However, LOOCV has high computational cost since the number of interactions is the same as the sample size and each model fit uses a subset that is nearly the same size of the training set. On the other hand, when k is small (such as 2 or 3), the computation is more efficient, but the bias will increase. When the sample size is large, the impact of $$k$$ becomes marginal.

Chapter 7 of (Hastie T 2008) presents a more in-depth and more detailed discussion about the bias-variance trade-off in k-fold cross-validation.

You can implement k-fold cross-validation using createFolds() in caret:

library(caret)
class<-sim.dat$segment # creat k-folds set.seed(1) cv<-createFolds(class,k=10,returnTrain=T) str(cv) ## List of 10 ##$ Fold01: int [1:900] 1 2 3 4 5 6 7 8 9 10 ...
##  $Fold02: int [1:900] 1 2 3 4 5 6 7 9 10 11 ... ##$ Fold03: int [1:900] 1 2 3 4 5 6 7 8 10 11 ...
##  $Fold04: int [1:900] 1 2 3 4 5 6 7 8 9 11 ... ##$ Fold05: int [1:900] 1 3 4 6 7 8 9 10 11 12 ...
##  $Fold06: int [1:900] 1 2 3 4 5 6 7 8 9 10 ... ##$ Fold07: int [1:900] 2 3 4 5 6 7 8 9 10 11 ...
##  $Fold08: int [1:900] 1 2 3 4 5 8 9 10 11 12 ... ##$ Fold09: int [1:900] 1 2 4 5 6 7 8 9 10 11 ...
##  $Fold10: int [1:900] 1 2 3 5 6 7 8 9 10 11 ... The above code creates ten folds (k=10) according to the customer segments (we set class to be the categorical variable segment). The function returns a list of 10 with the index of rows in training set. #### 7.2.2.2 Repeated Training/Test Splits In fact, this method is nothing but repeating the training/test set division on the original data. Fit the model with the training set, and evaluate the model with the test set. Unlike k-fold cross-validation, the test set generated by this procedure may have duplicate samples. A sample usually shows up in more than one test sets. There is no standard rule for split ratio and number of repetitions. The most common choice in practice is to use 75% to 80% of the total sample for training. The remaining samples are for validation. The more sample in the training set, the less biased the model performance estimate is. Increasing the repetitions can reduce the uncertainty in the performance estimates. Of course, it is at the cost of computational time when the model is complex. The number of repetitions is also related to the sample size of the test set. If the size is small, the performance estimate is more volatile. In this case, the number of repetitions needs to be higher to deal with the uncertainty of the evaluation results. We can use the same function (createDataPartition ()) as before. If you look back, you will see times = 1. The only thing to change is to set it to the number of repetitions. trainIndex <- createDataPartition(sim.dat$segment, p = .8, list = FALSE, times = 5)
dplyr::glimpse(trainIndex)
##  int [1:800, 1:5] 1 3 4 5 6 7 8 9 10 11 ...
##  - attr(*, "dimnames")=List of 2
##   ..$: NULL ## ..$ : chr [1:5] "Resample1" "Resample2" "Resample3" "Resample4" ...

Once know how to split the data, the repetition comes naturally.

#### 7.2.2.3 Bootstrap Methods

Bootstrap is a powerful statistical tool (a little magic too). It can be used to analyze the uncertainty of parameter estimates (Efron and Tibshirani 1986) quantitatively. For example, estimate the standard deviation of linear regression coefficients. The power of this method is that the concept is so simple that it can be easily applied to any model as long as the computation allows. However, you can hardly obtain the standard deviation for some models by using the traditional statistical inference.

Since it is with replacement, a sample can be selected multiple times, and the bootstrap sample size is the same as the original data. So for every bootstrap set, there are some left-out samples, which is also called “out-of-bag samples.” The out-of-bag sample is used to evaluate the model. Efron points out that under normal circumstances (Efron 1983), bootstrap estimates the error rate of the model with more certainty.The probability of an observation $$i$$ in bootstrap sample B is:

$$\begin{array}{ccc} Pr{i\in B} & = & 1-\left(1-\frac{1}{N}\right)^{N}\\ & \approx & 1-e^{-1}\\ & = & 0.632 \end{array}$$

On average, 63.2% of the observations appeared at least once in a bootstrap sample, so the estimation bias is similar to 2-fold cross-validation. As mentioned earlier, the smaller the number of folds, the larger the bias. Increasing the sample size will ease the problem. In general, bootstrap has larger bias and smaller uncertainty than cross-validation. Efron came up the following “.632 estimator” to alleviate this bias:

$(0.632 × original\ bootstrap\ estimate) + (0.368 × apparent\ error\ rate)$

The apparent error rate is the error rate when the data is used twice, both to fit the model and to check its accuracy and it is apparently over-optimistic. The modified bootstrap estimate reduces the bias but can be unstable with small samples size. This estimate can also be unduly optimistic when the model severely over-fits since the apparent error rate will be close to zero. Efron and Tibshirani (Efron and Tibshirani 1997) discuss another technique, called the “632+ method,” for adjusting the bootstrap estimates.

### References

Efron, B. 1983. “Estimating the Error Rate of a Prediction Rule: Improvement on Cross-Validation.” Journal of the American Statistical Association, 316–31.

Efron, B, and R Tibshirani. 1986. “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy.” Statistical Science, 54–75.

Efron, B, and R Tibshirani. 1997. “Improvements on Cross-Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association 92 (438): 548–60.

Hastie T, Friedman J, Tibshirani R. 2008. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.

Hyndman, R.J., and G. Athanasopoulos. 2013. Forecasting: Principles and Practice. Vol. Section 2/5. OTect: Melbourne, Australia.

Willett, Peter. 2004. “Dissimilarity-Based Algorithms for Selecting Structurally Diverse Sets of Compounds.” Journal of Computational Biology 6(3-4) (doi:10.1089/106652799318382): 447–57.