Chapter 7 Model Tuning Strategy

When training a machine learning model, there are many decisions to make. For example, when training a random forest, you need to decide the number of trees and the number of variables to use at each node. For the lasso method, you need to determine the penalty parameter. Unlike the parameters derived by training (such as the coefficients in a linear regression model), those parameters are used to control the learning process and are called hyperparameters. To train a model, you need to set the value of hyperparameters.

A common way to make those decisions is to split the data into training and testing sets. Use training data to fit models with different parameter values and apply the fitted models to the testing data. And then, find the hyperparameter value that gives the best testing performance. Data splitting is also used in model selection and evaluation, where you access the correctness of a model on an evaluation set and compare different models to find the best one.

In practice, applying machine learning is a highly iterative process. This chapter will illustrate the practical aspects of model tuning. We will talk about different types of model error, source of model error, hyperparameter tuning, how to set up your data, and how to ensure your model implementation is correct (i.e. model selection and evalutaion).

Load the R packages first:

# install packages from CRAN
p_needed <- c('ggplot2','tidyr', 'caret', 'dplyr', 
              'lattice', 'proxy', 'caret')
packages <- rownames(installed.packages())
p_to_install <- p_needed[!(p_needed %in% packages)]
if (length(p_to_install) > 0) {
    install.packages(p_to_install)
}

lapply(p_needed, require, character.only = TRUE)