It is the most straightforward data transformation. It centers and scales a variable to mean 0 and standard deviation 1. It ensures that the criterion for finding linear combinations of the predictors is based on how much variation they explain and therefore improves the numerical stability. Models involving finding linear combinations of the predictors to explain response/predictors variation need data centering and scaling, such as principle component analysis (PCA) (Jolliffe 2002), partial least squares (PLS) (Geladi P 1986) and factor analysis (Mulaik 2009). You can quickly write code yourself to conduct this transformation.
Let’s standardize the variable
<- sim.dat$income income # calculate the mean of income <- mean(income, na.rm = T) mux # calculate the standard deviation of income <- sd(income, na.rm = T) sdx # centering <- income - mux tr1 # scaling <- tr1/sdxtr2
Or the function
preProcess() can apply this transformation to a set of predictors.
<- subset(sim.dat, select = c("age", "income")) sdat # set the 'method' option <- preProcess(sdat, method = c("center", "scale")) trans # use predict() function to get the final result <- predict(trans, sdat)transformed
Now the two variables are in the same scale. You can check the result using
summary(transformed). Note that there are missing values.