5.3 Centering and Scaling
It is the most straightforward data transformation. It centers and scales a variable to mean 0 and standard deviation 1. It ensures that the criterion for finding linear combinations of the predictors is based on how much variation they explain and therefore improves the numerical stability. Models involving finding linear combinations of the predictors to explain response/predictors variation need data centering and scaling, such as principle component analysis (PCA) (Jolliffe 2002), partial least squares (PLS) (Geladi P 1986) and factor analysis (Mulaik 2009). You can quickly write code yourself to conduct this transformation.
Let’s standardize the variable income
from sim.dat
:
<- sim.dat$income
income # calculate the mean of income
<- mean(income, na.rm = T)
mux # calculate the standard deviation of income
<- sd(income, na.rm = T)
sdx # centering
<- income - mux
tr1 # scaling
<- tr1/sdx tr2
Or the function preProcess()
can apply this transformation to a set of predictors.
<- subset(sim.dat, select = c("age", "income"))
sdat # set the 'method' option
<- preProcess(sdat, method = c("center", "scale"))
trans # use predict() function to get the final result
<- predict(trans, sdat) transformed
Now the two variables are in the same scale. You can check the result using summary(transformed)
. Note that there are missing values.