5.3 Centering and Scaling

It is the most straightforward data transformation. It centers and scales a variable to mean 0 and standard deviation 1. It ensures that the criterion for finding linear combinations of the predictors is based on how much variation they explain and therefore improves the numerical stability. Models involving finding linear combinations of the predictors to explain response/predictors variation need data centering and scaling, such as principle component analysis (PCA) (Jolliffe 2002), partial least squares (PLS) (Geladi P 1986) and factor analysis (Mulaik 2009). You can quickly write code yourself to conduct this transformation.

Let’s standardize the variable income from sim.dat:

income <- sim.dat$income
# calculate the mean of income
mux <- mean(income, na.rm = T)
# calculate the standard deviation of income
sdx <- sd(income, na.rm = T)
# centering
tr1 <- income - mux
# scaling
tr2 <- tr1/sdx

Or the function preProcess() can apply this transformation to a set of predictors.

sdat <- subset(sim.dat, select = c("age", "income"))
# set the 'method' option
trans <- preProcess(sdat, method = c("center", "scale"))
# use predict() function to get the final result
transformed <- predict(trans, sdat)

Now the two variables are in the same scale. You can check the result using summary(transformed). Note that there are missing values.

References

Geladi P, Kowalski B. 1986. “Partial Least Squares Regression: A Tutorial.” Analytica Chimica Acta, no. 185: 1–17.

Jolliffe, I. T. 2002. Principla Component Analysis. 2nd ed. Springer.

Mulaik, S. A. 2009. Foundations of Factor Analysis. 2ND ed. Chapman Hall/CRC.