Homepage
Preface
Goal of the Book
Who This Book Is For
What This Book Covers
Acknowledgements
About the Authors
1
Introduction
1.1
Data science role and skill tracks
1.1.1
Engineering
1.1.2
Analysis
1.1.3
Modeling
1.2
What kind of questions can data science solve?
1.2.1
Prerequisites
1.2.2
Problem type
1.3
Structure data science team
1.4
List of potential data science careers
2
Soft Skills for Data Scientists
2.1
Comparison between Statistician and Data Scientist
2.2
Beyond Data and Analytics
2.3
Three Pillars of Knowledge
2.4
Data Science Project Cycle
2.4.1
Types of Data Science Projects
2.4.2
At the Planning Stage
2.4.3
At the Modeling Stage
2.4.4
At the Production Stage
2.4.5
Summary
2.5
Common Mistakes in Data Science
2.5.1
Problem Formulation Stage
2.5.2
Problem Planning Stage
2.5.3
Modeling Stage
2.5.4
Production Stage
2.5.5
Summary
3
Introduction to The Data
3.1
Customer Data for A Clothing Company
3.2
Customer Satisfaction Survey Data from Airline Company
3.3
Swine Disease Breakout Data
3.4
MNIST Dataset
3.5
IMDB Dataset
4
Big Data Cloud Platform
4.1
Power of Cluster of Computers
4.2
Evolution of Cluster Computing
4.2.1
Hadoop
4.2.2
Spark
4.3
Introduction of Cloud Environment
4.3.1
Open Account and Create a Cluster
4.3.2
R Notebook
4.3.3
Markdown Cells
4.4
Leverage Spark Using R Notebook
4.5
Databases and SQL
4.5.1
History
4.5.2
Database, Table and View
4.5.3
Basic SQL Statement
4.5.4
Advanced Topics in Database
5
Data Pre-processing
5.1
Data Cleaning
5.2
Missing Values
5.2.1
Impute missing values with median/mode
5.2.2
K-nearest neighbors
5.2.3
Bagging Tree
5.3
Centering and Scaling
5.4
Resolve Skewness
5.5
Resolve Outliers
5.6
Collinearity
5.7
Sparse Variables
5.8
Re-encode Dummy Variables
6
Data Wrangling
6.1
Read and write data
6.1.1
readr
6.1.2
data.table
— enhanced
data.frame
6.2
Summarize data
6.2.1
apply()
,
lapply()
and
sapply()
in base R
6.2.2
dplyr
package
6.3
Tidy and Reshape Data
6.3.1
reshape2
package
6.3.2
tidyr
package
7
Model Tuning Strategy
7.1
Variance-Bias Trade-Off
7.2
Data Splitting and Resampling
7.2.1
Data Splitting
7.2.2
Resampling
8
Measuring Performance
8.1
Regression Model Performance
8.2
Classification Model Performance
8.2.1
Confusion Matrix
8.2.2
Kappa Statistic
8.2.3
ROC
8.2.4
Gain and Lift Charts
9
Regression Models
9.1
Ordinary Least Square
9.1.1
The Magic P-value
9.1.2
Diagnostics for Linear Regression
9.2
PCR and PLS
9.3
Measurement Error
9.3.1
Measurement Error in the Response
9.3.2
Measurement Error in the Independent Variables
10
Regularization Methods
10.1
Ridge Regression
10.2
LASSO
10.3
Variable selection property of the lasso
10.4
Elastic Net
10.5
Penalized Generalized Linear Model
10.5.1
Introduction to
glmnet
package
10.5.2
Penalized logistic regression
11
Tree-Based Methods
11.1
Splitting Criteria
11.2
Tree Pruning
11.3
Regression and Decision Tree Basic
11.3.1
Regression Tree
11.3.2
Decision Tree
11.4
Bagging Tree
11.5
Random Forest
11.6
Gradient Boosted Machine
11.6.1
Adaptive Boosting
11.6.2
Stochastic Gradient Boosting
11.6.3
Boosting as Additive Model
12
Deep Learning
12.1
Projection Pursuit Regression
12.2
Feedforward Neural Network
12.2.1
Logistic Regression as Neural Network
12.2.2
Gradient Descent
12.2.3
Deep Neural Network
12.2.4
Activation Function
12.2.5
Deal with Overfitting
12.2.6
Optimization
12.2.7
Image Recognition Using FFNN
12.3
Convolutional Neural Network
12.3.1
Convolution Layer
12.3.2
Padding Layer
12.3.3
Pooling Layer
12.3.4
Convolution Over Volume
12.3.5
Image Recognition Using CNN
12.4
Recurrent Neural Network
12.4.1
RNN Model
12.4.2
Word Embedding
12.4.3
Long Short Term Memory
Appendix
A
R code for data simulation
A.1
Customer Data for Clothing Company
A.2
Customer Satisfaction Survey Data from Airline Company
A.3
Swine Disease Breakout Data
References
Published with bookdown
Introduction to Data Science
9.2
PCR and PLS