Homepage
Preface
Goal of the Book
What This Book Covers
Who This Book Is For
How to Use This Book
What the Book Assumes
How to Run R and Python Code
Complementary Reading
About the Authors
1
Introduction
1.1
A Brief History of Data Science
1.2
Data Science Role and Skill Tracks
1.2.1
Engineering
1.2.2
Analysis
1.2.3
Modeling/Inference
1.3
What Kind of Questions Can Data Science Solve?
1.3.1
Prerequisites
1.3.2
Problem Type
1.4
Structure of Data Science Team
1.5
Data Science Roles
2
Soft Skills for Data Scientists
2.1
Comparison between Statistician and Data Scientist
2.2
Beyond Data and Analytics
2.3
Three Pillars of Knowledge
2.4
Data Science Project Cycle
2.4.1
Types of Data Science Projects
2.4.2
Problem Formulation and Project Planning Stage
2.4.3
Project Modeling Stage
2.4.4
Model Implementation and Post Production Stage
2.4.5
Project Cycle Summary
2.5
Common Mistakes in Data Science
2.5.1
Problem Formulation Stage
2.5.2
Project Planning Stage
2.5.3
Project Modeling Stage
2.5.4
Model Implementation and Post Production Stage
2.5.5
Summary of Common Mistakes
3
Introduction to the Data
3.1
Customer Data for a Clothing Company
3.2
Swine Disease Breakout Data
3.3
MNIST Dataset
3.4
IMDB Dataset
4
Big Data Cloud Platform
4.1
Power of Cluster of Computers
4.2
Evolution of Cluster Computing
4.2.1
Hadoop
4.2.2
Spark
4.3
Introduction of Cloud Environment
4.3.1
Open Account and Create a Cluster
4.3.2
R Notebook
4.3.3
Markdown Cells
4.4
Leverage Spark Using R Notebook
4.5
Databases and SQL
4.5.1
History
4.5.2
Database, Table and View
4.5.3
Basic SQL Statement
4.5.4
Advanced Topics in Database
5
Data Pre-processing
5.1
Data Cleaning
5.2
Missing Values
5.2.1
Impute missing values with median/mode
5.2.2
K-nearest neighbors
5.2.3
Bagging Tree
5.3
Centering and Scaling
5.4
Resolve Skewness
5.5
Resolve Outliers
5.6
Collinearity
5.7
Sparse Variables
5.8
Re-encode Dummy Variables
6
Data Wrangling
6.1
Summarize Data
6.1.1
dplyr
package
6.1.2
apply()
,
lapply()
and
sapply()
in base R
6.2
Tidy and Reshape Data
7
Model Tuning Strategy
7.1
Variance-Bias Trade-Off
7.2
Data Splitting and Resampling
7.2.1
Data Splitting
7.2.2
Resampling
8
Measuring Performance
8.1
Regression Model Performance
8.2
Classification Model Performance
8.2.1
Confusion Matrix
8.2.2
Kappa Statistic
8.2.3
ROC
8.2.4
Gain and Lift Charts
9
Regression Models
9.1
Ordinary Least Square
9.1.1
The Magic P-value
9.1.2
Diagnostics for Linear Regression
9.2
Principal Component Regression and Partial Least Square
10
Regularization Methods
10.1
Ridge Regression
10.2
LASSO
10.3
Elastic Net
10.4
Penalized Generalized Linear Model
10.4.1
Introduction to
glmnet
package
10.4.2
Penalized logistic regression
11
Tree-Based Methods
11.1
Tree Basics
11.2
Splitting Criteria
11.2.1
Gini impurity
11.2.2
Information Gain (IG)
11.2.3
Information Gain Ratio (IGR)
11.2.4
Sum of Squared Error (SSE)
11.3
Tree Pruning
11.4
Regression and Decision Tree Basic
11.4.1
Regression Tree
11.4.2
Decision Tree
11.5
Bagging Tree
11.6
Random Forest
11.7
Gradient Boosted Machine
11.7.1
Adaptive Boosting
11.7.2
Stochastic Gradient Boosting
12
Deep Learning
12.1
Feedforward Neural Network
12.1.1
Logistic Regression as Neural Network
12.1.2
Stochastic Gradient Descent
12.1.3
Deep Neural Network
12.1.4
Activation Function
12.1.5
Optimization
12.1.6
Deal with Overfitting
12.1.7
Image Recognition Using FFNN
12.2
Convolutional Neural Network
12.2.1
Convolution Layer
12.2.2
Padding Layer
12.2.3
Pooling Layer
12.2.4
Convolution Over Volume
12.2.5
Image Recognition Using CNN
12.3
Recurrent Neural Network
12.3.1
RNN Model
12.3.2
Long Short Term Memory
12.3.3
Word Embedding
12.3.4
Sentiment Analysis Using RNN
Appendix
13
Handling Large Local Data
13.1
readr
13.2
data.table
— enhanced
data.frame
14
R code for data simulation
14.1
Customer Data for Clothing Company
14.2
Swine Disease Breakout Data
References
Published with bookdown
Practitioner’s Guide to Data Science
Chapter 14
R code for data simulation