Interest in data science-related careers is witnessing unprecedented growth and has seen a surge in popularity over the last few years. Data scientists come from a variety of backgrounds and disciplines, making it difficult to provide a concise answer when asked what data science is all about. Data science is a widely discussed topic, yet few can accurately define it.
Media has been hyping about “Data Science,” “Big Data”, and “Artificial Intelligence” over the past few years. There is an amusing statement from the internet:
“When you’re fundraising, it’s AI. When you’re hiring, it’s ML. When you’re implementing, it’s logistic regression.”
For outsiders, data science is the magic that can extract useful information from data. Everyone is familiar with the concept of big data. Data science trainees must now possess the skills to manage large data sets. These skills may include Hadoop, a system that uses Map/Reduce to process large data sets distributed across a cluster of computers or Spark, a system that builds on top of Hadoop to speed up the process by loading massive data sets into shared memory (RAM) across clusters with an additional suite of machine learning functions for big data.
The new skills are essential for dealing with large data sets beyond a single computer’s memory or hard disk and the large-scale cluster computing. However, they are not necessary for deriving meaningful insights from data.
A lot of data means more sophisticated tinkering with computers, especially a cluster of computers. The computing and programming skills to handle big data were the biggest hurdle for traditional analysis practitioners to be a successful data scientist. However, this barrier has been significantly lowered thanks to the cloud computing revolution, as discussed in Chapter 2. After all, it isn’t the size of the data that’s important, but what you do with it. You may be feeling a mix of skepticism and confusion. We understand; we had the same reaction.
To declutter, let’s start with a brief history of data science. If you search on Google Trends, which shows search keyword information over time, the term “data science” dates back further than 2004. Media coverage may give the impression that machine learning algorithms are a recent invention and that there was no “big” data before Google. However, this is not true. While there are new and exciting developments in data science, many of the techniques we use are based on decades of work by statisticians, computer scientists, mathematicians, and scientists from a variety of other fields.
In the early 19th century, Legendre and Gauss came up with the least squares method for linear regression . At the time, it was mainly used by physicists to fit their data. Nowadays, nearly anyone can build linear regression models using spreadsheet with just a little bit of self-guided online training.
In 1936, Fisher came up with linear discriminant analysis. In the 1940s, logistic regression became a widely used model. Then, in the 1970s, Nelder and Wedderburn formulated the “generalized linear mode (GLM)” which:
“generalized linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.” [from Wikipedia]
By the end of the 1970s, a variety of models existed, most of them were linear due to the limited computing power available at the time. Non-linear models weren’t able to be fitted until the 1980s.
In 1984, Breiman introduced the Classification and Regression Tree (CART) , one of the oldest and most widely used classification and regression techniques (L. Breiman et al. 1984).
After that, Ross Quinlan developed tree algorithms such as ID3, C4.5, and C5.0. In the 1990s, ensemble techniques, which combine the predictions of many models, began to emerge. Bagging is a general approach that uses bootstrapping in conjunction with regression or classification models to construct an ensemble. Based on the ensemble idea, Breiman came up with the random forest model in 2001 (Leo Breiman 2001a). In the same year, Leo Breiman published a paper “Statistical Modeling: The Two Cultures” (Leo Breiman 2001b), in which he identified two cultures in the use of statistical modeling to extract information from data:
- Data is from a given stochastic data model
- Data mechanism is unknown and people approach the data using algorithmic model
Most of the classical statistical models are the first type of stochastic data model. Black-box models, such as random forest, Gradient Boosting Machine (GBM), and deep learning , are algorithmic models. As Breiman pointed out, algorithmic models can be used on large, complex data as a more accurate and informative alternative to stochastic modeling on smaller datasets. These algorithms have developed rapidly with much-expanded applications in fields outside of traditional statistics which is one of the most important reasons why statisticians are not in the mainstream of today’s data science, both in theory and practice.
Python is overtaking R as the most popular language in data science, mainly due to the backgrounds of many data scientists. Since 2000, the approaches to getting information out of data have shifted from traditional statistical models to a more diverse toolbox that includes machine learning and deep learning models. To help readers who are traditional data practitioners, we provide both R and Python codes.
What is the driving force behind the shifting trend? John Tukey identified four forces driving data analysis (there was no “data science” when this was written in 1962):
- The formal theories of math and statistics
- Acceleration of developments in computers and display devices
- The challenge, in many fields, of more and ever larger bodies of data
- The emphasis on quantification in an ever-wider variety of disciplines
Tukey’s 1962 list is surprisingly modern, even when viewed in today’s context. People often develop theories way before they find potential applications. Over the past 50 years, statisticians, mathematicians, and computer scientists have laid the theoretical groundwork for the construction of “data science” as we know it today.
The development of computers has enabled us to apply the algorithmic models (which can be very computationally expensive) and deliver results in a friendly and intuitive way. The transition to the internet and the internet of things has generated vast amounts of commercial data. Industries have also recognized the value of exploiting this data. Data science seems sure to be a significant preoccupation of commercial life in the coming decades. All the four forces John identified exist today and have been driving data science.
The applications have been expanding fast, benefiting from the increasing availability of digitized information and the ability to distribute it through the internet. Today, people apply data science in a variety of fields, such as business, health, biology, social science, politics, etc. But what is today’s data science today?