1.1 A brief history of data science
Interest in data science-related careers is at an all-time high and has exploded in popularity in the last few years. Data scientists today are from various backgrounds. If someone ran into you ask what data science is all about, what would you tell them? It is not an easy question to answer. Data science is one of the areas that everyone is talking about, but no one can define it well.
Media has been hyping about “Data Science,” “Big Data,” and “Artificial Intelligence” over the past few years. There is an amusing statement from the internet:
“When you’re fundraising, it’s AI. When you’re hiring, it’s ML. When you’re implementing, it’s logistic regression.”
For outsiders, data science is whatever magic that can get useful information out of data. Everyone should have heard about big data. Data science trainees now need the skills to cope with such big data sets. What are those skills? You may hear about: Hadoop, a system using Map/Reduce to process large data sets distributed across a cluster of computers, or hear about Spark, a system builds atop Hadoop for speeding up the same by loading massive datasets into shared memory (RAM) across clusters with an additional suite of machine learning functions for big data. The new skills are for dealing with organizational artifacts of large data sets beyond a single computer’s memory or hard disk and the large-scale cluster computing but not for better solving the real problem. A lot of data means more sophisticated tinkering with computers, especially a cluster of computers. The computing and programming skills to handle big data were the biggest hurdle for traditional analysis practitioners to be a successful data scientist. However, this hurdle is significantly reduced with the cloud computing revolution, as described in Chapter 2. After all, it isn’t the size of the data that’s important. It’s what you do with it. Your first reaction to all of this might be some combination of skepticism and confusion. We want to address this upfront that: we had that exact reaction.
To declutter, let’s start with a brief history of data science. If you hit up the Google Trends website, which shows search keyword information over time, and check the term “data science,” you will find the history of data science goes back a little further than 2004. The way media describes it, you may feel that machine learning algorithms are new, and there was never “big” data before Google. That is not true. There are new and exciting developments in data science. But many of the techniques we are using are based on decades of work by statisticians, computer scientists, mathematicians, and scientists of many other fields.
In the early 19th century, when Legendre and Gauss came up with the least-squares method for linear regression , probably only physicists would use it to fit linear regression for their data. Now, nearly anyone can build linear regression using excel with just a little bit of self-guided online training. In 1936, Fisher came up with linear discriminant analysis. In the 1940s, we had another widely used model – logistic regression . In the 1970s, Nelder and Wedderburn formulated a “generalized linear mode (GLM) ” which:
“generalized linear regression by allowing the linear model to be related to the response variable via a link function and by allowing the magnitude of the variance of each measurement to be a function of its predicted value.” [from Wikipedia]
By the end of the 1970s, there was a range of analytical models, and most of them were linear because computers were not powerful enough to fit non-linear models until the 1980s.
In 1984, Breiman introduced the classification and regression tree (CART) , one of the oldest and most utilized classification and regression techniques(Breiman et al. 1984).
After that, Ross Quinlan came up with more tree algorithms such as ID3, C4.5, and C5.0. In the 1990s, ensemble techniques (methods that combine many models’ predictions) began to appear. Bagging is a general approach that uses bootstrapping in conjunction with regression or classification model to construct an ensemble. Based on the ensemble idea, Breiman came up with the random forest model in 2001 (Leo Breiman 2001a). In the same year, Leo Breiman published a paper “Statistical Modeling: The Two Cultures” (Leo Breiman 2001b) where he pointed out two cultures in the use of statistical modeling to get information from data:
- Data is from a given stochastic data model
- Data mechanism is unknown and people approach the data using algorithmic model
Most of the classical statistical models are the first type of stochastic data model. Black-box models, such as random forest, GMB, and deep learning , are algorithmic modeling. As Breiman pointed out, algorithmic models can be used on large complex data as a more accurate and informative alternative to stochastic data modeling on smaller data sets. Those algorithms have developed rapidly with much-expanded applications in fields outside traditional statistics. That is one of the most important reasons that statisticians are not the mainstream of today’s data science, both in theory and practice. We observe that Python is passing R as the most commonly used language in data science, mainly due to many data scientists’ background. Since 2000, the approaches to getting information out of data have shifted from traditional statistical models to a more diverse toolbox that includes machine learning and deep learning models. To help readers who are traditional data practitioners, we provide both R and Python codes.
What is the driving force behind the shifting trend? John Tukey identified four forces driving data analysis (there was no “data science” back to 1962):
- The formal theories of math and statistics
- Acceleration of developments in computers and display devices
- The challenge, in many fields, of more and ever larger bodies of data
- The emphasis on quantification in an ever-wider variety of disciplines
Tukey’s 1962 list is surprisingly modern. Let’s inspect those points in today’s context. People usually develop theories way before they find potential applications. In the past 50 years, statisticians, mathematicians, and computer scientists have laid the theoretical groundwork for constructing “data science” today. The development of computers enables us to apply the algorithmic models (which can be very computationally expensive) and deliver results in a friendly and intuitive way. The striking transition to the internet and the internet of things generates vast amounts of commercial data. Industries have also sensed the value of exploiting that data. Data science seems sure to be a significant preoccupation of commercial life in the coming decades. All the four forces John identified exist today and have been driving data science.
The toolbox and application have been expanding fast, benefiting from the increasing availability of digitized information and the possibility of distributing it through the internet. Today, people apply data science in many areas, including business, health, biology, social science, politics, etc. Now data science is everywhere. But what is today’s data science?
Breiman, Leo. 2001a. “Random Forests.” Machine Learning 45: 5–32.
Breiman, Leo. 2001b. “Statistical Modeling: The Two Cultures.” Statistical Science 16 (3): 199231.
Breiman, L., J. H. Friedman, R. A. Olshen, and C. J. Stone. 1984. Classification and Regression Trees. ISBN 978-0412048418. CRC.