1.2 What should data science do?

1.2.1 Let’s dream big

Here is my two points for the question:

  • Make human better human by alleviating bounded rationality and minimize politics/emotion (rather than make machine more like human)
  • Strive for the “democratization” of data as legally possible: empower everyone in the organization to acquire, process, and leverage data in a timely and efficient fashion

I know it is vague. Behold, I am going to explain more.

It’s easy to pretend that you are data driven. But if you get into the mindset to collect and measure everything you can, and think about what the data you’ve collected means, you’ll be ahead of most of the organizations that claim to be data driven. If you know the difference between “data driven” and “data confirmed”, you’ll be sailing at the right direction. What on earth is the difference?

Imagine that you are buying something online and you need to decide whether or not to trust the product without seeing it physically. You see the average rating is 4.1 out of 5.0. Is this a good score? It depends on your subconscious decision. If you really need the thing, you may happily cheer “It is more than 4.0!”. If you are still not sure whether you need it, you can’t help to check the few low rating reviews and tell yourself “look at those 1-star reviews”. Sounds familiar? Psychologists call it confirmation bias.

Confirmation bias is the tendency to search for, interpret, favor, and recall information in a way that confirms one’s preexisting beliefs or hypotheses [Wikipedia]

So if you use data to feel better (confirm) decisions/assumptions that are already made before you analyze the data, that is “data confirmed”. A clear sign of confirmation bias is when you go back to tinker the definition of your metic because the current result is not impressive. However, this bias is not always easy to see. It is not only misleading but also expensive. Because it could take data science team days of toil to boil everything down to that magic number and put the result on the report. Data scientists are not totally immune to the bias either. Good news is that there is antidote to confirmation bias.

Antidote 1: Do the brainstorming of data definition and set the goal in advance and resist temptation to move them later. In other words, the decision makers have to set decision criteria and the boundary up front in your data science project.

Antidote 2: Data democratization. Keep in mind that data isn’t just for the professionals or a small group of people in the organization that are “key decision makers”. Everyone should be able to get access to and look at the data (as much as legally possible). In that way, there will be more eyes on the decision.

The way data science can help is to provide a sound data framework and necessary training for the organization to access data with least amount of pain. Also be clear about the data definition and documentation. Data science holds the responsibility for data stewardship in the organization with high integrity. (there is data science for social good which is data science’s responsibility for outside the organization but we are not going to discuss that here)

That is still very abstract, I hear you. Now, Let’s be more specific…

1.2.2 What kind of questions can data science solve?

1.2.2.1 Prerequisites

Data science is not a panacea and there are problems data science can’t help. It is best to make a judgment as early in the analytical cycle as possible. Tell your clients honestly and clearly when you think data analytics can’t give the answer they want. What kind of questions can data science solve?

  1. Your question needs to be specific enough

Look at two examples:

  • Question 1: How can I increase product sales?
  • Question 2: Is the new promotional tool introduced at the beginning of this year boosting the annual sales of P1197 in Iowa and Wisconsin? (P1197 is an impressive corn seed product from DuPont Pioneer)

It is easy to see the difference between the two questions. Question 1 is a grammatically correct question, but it is proper for data analysis to answer. Why? It is too general. What is the response variable here? Product sales? Which product? Is it annual sales or monthly sales? What are the candidate predictors? You nearly can’t get any useful information from the questions. In contrast, question 2 is much more specific. From the analysis point of view, the response variable is clearly “annual sales of P1197 in Iowa and Wisconsin”. Even we don’t know all the predictors, but the variable of interest is “the new promotional tool introduced early this year.” We want to study the impact of the promotion of sales. You can start from there and move on to figure out other variables need to include in the model by further communication.

As a data scientist, you may start with something general and unspecific like question 1 and eventually get to question 2. Effective communication and in-depth domain knowledge about the business problem are essential to convert a general business question into a solvable analytical problem. Domain knowledge helps data scientist communicate with the language the other people can understand and obtain the required information.

However, defining the question and variables involved won’t guarantee that you can answer it. For example, I encountered this situation with a well-defined supply chain problem. My client asked me to estimate the stock needed for a product in a particular area. Why can’t this question be answered? I tried fitting a Multivariate Adaptive Regression Spline (MARS) model and thought I found a reasonable solution. But it turned out later that the data my client gave me was inaccurate. In this case, only estimates rather than actual values of past supply figures were available and there was no way to get accurate data. The lesson lends itself to the next point.

  1. You need to have sound and relevant data

One cannot make a silk purse out of a sow’s ear. Data scientists need data, sound and relevant data. The supply problem is a case in point. There was relevant data, but not sound. All the later analytics based on that data was a building on sand. Of course, data nearly almost have noise, but it has to be in a certain range. Generally speaking, the accuracy requirement for the independent variables of interest and response variable is higher than others. In question 2, it is data related to the “new promotion” and “sales of P1197”.

The data has to be helpful for the question. If you want to predict which product consumers are most likely to buy in the next three months, you need to have historical purchasing data: the last buying time, the amount of invoice, coupons and so on. Information about customers’ credit card number, ID number, the email address is not going to help.

Often the quality of the data is more important than the quantity, but the quantity cannot be overlooked. In the premise of guaranteeing quality, usually the more data, the better. If you have a specific and reasonable question, also sound and relevant data, then congratulations, you can start playing data science!

1.2.2.2 Problem type

Many of the data science books classify the various models from a technical point of view. Such as supervised vs. unsupervised models, linear vs. nonlinear models, parametric models vs. non-parametric models, and so on. Here we will continue on “problem-oriented” track. We first introduce different groups of real problems and then present which models can be used to answer the corresponding category of questions.

  1. Description

The basic analytic problem is to summarize and explore a data set with descriptive statistics (mean, standard deviation, and so forth) and visualization methods. It is the simplest problem and yet the most crucial and common one. You will need to describe and explore the dataset before moving on to more complex analysis. In the problem such as customer segmentation, after you cluster the sample, the next step is to figure out the profile of each class by comparing the descriptive statistics of the various variables. Questions of this kind are:

  • How does the annual income distribute?
  • Are there outliers?
  • What are the mean active days of different accounts?

Data description is often used to check data, find the appropriate data preprocessing method, and demonstrate the model results.

  1. Comparison

The first common problem is to compare different groups. Such as: Is A better in some way than B? Or more comparisons: Is there any difference among A, B, and C in a particular aspect? Here are some examples:

  • Are males more inclined to buy our products than females?
  • Are there any differences in customer satisfaction in different business districts?
  • Do soybean carrying a particular gene have higher oil content?

For those problems, it is usually to start exploring from the summary statistics and visualization by groups. After a preliminary visualization, you can test the differences between treatment and control group statistically. The commonly used statistical tests are chi-square test, t-test, and ANOVA. There are also methods using Bayesian methods. In biology industry, such as new drug development, crop breeding, mixed effect models are the dominant technique.

  1. Clustering

Clustering is a widespread problem, which is usually related to classification. Clustering answers questions like:

  • Which customers have similar product preference?
  • Which printer performs a similar pattern to the broken ones?
  • How many different themes are there in the corpus?

Note that clustering is unsupervised learning. The most common clustering algorithms include K-Means and Hierarchical Clustering.

  1. Classification

Usually, a labeled sample set is used as a training set to train the classifier. Then the classifier is used to predict the category of a future sample. Here are some example questions:

  • Who is more likely to buy our product?
  • Is the borrower going to pay back?
  • Is it spam?

There are hundreds of classifiers. In practice, we do not need to try all the models but several models that perform well generally.

  1. Regression

In general, regression deals with the problem of “how much is it?” and return a numerical answer. In some cases, it is necessary to coerce the model results to be 0, or round the result to the nearest integer. It is the most common problem.

  • What will be the temperature tomorrow?
  • What is the projected net income for the next season?
  • How much inventory should we have?