1.3 What kind of questions can data science solve?

1.3.1 Prerequisites

Data science is not a panacea, and there are problems data science can’t help. It is best to make a judgment as early in the analytical cycle as possible. First and foremost, we need to tell customers, clients, and stakeholders honestly and clearly when we think data analytics can’t answer their question after careful evaluation of the request, data availability, computation resources, and modeling details. Often, we can tell them what we can do as an alternative. It is essential to “negotiate” with others what data science can do specifically; simply answering “we cannot do what you want” will end the collaboration. Now let’s see what kind of questions data science can solve:

  1. The question needs to be specific enough

Let us look at the two examples below:

  • Question 1: How can I increase product sales?
  • Question 2: Is the new promotional tool introduced at the beginning of this year boosting the annual sales of P1197 in Iowa and Wisconsin? (P1197 is an impressive corn seed product from DuPont Pioneer)

It is easy to see the difference between the two questions. Question 1 is grammatically correct, but it is not proper for data analysis to answer. Why? It is too general. What is the response variable here? Product sales? Which product? Is it annual sales or monthly sales? What are the candidate predictors? We nearly can’t get any useful information from the questions.

In contrast, question 2 is much more specific. From the analysis point of view, the response variable is clearly “annual sales of P1197 in Iowa and Wisconsin”. Even we don’t know all the predictors, but the variable of interest is “the new promotional tool introduced early this year.” We want to study the impact of the promotion on sales. We can start there and figure out other variables that need to be included in the model by further communication.

As a data scientist, we may start with general questions from customers, clients, or stakeholders and eventually get to more specific and data science solvable questions with a series of communication, evaluation, and negotiation. Effective communication and in-depth knowledge about the business problem are essential to converting a general business question into a solvable analytical problem. Domain knowledge helps data scientists communicate using the language other people can understand and obtain the required information.

However, defining the question and variables involved won’t guarantee that we can answer it. For example, we could encounter this situation with a well-defined supply chain problem. The client may ask us to estimate the stock needed for a product in a particular area. Why can’t this question be answered? We can try to fit various models such as a Multivariate Adaptive Regression Spline (MARS) model and find a reasonable solution from a modeling perspective. But it can turn out later that the client’s data is an estimated value, not the actual observation. There is no good way for data science to solve the problem with the desired accuracy with inaccurate data.

  1. You need to have sound and relevant data

One cannot make a silk purse out of a sow’s ear. Data scientists need data, sound, and relevant data. The supply problem mentioned above is a case in point. There were relevant data, but not sound. All the later analytics based on that data was a building on sand. Of course, data nearly almost have noise, but it has to be in a certain range. Generally speaking, the accuracy requirement for the independent variables of interest and response variable is higher than others. For the above question 2, it is variables related to the “new promotion” and “sales of P1197”.

The data has to be helpful for the question. If we want to predict which product consumers are most likely to buy in the next three months, we need to have historical purchasing data: the last buying time, the amount of invoice, coupons, etc. Information about customers’ credit card numbers, ID numbers, and email addresses will not help much.

Often, the data quality is more important than the quantity, but you can not overlook the quantity.
Suppose you can guarantee the quality; usually, the more data, the better. If we have a specific and reasonable question with sound and relevant data, then congratulations, we can start playing data science!

1.3.2 Problem type

Many of the data science books classify various models from a technical point of view. Such as supervised vs. unsupervised models, linear vs. nonlinear models, parametric models vs. non-parametric models, and so on. Here we will continue on a “problem-oriented” track. We first introduce different groups of real-world problems and then present which models can answer the corresponding category of questions.

  1. Description

The primary analytic problem is to summarize and explore a data set with descriptive statistics (mean, standard deviation, and so forth) and visualization methods. It is the most straightforward problem and yet the most crucial and common one. We will need to describe and explore the dataset before moving on to a more complex analysis. For problems such as customer segmentation, after we cluster the sample, the next step is to figure out each class’s profile by comparing the descriptive statistics of various variables. Questions of this kind are:

  • What is the annual income distribution?
  • Are there any outliers?
  • What are the mean active days of different accounts?

Data description is often used to check data, find the appropriate data preprocessing method, and demonstrate the model results.

  1. Comparison

The first common modeling problem is to compare different groups. Is A better in some way than B? Or more comparisons: Is there any difference among A, B, and C in a particular aspect? Here are some examples:

  • Are males more inclined to buy our products than females?
  • Are there any differences in customer satisfaction in different business districts?
  • Do soybean carrying a particular gene have higher oil content?

For those problems, it usually starts with some summary statistics and visualization by groups. After a preliminary visualization, you can test the differences between the treatment and control groups statistically. The commonly used statistical tests are chi-square test, t-test, and ANOVA. There are also methods using Bayesian methods. In the biology industry, such as new drug development, crop breeding, fixed/random/mixed effect models are standard techniques.

  1. Clustering

Clustering is a widespread problem, and it can answer questions like:

  • How many reasonable customer segments are there based on historical purchase patterns?
  • How are the customer segments different from each other?

Please note that clustering is unsupervised learning; there are no response variables. The most common clustering algorithms include K-Means and Hierarchical Clustering.

  1. Classification

For classification problems, there are one or more label columns to define the ground truth of classes. We use other features of the training dataset as explanatory variables for model training. We can use the trained classifier to predict the labels of a new observation. Here are some example questions:

  • Will this customer likely to buy our product?
  • Is the borrower going to pay us back?
  • Is it spam email or not?

There are hundreds of different classifiers. In practice, we do not need to try all the models but several models that perform well generally. For example, the random forest algorithm is usually used as the baseline model to set model performance expectations.

  1. Regression

In general, regression deals with a question like “how much is it?” and return a numerical answer. It is necessary to coerce the model results to be 0 or round it to the nearest integer in some cases. It is still the most common problem in the data science world.

  • What will be the temperature tomorrow?
  • What is the projected net income for the next season?
  • How much inventory should we have?
  1. Optimization

Optimization is another common type of problems in data science to find an optimal solution by tuning a few tune-able variables with other non-controllable environmental variables. It is an expansion of comparison problem and can solve problems such as:

  • What is the best route to deliver the packages?
  • What is the optimal advertisement strategy to promote a new product?