2.5 Common Mistakes in Data Science
Data science project can go wrong at different stages in many ways. Most textbooks and online blogs focus on technical mistakes about machine learning model, algorithm or theory, such as including outliers and overfitting. It is important to avoid these technical mistakes. However, there are common systematic mistakes across data science projects that are rarely discussed in textbooks. To summarize these mistakes, people need to do real-world data science projects. In this section, we describe these common mistakes in detail so that readers can proactively identify and avoid these systematic mistakes in their own data science projects.
2.5.1 Problem Formulation Stage
The most challenging part of a data science project is problem formulation. Data science project stems from pain points of the business. The draft version of the goal of the project is relatively vague without much quantification or is the gut feeling of the leadership team. Often there are multiple teams involved in the initial project formulation stage and they have different views. It is easy to have malalignment across different teams such as resource allocation, milestone deliverable, and timeline. At the problem formulation stage, data science team members with technical background sometimes are not even invited to the initial discussion. It sounds ridiculous, but sadly true that a lot of resources are spent on solving the wrong problem, the number one systematic common mistake in data science. Formulating a business problem into the right data science project requires an in-depth understanding of the business context, data availability and quality, computation infrastructure, and methodology to leverage the data to quantify business value.
We have seen people over promise about business value all the time, another common mistake that is going to fail the project at the beginning. With the hype of big data and machine learning, leaders across industries often have unrealistic high expectation on data science. It is especially true during enterprise transformation when there is a strong push to adopt new technology to get value out of the data. The unrealistic expectations are based on assumptions that are way off chart without checking the data availability, data quality, computation resource, and current best practices in the field. Even there is some exploratory analysis by the data science team at the problem formulation stage, project leaders sometimes ignore their data-driven voice.
These two systematic mistakes undermine the organization’s data science strategy. The higher the expectation, the bigger the disappointment when the project cannot deliver business value. Data and business context are essential to formulate the business problem and set reachable business value. It helps to avoid the mistakes by having a strong data science leader with a broad technical background and let data scientist coordinate and drive the problem formulation and set realistic goals based data and business context.
2.5.2 Problem Planning Stage
Now suppose the data science project is formulated correctly with a reasonable expectation on the business value. The next step is to plan the project such as allocating resources, setting up milestones and timeline, and defining deliverable. In most cases, there are project managers to coordinate different teams that are involved in the project and use agile project management tools similar to those in software development. Unfortunately, the project management team may not have experience with data science projects and hence fail to account for the uncertainties at the planning stage. The fundamental difference between data science projects and other projects lead to another common mistake: too optimistic about the timeline. For example, data exploratory and preparation may take 60% to 80% of the total time for a given data science project, but people often don’t realize that.
When there are lots of data already collected across the organization, people assume you have enough data for everything. It leads to the mistake: too optimistic about data availability and quality. What you need is not “big data”, but data that can help you solve the problem. The data available may be low quality and you need to put substantial efforts to clean the data before you can use it. There are “unexpected” efforts to bring the right and relevant data for a specific data science project. To ensure smooth delivery of data science projects, you need to account for the “unexpected” work at the planning stage. We all know data pre-processing and feature engineering is usually the most time-consuming part of a data science project. However, people outside data science are not aware of it and we need to educate other team members and the leadership team.
2.5.3 Modeling Stage
Finally, you start to look at the data and fit some models. One common mistake at this stage is unrepresentative data. The model trained using historical data may not generalize to the future. There is always a problem of biased or unrepresentative data. As a data scientist, we need to use data that are closer to the situation where the model is going to apply and quantify the impact of model output in production. Another mistake at this stage is overfitting and obsession for complicated models. Now we can easily get hundreds or even thousands of features and the machine learning models are getting more complicated. People can use open source libraries to try all kinds of models. People are sometimes obsessed with complicated models instead of using the simplest among a set of compatible models.
The data used to build the models is always biased or unrepresentative to some extent, simpler models are better to generalize and it has a higher chance to provide consistent business value once the model passes the test and is finally implemented in the production environment. It is possible that you can not use the existing data and methods to solve the business problem. You can try to collect more data, do feature engineering, or create your own models. However, if there is a fundamental gap between data and the business problem, the data scientist has to make the tough decision to unplug the project. On the other hand, data science projects usually have high visibility and may be initiated by senior leadership. Even the data science team provide enough evidence that they can’t deliver the expected business value, people may not want to stop the project which leads another common mistake at modeling stage: take too long to fail. The earlier we can stop a failing project, the better. Because we can put valuable resources to other promising projects. It damages the data science strategy and everyone will be hurt by a long data science project that is doomed to fail.
2.5.4 Production Stage
Now suppose you have found a model that works great for the training and testing data. If it is an online application, you are halfway. The next is to put the model in production, which sounds like alien work for a data scientist. It is true that the data engineering team can help with model production. However, as a data scientist, you need to know the potential mistakes at this stage. One big mistake is missing A/B testing and assuming that the model performance at model training/testing stays the same in the production environment. Unfortunately, the model trained and evaluated using historical data nearly never performs the same in the production environment. The data used in the offline training maybe significant different from online data and the business context may have changed. If possible, machine learning models in production should always go through A/B testing to evaluate performance.
In the model training stage, people usually focus on model performance, such as accuracy without paying too much attention to the model execution time. When a model runs online in real time, the total run time for each instance (i.e., model latency) should not impact the customer’s user experience. Nobody wants to wait for even one second to see the results after click the “Search” button. In the production stage, feature availability is crucial to run a real-time model. Engineering resources are essential for model production. However, in traditional companies, it is common that a data science project fail to scale in real time applications due to lack of computation capacity, engineering resources, or non-tech culture and environment.
As the business problem evolve rapidly, the data and model in the production environment need to change accordingly or the performance of the model deteriorates over time. The online production environment is more complicated than model training and testing, for example, you pull online features from different resources, and some features may be missing at a specific time; the model may run into time out zone, and there are tons of different software and data exceptions that may happen. We need regular checkup during the entire life of the model cycle from implementation to retirement. Unfortunately, people often don’t set the monitoring system for data science projects, and it is another common mistake: missing necessary online checkup. It is essential to set a monitoring dashboard and automatic alarms, create model tuning, re-training, and retirement plans.
The data science project is a combination of art and engineering. A data science project may fail in different ways. However, if we put data and business context at the center of the project, get familiar with the data science project cycle and proactively identify and avoid these potential mistakes, the data science project can provide significant business value. Here is the summary of the mistakes:
- Solving the wrong problem
- Over promise on business value
- Too optimistic about the timeline
- Too optimistic about data availability and quality
- Unrepresentative data
- Overfitting and obsession for complicated models
- Take too long to fail
- Missing A/B testing
- Fail to scale in real-time applications
- Missing necessary online checkup