2.5 Common Mistakes in Data Science
Data science projects can go wrong at different stages in many ways. Most textbooks and online blogs focus on technical mistakes about machine learning models, algorithms, or theories, such as detecting outliers and overfitting. It is important to avoid these technical mistakes. However, there are common systematic mistakes across data science projects that are rarely discussed in textbooks. In this section, we describe these common mistakes in detail so that readers can proactively identify and avoid these systematic mistakes in their data science projects.
2.5.1 Problem Formulation Stage
The most challenging part of a data science project is problem formulation. Data science project stems from pain points of the business. The draft version of the project’s goal is relatively vague without much quantification or is the gut feeling of the leadership team. Often there are multiple teams involved in the initial project formulation stage, and they have different views. It is easy to have misalignment across teams, such as resource allocation, milestone deliverable, and timeline. Data science team members with technical backgrounds sometimes are not even invited to the initial discussion at the problem formulation stage. It sounds ridiculous, but sadly true that a lot of resources are spent on solving the wrong problem, the number one systematic common mistake in data science. Formulating a business problem into the right data science project requires an in-depth understanding of the business context, data availability and quality, computation infrastructure, and methodology to leverage the data to quantify business value.
We see people over-promise about business value all the time, another common mistake that will fail the project at the beginning. With big data and machine learning hype, leaders across many industries often have unrealistic high expectations on data science. It is especially true during enterprise transformation when there is a strong push to adopt new technology to get value out of the data. The unrealistic expectations are based on assumptions that are way off the chart without checking the data availability, data quality, computation resource, and current best practices in the field. Even when there is some exploratory analysis by the data science team at the problem formulation stage, project leaders sometimes ignore their data-driven voice.
These two systematic mistakes undermine the organization’s data science strategy. The higher the expectation, the bigger the disappointment when the project cannot deliver business value. Data and business context are essential to formulate the business problem and set reachable business value. It helps avoid mistakes by having a strong data science leader with a broad technical background and letting data scientists coordinate and drive the problem formulation and set realistic goals based on data and business context.
2.5.2 Project Planning Stage
Now suppose the data science project is formulated correctly with a reasonable expectation on the business value. The next step is to plan the project by allocating resources, setting up milestones and timelines, and defining deliverables. In most cases, project managers coordinate different teams involved in the project and use agile project management tools similar to those in software development. Unfortunately, the project management team may not have experience with data science projects and hence fail to account for the uncertainties at the planning stage. The fundamental difference between data science projects and other projects leads to another common mistake: too optimistic about the timeline. For example, data exploratory and data preparation may take 60% to 80% of the total time for a given data science project, but people often don’t realize that.
When there are a lot of data already collected across the organization, people assume we have enough data for everything. It leads to the mistake: too optimistic about data availability and quality. We need not “big data,” but data that can help us solve the problem. The data available may be of low quality, and we need to put substantial effort into cleaning the data before we can use it. There are “unexpected” efforts to bring the right and relevant data for a specific data science project. To ensure smooth delivery of data science projects, we need to account for the “unexpected” work at the planning stage. Data scientists all know data preprocessing and feature engineering is usually the most time-consuming part of a data science project. However, people outside data science are not aware of it, and we need to educate other team members and the leadership team.
2.5.3 Project Modeling Stage
Finally, we start to look at the data and fit some models. One common mistake at this stage is unrepresentative data. The model trained using historical data may not generalize to the future. There is always a problem with biased or unrepresentative data. As a data scientist, we need to use data that are closer to the situation where the model will apply and quantify the impact of model output in production. Another mistake at this stage is overfitting and obsession with complicated models. Now, we can easily get hundreds or even thousands of features, and the machine learning models are getting more complicated. People can use open source libraries to try all kinds of models and are sometimes obsessed with complicated models instead of using the simplest among a set of compatible models with similar results.
The data used to build the models is always somewhat biased or unrepresentative. Simpler models are better to generalize. It has a higher chance of providing consistent business value once the model passes the test and is finally implemented in the production environment. The existing data and methods at hand may be insufficient to solve the business problem. In that case, we can try to collect more data, do feature engineering, or develop new models. However, if there is a fundamental gap between data and the business problem, the data scientist must make the tough decision to unplug the project.
On the other hand, data science projects usually have high visibility and may be initiated by senior leadership. Even after the data science team provided enough evidence that they can’t deliver the expected business value, people may not want to stop the project, which leads to another common mistake at the modeling stage: take too long to fail. The earlier we can prevent a failing project, the better because we can put valuable resources into other promising projects. It damages the data science strategy, and everyone will be hurt by a long data science project that is doomed to fail.
2.5.4 Model Implementation and Post Production Stage
Now suppose we have found a model that works great for the training and testing data. If it is an online application, we are halfway. The next is to implement the model, which sounds like alien work for a data scientist without software engineering experience in the production system. The data engineering team can help with model production. However, as a data scientist, we need to know the potential mistakes at this stage. One big mistake is missing shadow mode and A/B testing and assuming that the model performance at model training/testing stays the same in the production environment. Unfortunately, the model trained and evaluated using historical data nearly never performs the same in the production environment. The data used in the offline training may be significantly different from online data, and the business context may have changed. If possible, machine learning models in production should always go through shadow mode and A/B testing to evaluate performance.
In the model training stage, people usually focus on model performance, such as accuracy, without paying too much attention to the model execution time. When a model runs online in real-time, each instance’s total run time (i.e., model latency) should not impact the customer’s user experience. Nobody wants to wait for even one second to see the results after click the “search” button. In the production stage, feature availability is crucial to run a real-time model. Engineering resources are essential for model production. However, in traditional companies, it is common that a data science project fails to scale in real-time applications due to lack of computation capacity, engineering resources, or non-tech culture and environment.
As the business problem evolves rapidly, the data and model in the production environment need to change accordingly, or the model’s performance deteriorates over time. The online production environment is more complicated than model training and testing. For example, when we pull online features from different resources, some may be missing at a specific time; the model may run into a time-out zone, and various software can cause the version problem. We need regular checkups during the entire life of the model cycle from implementation to retirement. Unfortunately, people often don’t set the monitoring system for data science projects, and it is another common mistake: missing necessary online checkup. It is essential to set a monitoring dashboard and automatic alarms, create model tuning, re-training, and retirement plans.
2.5.5 Summary of Common Mistakes
The data science project is a combination of art, science, and engineering. A data science project may fail in different ways. However, the data science project can provide significant business value if we put data and business context at the center of the project, get familiar with the data science project cycle and proactively identify and avoid these potential mistakes. Here is the summary of the mistakes:
- Solving the wrong problem
- Overpromise on business value
- Too optimistic about the timeline
- Too optimistic about data availability and quality
- Unrepresentative data
- Overfitting and obsession with complicated models
- Take too long to fail
- Missing A/B testing
- Fail to scale in real-time applications
- Missing necessary online checkup