A data science project has various stages. Many textbooks and blogs focus on one or two specific stages, and it is rare to see an end-to-end life cycle of a data science project. To get a good grasp of the end-to-end process requires years of real-world experience. Seeing a holistic picture of the whole cycle helps data scientists to better prepare for real-world applications. We will walk through the full project cycle in this section.
People often use data science projects to describe any project that uses data to solve a business problem, including traditional business analytics, data visualization, or machine learning modeling. Here we limit our discussion of data science projects that involve data and some statistical or machine learning models and exclude basic analytics or visualization. The business problem itself gives us the flavor of the project. We can view data as the raw ingredient to start with, and the machine learning model makes the dish. The types of data used and the final model development define the different kinds of data science projects.
There are offline and online data. Offline data are historical data stored in databases or data warehouses. With the development of data storage techniques, the cost to store a large amount of data is low. Offline data are versatile and rich in general (for example, websites may track and keep each user’s mouse position, click and typing information while the user is visiting the website). The data is usually stored in a distributed system, and it can be extracted in batch to create features used in model training.
Online data are real-time information that flows to models to make automatic actions. Real-time data can frequently change (for example, the keywords a customer is searching for can change at any given time). Capturing and using real-time online data requires the integration of a machine learning model to the production infrastructure. It used to be a steep learning curve for data scientists not familiar with computer engineering, but the cloud infrastructure makes it much more manageable. Based on the offline and online data and model properties, we can separate data science projects into three different categories as described below.
This type of data science project is for a specific business problem that needs to be solved once or multiple times. But the dynamic and disruptive nature of this type of business problem requires substantial work every time. One example of such a project is “whether a brand-new business workflow is going to improve efficiency.” In this case, we often use internal/external offline data and business insight to build models. The final results are delivered as a report to answer the specific business question. It is similar to the traditional business intelligence project but with more focus on data and models. Sometimes the data size and model complexity are beyond the capacity of a single computer. Then we need to use distributed storage and computation. Since the model uses historical data, and the output is a report, there is no need for real-time execution. Usually, there is no run-time constraint on the machine learning model unless the model runs beyond a reasonable time frame, such as a few days. We can call this type of data science project “offline training, offline application” project.
Another type of data science project uses offline data for training and applies the trained model to real-time online data in the production environment. For example, we can use historical data to train a personalized advertisement recommendation model that provides a real-time ad recommendation. The model training uses historical offline data. The trained model then takes customers’ online real-time data as input features and run the model in real-time to provide an automatic action. The model training is very similar to the “offline training, offline application” project. But to put the trained model into production, there are specific requirements. For example, as features used in the offline training have to be available online in real-time, the model’s online run-time has to be short enough without impacting user experience. In most cases, data science projects in this category create continuous and scalable business value as the model could run millions of times a day. We will use this type of data science project to describe the typical data science project cycle from section 2.4.2 to section 2.4.5.
For some business problems, it is so dynamic that even yesterday’s data is out of date. In this case, we can use online data to train the model and apply it in real-time. We call this type of data science project “online training, online application.” This type of data science project requires high automation and low latency.
A data-driven and fact-based planning stage is essential to ensure a successful data science project. With the recent big data and data science hype, there is a high demand for data science projects to create business value across different business sectors. Usually, the leaders of an organization are those who initiate the data science project proposals. This top-down style of data science projects typically have high visibility with some human and computation resources pre-allocated. However, it is crucial to understand the business problem first and align the goal across different teams, including:
the business team, which may include members from the business operation team, business analytics, insight, and metrics reporting team;
the technology team, which may include members from the database and data warehouse team, data engineering team, infrastructure team, core machine learning team, and software development team;
the project, program, and product management team depending on the scope of the data science project.
To start the conversation, we can ask everyone in the team the following questions :
- What are the pain points in the current business operation?
- What data are available, and how is the quality and quantity of the data?
- What might be the most significant impacts of a data science project?
- Is there any positive or negative impact on other teams?
- What computation resources are available for model training and model execution?
- Can we define key metrics to compare and quantify business value?
- Are there any data security, privacy, and legal concerns?
- What are the desired milestones, checkpoints, and timeline?
- Is the final application online or offline?
- Are the data sources online or offline?
It is likely to have a series of intense meetings and heated discussions to frame the project reasonably. After the planning stage, we should be able to define a set of key metrics related to the project, identify some offline and online data sources, request needed computation resources, draft a tentative timeline and milestones, and form a team of data scientist, data engineer, software developer, project manager and members from the business operation. Data scientists should play a significant role in these discussions. If data scientists do not lead the project formulation and planning, the project may not catch the desired timeline and milestones.
Even though we already set some strategies, milestones, and timelines at the problem formulation and project planning stage, data science projects are dynamic. There could be uncertainties along the road. As a data scientist, communicating any newly encountered difficulties or opportunities during the modeling stage to the entire team is essential to keep the data science project progress. Data cleaning, data wrangling, and exploratory data analysis are great starting points toward modeling with the available data source identified at the planning stage. Meanwhile, abstracting the business problem to be a set of statistical and machine learning problems is an iterative process. Business problems can rarely be solved by using just one statistical or machine learning model. Using a sequence of methods to decompose the business problem is one of the critical responsibilities for a senior data scientist. The process requires iterative rounds of discussions with the business and data engineering team based on each iteration’s new learnings. Each iteration includes both data-related and model-related parts.
For offline application data science projects, the end product is often a detailed report with model results and output. However, for online application projects, a trained model is just halfway from the finish line. The offline data is stored and processed in a different environment from the online production environment. Building the online data pipeline and implementing machine learning models in a production environment requires lots of additional work. Even though recent advance in cloud infrastructure lowers the barrier dramatically, it still takes effort to implement an offline model in the online production system. Before we promote the model to production, there are two more steps to go:
- Shadow mode
- A/B testing
A shadow mode is like an observation period when the data pipeline and machine learning models run as fully functional, but we only record the model output without any actions. Some people call it proof of concept (POC). During the shadow mode, people frequently check the data pipeline and model and detect bugs such as a timeout, missing features, version conflict (for example, Python 2 vs. Python 3), data type mismatch, etc.
Once the online model passes the shadow mode, A/B testing is the next stage. During A/B testing, all the incoming observations are randomly separated into two groups: control and treatment. The control group will skip the machine learning model, while the treatment group is going through the machine learning model. After that, people monitor a list of pre-defined key metrics during a specific period to compare the control and treatment groups. The differences in these metrics determine whether the machine learning model provides business value or not. Real applications can be complicated. For example, there can be multiple treatment groups, or hundreds, even thousands of A/B testing running by different teams at any given time in the same production environment.
Once the A/B testing shows that the model provides significant business value, we can put it into full production. It is ideal that the model runs as expected and continues to offer scalable values. However, the business can change, and a machine learning model that works now can break tomorrow, and features available now may not be available tomorrow. We need a monitoring system to notify us when one or multiple features change. When the model performance degrades below a pre-defined level, we need to fine-tune the parameters and thresholds, re-train the model with more recent data, add or remove features to improve model performance. Eventually, any model will fail or retire at some time with a pre-defined model retirement plan.
Data science end-to-end project cycle is a complicated process that requires close collaboration among many teams. The data scientist, maybe the only scientist in the team, has to lead the planning discussion and model development based on data available and communicate key assumptions and uncertainties. A data science project may fail at any stage, and a clear end-to-end cycle view of the project helps avoid some mistakes.