2.4 Data Science Project Cycle
A data science project has various stages. Many textbooks and blogs focus on one or two specific stages and it is rare to see the end-to-end life cycle of data science projects. In fact, to get a good grasp of the end-to-end cycle requires many years of experience of doing real-world data science. We will share our opinions on that in this section. Seeing a holistic picture of the whole cycle helps you to better prepare for real-world applications.
2.4.1 Types of Data Science Projects
People often use data science project to describe any project that uses data to solve a business problem, including traditional business analytics or data visualization. Here we limit our discussion of data science projects that involve data and some statistical or machine learning models. The business problem itself gives us the flavor of the project, data is the raw ingredient to start with, and the model makes the dish. Different types of data science projects can be determined by the types of data used and the final model development and implementation.
18.104.22.168 Offline and online Data
There are offline and online data. Offline data are historical archived data stored in databases or data warehouses. With the development of data storage, the cost to store a large amount of data is cheap and offline data are very rich in general (for example website may track and store each individual user’s mouse position, click and typing information while the user is visiting the website). Offline data is usually stored in a distributed system and it can be extracted in batch as raw materials to create features that can be used in model training. Online data are real-time information that can be feed to models to make automatic actions. Real-time information can changes frequently such as the keywords a customer is searching for. Capturing and using real-time online data requires the integration of machine learning to the production infrastructure. It used to be a steep learning curve for data scientists, but the cloud infrastructure makes it much easier.
22.214.171.124 Offline training and offline application
This type of data science project is for a specific business problem which needs to be solved once or multiple times. But the dynamic nature of the business problem requires substantial work every time. One example of such a project is “whether a new workflow is going to improve efficiency.” In this situation, we often use offline internal and external data, build models, and deliver the final results as a report to answer the specific business question. It is similar to the traditional business intelligence project but with more focus on data and model. Sometimes the data size and model complexity are beyond the capacity of a single computer. So you need to use distributed storage and computation. Since the model is based on the historical data and the output is a report, there is no need for real-time execution. Usually, there is no run-time constraint on the machine learning model unless the model is running beyond a reasonable time frame such as a few hours or a few days. We can call this type of data science project “offline training, offline application” project.
126.96.36.199 Offline training and online application
Another type of data science project is to use offline data for training and apply the trained model to real-time online data in the production environment. One example of such a project is “using historical data to train a personalized advertisement model, and then provides real-time ad recommendation when customers visit the website.” The model is trained based on offline data, and then use a customer’s online real-time data as features to run the model in real time to provide an automatic action. The model training is very similar to the “offline training, offline application” project, but as the trained model will be put to production, there are specific requirements such as features used in the offline training have to be available online in real time, and the online run-time of the model has to be short enough without impacting user experience. In most cases, data science projects in this category create continuous and scalable business value. We will use this type of data science project to describe the project cycle.
188.8.131.52 Online training and online application
For some business problems, it is so dynamic that even yesterday’s data is out of date. For such cases, we can use online data to train the model and then applying it in real time. We call this type of data science project “online training, online application.” This type of data science project requires high automation and low latency.
2.4.2 At the Planning Stage
To ensure a successful data science project, a data-driven and fact-based planning stage is essential. With the recent big data and data science hype, there is a high demand for data science projects to create business value across different business sectors. Often times, these data science project proposals are initiated by the leaders of an organization. This top-down style data science projects usually have high visibility with certain human and computation resources pre-allocated. However, it is crucial to understand the business problem first and align the goal across different teams including
the business team which may include members from the business operation team, business analyst, insight and reporting team;
technology team which may include members from database and data warehouse team, data engineering team, infrastructure team, core machine learning team, and software development team;
project management team which may include program management team and product management team depending on the scope of the data science project.
To start the conversation, we can ask the following questions to everyone in the team:
- What are the pain points in current business operation?
- What data are available and how is the quality and quantity of the data?
- What might be the most significant impacts of a data science project?
- Are there any negative impact to other teams?
- What computation resources are available for model training and model execution?
- Can we define key metrics to compare and quantify business value?
- Are there any data security, privacy and legal concerns?
- What are the desired check points and timeline?
- Is the final application online or offline?
- Are the data online or offline?
It is likely to have a series of intense meetings and heated discussions to frame the project to a reasonable scope. After the planning stage, we should be able to define a set of key metrics related to the project, identify some offline and online data sources, request needed computation resources, draft tentative timeline and milestones, and form a team of data scientist, data engineer, software developer, project manager and members from business operation. Data scientists should play a major role in these discussions. If data scientist is not leading the data science project formulation, it is very likely the entire project will not reach the timeline and milestones.
2.4.3 At the Modeling Stage
Even though at the planning stage we already set some strategy, milestone, and timeline, data science projects are dynamic in nature and there could be uncertainties along the road. As a data scientist, clearly communicate any newly encountered difficulties during the modeling stage to the entire team is essential to keep the data science project progress. With the available data source identified at the planning stage, data cleaning, data wrangling, and exploratory data analysis are great starting points toward modeling. Meanwhile, abstracting the business problem to be a set of statistical and machine learning problems is an iterative process. It is rare that business problems can be solved by using just one statistical or machine learning model. The ability to use a sequence of methods to decompose the business problem is one of the key responsibility for a senior data scientist. The process requires iterative rounds of discussions with the business team and data engineering team based on the new learning from each iteration. Each iteration includes both data related and model related part.
2.4.4 At the Production Stage
For offline application data science projects, the end product is often a detailed report with model result and output. However, for online application projects, a trained model is just halfway from the finish line. The offline data is stored and processed in a totally different environment from the online production environment. Building the online data pipeline and implementing machine learning models in a production environment requires lots of additional work. Even though recent advance in cloud infrastructure lowers the barrier dramatically, it still takes effort to implement an offline model in the online production system. Before you promote the model to production, there are two more steps to go:
- shadow mode
- A/B testing
A shadow mode is like an observation period when the data pipeline and machine learning models run as it is fully functional, but we only record the model output without any actions. Some people call it proof of concept (POC). During POC, people frequently check the data pipeline and model and detect bugs such as a timeout or missing features, version conflict (for example python 2 v.s. python 3), data type mismatch, etc.
Once the online model passes the shadow mode, A/B testing is the next stage. During A/B testing, all the incoming observations are randomly separated into two groups: control and treatment. The control group is going to skip the machine learning model, while the treatment group is going through the machine learning model. After that, people monitor a list of pre-defined key metrics during a specific time period to compare the control and treatment groups. The differences in these key metrics determine whether the machine learning model provides business value or not. Real applications can be complicated. For example, there can be multiple treatment groups, or hundreds, even thousands of A/B testing running by different teams at any given time.
Once the A/B testing shows that the model provides significant business value, then you can put it into full production. It is ideal that the model runs as expected and continues to provide scalable values. However, the business can change and a machine learning model works now can break tomorrow, and features available now may not be available tomorrow. You need a monitoring system to automatically notify us when one or multiple features change. When the model performance degrades below a pre-defined a level, you need to fine-tune the parameters and thresholds, re-train the model with more recent data, add or remove features to improve model performance. Eventually, any model will fail or retire at some time.
Data science end-to-end project cycle is a complicated process which requires close collaboration among many teams. Data scientist, maybe the only scientist in the team, has to lead the planning discussion and model development based on data available and clearly communicate key assumptions and uncertainties with the entire team. A data science project may fail at any stage, and a clear end-to-end cycle view of the project helps avoid some mistakes.