Over the last few weeks, we’ve analyzed why data science efforts, despite their importance to business executives, routinely fail. Last week, we explored how selecting the wrong use case, especially for teams just getting started with data science, can derail projects before they even start. Beyond selecting a good use case (the first step of the data science lifecycle), how should organizations think about prioritizing their data science investments? To answer this question, you have to understand where in the lifecycle your projects actually get delayed or completely stall. While each project is unique, our breadth of experience has taught us how the best teams strategically prioritize their data science investments. In this two-part series, we’ll share our learnings in process and technology investment prioritization.
As you can see, we organize steps of the lifecycle into four distinct stages: Define, Prepare, Model, and Predict. Many organizations jump to investing in the latter stages of the lifecycle without recognizing that model training can’t begin without having features built and vetted with the right data, properly prepared. We recommend the exact opposite - prioritize and make investments in the technology and processes that scale and optimize the earliest steps first to ensure data scientists have the building blocks needed to train models that deliver valuable predictions.
And now, the four steps we recommend you invest in first:
Data preparation is the first step that requires a focus on process and technology. While standard tools like Pandas and SQL can be used, it is important to systematize this process. A system that documents the data preparation process allows all work that utilizes the same data to utilize the same definitions, criteria, etc. This means new projects do not need to start from scratch and can get to results much more quickly. In addition, it prevents similar features being created with slightly different criteria and definitions. Too often, businesses rely on dashboards and models that use almost the same definitions leading to confusion or bad business decisions.
Beyond just ensuring consistent definitions, this portion of the data science lifecycle can consume a majority of the time spent on the entire project. Combining this with the fact that most data scientists do not enjoy this work and would rather spend the bulk of their time on modeling can lead to dissatisfaction with their job and high attrition in the data science team.
While data science teams can start with simple, open source tools (Pandas, SQL, etc.), businesses can often justify investing in tools to both speed up data preparation and help document or share the existing work leading to better data science collaboration. In addition, hiring one or more data engineers to support data scientists with data preparation can deliver large dividends.
While ROI can be generated using only the data identified during a well run business problem definition step, there is almost always additional data within the organization that could be used to improve the models performance. Unfortunately, even with a data warehouse, it can be very challenging for the data scientist to incorporate or often even find this data. From a business perspective, this may be even more of a problem if the business is storing the data, but no one is using it, that storage is nothing more than a cost. A well designed data discovery/exploration process can help unlock this data and improve the output of the data science teams.
The key need of the data science team is to easily explore the data warehouse to identify tables and their included fields that may be relevant to the problem being solved. Once the table and fields have been identified, the data scientist needs to understand the dimensions of the tables and determine the correct way to join the data together. This information can then be used in the data preparation phase to create the modeling data.
Instead of single tools that can help accelerate this phase, it will be a combination of data catalogs, BI tools to visualize the data, and data processing tools. In addition, ways to document the findings are key to allowing other members of the data science team to leverage these new sources of data in their own models. The addition of a data engineer can also provide a boost to the data science team at this point.
Many data scientists enjoy this portion of the data science lifecycle as it is seen as one of the most important things they can do to provide value and create good models. The team’s skills can drive success in data science without a strong process around feature engineering, but only to a point. Feature engineering is time consuming and much of the work is simply applying standard techniques to help preprocess categorical, text and numeric data. As these are standard techniques, solutions that can speed this process are beneficial.
Outside of these standard techniques, the real power of feature engineering is in applying business and problem understanding to create features that help the models work better. This complex feature engineering allows the data scientist to express their creativity and directly improve the performance of the model they are building. There are no tools that can replace the data scientist at this point.
However, there is significant benefit in investing in a system that will document the features the data scientist is creating and even allow the same features to easily be recreated or applied to new data. Having existing code that can be reused by a different data scientist allows the organization to avoid competing definitions of the same concept and allows the models to be compared to one another.
Data science projects depend on data, data that is typically stored in a data warehouse. Significant value can be extracted simply through the use of these data assets during the course of data science projects during the phases detailed above. However, adding additional data from both internal sources that aren’t yet in the warehouse and external data (be it customer data from third-party data providers, economic or physical data such as census results or weather, or related data from partners) can improve models and decisions measurably.
There are really two types of tools that should be considered. First, there is the technology that provides the data warehouse/lake either in the cloud or in on premise solutions. Often these decisions have already been made by the company's IT department. Simply storing the data is not enough, the data science team can invest in an orchestration tool to help the data science team manage adding and processing these new sources of data before they are incorporated into the data engineering process.
Invest early in optimizing the data science lifecycle steps that account for over 80% of data science project delays and failures. Without powerful tools and thoughtful processes, the Prepare stage quickly becomes such a severe bottleneck that subsequent modeling steps can’t even begin. Next week, we’ll wrap up this series by looking at how to prioritize investments in the remaining data science lifecycle steps.
If you are looking to invest in the Prepare stage, Rasgo’s Accelerated Modeling Preparation (AMP) platform focuses exclusively on optimizing and centralizing the steps and activities in this stage with built-in collaboration tools such as a feature repository.