“Data is the new oil”. This phrase has been circulating for over a decade. And, to be honest, I’ve always disliked the phrase. I am uncomfortable with the idea that just because we have data, it has value, and I’ve seen too many companies following this idea and simply storing all of their data on massive data warehouses or data lakes and never touching it again. Data sitting unused in a data warehouse has no value, instead it is just a cost.
It’s the data scientist’s role to work with the data to identify and extract the value. Instead of this quote, it was Andrew Ng’s statement, “Applied machine learning is basically feature engineering,” that really resonated with me. This sentiment was reinforced by Dr. Pedro Domingos, who stated, “At the end of the day, some machine learning projects succeed and some fail. What makes the difference? Easily the most important factor is the features used.” Focusing on data as valuable by itself hides the effort that is needed to extract that value. In many cases, this effort is more of a slog.
In investigating the source of the “Data is the new oil” quote, I discovered that it could be traced back to 2006 and was from Clive Humby. Interestingly it was only part of the quote. The full quote reads, “Data is the new oil. It’s valuable, but if unrefined it cannot really be used. It has to be changed into gas, plastic, chemicals, etc to create a valuable entity that drives profitable activity; so must data be broken down, analyzed for it to have value.” He is absolutely right. There is value in data, just as there is in oil. But just as oil in the ground generates no immediate value, data sitting in a warehouse generates no value. It is only through refining and distribution that both oil and data generate value.
The refinement of data yields features that can be used in dashboards and predictive models and encompasses everything from identifying the right data, cleaning it, merging it to other data and performing feature engineering. At Rasgo, we believe that data scientists primarily add value by refining raw data into features that allow their descriptive and predictive models to capture the signal from the noise.
This refinement process, sometimes called data preparation (and sometimes with feature engineering broken out as a separate step) is estimated to account for between 50% and 80% of the time spent during a project. Further, much of the data preparation process is tedious and time consuming leading to the New York Times famously calling it data janitorial work.
If we applied today’s data preparation process to filling a car with gasoline, the driver would receive crude oil and be expected to jury-rig a refinery to create their own gasoline. Further, each data scientist would be building their own, slightly different refinery.
Why is the most important part of the job still a slog? Before we answer this question, let’s take a step back and examine the state of data science. Data science is on the verge of a crisis. Repeated surveys show that executives believe data science and AI will be key to their business going forward. Elon Musk claimed, “Companies have to race to build AI or they will be made uncompetitive. Essentially, if your competitor is racing to build AI, they will crush you.”
Despite executive sponsorship and funding, data science is failing to deliver on its promise. In 2017, CIO Dive reported just 13% of data science projects reached completion and only 8% of executives were satisfied with the outcome. It hasn’t gotten any better since. In September of 2020, ESI THOUGHTLAB released the results of a survey of 1,200 firms. This survey showed that 40% of data science projects lead to no or negative returns and overall, data science projects showed only a 1.3% return on investment.
The data science community has been talking about this lack of return on investment in data science projects for over five years. If something doesn’t change, business executives will eventually conclude that data science will not deliver on its promise and downsize the data science teams and focus those resources elsewhere in the business.
This has happened before. The history of AI contains numerous examples of AI winters where the potential of AI became overhyped leading to disappointing results and funding cuts. The first AI winter occurred in 1966 after the United States government stopped funding the machine translation effort dating back to 1954 after spending $20 million. In 1973, the Lighthill report for the United Kingdom Parliament criticized the state of AI as being only suitable to solve simple problems and unable to deliver on its promise. This led to a decade with little AI research in the UK.
The data science community is much more robust than it was during these AI winters, the tools are better and data science is driving some of the largest companies in the world. To attempt to deliver on data science’s promise, there has been significant investment in five areas: education, data storage, visualization, model building and model deployment.
Five years ago, there was an extreme shortage of data scientists and it was exceedingly challenging for most businesses to hire the data scientists they needed to complete their projects. Since then, there has been a proliferation of training options from online courses and tutorials, to bootcamps, to academic programs. As the shortage of data scientists drove up their salaries, many individuals began learning data science skills. And as this market developed, most universities created undergraduate and graduate programs to train additional data scientists. These training options have created a large number of junior data scientists looking for work. We have gone from the time when it was hard to hire data scientists to the point where any job posting will be inundated with hundreds of applications. There is still a shortage of senior data scientists and data science leaders, but with the recent surge in junior data scientists, this shortage should soon alleviate as well.
The data storage space has also seen a surge in options and flexibility as we have moved past the day where most employees had access to the data stored on their computer in Excel or Access databases or were able to pull data from relational databases such as Oracle, Microsoft SQL Server or DB2 running on large servers in the businesses data center. The growth of cloud databases, NoSQL databases, Hadoop, and now Snowflake give businesses the ability to store more data of different kinds and make it accessible to everyone who needs it.
Recently, there has been a recognition that existing data science workflows and data warehouse solutions can make it challenging to create the necessary features in production systems. Most data scientists write new features in tools and languages — Python, R, Alteryx — that are not used in production systems. This requires software engineering resources to translate the feature engineering code into the language of the production system. Beyond just the challenge in finding time to do this translation, the translation of this code from one language to another creates multiple chances for errors to be introduced into the production model. To help with this challenge, new specialized storage solutions, feature stores, have been developed to easily serve features to these models in production.
As we mentioned at the beginning, data by itself is just a cost. But with the rise of visualization tools like Tableau, Looker and PowerBI, the business analysts who used to work in Excel with just the data they had on their computer can now leverage data warehouses and visualization tools to gain nearly complete views of the entire business with the ability to drill into details as necessary.
Machine learning and predictive analytics used to be the purview of a select few who were able to code the algorithms to train neural networks and random forests by hand, but with the rise of Python and R packages, not to mention AutoML, more people are able to effectively and quickly build machine learning models. Packages like Scikit-learn, XGBoost, caret, Microsoft’s DMTK, TensorFlow from Google, Torch and Keras allow data scientists comfortable in Python or R to quickly build models in their language of choice. AutoML tools like DataRobot and H20 Driverless AI expand this ability to analysts who do not need to code. These tools allow for rapid model building and easy comparisons between different techniques.
Finally, with the recognition that many models never make it to production and those that do are often not monitored for performance, resources have poured into developing MLOps tools. These tools make it easy to take models from either one of the machine learning packages or from an AutoML tool and package it so that it is easily deployed, called to score new data and integrated into production systems. In addition, most MLOps tools track both the input data to detect data drift and monitor the predictions to alert if there is a change in the production system.
We’re fans of many of these tools and they make the data scientist’s life easier and more productive. But there has been little development to help with the bulk of the data scientists work, the data slog. Data scientists are using tools that at their core, were not built to help data scientists with the portion of their work where they spend the most time.
In part two of this blog, we will dig deeper into this data slog and discuss both what the challenges are, discuss why this slog is so important and how tools can help improve the data preparation and feature engineering process.