Data engineers are experts at building automated data pipelines that acquire, transform, and load data from source systems into a data warehouse. Data is typically stored in data structures designed to support pre-built business intelligence reports and ad-hoc queries by business analysts. Data Scientists on the other hand, are experts in statistical and mathematical analysis and applying algorithms to train machine learning models. While Data Scientists typically require data from the same source systems as business analysts, the data must be organized and structured very differently in order to work with machine learning algorithms. Further, Data Scientists often require additional data, some of which is external to the organization.
The prepare stage of the data science lifecycle requires you to get the source data in shape for model training. Unfortunately, the work performed in this stage does not fall squarely into the domain of either the Data Engineer or the Data Scientist. The work requires a more hybrid skill set. Picture a venn diagram, where data science expertise is needed to explore and identify source data that will serve as inputs to train models and data engineering skills are used to write the code that transforms data into optimized machine learning features.
Unfortunately, this hybrid job function doesn’t exist. As a result, Data Scientists, who are typically more proficient at writing software than Data Engineers are at data science, take ownership of the Prepare stage. However, writing the software to extract, explore, and clean data, identify joins between datasets, create join keys, and perform data transformations is highly complex and requires data source expertise. If a project can’t get through data extraction, exploration, and preparation, then Data Scientists can’t engineer features. Without features, models can’t be trained. Without models, predictions can’t be made. Without predictions, the business questions driving projects cannot be answered.
Data Scientists, stuck in the modeling preparation abyss, expend their energy, time, and the project’s budget preparing to do data science. After suffering through a failed project or two, it’s easy to conclude that data science and machine learning just don’t work in practice. You’re glad you tried it, it was a good experiment but it’s time to move on. Most blame the media, vendors, and industry pundits for overhyping the technology and making enticing, but hyperbolic statements about the outsized benefits for organizations who jump on the data science gravy train.
While there is some validity to those accusations, there’s certainly no shortage of hype and unsubstantiated claims in the industry, the truth is that most organizations went all in on data science and machine learning before the industry, frankly, was ready. There have been significant industry innovations in the ‘book ends’ of the data science lifecycle. New cloud data warehousing platforms drastically improve data acquisition and access. While innovations in model training tools and infrastructure have made the Model and Predict stages more efficient. Yet the Prepare stage of the data science lifecycle has been largely untouched by the industry. Data Scientists, through no fault of their own, have been forced to execute the sisyphean task of getting raw data into model training shape without proper arrows in their quiver. In hindsight, it’s not surprising that 87% of data science projects fail.
At Rasgo, we’re helping Data Scientists get out of the model preparation abyss through our centralized Accelerated Modeling Preparation (AMP) platform. With Rasgo, Data Science teams collaborate on features, explore and identify applicable source data, prepare, join, & transform data using built-in platform capabilities, create training data sets, & access their training data in their modeling tool of choice through direct integrations with Rasgo.
Interested in learning more about Rasgo? Schedule a time to talk with one of our data science experts.