In this tutorial we will cover how to impute missing data using SQL, why it is important, and a step-by-step example.
Imputation is a technique used to clean data by filling in missing values with representative values deduced from the non-missing data.
Imputation is important because missing values commonly occur in data science and data analyst workflows. Missing values are a problem because they can not only mess up your data but also impact business decisions made off of insights derived from these base tables.
In many cases the missing values could be the result of joining two tables where one of the tables doesn’t have a matching record. Or, it could be created when generating aggregated transactions and there were no transactions during a certain period. In fact, in many cases, the missing data that occurs in data science workflows is missing because no data exists for a record. Often, the value zero needs to be imputed for the missing values. You would use Impute to add in these missing values so that further transformations can occur.
Let’s use the example of building a dashboard to show total and average sales amounts by promotion per week over time. We’ll demo this example using the internet sales data from AdventureWorks. Three data transformation steps need to occur to go from the raw form of the data to the dashboard-ready form. First, the data needs to be aggregated to the promotion-week level. Second, the missing weeks at the promotion-week level need to be created. Finally, weeks that were created (missing values) need zero to be imputed for the total sales and average sales amounts.
This can be done in a BI tool such as Tableau, PowerBI or Looker, but it’s faster and more consistent to calculate this directly in the database with SQL. This way, when the source data updates, the new data can also be cleaned up as needed.
To start with, the weekly aggregations need to be created. The first step is to get the week from the ORDERDATE by calling DATE_TRUNC. This can be done in a subquery, or as done here, with a Common Table Expression (CTE)
Next, to fill in the missing weeks, the above SELECT statement needs to be converted to a CTE as well as three new CTEs to define the global range of dates to consider, the individual weeks that need to be populated, and the combination of these two to create the actual weeks that will be joined in the final query.
These five CTEs can then be combined along with a final query to generate a row for each date-promotionkey combination.
This is complicated SQL and can be a challenge to develop and debug. Despite this challenge, with large datasets, the performance improvement from running this in the database can be worth it.
By following the steps above you have combined the power of Impute and the power of SQL to transform data directly in your cloud data warehouse. Please feel free to reach out to a member of our team if you have any additional questions.