Cleaning Up Missing Data using SQL

In this tutorial we will cover how to impute missing data using SQL, why it is important, and a step-by-step example.

‍

What is Imputation and why is it important?

Imputation is a technique used to clean data by filling in missing values with representative values deduced from the non-missing data.

‍

Imputation is important because missing values commonly occur in data science and data analyst workflows. Missing values are a problem because they can not only mess up your data but also impact business decisions made off of insights derived from these base tables.

‍

In many cases the missing values could be the result of joining two tables where one of the tables doesn’t have a matching record. Or, it could be created when generating aggregated transactions and there were no transactions during a certain period. In fact, in many cases, the missing data that occurs in data science workflows is missing because no data exists for a record. Often, the value zero needs to be imputed for the missing values. You would use Impute to add in these missing values so that further transformations can occur.

‍

Imputation Example:

Let’s use the example of building a dashboard to show total and average sales amounts by promotion per week over time. We’ll demo this example using the internet sales data from AdventureWorks. Three data transformation steps need to occur to go from the raw form of the data to the dashboard-ready form. First, the data needs to be aggregated to the promotion-week level. Second, the missing weeks at the promotion-week level need to be created. Finally, weeks that were created (missing values) need zero to be imputed for the total sales and average sales amounts.

‍

This can be done in a BI tool such as Tableau, PowerBI or Looker, but it’s faster and more consistent to calculate this directly in the database with SQL. This way, when the source data updates, the new data can also be cleaned up as needed.

‍

SQL:

To start with, the weekly aggregations need to be created. The first step is to get the week from the ORDERDATE by calling DATE_TRUNC. This can be done in a subquery, or as done here, with a Common Table Expression (CTE)


WITH WEEKLY AS (SELECT *
, DATE_TRUNC(week, ORDERDATE) as WEEK 
FROM FACTINTERNETSALES)

SELECT
WEEK
, PROMOTIONKEY
, SUM(SALESAMOUNT) as SALESAMOUNT_SUM
, AVG(SALESAMOUNT) as SALESAMOUNT_AVG
FROM WEEKLY
GROUP BY WEEK, PROMOTIONKEY

‍

Next, to fill in the missing weeks, the above SELECT statement needs to be converted to a CTE as well as three new CTEs to define the global range of dates to consider, the individual weeks that need to be populated, and the combination of these two to create the actual weeks that will be joined in the final query.


WITH GLOBAL_SPINE AS (
SELECT ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID
, DATEADD('week', (INTERVAL_ID - 1), '2010-12-01'::timestamp_ntz) as SPINE_START
, DATEADD('week', INTERVAL_ID, '2010-12-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 165))), 

GROUPS AS (
SELECT PROMOTIONKEY
, MIN(WEEK) AS LOCAL_START
, MAX(WEEK) AS LOCAL_END
FROM WEEKLY_AGG
GROUP BY PROMOTIONKEY), 

GROUP_SPINE AS (
SELECT PROMOTIONKEY
, SPINE_START AS WEEK_START
, SPINE_END AS WEEK_END
FROM GROUPS G
CROSS JOIN LATERAL (SELECT SPINE_START, SPINE_END
    FROM GLOBAL_SPINE S))

‍

These five CTEs can then be combined along with a final query to generate a row for each date-promotionkey combination.


WITH WEEKLY AS (SELECT *
, DATE_TRUNC(week, ORDERDATE) as WEEK 
FROM FACTINTERNETSALES),

WEEKLY_AGG AS (SELECT
WEEK
, PROMOTIONKEY
, SUM(SALESAMOUNT) as SALESAMOUNT_SUM
, AVG(SALESAMOUNT) as SALESAMOUNT_AVG
FROM WEEKLY
GROUP BY WEEK, PROMOTIONKEY),

GLOBAL_SPINE AS (
SELECT ROW_NUMBER() OVER (ORDER BY NULL) as INTERVAL_ID
, DATEADD('week', (INTERVAL_ID - 1), '2010-12-01'::timestamp_ntz) as SPINE_START
, DATEADD('week', INTERVAL_ID, '2010-12-01'::timestamp_ntz) as SPINE_END
FROM TABLE (GENERATOR(ROWCOUNT => 165))), 

GROUPS AS (
SELECT PROMOTIONKEY
, MIN(WEEK) AS LOCAL_START
, MAX(WEEK) AS LOCAL_END
FROM WEEKLY_AGG
GROUP BY PROMOTIONKEY), 

GROUP_SPINE AS (
SELECT PROMOTIONKEY
, SPINE_START AS WEEK_START
, SPINE_END AS WEEK_END
FROM GROUPS G
CROSS JOIN LATERAL (SELECT SPINE_START, SPINE_END
    FROM GLOBAL_SPINE S))

SELECT 
G.PROMOTIONKEY AS PROMOTIONKEY
, WEEK_START
, COALESCE(T.SALESAMOUNT_SUM, 0)
, COALESCE(T.SALESAMOUNT_AVG, 0)
FROM GROUP_SPINE G
LEFT JOIN WEEKLY_AGG T
ON WEEK >= G.WEEK_START 
AND WEEK < G.WEEK_END 
AND G.PROMOTIONKEY = T.PROMOTIONKEY
ORDER BY WEEK_START, PROMOTIONKEY;

‍

This is complicated SQL and can be a challenge to develop and debug. Despite this challenge, with large datasets, the performance improvement from running this in the database can be worth it.

‍

Conclusion:

By following the steps above you have combined the power of Impute and the power of SQL to transform data directly in your cloud data warehouse. Please feel free to reach out to a member of our team if you have any additional questions.

‍

Data Analysis Tutorials

Cleaning Up Missing Data using SQL

Data Analysis Tutorials

What is Imputation and why is it important?

Imputation Example:

SQL:

Conclusion:

No-code/low-code data prep and visualization

Get your data science on.

Book a Enterprise GPT Demo

Data Analysis Tutorials

Cleaning Up Missing Data using SQL

Data Analysis Tutorials

What is Imputation and why is it important?

Imputation Example:

SQL:

Conclusion:

No-code/low-code data prep and visualization

Get your data science on.

Book a
Enterprise GPT Demo