In this tutorial we will cover what a missing value is, what causes them, and a step-by-step example of how to fill in missing values using SQL.
In almost any analytics project, a large amount of time will be spent cleaning the data before beginning analysis. One of the most common tasks is to replace or impute missing values. Missing values can occur due to failure to collect the data in the first place, errors in recording the data, or from SQL JOINS that do not have a matching record. Often it is necessary to replace the values.
In this example we will be looking at sales counts according to different dates. Consider the following data stored in a table SALES_COUNTS
In this data the sales counts for January 23, 2019, and February 13, 2019, are both missing. If this was missing due to a SQL JOIN failing to find a single record, we would want to replace it with a 0. Alternatively, we may want to replace the missing values with the mode, 8, the median value, also 8, or the average, 7.
Replacing missing variables in Excel is straightforward. However, in many problems faced in a business setting, the data exists in a data warehouse and needs to be extracted to the Excel spreadsheet before analysis can begin. At best, this is an added step and wastes time downloading the data. At worst, this can require filtering in the database as there is too much data for Excel to handle. Even if all the data fit in Excel, it can be very tedious to calculate the imputed value and fill in the missing values with that new data. In either case, it is easy to perform these imputations directly in the data warehouse and just return the result.
The COALESCE function in SQL can be used to replace NULL (missing values). The function takes a list of arguments
And returns the first argument that is not NULL. This means that we can replace the missing value in a given column
In particular, to replace a missing value in Sales_counts with 0, use
If instead, we wanted to replace the missing value by the mean, we would use
Here we can’t simply call AVG as the average function by itself is an aggregate function and will try to aggregate data into a smaller number of rows. If called with a GROUP BY statement, this would return a single row for each group with the average for that group. Alternatively, if called without a GROUP BY, it would return a single row containing the average value over the entire table.
In order to get an average for each row in the original table, we use a WINDOW function. The general syntax of a WINDOW function is
Where the PARTITION BY and ORDER BY specify what window to calculate FUNCTION. As we want the mean of all the rows in the data, we replace FUNCTION with AVG and everything in the parentheses after OVER by just () to specify that we want the average over all rows in the table.
The missing value can be replaced by the median by replacing the AVG with MEDIAN.
And for the mode
Using the SQL COALESCE function, it is easy to replace missing or NULL values in SELECT statements. Specific values can be set directly with COALESCE and the mean, median or mode can be used by combining COALESCE with WINDOW functions.
By following the steps above you have combined the power of filling missing values and SQL to transform data directly in your cloud data warehouse. Please feel free to reach out to a member of our team if you have any additional questions.