In this tutorial you will see many examples of different calculations to aggregate data at multiple levels all designed to provide insights for your business. In all of these examples, new columns are created and potentially renamed. In some cases, the column exists in the data with a name that does not clearly explain what information it contains. Renaming these columns prior to analysis can be extremely useful.
Often only certain columns are kept, and you drop the columns that are not needed for the final analysis. In many analytics tasks more columns than are needed for the final analysis are carried along with the base data. This is often done as it is not clear whether or not a column will be needed until nearly the end of the analysis. Usually, this does not cause a problem with the final output. However, it can represent wasted work in a production process, and, more significantly, it can cause confusion for others when they try to extend the analysis to solve new problems.
For example: sometimes only certain classes of transactions are of interest. This could be products sold with or without a promotion, products purchased by customers in a certain location, etc. Being able to filter the columns to use only the data that is relevant before loading it into the modeling, visualization or other business intelligence tool can save significant time.
To understand all these issues, consider the problem of calculating a monthly purchase amount for products sold on promotion. Monthly sales averages and totals for internet sales that had a promotion are calculated, using the AdventureWorks sample data. These totals will then be used to populate a dashboard showing how sales are changing over time.
Since the AdventureWorks internet sales data is at the transaction level and multiple transactions can occur on a given day, the data needs to be aggregated to a monthly level. First, the relevant month needs to be calculated for each transaction.
Here, the monthly date gets assigned the name “DATE_TRUNC(MONTH, ORDERDATE)”. This is not convenient to work with. This column can be assigned a more meaningful name by assigning that name with the AS command.
In this analysis, the only resulting data that is needed is the MONTH, the PROMOTIONKEY (to determine if the sale occurred with a promotion) and the SALESAMOUNT; all the other columns can be dropped by only specifying the columns that are needed.
Using this block (and wrapping it in a ‘CREATE VIEW AS’ or common table expression, ‘WITH DATETRUNC AS …’) the monthly aggregates can be calculated. Here, a CTE is used.
This doesn’t quite work as the sum and mean are calculated for all transactions. Sales that occur with no promotion are identified in the data as having a PROMOTIONKEY of 1. To get sales that occur without a promotion, these transactions need to be excluded using a WHERE clause.
This where clause can be applied in the last SELECT statement.
Alternatively, it can be applied in the common table expression (CTE).
Either approach works.
By following the steps above you have combined the power of categorical aggregates and the power of SQL to transform data directly in your cloud data warehouse. Please feel free to reach out to a member of our team if you have any additional questions.