Rasgo Product# SQL Generator

Rasgo is happy to announce the release of the SQL Generator, a browser tool that enables anyone to generate a complex SQL query without writing a line of code.

If you've ever worked with data, you've likely experienced the feeling of terror when you know what you need to do to the data to get it in the right form, but you're not exactly sure how to do that in your database's preferred form of SQL. As cloud data warehouses have grown in popularity, SQL has only become more important, and our users are always pushing us to add new transforms to our open source library of SQL transforms.

Now, you can access that same library and put it to use through the SQL Generator. In just a few clicks, you can:

- Upload / create your table schema
- Choose a SQL transform (like 'Moving Avg' or 'Clean')
- Pick your SQL syntax (i.e. Snowflake, BigQuery, PostgresQL, etc.)
- Generate SQL!
- Copy the URL and share it with a friend

Try it out, and let us know what you think in Slack!

P.S. if you want to contribute your own useful SQL to the Rasgo SQL Generator, all you have to do is open a pull request in our open source repo. Thanks in advance!

With this dynamic generator you are able to input your column headers and the transform you are trying to replicate - then our generator will kick out the exact SQL that you will need to transform in your cloud data warehouse!

Have a dataset in excel? Feel free to import your CSV for even more customization!

Welcome to the Rasgo SQL generator - this will be your user interface:

You can either scroll or search all of the different transforms. We are going to use “Filter” for this example.

- Columns can have different inputs: the sql generator allows for Text, Number, Decimal, Boolean (true/false) and dates.

- Base Table is what is being used as the base for the transform. As you can see, in this example, the column options will change depending on the base table.

In this example you put the column you would like to filter, the different function and then type in your desired filter. It’s important to note that you can add additional filters by clicking the “+ Add Item”.

First you pick which table you will be performing the transformation on.

In our example using the Filter transformation you must then choose which filter you would like to use. The process will still be consistent throughout other transformations. You will use these sections to adjust what function you would like to perform.

After choosing which filter you would like to use you must input the value that you are searching for.

Next you need to choose which data warehouse you are using: the generator is pre-set to use Snowflake.

The last step is to generate your SQL and copy it into your project!

Below is a list of all of the different transforms currently being sponsored on the Rasgo SQL Generator, this list will continue to be updated. This list can also be found on the Rasgo Docs for greater clarification.

Groups rows by the group_by items applying aggregations functions for the resulting group and selected columns.

Aggregate strings across rows by concatenating them together, and grouping by other columns.

Uses a text separator to aggregate the string values together, and returns a single column where the rows are the aggregated strings.

A transform that accepts a custom template to execute. Must use the sql template argument source_table to reference the Rasgo dataset which will serve as the base of any SELECT.

This function will categorize or bin an input column such that for N bins, an output column is created with values [1-N] where each value represents some bin.

This transformation supports two binning methods (called "binning_type" in the arguments): ntile and equalwidth.

Cast selected columns to a new type.

Cast data types, rename or drop columns, impute missing values, and filter values in a dataset.

This function creates a new column that concatenates fixed values and columns in your dataset.

Pass in a list named "concat_list", containing the names of the columns and the static string values to concatenate, in order.

Pass in a list of filter rules, and aggregate rows that match.

If multiple rules are passed, they are combined and aggregated both together and separately.

Run pairwise correlation on all numeric columns in the source_table.

Row-based; Calculates a cumulative aggregate based on a relative row window. Pass in order_by columns to create a row-based look-back or look-forward window. Example use case: Aggregate all sales for a customer from the beginning of time until this row.

Retrieves predictions from a DataRobot model that was deployed in Snowflake.

Increments a date by the specified interval value.

Calculates the difference between two date, time, or timestamp expressions based on the date or time part requested. Difference is calculated as date_1 - date_2.

Extracts a specific part of a date column. For example, if the input is '2021-01-01', you can ask for the year and get back 2021.

An exhaustive list of valid date parts can be found here.

This transform generates a date spine for your date index, which can replace your date index column for modeling.

All intervals are considered to be start-inclusive and end-exclusive, or [start, end]. The join with the date spine will be an outer join such that all intervals are present and all data that does not fall into one of those intervals is excluded.

If start_timestamp or end_timestamp are left blank, they will be automatically generated as the min or max timestamp in the 'date_col' column.

Generates a date spine across custom groupings. Each grouping will be returned with a continuous set of dates. Spines can be calculated globally using the table min/max or locally using each group min/max.

All intervals are considered to be start-inclusive and end-exclusive, or [start, end]. The join with the date spine will be an outer join such that all intervals are present and all data that does not fall into one of those intervals is excluded.

Truncates a date to the datepart you specify. For example, if you truncate the date '10-31-2022' to the 'month', you would get '10-1-2022'.

For a list of valid dateparts, refer to Supported Date and Time Parts

Deduplicate a table based on a passed-in composite key. Once an order column and an order method are selected, only the top record from the resulting grouped and ordered dataset will be kept.

Describes the dataset using a consistent set of metrics, based on data type. Numeric: DTYPE, COUNT, NULL_COUNT, UNIQUE_COUNT, MOST_FREQUENT, MEAN, STD_DEV, MIN, _25_PERCENTILE, _50_PERCENTILE, _75_PERCENTILE, MAX Other: DTYPE, COUNT, NULL_COUNT, UNIQUE_COUNT, MOST_FREQUENT, MIN, MAX

Drop columns by passing either an include_cols list of columns to include or an exclude_cols list of columns to exclude.

Passing both include_cols and exclude_cols will result in an error.

Remove missing values.

Extracts sequences of consecutive increase/decrease from a time-series dataset

Apply one or more column filters to the dataset.

Analyze the hierarchical record count of a series of columns by counting the number of records in each pair of values in hierarchically adjacent columns.

Generate an x / y heatmap, which uses the number of rows in each x/y bin as a density overlay to a 2-d histogram.

Analyze the value distribution of a single continuous variable by binning it and calculating frequencies in each bin.

This function creates a new column based on the conditions provided in the conditions argument.

Output values should be of the same type, since they are constructing one new column.

A default value for the new column should be set, as should the output column name.

Impute missing values in column/columns with the mean, median, mode, or a value

Join a dataset with another dataset, by matching on one or more columns between the two tables.

If you pass a join_prefix, all column names in the join table will be named "{join_prefix}_{columnname}". If you don't pass a join_prefix, columns that share the same name in both tables will be only have the column from the base table included in the final output.

Join one or more datasets together using SQL joins. Supported join types include INNER, LEFT, RIGHT.

Encode target labels with value between 0 and n_classes-1. See scikit-learn's LabelEncoder for full documentation.

Lag shifts your features on a partition index, creating a lookback feature offset by an amount. Lag supports generating multiple lags in one transform by generating each unique combination of columns and amounts from your inputs.

Impute missing values in ALL columns with the latest value seen in rows prior.

Calculate the edit distance between pairwise combinations of string columns.

Fit a simple linear regression and return the formula. Optionally, use one or more group_by columns to create a regression per unique grouping.

Currently, only supports a single independent variable.

Analyze historical transaction contents to understand products that are frequently purchased together.

This approach uses a transactional table to aggregate each product purchased in a transaction, and then aggregates transactions together to look for common patterns.

Calculate one or more new columns using math functions.

This function scales a column by subtracting the min value in the column and dividing by the range between the max and min values.

If you omit minimums and maximums, the function will compute the mins and maxes of each column. Pass minimums and maximiums values if you want to override the calculation with static values for each column.

Generates moving averages per column and per window size.

Union n number of datasets with the 'base' dataset, using a common list of columns between each dataset, and selecting them in order.

One hot encode a column. Create a null value flag for the column if any of the values are NULL.

Order a dataset by specified columns, in a specified order.

This function determines which records in the table are an outlier based on a given statistical method (z-score, IQR, or manual threshold) and a target column. It produces a new column named 'OUTLER_<target_column>' which is TRUE for records that are outliers, and FALSE for records that aren't.

Transpose unique values in a single column to generate multiple columns, aggregating as needed. The pivot will dynamically generate a column per unique value, or you can pass a list_of_vals with the unique values you wish to create columns for.

Add a prefix to each column name.

Create a ranking of values in a column.

Performs empirical bayesian estimation with shrinkage towards a beta prior. Given a dataset with a numerator and a denominator, will calculate the raw ratio as numerator / denom, as well as provide an adjusted ratio that shrinks the ratio towards the observed beta prior.

This is a simplified version that establishes the priors directly from the data given a min_cutoff count of observations.

NOTE: your data should already be aggregated before performing this operation.

Rename columns by passing a renames dict.

Returns the subject with the specified pattern (or all occurrences of the pattern) either removed or replaced by a replacement string. If no matches are found, returns the original subject.

Row-based; Calculates a rolling aggregate based on a relative row window.

Pass in order_by columns and offsets to create a row-based look-back or look-forward windows.

Example use case: Aggregate the last 10 sales for a customer regardless of when they occurred.

Calculates the relative strength index on a given column, on a row-by-row basis. Developed by J. Welles Wilder, the Relative Strength Index (RSI) is a momentum oscillator that measures the speed and change of price movements. RSI oscillates between zero and 100. According to Wilder, RSI is considered overbought when above 70 and oversold when below 30. Signals can also be generated by looking for divergences, failure swings and centerline crossovers. RSI can also be used to identify the general trend.

Pass in a partition_col, an order_col, and a lookback window size.

NOTE: Your data should be a properly formatted timeseries dataset before applying this transformation. In other words, each period should only appear once, and periods considered zero should be imputed with 0 already.

Example use case: On daily stock data, calculate RSI by TICKER, with a 14-period lookback window.

Take a sample of a dataset using a specific number of rows or a probability that each row will be selected.

Sample n rows for each value of a column.

Analyze the hierarchical record count of a series of columns by counting the number of records in each pair of values in hierarchically adjacent columns.

Simple join between two datasets that uses a 'USING' clause. Returns all columns from both tables in the result set.

Calculates the linear slope on a given row, looking backwards for a user-defined window of periods.

Pass in a partition_col, an order_col, and a lookback window size.

NOTE: Your data should be a properly formatted timeseries dataset before applying this transformation. In other words, each period should only appear once, and periods considered zero should be imputed with 0 already. NOTE: Slope calculations are notoriously sensitive to large outliers, especially with smaller windows.

Example use case: On daily stock data, calculate SLOPE by TICKER, with a 14-period lookback window.

Takes a delimiter and target column, splitting the column values by the delimiter into multiple columns.

This function scales a column by removing the mean and scaling by standard deviation.

If you omit averages and standarddevs, the function will compute the average and standard deviation of each column. Pass averages and standarddevs values if you want to override the calculation with static values for each column.

This function creates a new column that contains a substring of either a fixed value or another column in your dataset.

Add a suffix to each column name.

Compare categorical values across one or more metrics.

Encode a categorical column with the average value of a target column for the corresponding value of the categorical column.

See scikit-learn's TargetEncoder for full documentation.

Date-based; Calculates a rolling aggregate based on a relative datetime window.

Pass in a date column, date_part and offsets to create look-back or look-forward windows.

Example use case: Aggregate all sales for a product with order dates within 2 months of this current order.

Creates a column of a date/timestamp type from a string or other non-date column.

See this Snowflake doc for information about valid formats.

Label rows as part of the train or test set based off of percentage split you want to apply to the data.

If you want a row-wise random sample applied, do not pass an order_by column. If you want an ordered split, then pass the order_by column.

Performs a SQL UNION or UNION ALL for the parent dataset, and another dataset. Operation will only merge columns with matching columns names in both datasets and drop all other columns. Column data type validation does not happen.

Performs a UNPIVOT operation, rotating a table by transforming columns into rows.

If your data is in a data warehouse, then you should take advantage of the performance and do as much inside the warehouse as possible. This alleviates the potential for data to be incomplete or to make irreversible changes. The easiest way to achieve this is through SQL.

Although it is not perfect, SQL continues to be one of the top recommended skills for both analysts and data scientists, and there is no shortage of resources out there to enhance your skills . For example, here is a short list of some of our favorite learning materials:

For Beginners:

For Intermediate:

For Practice:

- SQLHabit
- SelectStarSQL - an interactive book
- SuperCoolData (expected 2022)

Additionally, for very hands-on step-by-step examples Rasgo offers a number of different tutorials to help the data practitioner learn skills or perform full accelerations.

This collection of data analysis tutorials contains step-by-step guides for common SQL data transformations. For each transformation, we provide background on the transform, why it's important, and the actual SQL you can copy/paste into existing tools to support your analysis

Insight Accelerator Tutorials:

Insight Accelerators are step-by-step guides for using SQL to solve critical use cases that organizations face. For each accelerator, we provide background on the use case, why it's important, and the actual SQL you can copy/paste into existing tools to support your analysis.

Each Model Accelerator gives context of the business challenge, guides target variable identification, and recommends data sources for high-value features for a specific use case.