x
Rasgo Product

SQL Generator

by
Jared Parker
on
6/22/2022
SQL Generator

Rasgo is happy to announce the release of the SQL Generator, a browser tool that enables anyone to generate a complex SQL query without writing a line of code.

If you've ever worked with data, you've likely experienced the feeling of terror when you know what you need to do to the data to get it in the right form, but you're not exactly sure how to do that in your database's preferred form of SQL. As cloud data warehouses have grown in popularity, SQL has only become more important, and our users are always pushing us to add new transforms to our open source library of SQL transforms.

Now, you can access that same library and put it to use through the SQL Generator. In just a few clicks, you can:

  • Upload / create your table schema
  • Choose a SQL transform (like 'Moving Avg' or 'Clean')
  • Pick your SQL syntax (i.e. Snowflake, BigQuery, PostgresQL, etc.)
  • Generate SQL!
  • Copy the URL and share it with a friend

Try it out, and let us know what you think in Slack!

P.S. if you want to contribute your own useful SQL to the Rasgo SQL Generator, all you have to do is open a pull request in our open source repo. Thanks in advance!

Rasgo SQL Generator - How it works!

With this dynamic generator you are able to input your column headers and the transform you are trying to replicate - then our generator will kick out the exact SQL that you will need to transform in your cloud data warehouse!

Have a dataset in excel? Feel free to import your CSV for even more customization!

Step 1: Open the SQL Generator

Welcome to the Rasgo SQL generator - this will be your user interface: 

Free SQL Generator
The Rasgo free SQL generator interface

Step 2: Pick the transform you would like to perform! 

You can either scroll or search all of the different transforms. We are going to use “Filter” for this example.

SQL transforms on the Rasgo free SQL Generator
The different transforms offered by the SQL generator

Step 3: Input your column names or Import a CSV

Rasgo free SQL generator
Where you build out the different tables and columns

A few things to note: 

  • Columns can have different inputs: the sql generator allows for Text, Number, Decimal, Boolean (true/false) and dates.
Building columns in the Rasgo free SQL generator
Different column data points
  • Base Table is what is being used as the base for the transform. As you can see, in this example, the column options will change depending on the base table.
Free Rasgo SQL generator
Picking the columns you wish to transform

Free Rasgo SQL generator
Picking the columns on a new base table

Step 4: Adjust your transform and pick your data warehouse!

In this example you put the column you would like to filter, the different function and then type in your desired filter. It’s important to note that you can add additional filters by clicking the “+ Add Item”. 

Free Rasgo SQL generator
Adjusting your transform

First you pick which table you will be performing the transformation on.

Free Rasgo SQL generator
Picking the column you wish to transform

In our example using the Filter transformation you must then choose which filter you would like to use. The process will still be consistent throughout other transformations. You will use these sections to adjust what function you would like to perform.

Free Rasgo SQL generator
Pick which filter you want

After choosing which filter you would like to use you must input the value that you are searching for.

Free Rasgo SQL generator
Input your desired filter value

Next you need to choose which data warehouse you are using: the generator is pre-set to use Snowflake.

Picking your desired cloud data warehouse in the free Rasgo SQL generator
Pick your desired cloud data warehouse

Step 5: Generate your SQL and use it in your project

The last step is to generate your SQL and copy it into your project!

Free Rasgo SQL generator

List of all transforms:

Below is a list of all of the different transforms currently being sponsored on the Rasgo SQL Generator, this list will continue to be updated. This list can also be found on the Rasgo Docs for greater clarification.

Aggregate

Groups rows by the group_by items applying aggregations functions for the resulting group and selected columns.

Aggregate String

Aggregate strings across rows by concatenating them together, and grouping by other columns.

Uses a text separator to aggregate the string values together, and returns a single column where the rows are the aggregated strings.

Apply

A transform that accepts a custom template to execute. Must use the sql template argument source_table to reference the Rasgo dataset which will serve as the base of any SELECT.

Bin

This function will categorize or bin an input column such that for N bins, an output column is created with values [1-N] where each value represents some bin.

This transformation supports two binning methods (called "binning_type" in the arguments): ntile and equalwidth.

Cast

Cast selected columns to a new type.

Clean

Cast data types, rename or drop columns, impute missing values, and filter values in a dataset.

Concat

This function creates a new column that concatenates fixed values and columns in your dataset.

Pass in a list named "concat_list", containing the names of the columns and the static string values to concatenate, in order.

Conditional Agg

Pass in a list of filter rules, and aggregate rows that match.

If multiple rules are passed, they are combined and aggregated both together and separately.

Correlation

Run pairwise correlation on all numeric columns in the source_table.

Cumulative Agg

Row-based; Calculates a cumulative aggregate based on a relative row window. Pass in order_by columns to create a row-based look-back or look-forward window. Example use case: Aggregate all sales for a customer from the beginning of time until this row.

Datarobot Score

Retrieves predictions from a DataRobot model that was deployed in Snowflake.

Dateadd

Increments a date by the specified interval value.

Datediff

Calculates the difference between two date, time, or timestamp expressions based on the date or time part requested. Difference is calculated as date_1 - date_2.

Datepart

Extracts a specific part of a date column. For example, if the input is '2021-01-01', you can ask for the year and get back 2021.

An exhaustive list of valid date parts can be found here.

Datespine

This transform generates a date spine for your date index, which can replace your date index column for modeling.

All intervals are considered to be start-inclusive and end-exclusive, or [start, end]. The join with the date spine will be an outer join such that all intervals are present and all data that does not fall into one of those intervals is excluded.

If start_timestamp or end_timestamp are left blank, they will be automatically generated as the min or max timestamp in the 'date_col' column.

Datespine Groups

Generates a date spine across custom groupings. Each grouping will be returned with a continuous set of dates. Spines can be calculated globally using the table min/max or locally using each group min/max.

All intervals are considered to be start-inclusive and end-exclusive, or [start, end]. The join with the date spine will be an outer join such that all intervals are present and all data that does not fall into one of those intervals is excluded.

Datetrunc

Truncates a date to the datepart you specify. For example, if you truncate the date '10-31-2022' to the 'month', you would get '10-1-2022'.

For a list of valid dateparts, refer to Supported Date and Time Parts

Dedupe

Deduplicate a table based on a passed-in composite key. Once an order column and an order method are selected, only the top record from the resulting grouped and ordered dataset will be kept.

Describe

Describes the dataset using a consistent set of metrics, based on data type. Numeric: DTYPE, COUNT, NULL_COUNT, UNIQUE_COUNT, MOST_FREQUENT, MEAN, STD_DEV, MIN, _25_PERCENTILE, _50_PERCENTILE, _75_PERCENTILE, MAX Other: DTYPE, COUNT, NULL_COUNT, UNIQUE_COUNT, MOST_FREQUENT, MIN, MAX

Drop Columns

Drop columns by passing either an include_cols list of columns to include or an exclude_cols list of columns to exclude.

Passing both include_cols and exclude_cols will result in an error.

Drop NA

Remove missing values.

Extract Sequences

Extracts sequences of consecutive increase/decrease from a time-series dataset

Filter

Apply one or more column filters to the dataset.

Funnel

Analyze the hierarchical record count of a series of columns by counting the number of records in each pair of values in hierarchically adjacent columns.

Heatmap

Generate an x / y heatmap, which uses the number of rows in each x/y bin as a density overlay to a 2-d histogram.

Histogram

Analyze the value distribution of a single continuous variable by binning it and calculating frequencies in each bin.

If Then

This function creates a new column based on the conditions provided in the conditions argument.

Output values should be of the same type, since they are constructing one new column.

A default value for the new column should be set, as should the output column name.

Impute

Impute missing values in column/columns with the mean, median, mode, or a value

Join

Join a dataset with another dataset, by matching on one or more columns between the two tables.

If you pass a join_prefix, all column names in the join table will be named "{join_prefix}_{columnname}". If you don't pass a join_prefix, columns that share the same name in both tables will be only have the column from the base table included in the final output.

Joins

Join one or more datasets together using SQL joins. Supported join types include INNER, LEFT, RIGHT.

Label Encode

Encode target labels with value between 0 and n_classes-1. See scikit-learn's LabelEncoder for full documentation.

Lag

Lag shifts your features on a partition index, creating a lookback feature offset by an amount. Lag supports generating multiple lags in one transform by generating each unique combination of columns and amounts from your inputs.

Latest

Impute missing values in ALL columns with the latest value seen in rows prior.

Levenshtein

Calculate the edit distance between pairwise combinations of string columns.

Linear Regression

Fit a simple linear regression and return the formula. Optionally, use one or more group_by columns to create a regression per unique grouping. 

Currently, only supports a single independent variable.

Market Basket

Analyze historical transaction contents to understand products that are frequently purchased together.

This approach uses a transactional table to aggregate each product purchased in a transaction, and then aggregates transactions together to look for common patterns.

Math

Calculate one or more new columns using math functions.

Min Max Scaler

This function scales a column by subtracting the min value in the column and dividing by the range between the max and min values.

If you omit minimums and maximums, the function will compute the mins and maxes of each column. Pass minimums and maximiums values if you want to override the calculation with static values for each column.

Moving Avg

Generates moving averages per column and per window size. 

Multi Union

Union n number of datasets with the 'base' dataset, using a common list of columns between each dataset, and selecting them in order.

One Hot Encode

One hot encode a column. Create a null value flag for the column if any of the values are NULL.

Order

Order a dataset by specified columns, in a specified order.

Outliers

This function determines which records in the table are an outlier based on a given statistical method (z-score, IQR, or manual threshold) and a target column. It produces a new column named 'OUTLER_<target_column>' which is TRUE for records that are outliers, and FALSE for records that aren't.

Pivot

Transpose unique values in a single column to generate multiple columns, aggregating as needed. The pivot will dynamically generate a column per unique value, or you can pass a list_of_vals with the unique values you wish to create columns for.

Prefix

Add a prefix to each column name.

Rank

Create a ranking of values in a column.

Ratio With Shrinkage

Performs empirical bayesian estimation with shrinkage towards a beta prior. Given a dataset with a numerator and a denominator, will calculate the raw ratio as numerator / denom, as well as provide an adjusted ratio that shrinks the ratio towards the observed beta prior.

This is a simplified version that establishes the priors directly from the data given a min_cutoff count of observations.

NOTE: your data should already be aggregated before performing this operation.

Rename

Rename columns by passing a renames dict.

Replace String

Returns the subject with the specified pattern (or all occurrences of the pattern) either removed or replaced by a replacement string. If no matches are found, returns the original subject.

Rolling Agg

Row-based; Calculates a rolling aggregate based on a relative row window.

Pass in order_by columns and offsets to create a row-based look-back or look-forward windows.

Example use case: Aggregate the last 10 sales for a customer regardless of when they occurred.

RSI

Calculates the relative strength index on a given column, on a row-by-row basis. Developed by J. Welles Wilder, the Relative Strength Index (RSI) is a momentum oscillator that measures the speed and change of price movements. RSI oscillates between zero and 100. According to Wilder, RSI is considered overbought when above 70 and oversold when below 30. Signals can also be generated by looking for divergences, failure swings and centerline crossovers. RSI can also be used to identify the general trend.

Pass in a partition_col, an order_col, and a lookback window size. 

NOTE: Your data should be a properly formatted timeseries dataset before applying this transformation. In other words, each period should only appear once, and periods considered zero should be imputed with 0 already.

Example use case: On daily stock data, calculate RSI by TICKER, with a 14-period lookback window.

Sample

Take a sample of a dataset using a specific number of rows or a probability that each row will be selected.

Sample Class

Sample n rows for each value of a column.

Sankey

Analyze the hierarchical record count of a series of columns by counting the number of records in each pair of values in hierarchically adjacent columns.

Simple Join

Simple join between two datasets that uses a 'USING' clause. Returns all columns from both tables in the result set.

Sliding Slope

Calculates the linear slope on a given row, looking backwards for a user-defined window of periods.

Pass in a partition_col, an order_col, and a lookback window size.

NOTE: Your data should be a properly formatted timeseries dataset before applying this transformation. In other words, each period should only appear once, and periods considered zero should be imputed with 0 already. NOTE: Slope calculations are notoriously sensitive to large outliers, especially with smaller windows.

Example use case: On daily stock data, calculate SLOPE by TICKER, with a 14-period lookback window.

Split

Takes a delimiter and target column, splitting the column values by the delimiter into multiple columns.

Standard Scaler

This function scales a column by removing the mean and scaling by standard deviation.

If you omit averages and standarddevs, the function will compute the average and standard deviation of each column. Pass averages and standarddevs values if you want to override the calculation with static values for each column.

Substring

This function creates a new column that contains a substring of either a fixed value or another column in your dataset.

Suffix

Add a suffix to each column name.

Table

Compare categorical values across one or more metrics.

Target Encode

Encode a categorical column with the average value of a target column for the corresponding value of the categorical column.

See scikit-learn's TargetEncoder for full documentation.

Timeseries Agg

Date-based; Calculates a rolling aggregate based on a relative datetime window.

Pass in a date column, date_part and offsets to create look-back or look-forward windows.

Example use case: Aggregate all sales for a product with order dates within 2 months of this current order.

To Date

Creates a column of a date/timestamp type from a string or other non-date column.

See this Snowflake doc for information about valid formats.

Train Test Split

Label rows as part of the train or test set based off of percentage split you want to apply to the data.

If you want a row-wise random sample applied, do not pass an order_by column. If you want an ordered split, then pass the order_by column.

Union

Performs a SQL UNION or UNION ALL for the parent dataset, and another dataset. Operation will only merge columns with matching columns names in both datasets and drop all other columns. Column data type validation does not happen.

Unpivot

Performs a UNPIVOT operation, rotating a table by transforming columns into rows.

Why SQL / Other Resources:

If your data is in a data warehouse, then you should take advantage of the performance and do as much inside the warehouse as possible. This alleviates the potential for data to be incomplete or to make irreversible changes. The easiest way to achieve this is through SQL. 

Although it is not perfect, SQL continues to be one of the top recommended skills for both analysts and data scientists, and there is no shortage of resources out there to enhance your skills . For example, here is a short list of some of our favorite learning materials:

For Beginners:

For Intermediate:

For Practice:

Additionally, for very hands-on step-by-step examples Rasgo offers a number of different tutorials to help the data practitioner learn skills or perform full accelerations. 

Data Analysis Tutorials:

This collection of data analysis tutorials contains step-by-step guides for common SQL data transformations. For each transformation, we provide background on the transform, why it's important, and the actual SQL you can copy/paste into existing tools to support your analysis

Insight Accelerator Tutorials:

Insight Accelerators are step-by-step guides for using SQL to solve critical use cases that organizations face. For each accelerator, we provide background on the use case, why it's important, and the actual SQL you can copy/paste into existing tools to support your analysis.

Model Accelerator Tutorials:

Each Model Accelerator gives context of the business challenge, guides target variable identification, and recommends data sources for high-value features for a specific use case.

No-code/low-code data prep and visualization