LATEST NEWS:

Grab your copy of the free SaaS Metric Playbook! : Vital metrics you’ll need from leaders in SaaS and VC. >>

Rasgo Feature Store for Data Science
Tutorials
Community Dropdown button
Rasgo Quick Start
Data Analysis
Accelerators
Feature Engineering
Docs
Docs Homepage
Rasgo Quickstart
Tutorials
Insight Accelerators
Data Analysis
Model Accelerators
Feature Engineering
Community
Slack
GitHub
Community
Community Dropdown button
Slack
GitHub
Blogs
close menu overlay button
Login
Try For Free
Try For Free!
menu icon button

Sign-Up For Your
Free 30-Day Trial!

Rasgo can be configured to your data and dbt/git environments in under 20 minutes. Book time with your personal onboarding concierge and we'll get you all setup!

Not ready for a free trial?
Not ready for a free trial?
Private Demo
Click here to schedule time for a private demo
Book Demo
Try Rasgo’s FREE SQL Generator
A low-code web app to construct a SQL Query
SQL Generator
Tutorials that help Data Scientists get their pandas on.

Feature Transformation

Feature Transformation

Tutorials

How To Do Robust Scaler Normalization With Pandas and Scikit-learn

How to Use Min-Max Scaler Encoding

How to Use Standard Scaler Normalization

How To Do Leave-one-out Encoding

How To Do Target Encoding

How To Use One-hot Encoding

How To Create Time Series Features with tsfresh

Datetime Aggregates in Pandas

How To Do Moving Averages in Pandas

How to Create Lag Variables in Pandas

Additional Featured Engineering Tutorials

Data Cleaning

Model Selection

Feature Selection

Feature Profiling

Feature Importance

How To Do Robust Scaler Normalization With Pandas and Scikit-learn

This tutorial explains how to use the robust scaler encoding from scikit-learn. This scaler normalizes the data by subtracting the median and dividing by the interquartile range. This scaler is robust to outliers unlike the standard scaler.

For this tutorial you'll be using data for flights in and out of NYC in 2013.

Packages

This tutorial uses:

  • pandas
  • statsmodels
  • statsmodels.api
  • numpy
  • scikit-learn
  • sklearn.model_selection
  • sklearn.preprocessing
  • category_encoders

Open up a new Jupyter notebook and import the following:


import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
import category_encoders as ce

Reading the data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()


RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB


Feature Engineering

Handle null values


df.isnull().sum()

year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

‍

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.


df.dropna(inplace=True)


Convert the times from floats or ints to hour and minutes


df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)
                   

Prepare data for modeling

Set up train-test split


target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes

month                 int64
day                   int64
carrier              object
flight               object
tailnum              object
origin               object
dest                 object
air_time            float64
distance              int64
dep_hour              int64
dep_minute            int64
arr_hour              int64
arr_minute            int64
sched_arr_hour        int64
sched_arr_minute      int64
sched_dep_hour        int64
sched_dep_minute      int64
dtype: object

Encode categorical variables

We convert the categorical features to numerical through the leave one out encoder in categorical_encoders. This leaves a single numeric feature in the place of each existing categorical feature. This is needed to apply the scaler to all features in the training data.


encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
X_train_loo.shape

(261876, 17)

We apply the robust scaler from scikit-learn.


scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train_loo, y_train)
X_train_scaled.shape

(261876, 17)

X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled_df.describe()

	month	day	carrier	flight	tailnum	origin	dest	air_time	distance	dep_hour	dep_minute	arr_hour	arr_minute	sched_arr_hour	sched_arr_minute	sched_dep_hour	sched_dep_minute
count	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000	261876.000000
mean	-0.071959	-0.018142	-0.081372	0.114950	0.036065	0.313371	-0.062060	0.198117	0.181391	0.017205	-0.076880	-0.034667	0.015306	0.004101	-0.032336	0.017205	-0.076880
std	0.569163	0.585523	0.714443	0.803131	0.781421	0.465258	0.671881	0.858414	0.835307	0.582418	0.535955	0.665654	0.559931	0.621451	0.580158	0.582418	0.535955
min	-1.000000	-1.000000	-2.465264	-4.611989	-6.179514	-0.068899	-3.291851	-1.000000	-0.918182	-1.000000	-0.805556	-1.875000	-0.935484	-1.875000	-1.000000	-1.000000	-0.805556
25%	-0.500000	-0.533333	-0.723248	-0.433402	-0.509045	-0.064665	-0.648931	-0.431193	-0.430682	-0.500000	-0.583333	-0.500000	-0.483871	-0.500000	-0.533333	-0.500000	-0.583333
50%	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
75%	0.500000	0.466667	0.276752	0.566598	0.490955	0.935335	0.351069	0.568807	0.569318	0.500000	0.416667	0.500000	0.516129	0.500000	0.466667	0.500000	0.416667
max	0.833333	1.000000	1.639266	12.749470	19.021181	0.935628	5.286232	5.192661	4.653409	1.250000	0.833333	1.125000	0.967742	1.000000	0.966667	1.250000	0.833333

Scale the test set. This can now be passed into the predict or predict_proba functions of a trained model.


X_test_scaled = scaler.transform(X_test_loo)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_train.columns)
X_test_scaled_df.describe()

	month	day	carrier	flight	tailnum	origin	dest	air_time	distance	dep_hour	dep_minute	arr_hour	arr_minute	sched_arr_hour	sched_arr_minute	sched_dep_hour	sched_dep_minute
count	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000	65470.000000
mean	-0.074828	-0.013822	-0.084067	0.116335	0.036416	0.311938	-0.063946	0.202323	0.185636	0.019310	-0.076631	-0.033573	0.014072	0.006992	-0.029823	0.019310	-0.076631
std	0.567883	0.583688	0.714802	0.797747	0.770960	0.464947	0.679075	0.863958	0.840058	0.584118	0.536172	0.667538	0.559794	0.621859	0.583223	0.584118	0.536172
min	-1.000000	-1.000000	-2.266061	-3.428253	-3.853578	-0.064742	-3.044672	-0.990826	-0.918182	-1.000000	-0.805556	-1.875000	-0.935484	-1.875000	-1.000000	-1.000000	-0.805556
25%	-0.500000	-0.533333	-0.722841	-0.429206	-0.511570	-0.064742	-0.648951	-0.431193	-0.438636	-0.500000	-0.583333	-0.500000	-0.483871	-0.500000	-0.533333	-0.500000	-0.583333
50%	0.000000	0.000000	-0.528929	0.000202	0.002514	-0.000045	-0.000365	0.000000	0.000000	0.000000	0.000000	0.000000	0.016129	0.000000	0.000000	0.000000	0.000000
75%	0.500000	0.466667	0.276620	0.564906	0.487063	0.935338	0.350599	0.577982	0.581818	0.500000	0.416667	0.500000	0.516129	0.500000	0.500000	0.500000	0.416667
max	0.833333	1.000000	1.636003	7.225370	12.134945	0.935338	5.158320	5.110092	4.653409	1.250000	0.833333	1.125000	0.967742	1.000000	0.966667	1.250000	0.833333
Try RasgoQL

Open source data transformations, without having to write SQL. Choose from a wide selection of predefined transforms that can be exported to DBT or native SQL.

Explore on Github
No items found.
How To Do Robust Scaler Normalization With Pandas and Scikit-learn
How to Use Min-Max Scaler Encoding
How to Use Standard Scaler Normalization
How To Do Leave-one-out Encoding
How To Do Target Encoding
How To Use One-hot Encoding
How To Create Time Series Features with tsfresh
Datetime Aggregates in Pandas
How To Do Moving Averages in Pandas
How to Create Lag Variables in Pandas

© RASGO Intelligence, Inc. All rights reserved.

TUtorials
Rasgo Quick StartInsight AcceleratorsData AnalysisModel AcceleratorsFeature Engineering
COMMUNitY
GitHubSlackBlog
COMPANY
About Careers Privacy PolicyTerms of ServiceContact