This tutorial explains how to use the robust scaler encoding from scikit-learn. This scaler normalizes the data by subtracting the median and dividing by the interquartile range. This scaler is robust to outliers unlike the standard scaler.
For this tutorial you'll be using data for flights in and out of NYC in 2013.
Packages
This tutorial uses:
Open up a new Jupyter notebook and import the following:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler
import category_encoders as ce
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle null values
year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the times from floats or ints to hour and minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Prepare data for modeling
Set up train-test split
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes
month int64
day int64
carrier object
flight object
tailnum object
origin object
dest object
air_time float64
distance int64
dep_hour int64
dep_minute int64
arr_hour int64
arr_minute int64
sched_arr_hour int64
sched_arr_minute int64
sched_dep_hour int64
sched_dep_minute int64
dtype: object
Encode categorical variables
We convert the categorical features to numerical through the leave one out encoder in categorical_encoders. This leaves a single numeric feature in the place of each existing categorical feature. This is needed to apply the scaler to all features in the training data.
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_test_loo = encoder.transform(X_test)
X_train_loo.shape
We apply the robust scaler from scikit-learn.
scaler = RobustScaler()
X_train_scaled = scaler.fit_transform(X_train_loo, y_train)
X_train_scaled.shape
X_train_scaled_df = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled_df.describe()
month day carrier flight tailnum origin dest air_time distance dep_hour dep_minute arr_hour arr_minute sched_arr_hour sched_arr_minute sched_dep_hour sched_dep_minute
count 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000
mean -0.071959 -0.018142 -0.081372 0.114950 0.036065 0.313371 -0.062060 0.198117 0.181391 0.017205 -0.076880 -0.034667 0.015306 0.004101 -0.032336 0.017205 -0.076880
std 0.569163 0.585523 0.714443 0.803131 0.781421 0.465258 0.671881 0.858414 0.835307 0.582418 0.535955 0.665654 0.559931 0.621451 0.580158 0.582418 0.535955
min -1.000000 -1.000000 -2.465264 -4.611989 -6.179514 -0.068899 -3.291851 -1.000000 -0.918182 -1.000000 -0.805556 -1.875000 -0.935484 -1.875000 -1.000000 -1.000000 -0.805556
25% -0.500000 -0.533333 -0.723248 -0.433402 -0.509045 -0.064665 -0.648931 -0.431193 -0.430682 -0.500000 -0.583333 -0.500000 -0.483871 -0.500000 -0.533333 -0.500000 -0.583333
50% 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
75% 0.500000 0.466667 0.276752 0.566598 0.490955 0.935335 0.351069 0.568807 0.569318 0.500000 0.416667 0.500000 0.516129 0.500000 0.466667 0.500000 0.416667
max 0.833333 1.000000 1.639266 12.749470 19.021181 0.935628 5.286232 5.192661 4.653409 1.250000 0.833333 1.125000 0.967742 1.000000 0.966667 1.250000 0.833333
Scale the test set. This can now be passed into the predict or predict_proba functions of a trained model.
X_test_scaled = scaler.transform(X_test_loo)
X_test_scaled_df = pd.DataFrame(X_test_scaled, columns=X_train.columns)
X_test_scaled_df.describe()
month day carrier flight tailnum origin dest air_time distance dep_hour dep_minute arr_hour arr_minute sched_arr_hour sched_arr_minute sched_dep_hour sched_dep_minute
count 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000
mean -0.074828 -0.013822 -0.084067 0.116335 0.036416 0.311938 -0.063946 0.202323 0.185636 0.019310 -0.076631 -0.033573 0.014072 0.006992 -0.029823 0.019310 -0.076631
std 0.567883 0.583688 0.714802 0.797747 0.770960 0.464947 0.679075 0.863958 0.840058 0.584118 0.536172 0.667538 0.559794 0.621859 0.583223 0.584118 0.536172
min -1.000000 -1.000000 -2.266061 -3.428253 -3.853578 -0.064742 -3.044672 -0.990826 -0.918182 -1.000000 -0.805556 -1.875000 -0.935484 -1.875000 -1.000000 -1.000000 -0.805556
25% -0.500000 -0.533333 -0.722841 -0.429206 -0.511570 -0.064742 -0.648951 -0.431193 -0.438636 -0.500000 -0.583333 -0.500000 -0.483871 -0.500000 -0.533333 -0.500000 -0.583333
50% 0.000000 0.000000 -0.528929 0.000202 0.002514 -0.000045 -0.000365 0.000000 0.000000 0.000000 0.000000 0.000000 0.016129 0.000000 0.000000 0.000000 0.000000
75% 0.500000 0.466667 0.276620 0.564906 0.487063 0.935338 0.350599 0.577982 0.581818 0.500000 0.416667 0.500000 0.516129 0.500000 0.500000 0.500000 0.416667
max 0.833333 1.000000 1.636003 7.225370 12.134945 0.935338 5.158320 5.110092 4.653409 1.250000 0.833333 1.125000 0.967742 1.000000 0.966667 1.250000 0.833333