This tutorial explains how to use leave one out encoding from category_encoders. Leave one out encoding is just target encoding where the average or expected value is calculated ignoring the value in the current row.
This tutorial will data for flights in and out of NYC in 2013.
Packages
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import category_encoders as ce
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
Feature Engineering
Handle null values
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the times from floats or ints to hour and minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Prepare data for modeling
Set up train-test split
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes
You should get something back like this:
month int64
day int64
carrier object
flight object
tailnum object
origin object
dest object
air_time float64
distance int64
dep_hour int64
dep_minute int64
arr_hour int64
arr_minute int64
sched_arr_hour int64
sched_arr_minute int64
sched_dep_hour int64
sched_dep_minute int64
dtype: object
Encode categorical variables
We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier. Models can now be trained with any modeling algorithm with the feature set contained in X_train_loo
encoder = ce.LeaveOneOutEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_train_loo.dtypes
month int64
day int64
carrier float64
flight float64
tailnum float64
origin float64
dest float64
air_time float64
distance int64
dep_hour int64
dep_minute int64
arr_hour int64
arr_minute int64
sched_arr_hour int64
sched_arr_minute int64
sched_dep_hour int64
sched_dep_minute int64
dtype: object
Encode the test set. This can now be passed into the predict or predict_proba functions of a trained model.
X_test_loo = encoder.transform(X_test)
X_test_loo.describe()
month day carrier flight tailnum origin dest air_time distance dep_hour dep_minute arr_hour arr_minute sched_arr_hour sched_arr_minute sched_dep_hour sched_dep_minute
count 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000 65470.000000
mean 6.551031 15.792668 6.862181 6.907732 6.880032 6.877746 6.869284 151.053200 1051.359279 13.154483 26.241301 14.731419 29.436230 15.055934 29.105300 13.154483 26.241301
std 3.407300 8.755319 5.457000 11.119727 8.419143 1.625658 4.849410 94.171406 739.250702 4.672941 19.302202 5.340305 17.353617 4.974869 17.496692 4.672941 19.302202
min 1.000000 1.000000 -9.795775 -42.500000 -35.600000 5.560707 -14.416667 21.000000 80.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 0.000000
25% 4.000000 8.000000 1.985603 -0.696517 0.895833 5.560707 2.691649 82.000000 502.000000 9.000000 8.000000 11.000000 14.000000 11.000000 14.000000 9.000000 8.000000
50% 7.000000 16.000000 3.465982 5.288961 6.509804 5.786915 7.323326 129.000000 888.000000 13.000000 29.000000 15.000000 29.500000 15.000000 30.000000 13.000000 29.000000
75% 10.000000 23.000000 9.615767 13.160326 11.801242 9.057426 9.829630 192.000000 1400.000000 17.000000 44.000000 19.000000 45.000000 19.000000 45.000000 17.000000 44.000000
max 12.000000 31.000000 19.993676 106.000000 139.000000 9.057426 44.162500 686.000000 4983.000000 23.000000 59.000000 24.000000 59.000000 23.000000 59.000000 23.000000 59.000000