This tutorial explains how to use target encoding from category_encoders. Target encoding replaces a categorical value by a blend of the probability (or expected value) of the target given the category with the target probability (or expected value) over the entire training set.
This tutorial will data for flights in and out of NYC in 2013.
Packages
This tutorial uses:
Open up a new Jupyter notebook and import the following:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import category_encoders as ce
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
You should get the following output:
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
Feature Engineering
Handle null values
df.isnull().sum()
year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the times from floats or ints to hour and minute
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Prepare data for modeling
Set up train-test split
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes
month int64
day int64
carrier object
flight object
tailnum object
origin object
dest object
air_time float64
distance int64
dep_hour int64
dep_minute int64
arr_hour int64
arr_minute int64
sched_arr_hour int64
sched_arr_minute int64
sched_dep_hour int64
sched_dep_minute int64
dtype: object
Encode categorical variables
We use a target encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier. Models can now be trained with any modeling algorithm with the feature set contained in X_train_loo
encoder = ce.TargetEncoder(return_df=True)
X_train_loo = encoder.fit_transform(X_train, y_train)
X_train_loo.dtypes
month int64
day int64
carrier float64
flight float64
tailnum float64
origin float64
dest float64
air_time float64
distance int64
dep_hour int64
dep_minute int64
arr_hour int64
arr_minute int64
sched_arr_hour int64
sched_arr_minute int64
sched_dep_hour int64
sched_dep_minute int64
dtype: object
month day carrier flight tailnum origin dest air_time distance dep_hour dep_minute arr_hour arr_minute sched_arr_hour sched_arr_minute sched_dep_hour sched_dep_minute
count 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000 261876.000000
mean 6.568246 15.727864 6.882754 6.890976 6.876203 6.882754 6.882753 150.594774 1047.624311 13.137641 26.232320 14.722663 29.474499 15.032809 29.029907 13.137641 26.232320
std 3.414977 8.782851 5.454216 11.053420 8.365960 1.626746 4.797671 93.567094 735.070110 4.659342 19.294383 5.325232 17.357855 4.971609 17.404733 4.659342 19.294383
min 1.000000 1.000000 -9.795775 -38.252832 -34.835896 5.560707 -14.416311 20.000000 80.000000 5.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.000000 0.000000
25% 4.000000 8.000000 1.985603 -0.698324 0.929825 5.560707 2.691649 82.000000 509.000000 9.000000 8.000000 11.000000 14.000000 11.000000 14.000000 9.000000 8.000000
50% 7.000000 16.000000 7.514331 5.288961 6.493671 5.786915 7.323326 129.000000 888.000000 13.000000 29.000000 15.000000 29.000000 15.000000 30.000000 13.000000 29.000000
75% 10.000000 23.000000 9.615767 13.142857 11.827068 9.057426 9.829630 191.000000 1389.000000 17.000000 44.000000 19.000000 45.000000 19.000000 44.000000 17.000000 44.000000
max 12.000000 31.000000 19.993676 94.184935 173.543714 9.057426 44.162500 695.000000 4983.000000 23.000000 59.000000 24.000000 59.000000 23.000000 59.000000 23.000000 59.000000
Encode the test set. This can now be passed into the predict or predict_proba functions of a trained model.
X_test_loo = encoder.transform(X_test)
X_test_loo.describe()