This tutorial explains how to use one-hot encoding from scikit-learn using data for flights in and out of NYC in 2013.
Packages
This tutorial uses:
Open a new Jupyter notebook and import the following:
import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
Reading the data
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
Feature Engineering
Handle null values
year 0
month 0
day 0
dep_time 8255
sched_dep_time 0
dep_delay 8255
arr_time 8713
sched_arr_time 0
arr_delay 9430
carrier 0
flight 0
tailnum 2512
origin 0
dest 0
air_time 9430
distance 0
hour 0
minute 0
time_hour 0
dtype: int64
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Convert the times from floats or ints to hour and minutes
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['flight'] = df.flight.astype(str)
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Prepare data for modeling
Set up train-test split
target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1066)
X_train.dtypes
That should return:
month int64
day int64
carrier object
flight object
tailnum object
origin object
dest object
air_time float64
distance int64
dep_hour int64
dep_minute int64
arr_hour int64
arr_minute int64
sched_arr_hour int64
sched_arr_minute int64
sched_dep_hour int64
sched_dep_minute int64
dtype: object
Encode categorical variables
We convert the categorical features using one-hot encoding to create a new binary feature for each category in the column.
encoder = OneHotEncoder(handle_unknown="ignore")
X_train_ohe = encoder.fit_transform(X_train, y_train)
X_train_ohe.shape
The one-hot encoding has created nearly 9000 new features to account for all of levels in the categorical features.
Encode the test set. This can now be passed into the predict or predict_proba functions of a trained model.
X_test_ohe = encoder.transform(X_test)
X_test_ohe.shape