This tutorial explains how to use one-hot encoding from scikit-learn using data for flights in and out of NYC in 2013.
This tutorial uses:
Open a new Jupyter notebook and import the following:
The data is from rdatasets imported using the Python package statsmodels.
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
That should return:
We convert the categorical features using one-hot encoding to create a new binary feature for each category in the column.
The one-hot encoding has created nearly 9000 new features to account for all of levels in the categorical features.
Encode the test set. This can now be passed into the predict or predict_proba functions of a trained model.