This tutorial explains how to use tree-based (Gini) feature importance from a scikit-learn tree-based model to perform feature selection.
This will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
This tutorial uses:
The data is from rdatasets imported using the Python package statsmodels.
This should return a table resembling something like this:
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier.
Use SelectFromModel to use model.feature_importances_ to pick the features to remove. Since the model has already been fit, pass prefit=True to use the prefit model. As Origin has outsized Gini importance, set a low threshold with threshold='0.01*mean' to keep multiple features.
Create a data frame to hold the importance scores.
Create a list of the features with Gini importance greater than 0.005 and use that list to retrain the model.
Alternatively, to keep the top 5 features, use the following instead