This tutorial explains how to use Shapley importance from SHAP and a scikit-learn tree-based model to perform feature selection.
This notebook will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
This tutorial uses:
The data is from rdatasets imported using the Python package statsmodels.
This should return a table resembling something like this:
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
We use a leave-one-out encoder as it creates a single column for each categorical variable instead of creating a column for each level of the categorical variable like one-hot-encoding. This makes interpreting the impact of categorical variables with feature impact easier.
Create a data frame to hold the SHAP values.
Create a list of the features with Gini importance greater than 0.5 and use that list to retrain the model.
Alternatively, to keep the top 5 features, use the following instead