This tutorial explains how to use feature importance from catboost to perform backward stepwise feature selection. The feature importance used is the SHAP importance from a catboost model.
This will prune the features to model arrival delay for flights in and out of NYC in 2013.
This tutorial uses:
The data is from rdatasets imported using the Python package statsmodels.
This should return a table resembling something like this:
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Function to build the model.
Function to return feature to be dropped.
Function that incremental removes the feature with the lowest feature importance as calculated by catboost until the RMSE stops decreasing.
Call backward_selection on the modeling dataframe. reduced_df will contain the selected features and will be our reduced modeling dataset.
11.309300297508369 with 25
11.363724303677218 with 24
11.398055158351227 with 23
11.347547316020568 with 22
11.311533268825302 with 21
11.343828983988557 with 20
11.282547263583197 with 19
11.252227369040204 with 18
11.315881600344504 with 17