This tutorial explains how to use feature importance from pyrasgo to perform backward stepwise feature selection. The feature importance used is calculated from SHAP values from catboost.

This notebook will prune the features to model arrival delay for flights in and out of NYC in 2013.


This tutorial uses:

import statsmodels.api as sm
import pandas as pd
import numpy as np
import pyrasgo

Connect to Rasgo

Enter your email and password to create an account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.

Note: This only needs to be run the first time you use pyrasgo.

#pyrasgo.register(email='<your email>', password='<your password>')

Enter the email and password you used at registration to connect to Rasgo.

rasgo = pyrasgo.login(email='<your email>', password='<your password>')

Create experiment to track the changes in performance

Activated existing experiment with name Stepwise Feature Selection Tutorial for dataframe: UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA

Reading the Data

The data is from rdatasets imported using the Python package statsmodels.

df = sm.datasets.get_rdataset('flights', 'nycflights13').data

This should return a table resembling something like this:

RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

Feature Engineering

Handle Null Values

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.


Convert the Times From Floats or Ints to Hour and Minutes

df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)

Feature Selection

Define Function

Create a function that incremental removes the feature with the lowest feature importance as calculated by PyRasgo until the RMSE stops decreasing.

def backward_selection(df, target, max_features=None):
    This function uses the pyrasgo.evaluate.feature_importance and pyrasgo.prune.features functions
    to incrementally remove features from the training set until the RMSE no longer improves.
    This function returns the dataframe with the features that give the best RMSE.
    Return at most max_features.
    # get baseline RMSE
    select_df = df.copy()
    total_features = df.shape[1]
    response = rasgo.evaluate.feature_importance(select_df, target, return_cli_only=True)
    rmse = response['modelPerformance']['RMSE']
    print(f"{rmse} with {select_df.shape[1]}")
    last_rmse = rmse
    # Drop least important feature and recalculate model peformance
    if max_features is None:
        max_features = total_features-1
    for num_features in range(total_features-1, 1, -1):
        tmp_df = rasgo.prune.features(select_df, target, top_n=num_features)
        response = rasgo.evaluate.feature_importance(tmp_df, target, return_cli_only=True)
        rmse = response['modelPerformance']['RMSE']
        print(f"{rmse} with {tmp_df.shape[1]}")
        if (num_features < max_features) and (rmse > last_rmse):
            # RMSE increased, return last dataframe
            return select_df
            # RMSE improved, continue dropping features
            last_rmse = rmse
            select_df = tmp_df
    return select_df

Run Stepwise Feature Selection

Call backward_selection on the modeling dataframe. reduced_df will contain the selected features and will be our reduced modeling dataset.

target = 'arr_delay'
reduced_df = backward_selection(df, target, max_features=20)

Importance URL: with 25Prune Method: Keeping top 24 featuresImportance URL: features not in top 24: ['year']Importance URL: with 24Prune Method: Keeping top 23 featuresImportance URL: features not in top 23: ['sched_dep_time']Importance URL: with 23Prune Method: Keeping top 22 featuresImportance URL: features not in top 22: ['tailnum']Importance URL: with 22Prune Method: Keeping top 21 featuresImportance URL: features not in top 21: ['sched_dep_minute']Importance URL: with 21Prune Method: Keeping top 20 featuresImportance URL: features not in top 20: ['dep_minute']Importance URL: with 20Prune Method: Keeping top 19 featuresImportance URL: features not in top 19: ['sched_dep_hour']Importance URL: with 19


No-code/low-code data prep and visualization

Request Demo
Try for Free