Feature Selection with PyRasgo

This tutorial explains how to use feature importance plots from pyrasgo to perform feature selection. The feature importance importance is calculated from SHAP values from catboost.

This notebook will calculate the SHAP feature importance when predicting arrival delay for flights in and out of NYC in 2013.

Packages

This tutorial uses:


import statsmodels.api as sm
import pandas as pd
import numpy as np
import pyrasgo

Connect to Rasgo

Enter your email and password to create an account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.

Note: This only needs to be run the first time you use pyrasgo.

#pyrasgo.register(email='<your email>', password='<your password>')

Enter the email and password you used at registration to connect to Rasgo.‍

rasgo = pyrasgo.login(email='<your email>', password='<your password>')

‍
Reading the Data

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()

This should return a table resembling something like this:

<class 'pandas.core.frame.DataFrame'>RangeIndex: 336776 entries, 0 to 336775Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 336776 non-null int64 1 month 336776 non-null int64 2 day 336776 non-null int64 3 dep_time 328521 non-null float64 4 sched_dep_time 336776 non-null int64 5 dep_delay 328521 non-null float64 6 arr_time 328063 non-null float64 7 sched_arr_time 336776 non-null int64 8 arr_delay 327346 non-null float64 9 carrier 336776 non-null object 10 flight 336776 non-null int64 11 tailnum 334264 non-null object 12 origin 336776 non-null object 13 dest 336776 non-null object 14 air_time 327346 non-null float64 15 distance 336776 non-null int64 16 hour 336776 non-null int64 17 minute 336776 non-null int64 18 time_hour 336776 non-null object dtypes: float64(5), int64(9), object(5)memory usage: 48.8+ MB

Feature Engineering

Handle Null Values

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.


df.dropna(inplace=True)

Convert the Times From Floats or Ints to Hour and Minutes


df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True)