Rasgo can be configured to your data and dbt/git environments in under 20 minutes. Book time with your personal onboarding concierge and we'll get you all setup!
This tutorial explains how to use feature importance from pyrasgo to perform backward stepwise feature selection. The feature importance used is calculated from SHAP values from catboost.
This notebook will prune the features to model arrival delay for flights in and out of NYC in 2013.
This tutorial uses:
import statsmodels.api as sm
import pandas as pd
import numpy as np
import pyrasgo
Enter your email and password to create an account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.
Note: This only needs to be run the first time you use pyrasgo.
#pyrasgo.register(email='<your email>', password='<your password>')
Enter the email and password you used at registration to connect to Rasgo.
rasgo = pyrasgo.login(email='<your email>', password='<your password>')
Create experiment to track the changes in performance
Activated existing experiment with name Stepwise Feature Selection Tutorial for dataframe: UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA
The data is from rdatasets imported using the Python package statsmodels.
df = sm.datasets.get_rdataset('flights', 'nycflights13').data
df.info()
This should return a table resembling something like this:
RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 year 336776 non-null int64
1 month 336776 non-null int64
2 day 336776 non-null int64
3 dep_time 328521 non-null float64
4 sched_dep_time 336776 non-null int64
5 dep_delay 328521 non-null float64
6 arr_time 328063 non-null float64
7 sched_arr_time 336776 non-null int64
8 arr_delay 327346 non-null float64
9 carrier 336776 non-null object
10 flight 336776 non-null int64
11 tailnum 334264 non-null object
12 origin 336776 non-null object
13 dest 336776 non-null object
14 air_time 327346 non-null float64
15 distance 336776 non-null int64
16 hour 336776 non-null int64
17 minute 336776 non-null int64
18 time_hour 336776 non-null object
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
df.dropna(inplace=True)
df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
'minute': 'dep_minute'}, inplace=True)
Create a function that incremental removes the feature with the lowest feature importance as calculated by PyRasgo until the RMSE stops decreasing.
def backward_selection(df, target, max_features=None):
"""
This function uses the pyrasgo.evaluate.feature_importance and pyrasgo.prune.features functions
to incrementally remove features from the training set until the RMSE no longer improves.
This function returns the dataframe with the features that give the best RMSE.
Return at most max_features.
"""
# get baseline RMSE
select_df = df.copy()
total_features = df.shape[1]
response = rasgo.evaluate.feature_importance(select_df, target, return_cli_only=True)
rmse = response['modelPerformance']['RMSE']
print(f"{rmse} with {select_df.shape[1]}")
last_rmse = rmse
# Drop least important feature and recalculate model peformance
if max_features is None:
max_features = total_features-1
for num_features in range(total_features-1, 1, -1):
tmp_df = rasgo.prune.features(select_df, target, top_n=num_features)
response = rasgo.evaluate.feature_importance(tmp_df, target, return_cli_only=True)
rmse = response['modelPerformance']['RMSE']
print(f"{rmse} with {tmp_df.shape[1]}")
if (num_features < max_features) and (rmse > last_rmse):
# RMSE increased, return last dataframe
return select_df
else:
# RMSE improved, continue dropping features
last_rmse = rmse
select_df = tmp_df
return select_df
Call backward_selection on the modeling dataframe. reduced_df will contain the selected features and will be our reduced modeling dataset.
target = 'arr_delay'
reduced_df = backward_selection(df, target, max_features=20)
reduced_df.shape[1]
Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance5.774396196448282 with 25Prune Method: Keeping top 24 featuresImportance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importanceDropped features not in top 24: ['year']Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance5.622926184499909 with 24Prune Method: Keeping top 23 featuresImportance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importanceDropped features not in top 23: ['sched_dep_time']Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance6.0614605291484205 with 23Prune Method: Keeping top 22 featuresImportance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importanceDropped features not in top 22: ['tailnum']Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance5.669525870310706 with 22Prune Method: Keeping top 21 featuresImportance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importanceDropped features not in top 21: ['sched_dep_minute']Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance5.682320841457851 with 21Prune Method: Keeping top 20 featuresImportance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importanceDropped features not in top 20: ['dep_minute']Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance5.9402484082388 with 20Prune Method: Keeping top 19 featuresImportance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importanceDropped features not in top 19: ['sched_dep_hour']Importance URL: https://app.rasgoml.com/dataframes/UjNaU_zBWCfXrKzpEF5hN5JNkMGQnprAn6iLhn4qfNA/importance6.31078646630211 with 19
20
Open source data transformations, without having to write SQL. Choose from a wide selection of predefined transforms that can be exported to DBT or native SQL.