This Tutorial explains how to generate K-folds for cross-validation with groups using scikit-learn for evaluation of machine learning models with out of sample data.

During this notebook you will work with flights in and out of NYC in 2013.


This tutorial uses:

Open up a new Jupyter notebook and import the following:

import statsmodels.api as sm
import pandas as pd
import numpy as np
from sklearn.model_selection import GroupKFold

Reading the data

The data is from rdatasets imported using the Python package statsmodels.

df = sm.datasets.get_rdataset('flights', 'nycflights13').data

RangeIndex: 336776 entries, 0 to 336775
Data columns (total 19 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   year            336776 non-null  int64  
 1   month           336776 non-null  int64  
 2   day             336776 non-null  int64  
 3   dep_time        328521 non-null  float64
 4   sched_dep_time  336776 non-null  int64  
 5   dep_delay       328521 non-null  float64
 6   arr_time        328063 non-null  float64
 7   sched_arr_time  336776 non-null  int64  
 8   arr_delay       327346 non-null  float64
 9   carrier         336776 non-null  object 
 10  flight          336776 non-null  int64  
 11  tailnum         334264 non-null  object 
 12  origin          336776 non-null  object 
 13  dest            336776 non-null  object 
 14  air_time        327346 non-null  float64
 15  distance        336776 non-null  int64  
 16  hour            336776 non-null  int64  
 17  minute          336776 non-null  int64  
 18  time_hour       336776 non-null  object 
dtypes: float64(5), int64(9), object(5)
memory usage: 48.8+ MB

Feature Engineering

Handle null values


year                 0
month                0
day                  0
dep_time          8255
sched_dep_time       0
dep_delay         8255
arr_time          8713
sched_arr_time       0
arr_delay         9430
carrier              0
flight               0
tailnum           2512
origin               0
dest                 0
air_time          9430
distance             0
hour                 0
minute               0
time_hour            0
dtype: int64

As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.

df.reset_index(drop=True, inplace=True)

Convert the times from floats or ints to hour and minutes

df['arr_hour'] = df.arr_time.apply(lambda x: int(np.floor(x/100)))
df['arr_minute'] = df.arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_arr_hour'] = df.sched_arr_time.apply(lambda x: int(np.floor(x/100)))
df['sched_arr_minute'] = df.sched_arr_time.apply(lambda x: int(x - np.floor(x/100)*100))
df['sched_dep_hour'] = df.sched_dep_time.apply(lambda x: int(np.floor(x/100)))
df['sched_dep_minute'] = df.sched_dep_time.apply(lambda x: int(x - np.floor(x/100)*100))
df.rename(columns={'hour': 'dep_hour',
                   'minute': 'dep_minute'}, inplace=True

Cross-validation splitting

Scikit-learn's GroupKFold will randomly sample the data into N folds (default of 5) that can be used to perform cross-validation during machine learning training.

In this case, group records by individual planes as once a plane is late, the subsequent flights are more likely to be delayed as well.

group = df.tailnum.tolist()

Create the features and target before running cross-validation

target = 'arr_delay'
y = df[target]
X = df.drop(columns=[target, 'flight', 'time_hour', 'year', 'dep_time', 'sched_dep_time', 'arr_time', 'sched_arr_time', 'dep_delay'])

gkf = GroupKFold(n_splits=10)
for train_index, test_index in gkf.split(X, groups=group):
    print("Train:", train_index, "Test:", test_index)
    X_train = X.iloc[train_index, :]
    y_train = y[train_index]
    X_test = X.iloc[test_index, :]
    y_test = y[test_index]

Train: [     0      1      2 ... 327340 327342 327345] Test: [     8     10     14 ... 327341 327343 327344]
Train: [     0      2      4 ... 327341 327343 327344] Test: [     1      3      6 ... 327339 327342 327345]
Train: [     0      1      2 ... 327343 327344 327345] Test: [    26     57     73 ... 327314 327317 327325]
Train: [     0      1      2 ... 327343 327344 327345] Test: [    22     51     71 ... 327326 327332 327340]
Train: [     0      1      2 ... 327343 327344 327345] Test: [     9     33     35 ... 327321 327331 327338]
Train: [     0      1      2 ... 327343 327344 327345] Test: [     7     15     30 ... 327278 327313 327330]
Train: [     1      2      3 ... 327343 327344 327345] Test: [     0     11     12 ... 327300 327312 327322]
Train: [     0      1      2 ... 327343 327344 327345] Test: [     4      5     17 ... 327276 327299 327307]
Train: [     0      1      2 ... 327343 327344 327345] Test: [    13     16     34 ... 327316 327327 327333]
Train: [     0      1      3 ... 327343 327344 327345] Test: [     2     24     29 ... 327328 327335 327337]

