This Tutorial explains how to generate K-folds for cross-validation with groups using scikit-learn for evaluation of machine learning models with out of sample data.
During this notebook you will work with flights in and out of NYC in 2013.
This tutorial uses:
Open up a new Jupyter notebook and import the following:
The data is from rdatasets imported using the Python package statsmodels.
As this model will predict arrival delay, the Null values are caused by flights did were cancelled or diverted. These can be excluded from this analysis.
Scikit-learn's GroupKFold will randomly sample the data into N folds (default of 5) that can be used to perform cross-validation during machine learning training.
In this case, group records by individual planes as once a plane is late, the subsequent flights are more likely to be delayed as well.
Create the features and target before running cross-validation