This tutorial explains how to generate K-folds for cross-validation using scikit-learn for evaluation of machine learning models with out of sample data using stratified sampling. With stratified sampling, the relative proportions of classes from the overall dataset is maintained in each fold.
During this tutorial you will work with an OpenML dataset to predict who pays for internet with 10108 observations and 69 columns.
This tutorial uses:
Open up a new Jupyter notebook and import the following:
The data is from OpenML imported using the Python package sklearn.datasets.
Split the data into target and features.
Drop target leakage features of other options to pay.
Scikit-learn's StratifiedKFold will randomly sample data from each class into N folds (default of 5) that can be used to perform cross-validation during machine learning training.