This tutorial explains how to generate a time series split from scikit-learn to allow out of time validation of machine learning models, why this approach may not be what is needed and how to create true time-based splits with pandas.
This tutorial will use hourly weather data for multiple weather stations (origin) for flights from New York airports in 2013.
This tutorial uses:
The data is from rdatasets imported using the Python package statsmodels.
time_hour contains the hour of the observation as a string. Convert it to a datetime as observation_time. year, month, day and hour are duplicates and can be dropped from the dataframe.
TimeSeriesSplit doesn't implement true time series split. Instead, it assumes that the data contains a single series with evenly spaced observations ordered by the timestamp. With that data it partitions the first n observations into the train set and the remaining test_size into the test set.
Note this will not work in this case, as the weather data contains three different weather stations, EWR, JFK and LGA. While this data could be resorted to be ordered purely by timestamp, as TimeSeriesSplit will still split on a row count level, not on a date or time boundary.
This is not splitting the data on the time value as we need to conduct this analysis correctly.
Calculate the date to split on
Calculate the train-test cutoff date
Create the train and test dataframes
The train and test datasets now contain all of the observation sites with no overlap in dates. These can now be used as the train and test sets in machine learning model training.