This tutorial explains how to create time series features with tsfresh using the Beijing Multi-Site Air-Quality Data downloaded from the UCI Machine Learning Repository.
The documentation for each package used in this tutorial is linked below:
Open up a new Jupyter notebook and import the following:
The zipfile is downloaded from UCI Machine Learning Repository using urllib and unzipped with zipfile. This zipfile contains one csv for each reporting station. Read each of these csv files and append to the pandas dataframe.
tsfresh doesn't handle missing value well, so check for missing values.
You should see this in your output:
A dictionary of features and settings can also be created to control the features created. Below is a example:
The above method rolls all time series data up into a single record per column_id (station in this case). For time series, this summarization often needs to be done at each timestamp and summarize the data from prior to the current timestamp. roll_time_series creates a dataframe that allows tsfresh to calculate the features at each timestamp correctly. We control the maximum window of the data with the parameter max_timeshift.
Now that the rolled dataframe has been created, extract_features can be run just as was done before