This tutorial explains how to identify date gaps in time series data with pyrasgo.
Packages
This tutorial uses:
Open a new Jupyter Notebook and import the following:
import pandas as pd
import numpy as np
import pyrasgo
Connect to Rasgo
If you haven't done so already, head over to https://docs.rasgoml.com/rasgo-docs/onboarding/initial-setup and follow the steps outlined there to create your free account. This account gives you free access to the Rasgo API which will calculate dataframe profiles, generate feature importance score, and produce feature explainability for you analysis. In addition, this account allows you to maintain access to your analysis and share with your colleagues.
rasgo = pyrasgo.login(email='', password='')
Creating the data
We will create a dataframe that contains multiple time series, one for each group.
np.random.seed(1066)
dates = pd.date_range(start='2010-01-01', end='2010-12-31', freq='D')
df = pd.DataFrame({'date': dates,
'group': 'A',
'value': np.random.randint(0, 100, size=len(dates))
}).append(pd.DataFrame({'date': dates,
'group': 'B',
'value': np.random.randint(0, 100, size=len(dates))
})).append(pd.DataFrame({'date': dates,
'group': 'C',
'value': np.random.randint(0, 100, size=len(dates))
})).reset_index(drop=True)
df
Your dataframe should look like:
date group value
0 2010-01-01 A 57
1 2010-01-02 A 11
2 2010-01-03 A 83
3 2010-01-04 A 83
4 2010-01-05 A 93
... ... ... ...
1090 2010-12-27 C 50
1091 2010-12-28 C 59
1092 2010-12-29 C 85
1093 2010-12-30 C 32
1094 2010-12-31 C 3
Next, drop some rows randomly to create gaps in the data.
length = df.shape[0]
droplist = np.unique(np.sort(np.random.randint(0, length, size=100))).tolist()
df = df.drop(droplist).reset_index(drop=True)
df
Identify Date Gaps
In a single series
The function evaluate.timeseries_gaps will identify date gaps in the data.
gaps = rasgo.evaluate.timeseries_gaps(df[df.group == 'A'], datetime_column='date', partition_columns=['group'])
gaps
That should return something like:
date group value TSGAPLastDate TSGAPNextDate
0 2010-01-01 A 57 NaT 2010-01-02
38 2010-02-08 A 58 2010-02-07 2010-02-10
39 2010-02-10 A 97 2010-02-08 2010-02-11
43 2010-02-14 A 54 2010-02-13 2010-02-17
44 2010-02-17 A 93 2010-02-14 2010-02-19
45 2010-02-19 A 88 2010-02-17 2010-02-20
56 2010-03-02 A 93 2010-03-01 2010-03-04
57 2010-03-04 A 92 2010-03-02 2010-03-05
76 2010-03-23 A 21 2010-03-22 2010-03-25
77 2010-03-25 A 44 2010-03-23 2010-03-26
80 2010-03-28 A 10 2010-03-27 2010-03-30
81 2010-03-30 A 94 2010-03-28 2010-03-31
85 2010-04-03 A 47 2010-04-02 2010-04-05
86 2010-04-05 A 7 2010-04-03 2010-04-06
88 2010-04-07 A 67 2010-04-06 2010-04-10
89 2010-04-10 A 65 2010-04-07 2010-04-11
91 2010-04-12 A 75 2010-04-11 2010-04-15
92 2010-04-15 A 85 2010-04-12 2010-04-16
98 2010-04-21 A 24 2010-04-20 2010-04-23
99 2010-04-23 A 7 2010-04-21 2010-04-24
114 2010-05-08 A 89 2010-05-07 2010-05-10
115 2010-05-10 A 46 2010-05-08 2010-05-11
128 2010-05-23 A 45 2010-05-22 2010-05-25
129 2010-05-25 A 50 2010-05-23 2010-05-26
131 2010-05-27 A 3 2010-05-26 2010-05-29
132 2010-05-29 A 71 2010-05-27 2010-05-30
135 2010-06-01 A 67 2010-05-31 2010-06-03
136 2010-06-03 A 42 2010-06-01 2010-06-04
163 2010-06-30 A 83 2010-06-29 2010-07-02
164 2010-07-02 A 26 2010-06-30 2010-07-03
174 2010-07-12 A 30 2010-07-11 2010-07-14
175 2010-07-14 A 50 2010-07-12 2010-07-15
197 2010-08-05 A 95 2010-08-04 2010-08-07
198 2010-08-07 A 6 2010-08-05 2010-08-08
200 2010-08-09 A 21 2010-08-08 2010-08-11
201 2010-08-11 A 84 2010-08-09 2010-08-12
208 2010-08-18 A 49 2010-08-17 2010-08-20
209 2010-08-20 A 14 2010-08-18 2010-08-21
211 2010-08-22 A 23 2010-08-21 2010-08-24
212 2010-08-24 A 60 2010-08-22 2010-08-25
237 2010-09-18 A 88 2010-09-17 2010-09-20
238 2010-09-20 A 39 2010-09-18 2010-09-21
245 2010-09-27 A 94 2010-09-26 2010-09-29
246 2010-09-29 A 34 2010-09-27 2010-09-30
258 2010-10-11 A 2 2010-10-10 2010-10-13
259 2010-10-13 A 27 2010-10-11 2010-10-14
269 2010-10-23 A 68 2010-10-22 2010-10-25
270 2010-10-25 A 19 2010-10-23 2010-10-27
271 2010-10-27 A 3 2010-10-25 2010-10-28
290 2010-11-15 A 69 2010-11-14 2010-11-17
291 2010-11-17 A 1 2010-11-15 2010-11-18
296 2010-11-22 A 12 2010-11-21 2010-11-24
297 2010-11-24 A 39 2010-11-22 2010-11-26
298 2010-11-26 A 28 2010-11-24 2010-11-27
301 2010-11-29 A 35 2010-11-28 2010-12-01
302 2010-12-01 A 3 2010-11-29 2010-12-02
332 2010-12-31 A 70 2010-12-30 NaT
In multiple time series
Passing the series identifier (group in this case) into evaluate.timeseries_gaps using the partition_columns parameter checks for date gaps in each of the series independently.
gaps = rasgo.evaluate.timeseries_gaps(df, datetime_column='date', partition_columns=['group'])
gaps
Your dataframe should look like this:
date group value TSGAPLastDate TSGAPNextDate
0 2010-01-01 A 57 NaT 2010-01-02
38 2010-02-08 A 58 2010-02-07 2010-02-10
39 2010-02-10 A 97 2010-02-08 2010-02-11
43 2010-02-14 A 54 2010-02-13 2010-02-17
44 2010-02-17 A 93 2010-02-14 2010-02-19
... ... ... ... ... ...
973 2010-12-05 C 51 2010-12-04 2010-12-08
974 2010-12-08 C 1 2010-12-05 2010-12-09
984 2010-12-18 C 71 2010-12-17 2010-12-20
985 2010-12-20 C 39 2010-12-18 2010-12-21
996 2010-12-31 C 3 2010-12-30 NaT