Tutorials that help Data Scientists get their pandas on.

Data Cleaning

How to Create Lag Variables in Pandas

This tutorial explains how to create lag variables in pandas using gold and silver price data from rdatasets.

Packages

The documentation for each package used in this tutorial is linked below:


import statsmodels.api as sm
import pandas as pd

Create initial dataset

The data is from rdatasets imported using the Python package statsmodels.


df = sm.datasets.get_rdataset('GoldSilver', 'AER').data.reset_index().rename(columns={'index': 'date'})
df.info()


df['date'] = pd.to_datetime(df.date)

Lag Variables

Create lag variables, using the shift function. shift(1) creates a lag of a single record, while shift(5) creates a lag of five records.


df.sort_values('date', inplace=True)
df['silver_lag_1'] = df['silver'].shift(1)
df['silver_lag_5'] = df['silver'].shift(5)

df.head(20)

	date	gold	silver	silver_lag_1	silver_lag_5
0	1977-12-30	100.00	223.42	NaN	NaN
1	1978-01-02	100.00	223.42	223.42	NaN
2	1978-01-03	100.00	229.84	223.42	NaN
3	1978-01-04	100.00	224.58	229.84	NaN
4	1978-01-05	100.00	227.99	224.58	NaN
5	1978-01-06	100.00	227.19	227.99	223.42
6	1978-01-09	101.23	229.62	227.19	223.42
7	1978-01-10	100.95	228.97	229.62	229.84
8	1978-01-11	102.25	231.22	228.97	224.58
9	1978-01-12	100.88	227.89	231.22	227.99
10	1978-01-13	102.62	232.55	227.89	227.19
11	1978-01-16	103.09	232.49	232.55	229.62
12	1978-01-17	101.78	230.18	232.49	228.97
13	1978-01-18	100.87	227.96	230.18	231.22
14	1978-01-19	101.70	228.88	227.96	227.89
15	1978-01-20	101.72	228.92	228.88	232.55
16	1978-01-23	103.62	234.00	228.92	232.49
17	1978-01-24	103.59	233.63	234.00	230.18
18	1978-01-25	103.55	234.50	233.63	227.96
19	1978-01-26	103.28	231.49	234.50	228.88

This creates a lag variable based on the prior observations, but shift can also take a time offset to specify the time to use in shift. For example, 1D and 5D can be used to lag by 1 and 5 days respectively.

First, a datetime index must be created from date.


df.set_index('date', inplace=True)

df['silver_lag_1d'] = df['silver'].shift(freq='1D')
df['silver_lag_5d'] = df['silver'].shift(freq='5D')

df.head(20)

	gold	silver	silver_lag_1	silver_lag_5	silver_lag_1d	silver_lag_5d
date						
1977-12-30	100.00	223.42	NaN	NaN	NaN	NaN
1978-01-02	100.00	223.42	223.42	NaN	NaN	NaN
1978-01-03	100.00	229.84	223.42	NaN	223.42	NaN
1978-01-04	100.00	224.58	229.84	NaN	229.84	223.42
1978-01-05	100.00	227.99	224.58	NaN	224.58	NaN
1978-01-06	100.00	227.19	227.99	223.42	227.99	NaN
1978-01-09	101.23	229.62	227.19	223.42	NaN	224.58
1978-01-10	100.95	228.97	229.62	229.84	229.62	227.99
1978-01-11	102.25	231.22	228.97	224.58	228.97	227.19
1978-01-12	100.88	227.89	231.22	227.99	231.22	NaN
1978-01-13	102.62	232.55	227.89	227.19	227.89	NaN
1978-01-16	103.09	232.49	232.55	229.62	NaN	231.22
1978-01-17	101.78	230.18	232.49	228.97	232.49	227.89
1978-01-18	100.87	227.96	230.18	231.22	230.18	232.55
1978-01-19	101.70	228.88	227.96	227.89	227.96	NaN
1978-01-20	101.72	228.92	228.88	232.55	228.88	NaN
1978-01-23	103.62	234.00	228.92	232.49	NaN	227.96
1978-01-24	103.59	233.63	234.00	230.18	234.00	228.88
1978-01-25	103.55	234.50	233.63	227.96	233.63	228.92
1978-01-26	103.28	231.49	234.50	228.88	234.50	NaN
No items found.