我有一个产品的时间序列数据集,如下所示:
date product price amount
11/17/2019 A 10 20
11/19/2019 A 15 20
11/24/2019 A 20 30
12/01/2019 C 40 50
12/05/2019 C 45 35
该数据在每个产品的数据的开始日期和结束日期之间缺少天数("MM/dd/YYYY"(。我正在尝试用零行填充缺失的日期,并将上一个表转换为下面给出的表:
date product price amount
11/17/2019 A 10 20
11/18/2019 A 0 0
11/19/2019 A 15 20
11/20/2019 A 0 0
11/21/2019 A 0 0
11/22/2019 A 0 0
11/23/2019 A 0 0
11/24/2019 A 20 30
12/01/2019 C 40 50
12/02/2019 C 0 0
12/03/2019 C 0 0
12/04/2019 C 0 0
12/05/2019 C 45 35
为了获得这种转换,我使用了代码:
import pandas as pd
import numpy as np
data=pd.read_csv("test.txt", sep="t", parse_dates=['date'])
data=data.set_index(["date", "product"])
start=data.first_valid_index()[0]
end=data.last_valid_index()[0]
df=data.set_index("date").reindex(pd.date_range(start,end, freq="1D"), fill_values=0)
但是,代码给出了一个错误。有什么方法可以有效地进行转换吗?
如果需要为每个product
的缺少日期时间添加0
,请分别使用GroupBy.apply
中的自定义函数和DataFrame.reindex
中的最小和最大日期时间:
df = pd.read_csv("test.txt", sep="t", parse_dates=['date'])
f = lambda x: x.reindex(pd.date_range(x.index.min(),
x.index.max(), name='date'), fill_value=0)
df = (df.set_index('date')
.groupby('product')
.apply(f)
.drop('product', axis=1)
.reset_index())
print (df)
product date price amount
0 A 2019-11-17 10 20
1 A 2019-11-18 0 0
2 A 2019-11-19 15 20
3 A 2019-11-20 0 0
4 A 2019-11-21 0 0
5 A 2019-11-22 0 0
6 A 2019-11-23 0 0
7 A 2019-11-24 20 30
8 C 2019-12-01 40 50
9 C 2019-12-02 0 0
10 C 2019-12-03 0 0
11 C 2019-12-04 0 0
12 C 2019-12-05 45 35
一个选项是使用pyjanitor的完整函数来暴露每个组缺少的行:
#pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor
# build the dates to be applied per group
dates = dict(date = lambda df: pd.date_range(df.min(), df.max(), freq='1D'))
df.complete(dates, by='product', sort = True).fillna(0, downcast='infer')
date product price amount
0 2019-11-17 00:00:00 A 10 20
1 2019-11-18 00:00:00 A 0 0
2 2019-11-19 00:00:00 A 15 20
3 2019-11-20 00:00:00 A 0 0
4 2019-11-21 00:00:00 A 0 0
5 2019-11-22 00:00:00 A 0 0
6 2019-11-23 00:00:00 A 0 0
7 2019-11-24 00:00:00 A 20 30
8 2019-12-01 00:00:00 C 40 50
9 2019-12-02 00:00:00 C 0 0
10 2019-12-03 00:00:00 C 0 0
11 2019-12-04 00:00:00 C 0 0
12 2019-12-05 00:00:00 C 45 35
对于这种情况,有一种更简单的方法:
#create the full date range, and then create a DataFrame with the range
#if needed, you can expand the range a bit using datetime.timedelta()
alldates=pd.DataFrame(pd.date_range(data.index.min()-timedelta(1),data.index.max()+timedelta(4), freq="1D",name="newdate"))
#make 'newdate' the index, and you no longer need it as a column
alldates.index=alldates.newdate
alldates.drop(columns="newdate", inplace=True)
#now, join the tables, missing dates in the original table will be filled with NaN
data=alldates.join(data)