我有以下工作日数据帧
In [23]: d = pd.DataFrame({'date' : ['20070105', '20070105', '20070106', '20070106', '20070106', '20070109'], 's' : [1, 2, 1,2,3,1], 'i': ['a', 'b', 'a', 'b', 'c', 'a']})
In [26]: d['date'] = pd.to_datetime(d['date'], format='%Y%m%d')
In [27]: d
Out[27]:
date i s
0 2007-01-05 a 1
1 2007-01-05 b 2
2 2007-01-06 a 1
3 2007-01-06 b 2
4 2007-01-06 c 3
5 2007-01-09 a 1
我想填写缺少日期的数据(根据"所有日子"日历(,输出应如下所示。基本上20070107和20070108丢失了,其数据是从20070106复制的。
Out[31]:
date i s
0 2007-01-05 a 1
1 2007-01-05 b 2
2 2007-01-06 a 1
3 2007-01-06 b 2
4 2007-01-06 c 3
5 2007-01-07 a 1
6 2007-01-07 b 2
7 2007-01-07 c 3
8 2007-01-08 a 1
9 2007-01-08 b 2
10 2007-01-08 c 3
11 2007-01-09 a 1
在大熊猫中做到这一点的最佳方法是什么?
使用:
-
set_index
bycumcount
以获得独特的MultiIndex
- 通过
unstack
重塑,实现独特的DatetimeIndex
- 按索引中的最小值和最大值
reindex
,按method='ffill'
添加缺失值
按 stack
重塑- 最后删除第二级
MultiIndex
并将Datetimeindex
转换为列
df = (
d.set_index(['date', d.groupby('date').cumcount()])
.unstack()
.reindex(pd.date_range(d['date'].min(), d['date'].max()), method='ffill')
.stack()
.reset_index(level=1, drop=True)
.rename_axis('date')
.reset_index()
)
print (df)
date i s
0 2007-01-05 a 1.0
1 2007-01-05 b 2.0
2 2007-01-06 a 1.0
3 2007-01-06 b 2.0
4 2007-01-06 c 3.0
5 2007-01-07 a 1.0
6 2007-01-07 b 2.0
7 2007-01-07 c 3.0
8 2007-01-08 a 1.0
9 2007-01-08 b 2.0
10 2007-01-08 c 3.0
11 2007-01-09 a 1.0