这是一个简单的问题,但它难倒了我。我有一个数据框架,由个人每天的观察组成。然而,并不是所有的人都在同一天被观察。我需要创建行,包括在他们存在的日期之间未观察到个人的天数(即,计数NaN或0)。
相关数据如下
ID date StartDate EndDate Count Cov1 Cov2 Cov3
A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
A 06/05/2005 04/04/05 06/06/2006 5 1 F 1
A 07/05/2005 04/04/05 06/06/2006 2 1 F 1
A 10/05/2005 04/04/05 06/06/2006 7 1 F 1
B 05/05/2005 04/04/05 06/06/2006 6 0 M 2
B 07/05/2005 04/04/05 06/06/2006 1 0 M 2
C 01/05/2005 04/04/05 06/06/2006 3 1 F 1
C 03/05/2005 04/04/05 06/06/2006 7 1 F 1
但是,我需要它看起来像这样:
ID date StartDate EndDate Count Cov1 Cov2 Cov3
A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
A 06/05/2005 04/04/05 06/06/2006 5 1 F 1
A 07/05/2005 04/04/05 06/06/2006 2 1 F 1
A 08/05/2005 04/04/05 06/06/2006 0 1 F 1
A 09/05/2005 04/04/05 06/06/2006 0 1 F 1
A 10/05/2005 04/04/05 06/06/2006 7 1 F 1
B 05/05/2005 04/04/05 06/06/2006 6 0 M 2
B 06/05/2005 04/04/05 06/06/2006 0 0 M 2
B 07/05/2005 04/04/05 06/06/2006 1 0 M 2
C 01/05/2005 04/04/05 06/06/2006 3 1 F 1
C 02/05/2005 04/04/05 06/06/2006 0 1 F 1
C 03/05/2005 04/04/05 06/06/2006 7 1 F 1
因此,数据应该在开始日期和结束日期之间扩展,而不是在原始日期中计算。但是,count变量不能复制,而所有其他协变量都必须复制。
首先通过列date
创建DatetimeIndex
,因此可以使用DataFrame.asfreq
的自定义lambda函数,删除MultiIndex的第一层并将索引转换为列date
,最后使用Series.dt.strftime
用于原始格式DD/MM/YYYY
:
第一个是可能的测试重复行ID, date
:
print (df[df.duplicated(['ID','date'], keep=False)])
ID date StartDate EndDate Count Cov1 Cov2 Cov3
0 A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
1 A 05/05/2005 04/04/05 06/06/2006 3 1 F 1
如果可能,删除重复项:
df = df.drop_duplicates(['ID','date'])
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df1 = (df.set_index('date').groupby('ID')
.apply(lambda x: x.asfreq('D', method='ffill'))
.droplevel(0)
.reset_index())
print (df1)
date ID StartDate EndDate Count Cov1 Cov2 Cov3
0 2005-05-05 A 04/04/05 06/06/2006 3 1 F 1
1 2005-05-06 A 04/04/05 06/06/2006 5 1 F 1
2 2005-05-07 A 04/04/05 06/06/2006 2 1 F 1
3 2005-05-08 A 04/04/05 06/06/2006 2 1 F 1
4 2005-05-09 A 04/04/05 06/06/2006 2 1 F 1
5 2005-05-10 A 04/04/05 06/06/2006 7 1 F 1
6 2005-05-05 B 04/04/05 06/06/2006 6 0 M 2
7 2005-05-06 B 04/04/05 06/06/2006 6 0 M 2
8 2005-05-07 B 04/04/05 06/06/2006 1 0 M 2
9 2005-05-01 C 04/04/05 06/06/2006 3 1 F 1
10 2005-05-02 C 04/04/05 06/06/2006 3 1 F 1
11 2005-05-03 C 04/04/05 06/06/2006 7 1 F 1
print (df1.index.name)
None
如果在实际数据中可能是ID
使用:
df1 = df1.rename_axis(None)
m = df1.merge(df, indicator=True, how='left')['_merge'].eq('left_only')
df1.loc[m, 'Count'] = 0
df1['date'] = df1['date'].dt.strftime('%d/%m/%Y')
print (df1)
date ID StartDate EndDate Count Cov1 Cov2 Cov3
0 05/05/2005 A 04/04/05 06/06/2006 3 1 F 1
1 06/05/2005 A 04/04/05 06/06/2006 5 1 F 1
2 07/05/2005 A 04/04/05 06/06/2006 2 1 F 1
3 08/05/2005 A 04/04/05 06/06/2006 0 1 F 1
4 09/05/2005 A 04/04/05 06/06/2006 0 1 F 1
5 10/05/2005 A 04/04/05 06/06/2006 7 1 F 1
6 05/05/2005 B 04/04/05 06/06/2006 6 0 M 2
7 06/05/2005 B 04/04/05 06/06/2006 0 0 M 2
8 07/05/2005 B 04/04/05 06/06/2006 1 0 M 2
9 01/05/2005 C 04/04/05 06/06/2006 3 1 F 1
10 02/05/2005 C 04/04/05 06/06/2006 0 1 F 1
11 03/05/2005 C 04/04/05 06/06/2006 7 1 F 1
一个选项是with complete from pyjanitor,它抽象了暴露缺失行的过程:
# pip install pyjanitor
import pandas as pd
import janitor
df = pd.read_clipboard() # ignore this, just me reading in the data
# converting the date to date time,
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# create an anonymous function to generate all dates
# between the max and min date for each combination
# of ID, start date and end date
new_dates = {'date':lambda d: pd.date_range(d.min(), d.max(), freq='D')}
(df
.complete(new_dates,
by = ['ID','StartDate','EndDate'],
fill_value = {'Count':0})
.ffill(downcast='infer')
)
ID date StartDate EndDate Count Cov1 Cov2 Cov3
0 A 2005-05-05 04/04/05 06/06/2006 3 1 F 1
1 A 2005-05-06 04/04/05 06/06/2006 5 1 F 1
2 A 2005-05-07 04/04/05 06/06/2006 2 1 F 1
3 A 2005-05-08 04/04/05 06/06/2006 0 1 F 1
4 A 2005-05-09 04/04/05 06/06/2006 0 1 F 1
5 A 2005-05-10 04/04/05 06/06/2006 7 1 F 1
6 B 2005-05-05 04/04/05 06/06/2006 6 0 M 2
7 B 2005-05-06 04/04/05 06/06/2006 0 0 M 2
8 B 2005-05-07 04/04/05 06/06/2006 1 0 M 2
9 C 2005-05-01 04/04/05 06/06/2006 3 1 F 1
10 C 2005-05-02 04/04/05 06/06/2006 0 1 F 1
11 C 2005-05-03 04/04/05 06/06/2006 7 1 F 1