Pandas:如何在日期之间扩展数据框并向新行添加nan



这是一个简单的问题,但它难倒了我。我有一个数据框架,由个人每天的观察组成。然而,并不是所有的人都在同一天被观察。我需要创建行,包括在他们存在的日期之间未观察到个人的天数(即,计数NaN或0)。

相关数据如下

ID  date        StartDate  EndDate    Count    Cov1  Cov2 Cov3
A   05/05/2005    04/04/05   06/06/2006   3      1     F    1   
A   06/05/2005    04/04/05   06/06/2006   5      1     F    1
A   07/05/2005    04/04/05   06/06/2006   2      1     F    1
A   10/05/2005    04/04/05   06/06/2006   7      1     F    1
B   05/05/2005    04/04/05   06/06/2006   6      0     M    2   
B   07/05/2005    04/04/05   06/06/2006   1      0     M    2
C   01/05/2005    04/04/05   06/06/2006   3      1     F    1
C   03/05/2005    04/04/05   06/06/2006   7      1     F    1

但是,我需要它看起来像这样:

ID  date        StartDate  EndDate    Count    Cov1  Cov2 Cov3
A   05/05/2005    04/04/05   06/06/2006   3      1     F    1   
A   06/05/2005    04/04/05   06/06/2006   5      1     F    1
A   07/05/2005    04/04/05   06/06/2006   2      1     F    1
A   08/05/2005    04/04/05   06/06/2006   0      1     F    1
A   09/05/2005    04/04/05   06/06/2006   0      1     F    1
A   10/05/2005    04/04/05   06/06/2006   7      1     F    1
B   05/05/2005    04/04/05   06/06/2006   6      0     M    2   
B   06/05/2005    04/04/05   06/06/2006   0      0     M    2   
B   07/05/2005    04/04/05   06/06/2006   1      0     M    2
C   01/05/2005    04/04/05   06/06/2006   3      1     F    1
C   02/05/2005    04/04/05   06/06/2006   0      1     F    1
C   03/05/2005    04/04/05   06/06/2006   7      1     F    1

因此,数据应该在开始日期和结束日期之间扩展,而不是在原始日期中计算。但是,count变量不能复制,而所有其他协变量都必须复制。

首先通过列date创建DatetimeIndex,因此可以使用DataFrame.asfreq的自定义lambda函数,删除MultiIndex的第一层并将索引转换为列date,最后使用Series.dt.strftime用于原始格式DD/MM/YYYY:

第一个是可能的测试重复行ID, date:

print (df[df.duplicated(['ID','date'], keep=False)])
ID        date StartDate     EndDate  Count  Cov1 Cov2  Cov3
0  A  05/05/2005  04/04/05  06/06/2006      3     1    F     1
1  A  05/05/2005  04/04/05  06/06/2006      3     1    F     1

如果可能,删除重复项:

df = df.drop_duplicates(['ID','date'])

df['date'] = pd.to_datetime(df['date'], dayfirst=True)
df1 = (df.set_index('date').groupby('ID')
.apply(lambda x: x.asfreq('D', method='ffill'))
.droplevel(0)
.reset_index())
print (df1)
date ID StartDate     EndDate  Count  Cov1 Cov2  Cov3
0  2005-05-05  A  04/04/05  06/06/2006      3     1    F     1
1  2005-05-06  A  04/04/05  06/06/2006      5     1    F     1
2  2005-05-07  A  04/04/05  06/06/2006      2     1    F     1
3  2005-05-08  A  04/04/05  06/06/2006      2     1    F     1
4  2005-05-09  A  04/04/05  06/06/2006      2     1    F     1
5  2005-05-10  A  04/04/05  06/06/2006      7     1    F     1
6  2005-05-05  B  04/04/05  06/06/2006      6     0    M     2
7  2005-05-06  B  04/04/05  06/06/2006      6     0    M     2
8  2005-05-07  B  04/04/05  06/06/2006      1     0    M     2
9  2005-05-01  C  04/04/05  06/06/2006      3     1    F     1
10 2005-05-02  C  04/04/05  06/06/2006      3     1    F     1
11 2005-05-03  C  04/04/05  06/06/2006      7     1    F     1
print (df1.index.name)
None

如果在实际数据中可能是ID使用:

df1 = df1.rename_axis(None)

m = df1.merge(df, indicator=True, how='left')['_merge'].eq('left_only')
df1.loc[m, 'Count'] = 0
df1['date'] = df1['date'].dt.strftime('%d/%m/%Y')
print (df1)
date ID StartDate     EndDate  Count  Cov1 Cov2  Cov3
0   05/05/2005  A  04/04/05  06/06/2006      3     1    F     1
1   06/05/2005  A  04/04/05  06/06/2006      5     1    F     1
2   07/05/2005  A  04/04/05  06/06/2006      2     1    F     1
3   08/05/2005  A  04/04/05  06/06/2006      0     1    F     1
4   09/05/2005  A  04/04/05  06/06/2006      0     1    F     1
5   10/05/2005  A  04/04/05  06/06/2006      7     1    F     1
6   05/05/2005  B  04/04/05  06/06/2006      6     0    M     2
7   06/05/2005  B  04/04/05  06/06/2006      0     0    M     2
8   07/05/2005  B  04/04/05  06/06/2006      1     0    M     2
9   01/05/2005  C  04/04/05  06/06/2006      3     1    F     1
10  02/05/2005  C  04/04/05  06/06/2006      0     1    F     1
11  03/05/2005  C  04/04/05  06/06/2006      7     1    F     1

一个选项是with complete from pyjanitor,它抽象了暴露缺失行的过程:

# pip install pyjanitor
import pandas as pd
import janitor
df = pd.read_clipboard() # ignore this, just me reading in the data
# converting the date to date time, 
df['date'] = pd.to_datetime(df['date'], dayfirst=True)
# create an anonymous function to generate all dates 
# between the max and min date for each combination 
# of ID, start date and end date
new_dates = {'date':lambda d: pd.date_range(d.min(), d.max(), freq='D')}
(df
.complete(new_dates, 
by = ['ID','StartDate','EndDate'], 
fill_value = {'Count':0})
.ffill(downcast='infer')
)
ID       date StartDate     EndDate  Count  Cov1 Cov2  Cov3
0   A 2005-05-05  04/04/05  06/06/2006      3     1    F     1
1   A 2005-05-06  04/04/05  06/06/2006      5     1    F     1
2   A 2005-05-07  04/04/05  06/06/2006      2     1    F     1
3   A 2005-05-08  04/04/05  06/06/2006      0     1    F     1
4   A 2005-05-09  04/04/05  06/06/2006      0     1    F     1
5   A 2005-05-10  04/04/05  06/06/2006      7     1    F     1
6   B 2005-05-05  04/04/05  06/06/2006      6     0    M     2
7   B 2005-05-06  04/04/05  06/06/2006      0     0    M     2
8   B 2005-05-07  04/04/05  06/06/2006      1     0    M     2
9   C 2005-05-01  04/04/05  06/06/2006      3     1    F     1
10  C 2005-05-02  04/04/05  06/06/2006      0     1    F     1
11  C 2005-05-03  04/04/05  06/06/2006      7     1    F     1

最新更新