Pandas:根据来自多个数据框架的信息过滤数据



我正在分析临床数据,并试图根据来自另一个数据框的信息过滤掉一个数据框中的信息。

其中一个数据框列出了患者来治疗的日期

dfTreatments = pd.DataFrame({'PatientID': [4,4,4,9,9,9,11,11,11], 'TreatmentDate': ['2016-01-01', '2016-01-15', '2016-03-25','2016-01-01','2016-01-15','2016-01-29','2016-01-01','2016-03-15','2016-03-25']})
dfTreatments['TreatmentDate'] = pd.to_datetime(dfTreatments['TreatmentDate'])
   PatientID TreatmentDate
0          4    2016-01-01
1          4    2016-01-15
2          4    2016-03-25
3          9    2016-01-01
4          9    2016-01-15
5          9    2016-01-29
6         11    2016-01-01
7         11    2016-03-15
8         11    2016-03-25

和其他数据框列出了患者因并发症到医院就诊的日期。

dfHospitalVisits = pd.DataFrame({'PatientID': [4,4,9,11], 'HospitalVisitDate': ['2016-01-14','2016-03-10','2016-01-28','2016-01-03']})
dfHospitalVisits['HospitalVisitDate'] = pd.to_datetime(dfHospitalVisits['HospitalVisitDate'])
  HospitalVisitDate  PatientID
0        2016-01-14          4
1        2016-03-10          4
2        2016-01-28          9
3        2016-01-03         11

在我们的研究中,如果患者在20天内未接受治疗,我们希望将住院就诊排除在我们的分析之外。我们在间隔20天前的最后一次治疗开始排除他们。例:我们将排除病人4在2016-01-15之后的任何住院检查。

在本例中,患者4的第二次医院就诊患者11的医院就诊将从dfhospitalrisks中删除。

编辑:@Merlin,到目前为止,我使用dfTreatments.groupby('PatientID')['TreatmentDate'].diff()来获得按患者分组的治疗日期的差距。我被卡住的部分是,我不知道如何使用>20的治疗日期差异来过滤dfhospital探视中的值。

我建议如下:

# Make a sorted dataframe to calculate the time gap before the next treatment 
dfTreatments_sorted = dfTreatments.sort_values(['PatientID','TreatmentDate'], ascending=False)
# Calculate the time gap before the next treatment
df_diff = dfTreatments_sorted.groupby('PatientID').TreatmentDate.diff(periods=1).rename('Gap_before_next_treatment')
# Add the time gaps as a new column to your existing dfTreatments dataframe
dfTreatments = pd.concat([dfTreatments, -df_diff], axis=1, join='inner').sort_index()
# Join dfTreatments and dfHospitalVisits into new dataframe (df)
df = dfHospitalVisits.set_index('PatientID').join(dfTreatments.set_index('PatientID'))
# Select combination where TreatmentDate is before corresponding HospitalVisitDate
df = df[(df.HospitalVisitDate>df.TreatmentDate)]
# The TreatmentDate that is important is latest before the HospitalVisitDate
df = df.reset_index().groupby(['PatientID','HospitalVisitDate']).max()
# Now you can filter hospital visits given the calculated time gap
df = df[df.Gap_before_next_treatment<'20 days'].reset_index()[['PatientID','HospitalVisitDate']]

最新更新