我有两个数据框,用于存储不同类型的患者医疗信息。两个数据框的共同元素是遭遇ID(hadm_id
(,信息记录的时间((n|c)e_charttime
(。
一个数据框(df_str
(包含结构化信息,如生命体征和实验室测试值以及从中得出的值(如24小时内的变化统计(。另一个数据框(df_notes
(包含一列,其中包含在指定时间记录的临床记录。这两个数据框都包含多个遭遇,但共同的元素是遭遇 ID (hadm_id
(。
以下是带有变量子集的一个遭遇 ID (hadm_id
( 的数据框示例:
df_str
hadm_id ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 15:34:00 95.0 12.0 NaN 95.000000
1 196673 2108-03-05 16:00:00 85.0 11.0 NaN 90.000000
2 196673 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
3 196673 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
4 196673 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
5 196673 2108-03-05 19:00:00 99.0 16.0 1.8 95.000000
6 196673 2108-03-05 20:00:00 98.0 13.0 1.8 95.428571
7 196673 2108-03-05 21:00:00 97.0 14.0 1.8 95.625000
8 196673 2108-03-05 22:00:00 101.0 12.0 1.8 96.222222
9 196673 2108-03-05 23:00:00 97.0 13.0 1.8 96.300000
10 196673 2108-03-06 00:00:00 93.0 13.0 1.8 96.000000
11 196673 2108-03-06 01:00:00 89.0 12.0 1.8 95.416667
12 196673 2108-03-06 02:00:00 88.0 10.0 1.8 94.846154
13 196673 2108-03-06 03:00:00 87.0 12.0 1.8 94.285714
14 196673 2108-03-06 04:00:00 97.0 19.0 1.8 94.466667
15 196673 2108-03-06 05:00:00 95.0 11.0 1.8 94.500000
16 196673 2108-03-06 05:43:00 95.0 11.0 2.0 94.529412
17 196673 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
18 196673 2108-03-06 07:00:00 101.0 12.0 2.0 95.315789
19 196673 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
20 196673 2108-03-06 09:00:00 84.0 11.0 2.0 95.142857
21 196673 2108-03-06 10:00:00 89.0 11.0 2.0 94.863636
22 196673 2108-03-06 11:00:00 91.0 14.0 2.0 94.695652
23 196673 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
24 196673 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
25 196673 2108-03-06 14:00:00 100.0 18.0 2.0 94.653846
26 196673 2108-03-06 15:00:00 95.0 12.0 2.0 94.666667
27 196673 2108-03-06 16:00:00 96.0 20.0 2.0 95.076923
28 196673 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
df_notes
hadm_id ne_charttime note
0 196673 2108-03-05 16:54:00 NursingnNursing Progress NotenPt is a 43 yo ...
1 196673 2108-03-05 17:54:00 Physician nPhysician Resident Admission Note...
2 196673 2108-03-05 18:09:00 Physician nPhysician Resident Admission Note...
3 196673 2108-03-06 06:11:00 NursingnNursing Progress NotenPain control (...
4 196673 2108-03-06 08:06:00 Physician nPhysician Resident Progress Noten...
5 196673 2108-03-06 12:40:00 NursingnNursing Progress NotenChief Complain...
6 196673 2108-03-06 13:01:00 NursingnNursing Progress NotenPain control (...
7 196673 2108-03-06 17:09:00 NursingnNursing Transfer NotenChief Complain...
8 196673 2108-03-06 17:12:00 NursingnNursing Transfer NotenPain control (...
9 196673 2108-03-07 15:25:00 RadiologynCHEST (PA & LAT)n[**2108-3-7**] 3:...
10 196673 2108-03-07 18:34:00 RadiologynCTA CHEST W&W/O C&RECONS, NON-CORON...
11 196673 2108-03-09 09:10:00 RadiologynABDOMEN (SUPINE & ERECT)n[**2108-3...
12 196673 2108-03-09 12:22:00 RadiologynCT ABDOMEN W/CONTRASTn[**2108-3-9*...
13 196673 2108-03-10 05:26:00 RadiologynABDOMEN (SUPINE & ERECT)n[**2108-3...
14 196673 2108-03-10 05:27:00 RadiologynCHEST (PA & LAT)n[**2108-3-10**] 5...
我想做的是根据记录信息的时间合并两个数据框。更具体地说,对于df_notes
中的每一行,我想要一个带有ce_charttime <= ne_charttime
的df_str
对应的行。
例如,df_notes
中的第一行具有ne_charttime = 2108-03-05 16:54:00
。df_str
有三行的记录时间比这次少:ce_charttime = 2108-03-05 15:34:00, ce_charttime = 2108-03-05 16:00:00, ce_charttime = 2108-03-05 16:16:00
。其中最新的是带有ce_charttime = 2108-03-05 16:16:00
的行。所以在我的结果数据框中,对于ne_charttime = 2108-03-05 16:54:00
,我将有hr = 85.0, resp = 11.0, magnesium = 1.8, hr_24hr_mean = 88.33
.
实质上,在此示例中,生成的数据框将如下所示:
hadm_id ne_charttime note hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 NursingnNursing Progress NotenPt is a 43 yo ... 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician nPhysician Resident Admission Note... 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician nPhysician Resident Admission Note... 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 NursingnNursing Progress NotenPain control (... 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician nPhysician Resident Progress Noten... 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 NursingnNursing Progress NotenChief Complain... 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 NursingnNursing Progress NotenPain control (... 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 NursingnNursing Transfer NotenChief Complain... 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 NursingnNursing Transfer NotenPain control (... NaN NaN NaN NaN
9 196673 2108-03-07 15:25:00 RadiologynCHEST (PA & LAT)n[**2108-3-7**] 3:... NaN NaN NaN NaN
10 196673 2108-03-07 18:34:00 RadiologynCTA CHEST W&W/O C&RECONS, NON-CORON... NaN NaN NaN NaN
11 196673 2108-03-09 09:10:00 RadiologynABDOMEN (SUPINE & ERECT)n[**2108-3... NaN NaN NaN NaN
12 196673 2108-03-09 12:22:00 RadiologynCT ABDOMEN W/CONTRASTn[**2108-3-9*... NaN NaN NaN NaN
13 196673 2108-03-10 05:26:00 RadiologynABDOMEN (SUPINE & ERECT)n[**2108-3... NaN NaN NaN NaN
14 196673 2108-03-10 05:27:00 RadiologynCHEST (PA & LAT)n[**2108-3-10**] 5... NaN NaN NaN NaN
生成的数据框将与df_notes
长度相同。我已经能够使用for
循环和显式索引来获得一段非常低效的代码来获得以下结果:
cols = list(df_str.columns[2:])
final_df = df_notes.copy()
for col in cols:
final_df[col] = np.nan
idx = 0
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
for j, str_row in df_str.iterrows():
ce = str_row['ce_charttime']
if ne < ce:
idx += 1
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = df_str.iloc[j-1][col]
break
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = df_str.iloc[-1][col]
这段代码很糟糕,因为它效率非常低,虽然它可能适用于此示例,但在我的示例数据集中,我有超过 30 个不同的结构化变量列,以及超过 10,000 次遭遇。
EDIT-2:@Stef提供了一个很好的答案,它似乎有效,并用一行替换了我精心设计的循环代码(惊人(。但是,虽然这适用于这个特定示例,但当我将其应用于包含多次遭遇的较大子集时,我遇到了问题。例如,请考虑以下示例:
df_str.shape, df_notes.shape
((217, 386), (35, 4))
df_notes[['hadm_id', 'ne_charttime']]
hadm_id ne_charttime
0 100104 2201-06-21 20:00:00
1 100104 2201-06-21 22:51:00
2 100104 2201-06-22 05:00:00
3 100104 2201-06-23 04:33:00
4 100104 2201-06-23 12:59:00
5 100104 2201-06-24 05:15:00
6 100372 2115-12-20 02:29:00
7 100372 2115-12-21 10:15:00
8 100372 2115-12-22 13:05:00
9 100372 2115-12-25 17:16:00
10 100372 2115-12-30 10:58:00
11 100372 2115-12-30 13:07:00
12 100372 2115-12-30 14:16:00
13 100372 2115-12-30 22:34:00
14 100372 2116-01-03 09:10:00
15 100372 2116-01-07 11:08:00
16 100975 2126-03-02 06:06:00
17 100975 2126-03-02 17:44:00
18 100975 2126-03-03 05:36:00
19 100975 2126-03-03 18:27:00
20 100975 2126-03-04 05:29:00
21 100975 2126-03-04 10:48:00
22 100975 2126-03-04 16:42:00
23 100975 2126-03-05 22:12:00
24 100975 2126-03-05 23:01:00
25 100975 2126-03-06 11:02:00
26 100975 2126-03-06 13:38:00
27 100975 2126-03-08 13:39:00
28 100975 2126-03-11 10:41:00
29 101511 2199-04-30 09:29:00
30 101511 2199-04-30 09:53:00
31 101511 2199-04-30 18:06:00
32 101511 2199-05-01 08:28:00
33 111073 2195-05-01 01:56:00
34 111073 2195-05-01 21:49:00
这个例子有 5 次遭遇。数据帧按hadm_id
排序,在每个hadm_id
内,ne_charttime
排序。但是,列本身ne_charttime
不会按第 0 行ce_charttime=2201-06-21 20:00:00
和第 6 行ne_charttime=2115-12-20 02:29:00
所示进行排序。当我尝试执行merge_asof
时,出现以下错误:
ValueError: left keys must be sorted
.这是因为ne_charttime
列未排序吗?如果是这样,如何在保持遭遇 ID 组完整性的同时纠正此问题?
编辑-1:我也能够循环相遇:
cols = list(dev_str.columns[1:]) # get the cols to merge (everything except hadm_id)
final_dfs = []
grouped = dev_notes.groupby('hadm_id') # get groups of encounter ids
for name, group in grouped:
final_df = group.copy().reset_index(drop=True) # make a copy of notes for that encounter
for col in cols:
final_df[col] = np.nan # set the values to nan
idx = 0 # index to track the final row in the given encounter
for i, note_row in final_df.iterrows():
ne = note_row['ne_charttime']
sub = dev_str.loc[(dev_str['hadm_id'] == name)].reset_index(drop=True) # get the df corresponding to the ecounter
for j, str_row in sub.iterrows():
ce = str_row['ce_charttime']
if ne < ce: # if the variable charttime < note charttime
idx += 1
# grab the previous values for the variables and break
for col in cols:
final_df.iloc[i, final_df.columns.get_loc(col)] = sub.iloc[j-1][col]
break
# get the last value in the df for the variables
for col in cols:
final_df.iloc[idx, final_df.columns.get_loc(col)] = sub.iloc[-1][col]
final_dfs.append(final_df) # append the df to the list
# cat the list to get final df and reset index
final_df = pd.concat(final_dfs)
final_df.reset_index(inplace=True, drop=True)
同样,效率非常低,但可以完成工作。
有没有更好的方法来实现我想要的?任何帮助,不胜感激。
谢谢。
您可以使用merge_asof
(两个数据帧必须按要合并它们的列进行排序,在您的示例中已经如此(:
final_df = pd.merge_asof(df_notes, df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id')
结果:
hadm_id ne_charttime note ce_charttime hr resp magnesium hr_24hr_mean
0 196673 2108-03-05 16:54:00 NursingnNursing Progress NotenPt is a 43 yo ... 2108-03-05 16:16:00 85.0 11.0 1.8 88.333333
1 196673 2108-03-05 17:54:00 Physician nPhysician Resident Admission Note... 2108-03-05 17:00:00 109.0 12.0 1.8 93.500000
2 196673 2108-03-05 18:09:00 Physician nPhysician Resident Admission Note... 2108-03-05 18:00:00 97.0 12.0 1.8 94.200000
3 196673 2108-03-06 06:11:00 NursingnNursing Progress NotenPain control (... 2108-03-06 06:00:00 103.0 17.0 2.0 95.000000
4 196673 2108-03-06 08:06:00 Physician nPhysician Resident Progress Noten... 2108-03-06 08:00:00 103.0 20.0 2.0 95.700000
5 196673 2108-03-06 12:40:00 NursingnNursing Progress NotenChief Complain... 2108-03-06 12:00:00 85.0 10.0 2.0 94.291667
6 196673 2108-03-06 13:01:00 NursingnNursing Progress NotenPain control (... 2108-03-06 13:00:00 98.0 14.0 2.0 94.440000
7 196673 2108-03-06 17:09:00 NursingnNursing Transfer NotenChief Complain... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
8 196673 2108-03-06 17:12:00 NursingnNursing Transfer NotenPain control (... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
9 196673 2108-03-07 15:25:00 RadiologynCHEST (PA & LAT)n[**2108-3-7**] 3:... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
10 196673 2108-03-07 18:34:00 RadiologynCTA CHEST W&W/O C&RECONS, NON-CORON... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
11 196673 2108-03-09 09:10:00 RadiologynABDOMEN (SUPINE & ERECT)n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
12 196673 2108-03-09 12:22:00 RadiologynCT ABDOMEN W/CONTRASTn[**2108-3-9*... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
13 196673 2108-03-10 05:26:00 RadiologynABDOMEN (SUPINE & ERECT)n[**2108-3... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
14 196673 2108-03-10 05:27:00 RadiologynCHEST (PA & LAT)n[**2108-3-10**] 5... 2108-03-06 17:00:00 106.0 21.0 2.0 95.360000
PS:这为您提供了所有行的正确结果。代码中存在一个逻辑缺陷:您查找第一次ce_charttime
>ne_charttime
,然后获取上一行。如果没有这样的时间,您将永远没有机会获取前一行,因此结果表中的NaN
从第 8 行开始。
PPS:这包括最终数据帧中的ce_charttime
。您可以将其替换为信息年龄的列和/或删除它:
final_df['info_age'] = final_df.ne_charttime - final_df.ce_charttime
final_df = final_df.drop(columns='ce_charttime')
EDIT-2的更新:正如我在一开始所写,在评论中重复,并且文档明确指出:ce_charttime
和ne_charttime
都必须排序(hadm_id
不需要排序(。如果不满足此条件,则必须(暂时(根据需要对数据帧进行排序。请参阅以下示例:
import pandas as pd, string
df_str = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],10), 'ce_charttime': pd.date_range('2019-10-01 00:30', periods=20, freq='30T'), 'hr': pd.np.random.randint(80,120,20)})
df_notes = pd.DataFrame( {'hadm_id': pd.np.tile([111111, 222222],3), 'ne_charttime': pd.date_range('2019-10-01 00:45', periods=6, freq='40T'), 'note': [''.join(pd.np.random.choice(list(string.ascii_letters), 10)) for _ in range(6)]}).sort_values('hadm_id')
final_df = pd.merge_asof(df_notes.sort_values('ne_charttime'), df_str, left_on='ne_charttime', right_on='ce_charttime', by='hadm_id').sort_values(['hadm_id', 'ne_charttime'])
print(df_str); print(df_notes); print(final_df)
输出:
hadm_id ce_charttime hr
0 111111 2019-10-01 00:30:00 118
1 222222 2019-10-01 01:00:00 93
2 111111 2019-10-01 01:30:00 92
3 222222 2019-10-01 02:00:00 86
4 111111 2019-10-01 02:30:00 88
5 222222 2019-10-01 03:00:00 86
6 111111 2019-10-01 03:30:00 106
7 222222 2019-10-01 04:00:00 91
8 111111 2019-10-01 04:30:00 109
9 222222 2019-10-01 05:00:00 95
10 111111 2019-10-01 05:30:00 113
11 222222 2019-10-01 06:00:00 92
12 111111 2019-10-01 06:30:00 104
13 222222 2019-10-01 07:00:00 83
14 111111 2019-10-01 07:30:00 114
15 222222 2019-10-01 08:00:00 98
16 111111 2019-10-01 08:30:00 110
17 222222 2019-10-01 09:00:00 89
18 111111 2019-10-01 09:30:00 98
19 222222 2019-10-01 10:00:00 109
hadm_id ne_charttime note
0 111111 2019-10-01 00:45:00 jOcRWVdPDF
2 111111 2019-10-01 02:05:00 mvScJNrwra
4 111111 2019-10-01 03:25:00 FBAFbJYflE
1 222222 2019-10-01 01:25:00 ilNuInOsYZ
3 222222 2019-10-01 02:45:00 ysyolaNmkV
5 222222 2019-10-01 04:05:00 wvowGGETaP
hadm_id ne_charttime note ce_charttime hr
0 111111 2019-10-01 00:45:00 jOcRWVdPDF 2019-10-01 00:30:00 118
2 111111 2019-10-01 02:05:00 mvScJNrwra 2019-10-01 01:30:00 92
4 111111 2019-10-01 03:25:00 FBAFbJYflE 2019-10-01 02:30:00 88
1 222222 2019-10-01 01:25:00 ilNuInOsYZ 2019-10-01 01:00:00 93
3 222222 2019-10-01 02:45:00 ysyolaNmkV 2019-10-01 02:00:00 86
5 222222 2019-10-01 04:05:00 wvowGGETaP 2019-10-01 04:00:00 91
您可以进行完全合并,然后使用查询进行过滤:
df_notes.merge(df_str, on=hadm_id).query('ce_charttime <= ne_charttime')